# iWildCam 2021 - Starter Notebook

## Let's contribute study for wild animals from here with camera traps traps dataset!📷

In [None]:
import collections
from datetime import datetime as dt
import gc
import glob 
import json
import os
import warnings
warnings.filterwarnings('ignore')

import cv2
from imblearn.under_sampling import RandomUnderSampler
from IPython.display import YouTubeVideo
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image, ImageDraw
import plotly.express as px
import seaborn as sns

%matplotlib inline

SEED = 2021

## Contents
- [Competition goal and metric](#0)
- [Motivation of competition](#1)
- [Explanation for image file](#2)
- [Explanation for submission file](#3)
- [Explanation for metadata file](#4)
- [Exploratory data analysis](#5)
- [Sample solution](#6)

### <u>About this notebook</u>
I want to give you an overview of the prior knowledge and data that might be needed in this challenge!

This contest aims to detect wildlife in trapped picture at new monitoring locations. Now, we can get great insights about wildlife by camera traps. Camera trap is popular method and there are so many data in the world. But due to the so large number of data, it seems that the data was not always effectively accessed and utilized.

In this competition, we aim to develop a model that effectively classifies animals taken at different observation points　in the world. This challenge will surely bring great insights.

<a id="0"></a> <br>
# <div class="alert alert-block alert-warning">Competition goal and metric</div>

The goal of this competition is to categorize species and count the number of individuals across image bursts of camera trap. Image bursts of camera trap are assigned unique ID. The individual images that make up image bursts are also assigned ID. Using the image burst as input, count up how many animals of the 204 annotated species are present and output as CSV file.

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/iwildcam2021/task_image.png
" width="***500***">

Evaluation will be done by Mean Columnwise Root Mean Squared Error(MCRMSE). 

$$
\frac{1}{m}\sum_{j=1}^{m}\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_{ij}-y_{ij})^2}
$$

j represents a species, i represents a sequence, x_ij is the predicted count for that species in that sequence, and y_ij is the ground truth count. This is an index that takes the RMSE for each species and then averages it across species.

<a id="1"></a> <br>
# <div class="alert alert-block alert-info">Motivation of competition</div>

## How can we take wildlife pictures and use?
We can take wildlife pictures by camera trap. Camera traps have infrared sensor or motion sensor, so they can detect animals. When animals come near, camera take their pictures. Using camera traps, we can monitor wildlifes continuously ,at several point and at the same time. So we can understand how animals run their life in the area researchers interest.[1]

For example, in Kaen Krachan National Park in Thailand, indian sinatra are directlly observed. So it was pointed out that there is no longer any possibility. But by using camera trap, we could confirm thir existence. [2]

Traditionally, camera traps have been considered an excellent method for investigating ecological information about wildlife in a certain area.　Data are used for population estimation, calculating population index, 24hours monitorling and so on.[3]

Resently there has been a movement to make effective use of photos from camera traps using machine learning. One of the example is [4]. Google successed to access so much wildlife photo knowledge. Following Wilflife Insights is the platform of taking the initiative.

In [None]:
YouTubeVideo('qKgRbkCkRFY')

Thus, creating a model for effectively classifying wild animals is a very significant effort.

### Referece

[1]https://www.wwf.org.uk/project/conservationtechnology/camera-trap

[2]https://www.wwf.or.jp/campaign/2015_camera/

[3]https://en.wikipedia.org/wiki/Camera_trap

[4]https://www.blog.google/products/earth/ai-finds-where-the-wild-things-are/

<a id="2"></a> <br>
# <div class="alert alert-block alert-info">Explanation for image file</div>

First, let's take a look at what kind of images are available.

In [None]:
TRAIN_DATA_PATH = '../input/iwildcam2021-fgvc8/train/'
TEST_DATA_PATH = '../input/iwildcam2021-fgvc8/test/'

train_jpeg = glob.glob(TRAIN_DATA_PATH + '*')
test_jpeg = glob.glob(TEST_DATA_PATH + '*')

print("number of train jpeg data:", len(train_jpeg))
print("number of test jpeg data:", len(test_jpeg))

### Train data

In [None]:
fig = plt.figure(figsize=(25, 16))
for i,im_path in enumerate(train_jpeg[:16]):
    ax = fig.add_subplot(4, 4, i+1, xticks=[], yticks=[])
    im = Image.open(im_path)
    im = im.resize((480,270))
    plt.imshow(im)

In [None]:
fig = plt.figure(figsize=(25, 16))
for i,im_path in enumerate(train_jpeg[16:32]):
    ax = fig.add_subplot(4, 4, i+1, xticks=[], yticks=[])
    im = Image.open(im_path)
    im = im.resize((480,270))
    plt.imshow(im)

### Test data

In [None]:
fig = plt.figure(figsize=(25, 16))
for i,im_path in enumerate(test_jpeg[:16]):
    ax = fig.add_subplot(4, 4, i+1, xticks=[], yticks=[])
    im = Image.open(im_path)
    im = im.resize((480,270))
    plt.imshow(im)

By camera traps, images are taken continuously because they are captured in bursts triggered by motion. Therefore, dataset also contains series of images, and in addition to IDs of image, IDs of the sequence are assigned.We will use IDs of image to load the image, on the other hands we will use IDs of sequence to submit.

Take a look at [the submission file for details](#3).

Additionally, we will notice that some of the images are in color and some are in black and white. This is due to the difference in whether the images were taken during the day or at night.

<a id="3"></a> <br>
# <div class="alert alert-block alert-success">Explanation for submission file</div>

Explanation for submission file

In iWildcam 2021 - FGVC8 conpetition, we have to detect the number of each animal species in the image sequences. On the other hand, [last year (in iWildcam 2020 - FGVC7)](https://www.kaggle.com/c/iwildcam-2020-fgvc7/overview/evaluation) we identified which category of animal was being shown. 

In [None]:
sub = pd.read_csv("../input/iwildcam2021-fgvc8/sample_submission.csv")
sub.head()

Id columns represents id of sequence as we see in [Checking iwildcam2021_train_annotations.json](#4-1). PredictedX columns are number of individuals of species in sequence. There are 111057 sequences in test dataset, and 205 species in given dataset. 

In [None]:
sub.shape

<a id="4"></a> <br>
# <div class="alert alert-block alert-success">Explanation for metadata file</div>

<a id="4-1"></a>
## Checking iwildcam2021_train_annotations.json


We are provided annotation data for train data as "iwildcam2021_train_annotations.json". This json follows COCO-CameraTraps format with additional field.

If we load the json, we can find there are three key-values in it.

In [None]:
with open('../input/iwildcam2021-fgvc8/metadata/iwildcam2021_train_annotations.json', encoding='utf-8') as json_file:
    train_annotations =json.load(json_file)
    
train_annotations.keys()

In images value, we can get data for each wildcan images. Wildcam will take several frames in a row. The value of 'seq_num_frames' key is the number of frames, 'id' is the id of the image, and 'seq_id' is the ID associated with the sequentially shot image. This 'seq_id' is the same as the 'Id' in the submission file.

Let's extract the data corresponding to shot of seq_id:302ad820-7d42-11eb-8fb5-0242ac1c0002.

In [None]:
train_annotations_seq = train_annotations["images"][94:104]
train_annotations_seq

If we see the images, we can find that they are a series of images.

In [None]:
train_images_seq = [(TRAIN_DATA_PATH+item["id"]+'.jpg') for item in train_annotations_seq]
img_array = []
size = (480,270)

fig = plt.figure(figsize=(25, 16))
for i,im_path in enumerate(train_images_seq):
    ax = fig.add_subplot(4, 3, i+1, xticks=[], yticks=[])
    im = Image.open(im_path)
    im = im.resize(size)
    plt.imshow(im)
    
    img_array.append(im)

The value of the categories key contains a list of annotated animal species. id 0 is empty. The id is up to 571, but there are 204 annotated species. So we have to classify for 204 species + empty at most (not 571 + empty)!

In [None]:
df_categories = pd.DataFrame.from_records(train_annotations["categories"])
df_categories

Each training image has at least one associated annotation. The annotatetd catefory_ids are in value of "annotations" key.

In [None]:
train_annotations["annotations"][:10]

There seems to be data in train data for all annotated categories.

In [None]:
train_annotated_category = set([ annotation["category_id"] for annotation in train_annotations["annotations"]])
len(train_annotated_category)

<a id="4-2"></a> <br>
## Checking iwildcam2021_test_information.json

Information for test dataset. The format is similar to iwildcam2021_train_annotations.json with only images key.

In [None]:
with open('../input/iwildcam2021-fgvc8/metadata/iwildcam2021_test_information.json', encoding='utf-8') as json_file:
    test_information =json.load(json_file)
    
test_information.keys()

In [None]:
test_information['images'][:5]

<a id="4-3"></a> <br>
## Checking iwildcam2021_megadetector_results.json

We can also use [Microsoft AI for Earth MegaDetector](https://github.com/microsoft/CameraTraps/blob/master/megadetector.md). This model is trained to detect animals, people, and vehicles in camera trap images using hundreds of thousands of bounding boxes from various ecosystems. The model does not identify animals, it only finds them.

We are provided some sample detection results as "iwildcam2021_megadetector_results.json".

In [None]:
with open('../input/iwildcam2021-fgvc8/metadata/iwildcam2021_megadetector_results.json', encoding='utf-8') as json_file:
    megadetector_results =json.load(json_file)
    
megadetector_results.keys()

There are three key-value data in json.

Detected result is in images value.

In [None]:
megadetector_results_df = pd.DataFrame(megadetector_results["images"])
megadetector_results_df.head()

Since there are 263504 detection data, it seems that all the data from train data and test data have been processed. So we do not need to run MegaDetector ourselves. If you want to finetune with the MegaDetector's weights or redo the estimation yourself, please refer to this [notebook](https://www.kaggle.com/nayuts/try-megadetector-crop-animals-on-kaggle-notebook).

In [None]:
print(f"There are {len(megadetector_results_df)} detection data.")

Especially, detected bbox is in detections value.

In [None]:
megadetector_results_df.iloc[100]["detections"]

We can see the result like this.

In [None]:
#Refered: https://www.kaggle.com/qinhui1999/how-to-use-bbox-for-iwildcam-2020 

def draw_bboxs(detections_list, im):
    """
    detections_list: list of set includes bbox.
    im: image read by Pillow.
    """
    
    for detection in detections_list:
        x1, y1,w_box, h_box = detection["bbox"]
        ymin,xmin,ymax, xmax=y1, x1, y1 + h_box, x1 + w_box
        draw = ImageDraw.Draw(im)
        
        imageWidth=im.size[0]
        imageHeight= im.size[1]
        (left, right, top, bottom) = (xmin * imageWidth, xmax * imageWidth,
                                      ymin * imageHeight, ymax * imageHeight)
        
        draw.line([(left, top), (left, bottom), (right, bottom),
               (right, top), (left, top)], width=4, fill='Red')

In [None]:
# Let's see 100th data of train dataset.
data_index = 100

# Load 100th image data. 
im = Image.open("../input/iwildcam2021-fgvc8/train/" + megadetector_results_df.loc[data_index]['id'] + ".jpg")
im = im.resize((480,270))

# Overwrite bbox
draw_bboxs(megadetector_results_df.loc[data_index]['detections'], im)

# Show
plt.imshow(im)
plt.title(f"image {data_index} with bbox")

In [None]:
megadetector_results_df.loc[data_index]['detections'][0]["bbox"]

It is also possible to crop the detected area like this. If you save the image, you can create  dataset.

In [None]:
def get_crop_area(bbox, image_size):
    x1, y1,w_box, h_box = bbox
    ymin,xmin,ymax, xmax = y1, x1, y1 + h_box, x1 + w_box
    area = (xmin * image_size[0], ymin * image_size[1], 
            xmax * image_size[0], ymax * image_size[1])
    return area

crop_area = get_crop_area(megadetector_results_df.loc[data_index]['detections'][0]["bbox"], im.size)
im_croped = im.crop(crop_area)
plt.imshow(im_croped)

info and detection_categories values are incidental information.

In [None]:
megadetector_results["info"]

In [None]:
megadetector_results["detection_categories"]

I created dataset that crop the detection area of MegaDetector. Because of the processing time involved, I wrote it in [separate notebook](https://www.kaggle.com/nayuts/256-x-256-cropped-images).

<a id="5"></a> <br>
# <div class="alert alert-block alert-success">Exploratory data analysis</div>

I'll check distribution of data in mainly three insight as following:
- Category ID

- Time point

- Location

As far as the following EDA results are concerned, the given dataset seems be the same as last year. If you have interest, compare it to [my notes from last year](https://www.kaggle.com/nayuts/iwildcam-2020-overviewing-for-start). To a greater or lesser extent, we can use the findings of last year's competition.

## How many data are there per animal category Id?

There are a lot of categories in dataset. To confirme how many data are there in each categories, I plot barplot.

In [None]:
# Preperation for isualization
df_categories = pd.DataFrame(train_annotations["categories"])
labels_id = [item["id"] for item in train_annotations["categories"]]
cnt = collections.Counter([item["category_id"] for item in train_annotations["annotations"]])
df_categories_count = pd.DataFrame.from_dict(cnt, orient='index').reset_index()
df_categories_count = df_categories_count.rename(columns={'index':'id', 0:'count'})

df_categories_count = df_categories_count.merge(df_categories, on='id').sort_values(by=['count'], ascending=False)

In [None]:
fig = plt.figure(figsize=(30, 4))
ax = sns.barplot(x="id", y="count",data=df_categories_count, order=labels_id)
ax.set(ylabel='count')
ax.set(ylim=(0,80000))
plt.title('distribution of count per id in train')

Since there are many categories,I will also provide a plotly interactive bar chart to make it easier to check the details.

In [None]:
fig = px.bar(df_categories_count, x="id", y="count", 
             title='distribution of count per id in train',
             width=800, height=400, color='id')
fig.show()

The annotation data seems to be biased to some extent. To see the breakdown, let's look at the top 10 categories.

mpty is the most, but annotations stating that animals are in the picture also seem to vary among the top 10.

In [None]:
df_categories_count.iloc[:10]

On the other hand, fewer categories have only about one sample. We need to be careful when splitting the dataset to train and validation data when training the model.

In [None]:
df_categories_count.iloc[-10:]

Let's look at the cumulative ratio. If we take the cumulative sum in order of increasing number, we can see that the 40th category reaches 95% and the 90th category reaches 99%.

In [None]:
# Refered https://www.kaggle.com/kushal1506/deciding-n-components-in-pca

fig, ax = plt.subplots(figsize=(30, 10))
xi = np.arange(1, len(df_categories_count)+1, step=1)

plt.ylim(0.0,1.1)
plt.plot(xi, df_categories_count["count"].cumsum()/sum(df_categories_count["count"]), marker='o', linestyle='--', color='b')


plt.xlabel('Number of category', fontsize=30)
plt.xticks(np.arange(0, len(df_categories_count), step=10)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Accumulation Ratio (%)', fontsize=30)
plt.title('Relationships when cumulative sums are taken in order of increasing categories.', fontsize=30)

plt.axhline(y=0.99, color='g', linestyle='-')
plt.text(0.5, 1.00, '99%', color = 'green', fontsize=30)

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.92, '95%', color = 'red', fontsize=30)

plt.axvline(x=40, color='g', linestyle='-')
plt.text(40, 0.5, '40th', color = 'green', fontsize=30)
         
plt.axvline(x=90, color='g', linestyle='-')
plt.text(90, 0.5, '90th', color = 'green', fontsize=30)

ax.grid(axis='x')
plt.show()

If we use such imbalanced data as it is, we may not be able to train our models well. For a quickly, using [RandomUnderSampler](https://imbalanced-learn.org/stable/under_sampling.html#controlled-under-sampling-techniques) of [imblearn](https://imbalanced-learn.org/stable/) may improve the situation.

In [None]:
# Convert annotation data to pandas DataFrame
df_train_annotations = pd.DataFrame(train_annotations["annotations"])

# Under sampling
rus = RandomUnderSampler(random_state=SEED, replacement=True)
df_train_annotations_resampled, _ = rus.fit_resample(df_train_annotations, df_train_annotations["category_id"])

In [None]:
df_train_annotations_resampled.reset_index(drop=True)

## When did data taken?

Because animals can change their activity from time to time, we want to understand how data is distributed over time.

In [None]:
df_images = pd.DataFrame(train_annotations["images"])
df_images_test = pd.DataFrame(test_information["images"])

In [None]:
month_year = df_images['datetime'].map(lambda str: str[2:7])
labels_month_year = sorted(list(set(month_year)))

month_year_test = df_images_test['datetime'].map(lambda str: str[2:7])

### Year and Month Perspective

In [None]:
fig, ax = plt.subplots(1,2, figsize=(30,7))
ax = plt.subplot(1,2,1)
ax = plt.title('Count of train data per month & year')
ax = sns.countplot(month_year, order=labels_month_year)
ax.set(xlabel='YY-mm', ylabel='count')
ax.set(ylim=(0,50000))

ax = plt.subplot(1,2,2)
ax = plt.title('Count of test data per month & year')

ax = sns.countplot(month_year_test, order=labels_month_year)
ax.set(xlabel='YY-mm', ylabel='count')
ax.set(ylim=(0,50000))

Data starts 2013-01 but there seems be some lacks. For example, train data between 2013-11 to 2014-02 are missing.

Also we can find that train data in between 2013-01 to 2013-07 are rich than other time point.

Train data covers test data in perspective of time point.

### Monthly perspective

In [None]:
labels_month = sorted(list(set(df_images['datetime'].map(lambda str: str[5:7]))))

fig, ax = plt.subplots(1,2, figsize=(20,7))
ax = plt.subplot(1,2,1)
plt.title('Count of train data per month')
ax = sns.countplot(df_images['datetime'].map(lambda str: str[5:7] ), order=labels_month)
ax.set(xlabel='mm', ylabel='count')
ax.set(ylim=(0,55000))

ax = plt.subplot(1,2,2)
plt.title('Count of test data per month')
ax = sns.countplot(df_images_test['datetime'].map(lambda str: str[5:7] ), order=labels_month)
ax.set(xlabel='mm', ylabel='count')
ax.set(ylim=(0,55000))

Train data are bias. In February, March, June and July, data are rich than other months.

Train data covers test data in perspective of month.

Data for November and December are missing. Do animals hibernate?

### Hourly perspectives

In [None]:
train_taken_hour = df_images['datetime'].map(lambda x: dt.strptime(x, '%Y-%m-%d %H:%M:%S.%f').hour)
test_taken_hour = df_images_test['datetime'].map(lambda x: dt.strptime(x, '%Y-%m-%d %H:%M:%S.%f').hour)

fig, ax = plt.subplots(1,2, figsize=(20,7))
ax = plt.subplot(1,2,1)
plt.title('Count of train data per hour')
ax = sns.countplot(train_taken_hour)
ax.set(xlabel='hour', ylabel='count')
ax.set(ylim=(0,20000))

ax = plt.subplot(1,2,2)
plt.title('Count of test data per hour')
ax = sns.countplot(test_taken_hour)
ax.set(xlabel='hour', ylabel='count')
ax.set(ylim=(0,20000))

If we decide arbitrarily during daytime and at night, we can also calculate diurnal and nocturnal data counts.

For example, we define "during daytime" is "6-17 O'clock" and "at night" is "18-5 O'clock",

### Day and night perspective

In [None]:
train_taken_phase = train_taken_hour.map(lambda x: "daytime" if x >= 6 and x < 18 else "night")
test_taken_phase = test_taken_hour.map(lambda x: "daytime" if x >= 6 and x < 18 else "night")

In [None]:
fig, ax = plt.subplots(1,2, figsize=(20,7))
ax = plt.subplot(1,2,1)
plt.title('Count of train data per phase')
ax = sns.countplot(train_taken_phase, order=["daytime", "night"])
ax.set(xlabel='phase', ylabel='count')
ax.set(ylim=(0,200000))

ax = plt.subplot(1,2,2)
plt.title('Count of test data per phase')
ax = sns.countplot(test_taken_phase, order=["daytime", "night"])
ax.set(xlabel='phase', ylabel='count')
ax.set(ylim=(0,200000))

When we looked at given image files at the beginning of this notebook, we saw mixture of color and black and white images. Using the time of day feature, we may be able to successfully distinguish whether they are in color or not.

## Where did data taken?

We are required to detect photographs taken at different locations, but how distribute are data in perspect of location?

In [None]:
labels_location_train = sorted(list(set(df_images['location'])))
labels_location_test = sorted(list(set(df_images_test['location'])))
labels_location = labels_location_train + labels_location_test

fig = plt.figure(figsize=(30, 4))
ax = sns.countplot(df_images['location'], order=labels_location)
ax.set(xlabel='location', ylabel='count')
plt.title('Count of train data per location')

In [None]:
fig = plt.figure(figsize=(30, 4))
ax = sns.countplot(df_images_test['location'], order=labels_location)
ax.set(xlabel='location', ylabel='count')
plt.title('Count of test data per location')

Train data and test data seems be completelly taken in different locations.

Number of pictures are greatly differend by location.

<a id="6"></a> <br>
# <div class="alert alert-block alert-danger">Sample solution</div>

I have created sample solution, although it is not very accurate. Check out [this notebook](https://www.kaggle.com/nayuts/efficientnet-with-undersampling). It requires GPU to be turned on. I separated in order to people who forks this notebook and do some trial and error don't waste GPU time without realizing it.


I haven't beaten kaggle_sample_all_zero_iwildcam_2021.csv yet, but I will publish the idea.

1. First we crop the image based on the bbox detected by MegaDetector.
2. In the training data, the correct answer labels are given as annotations, so we can use them to train the model.
3. Classify the cropped images of the test data with the trained model.
4. We choose the animal species and their counts of the image with the highest count among the images in the same image burst.

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/iwildcam2021/model_image.png" width="***300***">

## How can we improve accuracy?

We also check how empty the training data and test data are. We will also check how empty the training and test data is, because when we submit the results of our inference, we will fount that "kaggle_sample_all_zero_iwildcam_2021.csv" is very powerful. This should be because much of the testdata is mostly empty. From the detection results using MegaDetector, we can see roughly how much of the image is empty. Let's take a look.

In [None]:
def is_in_test(x):
    if os.path.exists(TEST_DATA_PATH + x + ".jpg"):
        return True
    else:
        return False
    
are_images_in_test = [ is_in_test(x) for x in megadetector_results_df["id"]]
are_images_in_train = [not is_in_test for is_in_test in are_images_in_test]

train_megadetector_results_df =  megadetector_results_df[are_images_in_train]
test_megadetector_results_df =  megadetector_results_df[are_images_in_test]

In [None]:
fig = plt.figure(figsize=(15, 4))
ax = sns.countplot([len(detection) for detection in train_megadetector_results_df["detections"]])
ax.set(xlabel='Number of detections', ylabel='count')
plt.title('Distribution of number of detection by MegaDetector for train data.')

In [None]:
fig = plt.figure(figsize=(15, 4))
ax = sns.countplot([len(detection) for detection in test_megadetector_results_df["detections"]])
ax.set(xlabel='Number of detections', ylabel='count')
plt.title('Distribution of number of detection by MegaDetector for test data.')

We can see that the training data is nearly half empty, but the test data is almost half empty. In other words, in order to improve accuracy, it is better to submit all columns as zero as soon as possible if t can be determined to be empty with certainty.