# Imports

In this competition, our goal is to detect crown-of-thorns starfish which has become a great threat to the Great Barrier Reef. We have to make a model that is able to detect those starfish in real time. Predictions are done in the form of bounding boxes. An image can have one or more bounding boxes that represent the object, starfish. Our goal is to evaluate the images in the same order as they were recorded in the video.   
##### **Reference kernel**

Thank you for sharing your work. 

- [Kernel 1](https://www.kaggle.com/werooring/basic-eda-starter-for-everyone/notebook)
- [Kernel 2](https://www.kaggle.com/matthieubritoantunes/great-barrier-reef-exploratory-data-analysis)
- [Kernel 3](https://www.kaggle.com/kartik2khandelwal/data-analysis-and-prediction)
- [Kernel 4](https://www.kaggle.com/sarabhian/gbr-extremely-beginner-level-guide-1#%F0%9F%90%A0%F0%9F%90%9F%F0%9F%90%A1%F0%9F%A6%91%F0%9F%90%99%F0%9F%A6%88%F0%9F%90%AC%F0%9F%90%B3%F0%9F%90%8B%F0%9F%A6%80%F0%9F%90%9A%F0%9F%8F%8A%E2%80%8D%E2%99%80%EF%B8%8F%F0%9F%8D%80%E2%98%98%F0%9F%92%BA%F0%9F%9A%A4%E2%9A%93%F0%9F%8F%9D%F0%9F%8C%8A%F0%9F%8C%8A-%F0%9F%90%A0%F0%9F%90%9F%F0%9F%90%A1)
- [Kernel 5](https://www.kaggle.com/sjyangkevin/eda-bouding-box-analysis-annotated-videos)

In [None]:
import warnings
warnings.filterwarnings('ignore')

import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import ast
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

pd.set_option('display.max_colwidth',None)

# 📊 Exploratory Data Analysis | EDA 👩‍💻 

In [None]:
train = pd.read_csv('../input/tensorflow-great-barrier-reef/train.csv')
test = pd.read_csv('../input/tensorflow-great-barrier-reef/test.csv')
sample_submission = pd.read_csv('../input/tensorflow-great-barrier-reef/example_sample_submission.csv')

In [None]:
train.head(2)

In [None]:
test.head()

In [None]:
print('train dataset shape: ', train.shape)
print('test dataset shape: ', test.shape)
print('\n')
print('submission dataset shape: ', sample_submission.shape)
print('\n')
print('train dataset Info: ', train.info())
print('test dataset Info: ', test.info())

In [None]:
plt.figure(figsize=(10,5))
sns.heatmap(train.isna(), cbar=False)

there are no nulls detected!

In [None]:
train.dtypes

The number of `video_frame` per video

In [None]:
train.groupby('video_id')['video_frame'].count()

Checking out the `video_id` distribution / number of unique videos

In [None]:
plt.figure(figsize=(10,5))
sns.set_palette("pastel")
sns.histplot(data=train, x='video_id', kde = True)
plt.axvline(train['video_id'].mean(),c = 'red', ls = '--', lw = 3)

it can be seen that we are spanning over 3 different videos

### Adding image path from the `greatbarrierreef` folder to the dataset

I would also like to see how those images look like so I will choose a random `video_id` and a random `image_id` from the `train` dataset

In [None]:
train['image_path'] = '../input/tensorflow-great-barrier-reef/train_images/video_'+train['video_id'].astype(str)+'/'+train['image_id'].apply(lambda x: x.split('-')[1])+'.jpg'

In [None]:
train.tail()

### Plotting random image 🐠

In [None]:
rows, cols = 2, 2
fig, axs = plt.subplots(rows, cols, figsize=(12,10))
fig.subplots_adjust(top = 0.99, bottom=0.01, hspace=-0.6, wspace=0.4)
for i,ax in zip(train, axs.ravel()):
  random_image = random.randint(0,len(train)-1)
  img = mpimg.imread(train['image_path'][random_image])
  ax.imshow(img)
  ax.axis('off')
  ax.set_title(f'Image ID: {train["image_id"][random_image]}',{'fontsize': 20})

### All About Decoding `Annotations`
Exploring `annotations`

Annotations have the following format: (there can be multiple records in one list)
> [{'x': 645, 'y': 182, 'width': 41, 'height': 45}]

where,

- '645' -- x coordinate
- '182' -- x coordinate
- '41'  -- width of the box
- '45'  -- height of the box

In [None]:
len(train['annotations'])  

In [None]:
train['annotations'].dtype

In [None]:
train.annotations.describe()

Unique `annotations`

In [None]:
train.annotations.unique()

In [None]:
record_with_annotations = train[train['annotations'] != '[]']['annotations'].count()
record_without_annotations = train[train['annotations'] == '[]']['annotations'].count()
plt.rcParams['figure.figsize'] = (11, 5)
sns.barplot(x = ['record with annotations', 'record without annotations'], y = [record_with_annotations, record_without_annotations], palette = 'colorblind')
plt.title('with/without Annotation Distribution', fontsize = 30)
plt.xlabel('Annotation', fontsize = 15)
plt.ylabel('Count', fontsize = 15)
plt.show()

In [None]:
annotation_len = train[train['annotations'] != '[]']
annotation_len

number of no records for `annotations`

In [None]:
len(train['annotations'])  - len(annotation_len['annotations'])

looks like we are missiong a lot of records for annotations, which can be difficult for us to make a good model. 

checking annotation length distribution for records that have `annotations not null` detected

if the length of `annotation` detects the presence of starfish, lets check the distribution:

In [None]:
plt.rcParams['figure.figsize'] = (15, 9)
sns.countplot(annotation_len['annotations'].apply(lambda x: len(x)).value_counts(), palette = 'colorblind')
plt.title('Annotation Length Distribution', fontsize = 30)
plt.xlabel('Annotation', fontsize = 15)
plt.ylabel('Count', fontsize = 15)
plt.show()

### `Annotation` distribution per `video_id`

**Reference Kernel** is [here](https://www.kaggle.com/icaram/eda-non-annotated-starfish)

In [None]:
# for video_id in train['video_id'].unique():
video_1 = (annotation_len[annotation_len['video_id'] == 0]['annotations']).count()
video_2 = (annotation_len[annotation_len['video_id'] == 1]['annotations']).count()
video_3 = (annotation_len[annotation_len['video_id'] == 2]['annotations']).count()

In [None]:
ax = sns.barplot(x=['Video id: 0', 'Video id: 1', 'Video id: 2'], y=[video_1, video_2, video_3])
ax.set_ylabel('Count')

seems like video with `ID` 2 has the least amount of `annotations` record.

### Checking for corelation

no noticeable positive or negative correlation detected. 

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(train.corr(), linewidths = 0.5, linecolor = 'white', annot = True,
           cmap = 'RdYlGn', cbar_kws = {'shrink' : 0.5})

adding `image_path` to my `annotation_len` dataframe

In [None]:
annotation_len['image_path'] = '../input/tensorflow-great-barrier-reef/train_images/video_'+annotation_len['video_id'].astype(str)+'/'+annotation_len['image_id'].apply(lambda x: x.split('-')[1])+'.jpg'

In [None]:
annotation_len.tail(2)

### Detecting object with the help of a **`bounding box`** from a random image file

Each record in the `annotation` represents a bounding box. We can see from above exploration that one `annotation` list can have multiple records. This correspond to multiple bounding boxes or our goal of detected object, starfish 🐟.

In order to draw bounding box we must transform the `annotations` to list. `annotations` is in a string format. For bounding box the indices must be integers.  

In [None]:
annotation_len['annotations'] = annotation_len['annotations'].apply(ast.literal_eval)

In [None]:
annotation_len.tail(2)

adding number of bounding box from the list of `annotations`.

In [None]:
annotation_len['number_of_bounding_box'] = annotation_len['annotations'].apply(lambda x: len(x))
annotation_len.head(2)

Max number of starfish 🐟 that can be present in the dataframe

In [None]:
annotation_len['number_of_bounding_box'].max()

looks like 18 is the maximum number of startfish 🐟 that can be in an image from analyzing the DF

In [None]:
max_bbox = annotation_len[annotation_len['number_of_bounding_box'] == 18]['annotations']

In [None]:
max_bbox

In [None]:
len(max_bbox)

#### Bounding Box distribution

In [None]:
plt.rcParams['figure.figsize'] = (15, 9)
sns.countplot(annotation_len['number_of_bounding_box'], palette = 'colorblind')
plt.title('Bounding Boxh Distribution', fontsize = 30)
plt.xlabel('Bounding Box', fontsize = 15)
plt.ylabel('Count', fontsize = 15)
plt.show()

In [None]:
rows, cols = 1, 2
fig, axs = plt.subplots(rows, cols, figsize=(24,10))
fig.subplots_adjust(top = 0.99, bottom=0.01, hspace=-0.6, wspace=0.4)
for i,ax in zip(annotation_len, axs.ravel()):
  random_image = random.choice(annotation_len.index)
  img = mpimg.imread(annotation_len['image_path'][random_image])
    
  # creating bounding boc from annotation data
  annotations = annotation_len['annotations'][random_image]
  total = len(annotations)
  for bbox in annotations:
    x, y, w, h = bbox['x'], bbox['y'], bbox['width'], bbox['height']
    rect = matplotlib.patches.Rectangle((x, y), w, h, linewidth=1, edgecolor='darkorange', facecolor='orange', alpha=.5)
    ax.add_patch(rect)
  ax.imshow(img)
  ax.axis('off')
  ax.set_title(f'Image ID: {annotation_len["image_id"][random_image]} with {total} starfish(s)',{'fontsize': 15})

**WOW!! Looks great**