# [Tensorflow - Help Protect the Great Barrier Reef](https://www.kaggle.com/c/tensorflow-great-barrier-reef)

<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/31703/logos/header.png?t=2021-10-29-00-30-04">

# GOAL

### 1. Accurately identify starfish in real-time by building an object detection model trained on underwater videos of coral reefs.

### 2. Predict the presence and position  crown-of-thorns starfish in sequences of underwater images taken at various times and locations around the Great Barrier Reef.

# NEED

### 1. Great Barrier Reef is under threat because of the overpopulation of one particular starfish – the coral-eating crown-of-thorns starfish (or COTS for short). 

### 2. Underwater cameras will collect thousands of reef images and AI technology could drastically improve the efficiency and scale at which reef managers detect and control COTS outbreaks.

# INSTRUCTIONS

### 1. Predictions take the form of a bounding box together with a confidence score for each identified starfish. 

### 2. An image may contain zero or more starfish.

### 3. This competition uses a hidden test set that will be served by an API to ensure you evaluate the images in the same order they were recorded within each video. When your submitted notebook is scored, the actual test data (including a sample submission) will be availabe to your notebook.

# IMPORTING LIBRARIES

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from PIL import Image, ImageDraw
import random

# DATA DESCRIPTION

* `video_id` - ID number of the video the image was part of. The video ids are not meaningfully ordered.
* `video_frame` - The frame number of the image within the video. Expect to see occasional gaps in the frame number from when the diver surfaced.
* `sequence` - ID of a gap-free subset of a given video. The sequence ids are not meaningfully ordered.
* `sequence_frame` - The frame number within a given sequence.
* `image_id` - ID code for the image, in the format `{video_id}-{video_frame}`
* `annotations` - The bounding boxes of any starfish detections in a string format that can be evaluated directly with Python. Does not use the same format as the predictions you will submit. Not available in test.csv. A bounding box is described by the pixel coordinate `(x_min, y_min)` of its lower left corner within the image together with its `width` and `height` in pixels --> (COCO format).

In [None]:
train_metaData = pd.read_csv('../input/tensorflow-great-barrier-reef/train.csv')
train_metaData.head()

In [None]:
test_metaData = pd.read_csv('../input/tensorflow-great-barrier-reef/test.csv')
test_metaData.head()

# DATA VISUALIZATION

In [None]:
df_train = train_metaData.copy()
train_dir = "../input/tensorflow-great-barrier-reef/train_images"
df_train['image_path'] = train_dir + "/video_" + df_train['video_id'].astype(str) + "/" + df_train['video_frame'].astype(str) + ".jpg"
df_train.head()

In [None]:
df_train['video_id'].value_counts()

In [None]:
plt.bar(x = df_train['video_id'].value_counts().index, height = df_train['video_id'].value_counts())

In [None]:
df_train.info()

Number of training images

In [None]:
num_training_images = len(df_train)
num_training_images

Looking at random images from every video.

In [None]:
plt.figure(figsize = (20, 20))
for i in range(0, 10):
    plt.subplot(5, 2, i+1)
    index = random.randint(0, 23501)
    img_path = df_train['image_path'].iloc[index]
    img = Image.open(img_path)
    plt.imshow(img)

In [None]:
img_video_2_10 = plt.imread('../input/tensorflow-great-barrier-reef/train_images/video_2/10.jpg')
img_video_2_10.shape

All images are of same size so need of resizing.  
As you see, Images are really blur and definitely needs to be enhanced before modelling.

In [None]:
df_train_annotated = df_train[df_train['annotations'] != '[]']
df_train_annotated

In [None]:
(len(df_train_annotated)/ num_training_images) * 100

Only about 20% of the data is annotated.

Annotations contains the co-ordinate values for the bounding boxes where we will find starfishes.

In [None]:
df_train.dtypes

Now, we will look at the number of bounding boxes in each image because it is clearly stated that an image may conatin zero or more starfish.

In [None]:
df_train['No_bbox'] = df_train['annotations'].apply(lambda x:x.count('{')) 
df_train.head()

In [None]:
df_train['No_bbox'].value_counts()

In [None]:
plt.figure(figsize = (10,5))
sns.countplot(x = df_train['No_bbox'])

Doing the same for the annotated data

In [None]:
df_train_annotated['No_bbox'] = df_train_annotated['annotations'].apply(lambda x:x.count('{')) 
df_train_annotated.head()

Doing this to get **indexes** of those images which contain greater number of bounding boxes for **visualization**.

In [None]:
df_train_annotated[df_train_annotated['No_bbox'] >= 5]

In [None]:
df_train_annotated[df_train_annotated['No_bbox'] >= 8]

In [None]:
# this module helps to find out programmatically what the current grammar looks like.
import ast
ast.literal_eval(df_train_annotated.iloc[2345].annotations)

To know more about **ast** library see:
1. https://docs.python.org/3/library/ast.html
2. https://stackoverflow.com/questions/29552950/when-to-use-ast-literal-eval/29556591

In [None]:
def visualize_img_annots(df, id):
    img_path = df['image_path'][id]
    img = Image.open(img_path)
    bounding_boxes = ast.literal_eval(df['annotations'].loc[id])
                                         
    for box in bounding_boxes:
            shape = (box['x'], box['y'], box['x']+box['width'], box['y']+box['height'])
            ImageDraw.Draw(img).rectangle(shape, outline=180, width=3)
    display(img)

In [None]:
# Number of bounding boxes = 5
visualize_img_annots(df_train_annotated, id=19824)

In [None]:
# Number of bounding boxes = 11
visualize_img_annots(df_train_annotated, id = 9292)

# NOTEBOOK STILL IN PROGRESS