First, I would like to extend my many thanks to the hosts of this competition to come up with something meaningful and impactful. Few months back I saw a Netflix documentary [Seaspiracy](https://www.youtube.com/watch?v=1Q5CXN7soQg) which showed the deep impact human created pollution has on coral reefs. I was shocked to know that we have lost JUST SO MUCH of one of the largest ocean species. I always wanted to contribute to this cause and this competition for this reason is cloe to my heart. 

This kernel will explore the dataset. We will look at different sequences of the dataset using various Weights and Biases features. 

# Imports and Setup

In [None]:
# For video/gif creation
!pip -qq install imageio pygifsicle
# Install W&B
!pip install -q --upgrade wandb

In [None]:
# Required by pygifsicle
!apt-get install gifsicle

In [None]:
# Login 
import wandb
print(wandb.__version__)
wandb.login()

In [None]:
import os
import cv2
import ast
import numpy as np
import pandas as pd
import imageio as iio
from tqdm import tqdm
from pygifsicle import optimize
import matplotlib.pyplot as plt

# Load Dataset and Quick EDA

In [None]:
df = pd.read_csv('../input/tensorflow-great-barrier-reef/train.csv')
print("Number of training images: ", len(df))
df.head()

📍 Note: There are 23501 training images but the images are part of video sequences. Due to this fact there are images without any annotations to it. 

In [None]:
pd.DataFrame(df.annotations.value_counts())

📍 Note: Out of 23501 images 18582 images have no annotations. This is something we need to take care while modeling. 

In [None]:
pd.DataFrame(df.sequence.value_counts())

📍 Note: There are a total of 20 sequences. A sequence is gap-free subset of a given video. We can use this fact to create meaningful training and validation split. We however also need to consider the fact that only 4919 images are annotated. 

# Utilities

In [None]:
def get_frames_with_annotations(df_row: pd.Series) -> None:
    """
    Get the frame (numpy.ndarray) and the associated annotations (bounding boxes).
    
    Arguments:
        df_row (pd.Series): A row of the dataframe
    """
    # Get frame path
    frame_path = f"{TRAIN_PATH}/video_{row.video_id}/{row.video_frame}.jpg"
    # Open the image with OpenCV
    frame = cv2.imread(frame_path)
    if frame is None:
        return 
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    # Get annotations
    annotations = ast.literal_eval(row.annotations)
    
    return frame, annotations

In [None]:
def draw_bounding_boxes(frame: np.ndarray, 
                        annotations: list,
                        color: tuple=(0, 0, 255),
                        thickness: int= 2) -> None:
    """
    Draw bounding box given an image and bounding box annotaions.
    
    Arguments:
        frame (np.ndarray): Image 
        annotations (list): Bounding box coodinates in the format `x_min, y_min, width, height`.
        color (tuple): Color of the bounding boxes drawn on the frame.
        thickness (int): Thickness of the bounding boxees.
    """
    frame = frame.copy()
    
    # Draw bounding boxes
    for ant in annotations:
        start_point = (ant['x'], ant['y'])
        end_point = (ant['x']+ant['width'], ant['y']+ant['height'])

        frame = cv2.rectangle(frame, start_point, end_point, color, thickness)
        
    return frame

In [None]:
def wandb_bbox(image, bboxes, true_label, class_id_to_label, class_set):
    """
    Create wandb.Image object. `class_set` is required to log bounding box onto W&B Tables.
    
    Arguments:
    
        image (np.ndarray): Image 
        bboxes (list): Bounding box coordinates as dictionary
        true_label (int): Class id
        class_id_to_label (dict): Dictionary mapping class id to class name
        class_set (wandb.Classes): Needed to log image overlays onto W&B Tables. Might not be needed in future.
    """
    all_boxes = []
    for bbox in bboxes:
        box_data = {"position": {
                        "minX": bbox['x'],
                        "minY": bbox['y'],
                        "maxX": bbox['x']+bbox['width'],
                        "maxY": bbox['y']+bbox['height']
                    },
                     "class_id" : int(true_label),
                     "box_caption": class_id_to_label[true_label],
                     "domain" : "pixel"}
        all_boxes.append(box_data)
    
    return wandb.Image(image, boxes={
        "ground_truth": {
            "box_data": all_boxes,
          "class_labels": class_id_to_label
        }
    }, classes=class_set)

# Visualize One Sequence

In [None]:
SEQ_NUMBER = 18048
TRAIN_PATH = '../input/tensorflow-great-barrier-reef/train_images'

In [None]:
# Get one sequence
seq_df = df.loc[df.sequence == SEQ_NUMBER]
seq_df.head()

In [None]:
frames = []

# Iterate over the sequence
for i in tqdm(range(len(seq_df))):
    # Get the ith row
    row = seq_df.iloc[i]
    
    # Get frame and annotations
    frame, annotations = get_frames_with_annotations(row)
    if frame is None:
        continue
        
    # Draw bounding boxes
    frame = draw_bounding_boxes(frame, annotations)
    frames.append(frame)

frames = np.array(frames)

In [None]:
# Create gif
gif_path = f"sequence_{SEQ_NUMBER}.gif"

with iio.get_writer(gif_path, mode='I') as writer:
    for frame in frames:
        writer.append_data(frame)
        
optimize(gif_path)

# Log GIF as W&B Video
run = wandb.init(project='barrier_reef_viz')
wandb.log({f"{SEQ_NUMBER}": wandb.Video(gif_path)})
wandb.finish()

run

### [Check out the run page $\rightarrow$](https://wandb.ai/ayush-thakur/barrier_reef_viz/runs/guwmlt1u?workspace=)

# Visualize Sequences with W&B Tables

In [None]:
unique_sequences = df.sequence.unique()

# Define columns
columns = ["video_id", "video_frame", "sequence_frame_no", "is_annotation"] # W&B Code 1

# Setup a WandB Classes object. This will give additional metadata for visuals
# Note that we need to pass class_set to wandb.Image. In future, we might not to do this extra step. 
class_set = wandb.Classes([{'name': 'starfish', 'id': 0}]) # W&B Code 2

# Iterate over each sequence
for seq in unique_sequences:
    seq_df = df.loc[df.sequence == seq]
    
    # Log each sequence as separate W&B run
    run = wandb.init(project='barrier_reef_viz', group='viz-tables', name=f"table_{seq}") # W&B Code 3
    # Initialize W&B Tables
    seq_table = wandb.Table(columns=columns) # W&B Code 4
    
    # Iterate over the sequence
    for i in tqdm(range(len(seq_df))):
        # Get the ith row
        row = seq_df.iloc[i]
        
        # Get frame and annotations
        frame, annotations = get_frames_with_annotations(row)
        
        if len(annotations)==0:
            is_annotation = False
        else:
            is_annotation = True
        
        wandb_img = wandb_bbox(frame, annotations, 0, {0:'starfish'}, class_set)
            
        seq_table.add_data(row.video_id, 
                           wandb_img,
                           row.sequence_frame,
                           is_annotation) # W&B Code 5

    wandb.log({"tables_viz": seq_table}) # W&B Code 6
    
    # Close W&B run
    run.finish()

## Why log it to W&B?

You might wonder why take an extra step to log it onto W&B dashboard. That's a fair question and here are few reasons:
* One can easily log all the sequences or any way the data (gif) here is generated for future reference. 
* You will probably visualize the dataset and start training models, logging the model prediction along with ground truth prediction like this will help document the experiments better. 

# WORK IN PROGRESS

Better ways to visualize the dataset in the context of model performance comparison. 