This is a starter kernel to train a YOLOv5 model on Tensorflow - Help Protect the Great Barrier Reef dataset. Given an input image the task is to find the presence of crown-of-thorns starfish using bounding box detection.

### Other works in this competition

You can check out my visualization kernel [here](https://www.kaggle.com/ayuraj/visualize-the-cots-interactively-with-w-b?scriptVersionId=80596542).

# 🖼️ What is YOLOv5?

YOLO an acronym for 'You only look once', is an object detection algorithm that divides images into a grid system. Each cell in the grid is responsible for detecting objects within itself.

[Ultralytics' YOLOv5](https://github.com/ultralytics/yolov5) ("You Only Look Once") model family enables real-time object detection with convolutional neural networks.

# 🦄 What is Weights and Biases?

[Weights & Biases](https://wandb.ai/site) (W&B) is a set of machine learning tools that helps you build better models faster. Check out Experiment Tracking with Weights and Biases to learn more.
Weights & Biases is directly integrated into YOLOv5, providing experiment metric tracking, model and dataset versioning, rich model prediction visualization, and more. You can learn more about W&B in this [kernel](https://www.kaggle.com/ayuraj/experiment-tracking-with-weights-and-biases).

# ☀️ Imports and Setup

According to the official Train Custom Data guide, YOLOv5 requires a certain directory structure.

```
/parent_folder
    /dataset
         /images
         /labels
    /yolov5
```

* We thus will create a /tmp directory.
* Download YOLOv5 repository and pip install the required dependencies.
* Install the latest version of W&B and login with your wandb account. You can create your free W&B account [here](https://wandb.ai/signup).

In [None]:
%cd ../
!mkdir tmp
%cd tmp

In [None]:
# Download YOLOv5
!git clone https://github.com/ultralytics/yolov5  # clone repo
%cd yolov5
# Install dependencies
%pip install -qr requirements.txt  # install dependencies

%cd ../
import torch
print(f"Setup complete. Using torch {torch.__version__} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

## How to login to W&B?

There are two ways you can login to W&B in a Kaggle kernel setting:

* Run a cell with `wandb.login()`. It will ask for the API key, which you can copy + paste in. This is ideal if you Quick Save your kernel.

* You can also use Kaggle Secrets to store your API key and use the code snippet below to login. If you are not familiar with Kaggle Secrets check this [forum post](https://www.kaggle.com/product-feedback/114053). This is ideal if you do Run and Save All.

```
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

# I have saved my API token with "wandb_api" as Label. 
# If you use some other Label make sure to change the same below. 
wandb_api = user_secrets.get_secret("wandb_api") 

wandb.login(key=wandb_api)

```

More on W&B login [here](https://docs.wandb.ai/ref/cli/wandb-login).


In [None]:
# Install W&B 
# !pip install -q --upgrade wandb

# Login 
import wandb
print(wandb.__version__)
wandb.login(anonymous='must')


📍 Note: W&B comes pre-installed with Kaggle kernel but to ensure the lastest version of W&B use pip install.



In [None]:
# Necessary/extra dependencies. 
import os
import gc
import ast
import cv2
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
from shutil import copyfile
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# 🦆 Hyperparameters¶


In [None]:
%cd ../
TRAIN_PATH = 'input/tensorflow-great-barrier-reef/train_images'
IMG_SIZE = 256
BATCH_SIZE = 16
EPOCHS = 10

# 🔨 Prepare Dataset

This is the most important section when it comes to training an object detector with YOLOv5. The directory structure, bounding box format, etc must be in the correct order. This section builds every piece needed to train a YOLOv5 model.

* Find the number of images with annotations and remove the ones without annotations.
* Create train-validation split.
* Create required /dataset folder structure and more the images to that folder.
* Create `data.yaml` file needed to train the model.
* Create bounding box coordinates in the required YOLO format.

In [None]:
df = pd.read_csv('input/tensorflow-great-barrier-reef/train.csv')
print('Number of training images: ', len(df))
df.head()

📍 Notes:

* There are three unique video_ids. These are 0, 1, and 2 and are three distinct directories (video_0, video_1, and video_2).
* There are 20 unique sequences. 

In [None]:
def add_path(row):
    return f"{TRAIN_PATH}/video_{row.video_id}/{row.video_frame}.jpg"

def num_boxes(annotations):
    annotations = ast.literal_eval(annotations)
    return len(annotations)

# Add path
df['path'] = df.apply(lambda row: add_path(row), axis=1)
# Find number of annotations per image
df['num_bbox'] = df['annotations'].apply(lambda x: num_boxes(x))

data = (df.num_bbox>0).value_counts()/len(df)*100
print(f"Number of images without bboxes: {data[0]:0.2f}% | with bboxes: {data[1]:0.2f}%")

In [None]:
# Remove images without annotations. 
df = df.query("num_bbox>0")
print(f"Number of images with annotations: {len(df)}")
df.head()

In [None]:
# Create train and validation split
train_df, valid_df, = train_test_split(df, test_size=0.15, random_state=42)

train_df.loc[:, 'split'] = 'train'
valid_df.loc[:, 'split'] = 'valid'

df = pd.concat([train_df, valid_df]).reset_index(drop=True)
print(f"Number of train images: {len(train_df)}, validation images: {len(valid_df)}")

# 🍚 Prepare Required Folder Structure

The required folder structure for the dataset directory is:

```
/parent_folder
    /dataset
         /images
             /train
             /val
         /labels
             /train
             /val
    /yolov5
```

Note that I have named the directory `barrier_reef`.


In [None]:
def add_new_path(row):
    return f"tmp/barrier_reef/dataset/images/{row.split}/{row.video_frame}.jpg"
    
# Add new path
df['new_path'] = df.apply(lambda row: add_new_path(row), axis=1)

In [None]:
def copy_file(row):
    os.makedirs(os.path.dirname(row.new_path), exist_ok=True)
    copyfile(row.path, row.new_path)

_ = df.progress_apply(lambda row: copy_file(row), axis=1)

# 🍜 Create .YAML file

The `data.yaml`, is the dataset configuration file that defines

- an "optional" download command/URL for auto-downloading,
- a path to a directory of training images (or path to a *.txt file with a list of training images),
- a path to a directory of validation images (or path to a *.txt file with a list of validation images), 
- the number of classes, 
- a list of class names

📍 Important: There is just one class that's why nc=1. <br>
📍 Note: The `data.yaml` is created in the yolov5/data directory as required.

In [None]:
# Create .yaml file 
import yaml

data_yaml = dict(
    train = '../barrier_reef/images/train',
    val = '../barrier_reef/images/valid',
    nc = 1,
    names = ['cots']
)

# Note that I am creating the file in the yolov5/data/ directory.
with open('tmp/yolov5/data/data.yaml', 'w') as outfile:
    yaml.dump(data_yaml, outfile, default_flow_style=True)
    
%cat tmp/yolov5/data/data.yaml

# 🍮 Prepare Bounding Box Coordinated for YOLOv5

For every image with bounding box(es) a `.txt` file with the same name as the image will be created in the format shown below:

* One row per object.
* Each row is in `class x_center y_center width height` format.
* Box coordinates must be in normalized xywh format (from 0 - 1). We can normalize by the boxes in pixels by dividing `x_center` and `width` by image width, and `y_center` and `height` by image `height`.
* Class numbers are zero-indexed (start from 0).

📍 Note: We don't have to remove the images without bounding boxes from the training or validation sets. But in our case we are not considering images without bboxes.

In [None]:
IMG_WIDTH, IMG_HEIGHT = 1280, 720

def get_yolo_format_bbox(img_w, img_h, box):
    """
    Convert the bounding boxes in YOLO format.
    
    Input:
    img_w - Original/Scaled image width
    img_h - Original/Scaled image height
    box - Bounding box coordinates in the format, "x_min, y_min, width, height"
    
    Output:
    Return YOLO formatted bounding box coordinates, "x_center y_center width height".
    """
    w = box['width'] # width 
    h = box['height'] # height
    xc = box['x'] + int(np.round(w/2)) # xmin + width/2
    yc = box['y'] + int(np.round(h/2)) # ymin + height/2

    return [xc/img_w, yc/img_h, w/img_w, h/img_h] # x_center y_center width height
    

# Iterate over each row and write the labels and bbox coordinates to a .txt file. 
for i in tqdm(range(len(df))):
    row = df.loc[i]
    annotations = ast.literal_eval(row.annotations)
    bboxes = []
    for bbox in annotations:
        # get bbox in YOLO format
        bbox = get_yolo_format_bbox(IMG_WIDTH, IMG_HEIGHT, bbox)
        bboxes.append(bbox)
        
    # Create .txt files for label
    if row.split == 'train':
        file_name = f"tmp/barrier_reef/labels/{row.split}/{row.video_frame}.txt"
        os.makedirs(os.path.dirname(file_name), exist_ok=True)
    else:
        file_name = f"tmp/barrier_reef/labels/{row.split}/{row.video_frame}.txt"
        os.makedirs(os.path.dirname(file_name), exist_ok=True)
        
    # Write the label as .txt file
    with open(file_name, 'w') as f:
        for i, bbox in enumerate(bboxes):
            label = 0
            bbox = [label]+bbox
            bbox = [str(i) for i in bbox]
            bbox = ' '.join(bbox)
            f.write(bbox)
            f.write('\n')

# 🚅 Train with W&B

In [None]:
%cd tmp/yolov5/

```
--img {IMG_SIZE} \ # Input image size.
--batch {BATCH_SIZE} \ # Batch size
--epochs {EPOCHS} \ # Number of epochs
--data data.yaml \ # Configuration file
--weights yolov5s.pt \ # Model name
--save_period 1\ # Save model after interval
--project barrier-reef-yolo # W&B project name
```

In [None]:
!python train.py --img 768\
                 --batch 16\
                 --epochs 10\
                 --data data.yaml\
                 --weights yolov5s.pt\
                 --project barrier-reef-yolo

# WORK IN PROGRESS