### Tensorflow - Help Perotect the Great Barrier Reef

In [5]:
# install libraries
!pip3 install -qU wandb
!pip3 install -qU bbox-utility 

In [7]:
# import libraries
import os
import sys
import shutil      # file copy, collection and move on highlevel

import cv2
import glob       # beside of (os.listdir) it used. -- used for getting file in pattern ...

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from tqdm.notebook import tqdm   # progress bar, use with wraper tqdm(*******)
tqdm.pandas()
from joblib import Parallel, delayed    # beside of pickle it used. ...used for save the model and others ...
from IPython.display import display

#### Key Points

* One have to submit prediction using the provided python time-series API, which makes this competition different from previous Object Detection Competitions.
* Each prediction row needs to include all bounding boxes for the image. Submission is format seems also COCO which means [x_min, y_min, width, height]
* Copmetition metric F2 tolerates some false positives(FP) in order to ensure very few starfish are missed. Which means tackling false negatives(FN) is more important than false positives(FP)

F2 = 5 * p * r / 4 (p+r)

#### Track our experiments using weights and Biases

* Track, compare, and visualize ML experiments
* Get live metrics, terminal logs, and system stats streamed to the centralized dashboard.
* Explain how your model works, show graphs of how model versions improved, discuss bugs, and demonstrate progress towards milestones.

In [None]:
import wandb
wandb.login()
wandb.init(project='great-barrier-reef-yolov5')

#### MetaData

* train_images/ - Folder containing training set photos of the form video_{video_id}/{video_frame}.jpg.

* [train/test].csv - Metadata for the images. As with other test files, most of the test metadata data is only available to your notebook upon submission. Just the first few rows available for download.

* video_id - ID number of the video the image was part of. The video ids are not meaningfully ordered.

* video_frame - The frame number of the image within the video. Expect to see occasional gaps in the frame number from when the diver surfaced.
* sequence - ID of a gap-free subset of a given video. The sequence ids are not meaningfully ordered.
* sequence_frame - The frame number within a given sequence.
* image_id - ID code for the image, in the format {video_id}-{video_frame}
* annotations - The bounding boxes of any starfish detections in a string format that can be evaluated directly with Python. Does not use the * same format as the predictions you will submit. Not available in test.csv. A bounding box is described by the pixel coordinate (x_min, y_min) of its lower left corner within the image together with its width and height in pixels --> (COCO format).

In [34]:
# Parameters
FOLD = 1
DIM = 3000
MODEL = 'yolov5s6'
BATCH = 4
EPOCHS = 7
OPTIMIZER = 'Adam'
PROJECT = 'great-barrier-reef'

import time
TIME = time.time_ns()
NAME = f'{MODEL}-dim{DIM}-fold{FOLD}-{TIME}'              # time is dynamic incase if we forget to change

ROOT_DIR = '/home/dave117/MLOps/projects/Kaggle_Competitions/Protect-Great-Barrier-Reef/Data/raw'
IMAGE_DIR = '/home/dave117/MLOps/projects/Kaggle_Competitions/Protect-Great-Barrier-Reef/Data/interim/images'
LABLE_DIR = '/home/dave117/MLOps/projects/Kaggle_Competitions/Protect-Great-Barrier-Reef/Data/interim/lables'

#### create directories and path 

In [35]:
df = pd.read_csv(f'{ROOT_DIR}/train.csv')
df['old_image_path'] = f'{ROOT_DIR}/train_images/video_'+df.video_id.astype(str)+'/'+df.video_frame.astype(str)+'.jpg'
df['image_path'] = f'{IMAGE_DIR}/'+ df.image_id + '.jpg'
df['label_path'] = f'{LABLE_DIR}/'+df.image_id+'.txt'
df['annotations'] = df['annotations'].progress_apply(eval)
display(df.head(5))

  0%|          | 0/23501 [00:00<?, ?it/s]

Unnamed: 0,video_id,sequence,video_frame,sequence_frame,image_id,annotations,old_image_path,image_path,label_path
0,0,40258,0,0,0-0,[],/home/dave117/MLOps/projects/Kaggle_Competitio...,/home/dave117/MLOps/projects/Kaggle_Competitio...,/home/dave117/MLOps/projects/Kaggle_Competitio...
1,0,40258,1,1,0-1,[],/home/dave117/MLOps/projects/Kaggle_Competitio...,/home/dave117/MLOps/projects/Kaggle_Competitio...,/home/dave117/MLOps/projects/Kaggle_Competitio...
2,0,40258,2,2,0-2,[],/home/dave117/MLOps/projects/Kaggle_Competitio...,/home/dave117/MLOps/projects/Kaggle_Competitio...,/home/dave117/MLOps/projects/Kaggle_Competitio...
3,0,40258,3,3,0-3,[],/home/dave117/MLOps/projects/Kaggle_Competitio...,/home/dave117/MLOps/projects/Kaggle_Competitio...,/home/dave117/MLOps/projects/Kaggle_Competitio...
4,0,40258,4,4,0-4,[],/home/dave117/MLOps/projects/Kaggle_Competitio...,/home/dave117/MLOps/projects/Kaggle_Competitio...,/home/dave117/MLOps/projects/Kaggle_Competitio...


#### Number of BBoxs

* It will display the number of images (in %) with box around the species.

In [36]:
df['num_bbox'] = df['annotations'].progress_apply(lambda x: len(x))
data = (df.num_bbox>0).value_counts(normalize=True)*100
print(f"No BBox: {data[0]:0.2f}% | with BBox: {data[1]:0.2f}%")

  0%|          | 0/23501 [00:00<?, ?it/s]

No BBox: 79.07% | with BBox: 20.93%


#### Data Cleaning

In [40]:
if True:
    df =df.query('num_bbox>0')

#### Write images to new directory

In [None]:
cl