# Cropped Crown of Thorns Dataset Builder

This notebook builds a dataset of just cropped COTS images (contents of each bounding box) that may be useful for training/data augmentation purposes. It also shows how to work with the data at a basic level. For example showing images and drawing bounding boxes. Code is written in a more readable format for beginners, efficiency is not taken into serious consideration.

**Plan**
 1. Extract all COTS images from bounding box regions and be able save to new img files.
 2. Create new dataset of all these COTS images - could be used for augmentation purposes or other training
 
## 😅 If you use the cropped pics upvote the notebook and/or dataset. Lot of people just copying code/forking on Kaggle these days. 👀

* In the next notebook I'll build an augmented dataset for easy use.
* Check back here for the link:

## FINAL DATASET: [COTS v NotCOTS Cropped Crown of Thorns Dataset](https://www.kaggle.com/alexteboul/binary-cropped-crown-of-thorns-dataset)

![COTS BANNER](https://storage.googleapis.com/kaggle-datasets-images/1912529/3140514/88cd069275dfbb414e8b94783a781450/data-original.png?t=2022-02-05-01-08-03)

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cv2
import time
import ast

In [None]:
#create a directory to save cropped images to. Cropped images are going to be just what is inside the bounding boxes.
os.mkdir('cots_crops')
os.mkdir('notcots_crops')
os.listdir()

In [None]:
#make it so the whole df.head() column width is shown.
pd.set_option('display.max_colwidth', None)

# 1. Extract COTS images from bounding boxes

## 1.1 Get the data

In [None]:
#get data
path = '/kaggle/input/tensorflow-great-barrier-reef/'
train = pd.read_csv(path + 'train.csv')
test = pd.read_csv(path+ 'test.csv')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.dtypes

## 1.2 add a column for the img_path

In [None]:
#the file path is a combination of the video_id and video_frame columns
train['img_path'] = '/kaggle/input/tensorflow-great-barrier-reef/train_images/video_'+train['video_id'].astype(str)+'/'+train['video_frame'].astype(str)+'.jpg'
train.head()

In [None]:
#what do annotations look like
train['annotations'].iloc[35]

## 1.3 separate out the images that have annotations from those that do not

In [None]:
#grab just the rows that have annotations (aka they have cots in them)
train_onlycots = train[train['annotations'] != '[]']
train_nocots = train[train['annotations'] == '[]']

In [None]:
#all the rows with annotations
train_onlycots.head()

In [None]:
#all the rows without annotations
train_nocots.head()

In [None]:
#interesting so we have 
print(f'original:{train.shape}\ntrain_onlycots:{train_onlycots.shape}\ntrain_nocots:{train_nocots.shape}')
print(f'percentage of images with annotations/cots in them: {round(train_onlycots.shape[0]/train.shape[0]*100)}%')

## 1.4 try viewing some images

In [None]:
#example pic without COTS
ex_nocots = train_nocots['img_path'].iloc[0]
print(ex_nocots)

In [None]:
#read the image in as a numpy array with cv2
start_time = time.time()
ex_nocots_img = cv2.imread(ex_nocots)
print("--- %s seconds ---" % (time.time() - start_time))
print(type(ex_nocots_img))

In [None]:
#notice how the color looks off - this is because it displays the color channels as BGR not RGB
plt.figure(figsize=(18, 18))
plt.imshow(ex_nocots_img)

In [None]:
#adding [:,:,::-1] will flip it around to display the RGB colors - and note that it doesn't slow you down to do so
start_time = time.time()
ex_nocots_img = cv2.imread(ex_nocots)[:,:,::-1]
print("--- %s seconds ---" % (time.time() - start_time))
print(type(ex_nocots_img))

In [None]:
plt.figure(figsize=(18, 18))
plt.imshow(ex_nocots_img)

In [None]:
#example pic with COTS
ex_yescots = train_onlycots['img_path'].iloc[28]
print(ex_yescots)
print(type(ex_yescots))

In [None]:
ex_yescots_img = cv2.imread(ex_yescots)[:,:,::-1]
plt.figure(figsize=(18, 18))
plt.imshow(ex_yescots_img)

## 1.5 let's add a bounding box to the annotated image because I can't see the COT starfish 😅

In [None]:
#check out the annotation - this is the bounding box area
#note that there is only 1 bounding box atm
#also not that it appears as a list
ex_annotation = train_onlycots['annotations'].iloc[0]
print(ex_annotation, type(ex_annotation))
#note that we have to turn this into a list to work with it

In [None]:
#lowkey kindof dangerous to do this this way, but this is Kaggle and going for readability
#turns the stringified list to a normal one
print(ast.literal_eval(ex_annotation))
print(type(ast.literal_eval(ex_annotation)))

In [None]:
def bbox_drawer(img_path, annotation):
    '''Accepts an image path as a string and an annotation as a stringified list of dictionaries. 
    Outputs the image in RGB and has bounding boxes drawn on the image'''
    #box parameters
    #window_name = 'COTS'
    color = (0, 0, 255) #as (B,G,R) this means this color is Red
    thickness = 2
    
    #get img from url
    img = cv2.imread(img_path)#[:,:,::-1]
    
    #fix stringified list
    annotation_fixed = ast.literal_eval(annotation)

    #loop through the list of annotations and draw the box on the image for each
    #start_point is the top left coordinate as a tuple
    #end_point is the bottom right coordinate as a tuple
    for ann in annotation_fixed:
        start_point,end_point = (ann['x'], ann['y']) , (ann['x'] + ann['width'], ann['y'] + ann['height'])
        #print(start_point,end_point)
        img = cv2.rectangle(img, start_point, end_point, color, thickness)
    img = img[:,:,::-1]
    plt.figure(figsize=(18, 18))
    plt.imshow(img)
    return

In [None]:
image_index_selector = 28 #change this number to see different images
bbox_drawer(train_onlycots['img_path'].iloc[image_index_selector],train_onlycots['annotations'].iloc[image_index_selector])

In [None]:
image_index_selector = 888 #change this number to see different images
bbox_drawer(train_onlycots['img_path'].iloc[image_index_selector],train_onlycots['annotations'].iloc[image_index_selector])

* So we are now able to visualize the images and bounding boxes
* It also works for any number of COTS in the images
* Next I'm going to save the bounding box regions as their own .jpg files. 
* This way we can end up with a folder that just has pictures of COTS.

## 1.6 lets extract just the COT starfish from the annotated image and save it

In [None]:
def img_bb_cropper(img_path, annotation):
    '''Accepts an image path as a string and an annotation as a stringified list of dictionaries.
    output is saving the file to the /'''
    #turn img_path to just the 'video_id-video_frame'
    #example '/kaggle/input/tensorflow-great-barrier-reef/train_images/video_0/16.jpg' --> 'video_0-16.jpg'
    img_name = img_path[57:-4].replace('/','-')
    
    #get img from url
    img = cv2.imread(img_path)  #[:,:,::-1]
    
    #fix stringified list annotation
    annotation_fixed = ast.literal_eval(annotation)

    #loop through the list of annotations and draw the box on the image for each
    #in each loop grab the x, y, width, and height
    #use numpy array slicing to grab the bounding box area and save it in img
    #put it in RGB too
    #the ann_counters is just a counter that gets thrown on the image name because each pic can have multiple annotations
    ann_counter = 0
    for ann in annotation_fixed:
        x,y,w,h = ann['x'], ann['y'], ann['width'], ann['height']
        cropped_img = img[y:y+h,x:x+w]
        cv2.imwrite(f'cots_crops/cotscrop-{img_name}-{ann_counter}.jpg',cropped_img)
        cropped_img = cropped_img[:,:,::-1]
        ann_counter+=1
    plt.figure(figsize=(6, 6))
    plt.imshow(cropped_img)
    return #cropped_img

In [None]:
image_index_selector = 888 #change this number to see different images
img_bb_cropper(train_onlycots['img_path'].iloc[image_index_selector],train_onlycots['annotations'].iloc[image_index_selector])

* Pretty low res image gets output. Makes sense, made up a tiny portion of the total image and this is just getting those pixels out. 
* But it works at least!

In [None]:
#check that a cropped pic saved correctly to the crops folder
diditwork = cv2.imread('cots_crops/cotscrop-video_0-4703-0.jpg')[:,:,::-1]
plt.imshow(diditwork)

# 2. Create new dataset of all these COTS only images
For now lets just make a dataset of only cots images from the bounding boxes. Next notebook will be an augmented dataset in the format that will make it more conducive to train an actual object detection model.

We currently have a function called 'img_bb_cropper' that accepts an 'img_path' and a 'annotation' and can save those crops to the '/crops/' folder. Now lets apply that to the whole dataframe of only cots images 'train_onlycots'.

In [None]:
#lets modify the function a bit so it doesn't show the pics each time. will speed it up a bit. Also no need to flip the BGR to RGB in the loop anymore.
def img_bb_cropper_saver(img_path, annotation):
    '''Accepts an image path as a string and an annotation as a stringified list of dictionaries.
    output is saving the file to the /'''
    #get image name from the path
    img_name = img_path[57:-4].replace('/','-')
    
    #get img from url
    img = cv2.imread(img_path)  #[:,:,::-1]
    
    #fix stringified list annotation
    annotation_fixed = ast.literal_eval(annotation)

    #save the cots image from each annotated bounding box to the crops folder
    ann_counter = 0
    for ann in annotation_fixed:
        x,y,w,h = ann['x'], ann['y'], ann['width'], ann['height']
        cropped_img = img[y:y+h,x:x+w]
        cv2.imwrite(f'cots_crops/cotscrop-{img_name}-{ann_counter}.jpg',cropped_img)
        ann_counter+=1
    return 

In [None]:
#this is ugly but is fast enough for this use case I suppose. Roughly 120-160 seconds to get all the cots images from the bounding boxes saved into the new crops folder
start_time = time.time()
run_it = train_onlycots.apply(lambda row: img_bb_cropper_saver(row['img_path'], row['annotations']), axis=1)
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
#check to see the first time
print(os.listdir('cots_crops')[:5])
print(len(os.listdir('cots_crops')))

* Awesome so we have currently 11,898 images of cots in the /crops folder

# 3. Let's also grab some not cots images in case anyone wants to build a binary cots-not cots classifier
* to do this we can try grabbing the bounding box regions from the images that actually don't have cots

In [None]:
print(train_onlycots.shape)

In [None]:
train_nocots_2 = train_nocots[:4919]
train_nocots_2.head()

In [None]:
#we want equal number of rows here
train_onlycots_2 = train_onlycots
print(train_nocots_2.shape)
print(train_onlycots_2.shape)

In [None]:
#dumb but reset index values
train_onlycots_2 = train_onlycots_2.reset_index(drop=True)
train_nocots_2 = train_nocots_2.reset_index(drop=True)


In [None]:
#now lets put in the annotations from the images in train_only cots in train_nocots_2
train_nocots_2['annotations2'] = train_onlycots_2['annotations'].values
train_nocots_2.head(20)

In [None]:
train_nocots_2.tail()

In [None]:
#lets modify the function so that it grabs an equal sized to the bounding box right above the actual bounding box. Just flip the y+h to y-h
def img_notcots_cropper_saver(img_path, annotation):
    '''Accepts an image path as a string and an annotation as a stringified list of dictionaries.
    output is saving the file to the /'''
    #get image name from the path
    img_name = img_path[57:-4].replace('/','-')
    
    #get img from url
    img = cv2.imread(img_path)  #[:,:,::-1]
    
    #fix stringified list annotation
    annotation_fixed = ast.literal_eval(annotation)

    #save the cots image from each annotated bounding box to the crops folder
    ann_counter = 0
    for ann in annotation_fixed:
        x,y,w,h = ann['x'], ann['y'], ann['width'], ann['height']
        cropped_img = img[y:y+h,x:x+w]
        cv2.imwrite(f'notcots_crops/notcotscrop-{img_name}-{ann_counter}.jpg',cropped_img)
        ann_counter+=1
    return 

In [None]:
#this is ugly but is fast enough for this use case I suppose. Roughly 120-160 seconds to get all the cots images from the bounding boxes saved into the new crops folder
start_time = time.time()
run_it_again = train_nocots_2.apply(lambda row: img_notcots_cropper_saver(row['img_path'], row['annotations2']), axis=1)
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
#check to see the first time
print(os.listdir('notcots_crops')[:5])
print(len(os.listdir('notcots_crops')))

In [None]:
#check that a cropped pic saved correctly to the crops folder
diditwork2 = cv2.imread('notcots_crops/notcotscrop-video_0-11553-5.jpg')[:,:,::-1]
plt.imshow(diditwork2)

* Cool so now we have an equal number of cots and not cots images saved, 11,898 of each. We did this by using the bounding boxes from the annotations, but applied to images without any bounding boxes (images that didn't have COTS).
* There are now a couple ways we can try using this for data augmentation purposes or probably directly if you want to stack a binary cots/not cots classifier onto object detector

### Save both sets to zip files

In [None]:
!ls

In [None]:
%%capture
!zip -r cots.zip cots_crops

In [None]:
%%capture
!zip -r notcots.zip notcots_crops

In [None]:
!ls