# Reef-Net

Let's start the project by installing the data set from Kaggle

In order to install the data-set from kaggle first go to [Kaggle](https://www.kaggle.com/competitions/tensorflow-great-barrier-reef) and accept the terms and condition of the competiton.

### Install Kaggle if not already installed
!pip install Kaggle

### Add json file to Kaggle folder
C:/Users/.../.Kaggle/Kaggle.json

This can be done using the Kaggle account. Go to [Kaggle Account](https://www.kaggle.com/harrykt/account).

And create a new API, this will download the Kaggle.json file, it is like a public key so that Kaggle knows who is downloading the data-set.

### Download the Data-set
This can be done using the following command.
!kaggle competitions download -c tensorflow-great-barrier-reef

---

## Unziping the data-set

After downloading the data set you can unzip the data-set using the following script. 
Rather than manually doing it.

Why?
Just to look cool. Haha

    import zipfile
    with zipfile.ZipFile("tensorflow-great-barrier-reef.zip","r") as zip_ref:
        zip_ref.extractall("tensorflow-great-barrier-reef")

---

## Data-set Structure

In the data set we have 3 videos

    1. Video 0
    2. Video 1
    3. Video 2
    
Now this videos are split into images in their respective folder. For examples 

    1. tensorflow-great-barrier-reef\train_images\video_0
    2. tensorflow-great-barrier-reef\train_images\video_1
    3. tensorflow-great-barrier-reef\train_images\video_2
    
But we are concerned with meta-data, so this meta data is stored in 

    tensorflow-great-barrier-reef/train.csv
    
Once we open this CSV files there are 6 columns we are majorly interested in

    1. Video Video (This is nothing but which video is used).
    2. Sequence (It is random ID for a subset of a video. Not really makes sense as of now).
    3. Video Frame (This can be considered as selected photo at a particular time in a video).
    4. Sequence Frame (Similar to video frame but, this is a number given a sequence).
    5. Image ID (This is similar to frame, but sometimes the name of images are different. Careful while loading data).
    6. Annotation (This is the bouding box where our corals are present in the images. 

---

In [15]:
import os

#Here in the data-set we have three videos and there are multiple images
path = os.path.abspath('../../tensorflow-great-barrier-reef/train.csv')
image_path = os.path.abspath('tensorflow-great-barrier-reef/train_images/')

## Load CSV File

The Meta-data we have is in csv file, we now have to load this data into Python to actually use it.

Reference: [Tensorflow CSV](https://www.tensorflow.org/tutorials/load_data/csv)

Import tensorflow using:
    
    !pip install tensorflow

In [16]:
import pandas as pd
import numpy as np

# Make numpy values easier to read.
# np.set_printoptions(precision=3, suppress=True)

import tensorflow as tf
from tensorflow.keras import layers

In [17]:
# The first row of our data-set already contains the names of columns
# Thus we dont have to add naems in the pd.read_csv function
training_data = pd.read_csv(path)

In [18]:
training_data.head()

Unnamed: 0,video_id,sequence,video_frame,sequence_frame,image_id,annotations
0,0,40258,0,0,0-0,[]
1,0,40258,1,1,0-1,[]
2,0,40258,2,2,0-2,[]
3,0,40258,3,3,0-3,[]
4,0,40258,4,4,0-4,[]


In [19]:
training_data["annotations"]

0        []
1        []
2        []
3        []
4        []
         ..
23496    []
23497    []
23498    []
23499    []
23500    []
Name: annotations, Length: 23501, dtype: object

In [20]:
video_id = training_data["video_id"]
sequence = training_data["sequence"]
video_frame = training_data["video_frame"]
sequence_frame = training_data["sequence_frame"]
image_id = training_data["image_id"]
annotations = training_data["annotations"]

In [21]:
tf_video_id = tf.data.Dataset.from_tensor_slices(video_id)
tf_sequence = tf.data.Dataset.from_tensor_slices(sequence)
tf_video_frame = tf.data.Dataset.from_tensor_slices(video_frame)
tf_sequence_frame = tf.data.Dataset.from_tensor_slices(sequence_frame)
tf_image_id = tf.data.Dataset.from_tensor_slices(image_id)
tf_annotations = tf.data.Dataset.from_tensor_slices(annotations)

In [22]:
data_set = tf.data.Dataset.zip(
    (
        tf_video_id,
        tf_sequence,
        tf_video_frame,
        tf_sequence_frame,
        tf_image_id,
        tf_annotations,
    )
)

In [23]:
data_set

<ZipDataset element_spec=(TensorSpec(shape=(), dtype=tf.int64, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None))>

---

# Creating our own data-set folder which just has target images in it

In [193]:
import json
import cv2
import shutil

from matplotlib import pyplot as plt
from pascal_voc_writer import Writer

In [194]:
# Lets create a folder
data_img_dir = os.path.abspath('..\\..\\data\\images\\')
data_ann_dir = os.path.abspath('..\\..\\data\\annotations\\')
data_set_dir = os.path.abspath('..\\..\\data\\image_set\\main\\')

os.makedirs(data_img_dir, exist_ok=True)
os.makedirs(data_ann_dir, exist_ok=True)
os.makedirs(data_set_dir, exist_ok=True)

In [195]:
# Path to CSV File and Training images
base_path = ('../../tensorflow-great-barrier-reef/')
csv_path = os.path.abspath(os.path.join(base_path, "train.csv"))
image_base_path = os.path.abspath(os.path.join(base_path, "train_images"))

In [196]:
def get_bboxes(annotations):
    result = []
    annotations = annotations.replace("'", '"')
    annotations = json.loads(annotations)
    for bbox in annotations:
        box = [bbox["x"], bbox["y"], bbox["width"], bbox["height"]]
        box = [float(x) for x in box]
        result.append(box)
    return result

In [197]:
def visualize_bounding_boxes(img, annotations, category):
    for annotation in annotations:
        x, y, w, h = annotation.astype(np.float32)
        x1 = (x-w/2).astype(int)
        x2 = (x+w/2).astype(int)
        y1 = (y-h/2).astype(int)
        y2 = (y+h/2).astype(int)
        if category.size == 0:
            break
        img = cv2.rectangle(img, (x1, y1), (x2, y2), (255, 0, 0), 3)
    return img

In [215]:
def generate_dataset(image_paths, boxes):
    np.random.seed(56)
    i = 0
    len_i = len(str(len(image_paths)))
    
    ########## ---------- XXXXXXXXXX ---------- ##########
    # Create an empty val and train file
    train = os.path.join(data_set_dir, "train.txt")
    train_f = open(train, 'w')
    
    val = os.path.join(data_set_dir, "val.txt")
    val_f =  open(val, 'w')
    val_f.write("")
    
    
    ########## ---------- XXXXXXXXXX ---------- ##########
    # Create a train and calidation file which contains the name of all images and xmls
    trainval = os.path.join(data_set_dir, "trainval.txt")
    trainval_f = open(trainval, "w")
    for image_path, annotations in zip(image_paths, boxes):
        img = cv2.imread(image_path)
        
        name = str(i).zfill(len_i)
        new_img_path = os.path.join(data_img_dir, name+".jpg")
        new_ann_path = os.path.join(data_ann_dir, name+".xml")
        
        ########## ---------- XXXXXXXXXX ---------- ##########
        # Saving the Image
        cv2.imwrite(new_img_path, img)
        
        ########## ---------- XXXXXXXXXX ---------- ##########
        # Saving the XML File
        img_h, img_w, img_c = img.shape
        writer = Writer(new_img_path, img_h, img_w)
        c_name = "star_fish"        
        annotations = np.array(annotations)
        for ann in annotations:
            xmin, ymin, xmax, ymax = ann
            writer.addObject(c_name, xmin, ymin, xmax, ymax)
        writer.save(new_ann_path)
        
        ########## ---------- XXXXXXXXXX ---------- ##########
        # Write the name in the file
        trainval_f.write(name+"\n")
        
        choice = np.random.choice(2, 1, p=[0.3, 0.7])[0]
        if choice == 1:
            train_f.write(name+"\n")
        else:
            val_f.write(name+"\n")
        
        
        i = i + 1
        
    train_f.close()
    val_f.close()
    trainval_f.close()
    return

In [216]:
# Creating a new image path folder with images that contains the target
df = pd.read_csv(csv_path)

df["image_path"] = (image_base_path + "/video_" + df["video_id"].astype(str) + "/" + df["video_frame"].astype(str) + ".jpg")
df["boxes"] = df["annotations"].apply(lambda annotations: get_bboxes(annotations))
df["num_bboxes"] = df["boxes"].apply(lambda x: len(x))
max_boxes = df["num_bboxes"].max()

# Selecting the images with annotations more than 0
min_boxes_per_image = 0
df = df[df["num_bboxes"] > min_boxes_per_image]
image_paths = df["image_path"]
boxes = df["boxes"]

total_images = df.shape[0]
print("Total Number of Rows: ", total_images)

# Generating the data-set
generate_dataset(image_paths, boxes)

Total Number of Rows:  4919
