<center>
    <h1> Exploratory data analysis for Great Bareer Reef
</center>


In this notebook, we will look on the features of our data. It will help us to understand how can we take useful feature to solve the problem assigned to us.

In [18]:
#set datapath here
data_Path = "/home/sitwala/linuxdevs/DSI/module1/cots-detection/dataset"

In [26]:
!ls {data_Path}

example_sample_submission.csv  greatbarrierreef  train.csv
example_test.npy	       test.csv		 train_images


In [27]:
!ls {data_Path}/train_images


video_0  video_1  video_2


In our folder, we have the train and test data in csv form, along with the images. The images are divided in three video folders.

In [44]:
image_folder = data_Path+"/train_images/video_0"

## Vizualization

So firstly, let us vizualize the images that we are going to use.

In [2]:
%%capture
pip install easyimages #tool to vizualize the images.

In [3]:
from easyimages import EasyImageList # used to show image in the notebook


In [41]:
!ls {image_folder}

video_0  video_1  video_2


In [45]:
Li = EasyImageList.from_folder(image_folder)
Li.symlink_images()
Li.html(sample=500,size=44)

## Loading the data

Now, we are going to load the training data.

In [58]:
import pandas as pd

DATA_PATH = data_Path
train = pd.read_csv(DATA_PATH + '/train.csv')
train.tail(5)

Unnamed: 0,video_id,sequence,video_frame,sequence_frame,image_id,annotations
23496,2,29859,10755,2983,2-10755,[]
23497,2,29859,10756,2984,2-10756,[]
23498,2,29859,10757,2985,2-10757,[]
23499,2,29859,10758,2986,2-10758,[]
23500,2,29859,10759,2987,2-10759,[]


In [62]:
for i in range(0,3):
    count = len(train[train['video_id']==i])
    print("Number of images in Video_{}:{}".format(i,count))
   

Number of images in Video_0:6708
Number of images in Video_1:8232
Number of images in Video_2:8561


## Check duplicated data

In [88]:
num_of_duplicates = len(train.loc[train.duplicated(subset=["video_id","video_frame"])== True])

print("Number of duplicates: " + str(num_of_duplicates))

# check the number of duplicates using tconst
#len(data_ratings.loc[data_ratings.duplicated(subset=['tconst'])== True].index)

Number of duplicates: 0


In [89]:
train.head()

Unnamed: 0,video_id,sequence,video_frame,sequence_frame,image_id,annotations,num_bboxes
0,0,40258,0,0,0-0,[],0
1,0,40258,1,1,0-1,[],0
2,0,40258,2,2,0-2,[],0
3,0,40258,3,3,0-3,[],0
4,0,40258,4,4,0-4,[],0


## Check for images with no annotations


In this section, the column 'num_bboxes' is used to count if the images has nonempty annotations.

In [60]:
import ast

# Convert String to List Type
train['annotations'] = train['annotations'].apply(ast.literal_eval)

# Get the number of bounding boxes for each image
train['num_bboxes'] = train['annotations'].apply(lambda x: len(x))

In [90]:
print("Number of Images with no annotations: "+  str(len(train[train['num_bboxes'] > 0])))

Number of Images with no annotations: 4919


## Verify if there is corrupted data


This section is to check if all our images have valid format.

In [93]:
from os import listdir
from PIL import Image

def verify_images(video_id):
    path = DATA_PATH + "/" f'train_images/video_{video_id}/'    
    for filename in listdir(path):
        if filename.endswith('.jpg'):
            try:
                img = Image.open(path + filename)
                img.verify() # Verify it is in fact an image
            except (IOError, SyntaxError) as e:
                print('Bad file:', filename) # Print out the names of corrupt files
    print(f'Video {video_id} has all valid images. Verified!')

for video_id in range(3):
    verify_images(video_id)

Video 0 has all valid images. Verified!
Video 1 has all valid images. Verified!
Video 2 has all valid images. Verified!



## Verify if all image are readable and study.

In this section, we will check if there are some images that are not readable by exploring  the distribution of the sizes and aspect rations of the images.


For the code to use to take the feature https://note.nkmk.me/en/python-opencv-pillow-image-size/.

In [94]:
train.head()

Unnamed: 0,video_id,sequence,video_frame,sequence_frame,image_id,annotations,num_bboxes
0,0,40258,0,0,0-0,[],0
1,0,40258,1,1,0-1,[],0
2,0,40258,2,2,0-2,[],0
3,0,40258,3,3,0-3,[],0
4,0,40258,4,4,0-4,[],0


In [105]:
path = data_Path + '/train_images/video_'+str(0)+'/'+str(0)+'.jpg'

In [106]:
(cv2.imread(path)).shape

(720, 1280, 3)

I use a np array to store the size of the images and the aspect ration.

In [107]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23501 entries, 0 to 23500
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   video_id        23501 non-null  int64 
 1   sequence        23501 non-null  int64 
 2   video_frame     23501 non-null  int64 
 3   sequence_frame  23501 non-null  int64 
 4   image_id        23501 non-null  object
 5   annotations     23501 non-null  object
 6   num_bboxes      23501 non-null  int64 
dtypes: int64(5), object(2)
memory usage: 1.3+ MB


In [108]:
import csv 

In [110]:
row, column, color =[],[],[]
listexception = []
for i in range(23501 ) :
    try :
        path = data_Path + '/train_images/video_'+str(train.iloc[i].video_id)+'/'+str(train.iloc[i].sequence_frame)+'.jpg'
        #print(i)
        r,c1,c2 = (cv2.imread(path)).shape
        row.append(r)
        column.append(c1)
        color.append(c2)
    except :
      #print('an exeption occured at row ' + str(i))
      listexception.append(i)

print(len(listexception))



3554


The exception array store the row number of the image that imread could not read.

In [111]:
import numpy as np
# np.savetxt(data_Path + '/tensorflow-great-barrier-reef/listexception.csv', np.array(listexception), delimiter=',')


In [112]:
row_describe = pd.DataFrame(np.array(row))
row_describe.describe()

Unnamed: 0,0
count,19947.0
mean,720.0
std,0.0
min,720.0
25%,720.0
50%,720.0
75%,720.0
max,720.0


In [113]:
column_describe = pd.DataFrame(np.array(column))
column_describe.describe()

Unnamed: 0,0
count,19947.0
mean,1280.0
std,0.0
min,1280.0
25%,1280.0
50%,1280.0
75%,1280.0
max,1280.0


In [114]:
color_describe = pd.DataFrame(np.array(color))
color_describe.describe()

Unnamed: 0,0
count,19947.0
mean,3.0
std,0.0
min,3.0
25%,3.0
50%,3.0
75%,3.0
max,3.0


All the images are of the same size (720,1280,3 ) and there are 3554 images that imread could not read.
 



## Exploring the target. 

In this section, we will analyze the annotations in the train data. (summary)

In [115]:
listanotation = train[train['num_bboxes']>0]['annotations']

In [116]:
listanotation.head()

16    [{'x': 559, 'y': 213, 'width': 50, 'height': 32}]
17    [{'x': 558, 'y': 213, 'width': 50, 'height': 32}]
18    [{'x': 557, 'y': 213, 'width': 50, 'height': 32}]
19    [{'x': 556, 'y': 214, 'width': 50, 'height': 32}]
20    [{'x': 555, 'y': 214, 'width': 50, 'height': 32}]
Name: annotations, dtype: object

In [117]:
type(listanotation.iloc[1])

list

In [118]:
type(listanotation.iloc[1][0])

dict

In [119]:
listanotation.iloc[1][0]['x']

558

In [120]:
n = listanotation.shape[0]

In [121]:
listcoordinate = []
for i in range(n):
   annot = listanotation.iloc[i][0]
   listcoordinate.append([annot['x'],annot['y'],annot['width'],annot['height']])

#### References
- https://www.kaggle.com/diegoalejogm/great-barrier-reefs-eda-with-animations
- https://www.kaggle.com/debarshichanda/w-b-tables-great-barrier-reef-eda

- https://neptune.ai/blog/data-exploration-for-image-segmentation-and-object-detection