# Corals health monitoring project
## Data collection and cleaning
---

## Collect dataset from Kaggle

### Objectives:
* Collect data
* Clean data (remove files which are not images)

### Input:
* Kaggle autentication token (kaggle.json)

### Output:
* Generate dataset:<br>
inputs/corals-dataset/Dataset

<hr>

## NOTES
##### ** During model fitting it was discovered that the dataset contained some images in wrong folders. Namely, many images of dead corals were found in bleached, bleached corals were found in healthy. In order to reduce error during model training, validation and prediction, the dataset was inspected visually and the most obvious images (obviously dead, bleached or healthy were moved to corresponding folders).**

---

### Import packages

In [None]:
%pip install -r /workspace/corals_health/requirements.txt

In [5]:
import numpy
import os
from matplotlib.image import imread

### Setting up directory

In [6]:
current_dir = os.getcwd()
current_dir

'/workspace/corals_health/jupyter_notebooks'

In [7]:
os.chdir('/workspace/corals_health')
print(f"Your current working directory is:\n {os.getcwd()}")

Your current working directory is:
 /workspace/corals_health


## Install Kaggle

In [None]:
%pip install --upgrade kaggle

In [13]:
# change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from the [Kaggle URL](https://www.kaggle.com/datasets/sonainjamil/bhd-corals).
* Set your destination folder.

![Kaggle dataset summary page](../assets/images/kaggle-dataset.jpg)

### Set the Kaggle Dataset and Download it.

In [14]:
KaggleDatasetPath = "sonainjamil/bhd-corals"
DestinationFolder = "/workspace/corals_health/inputs/corals-dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/sonainjamil/bhd-corals
License(s): Attribution 4.0 International (CC BY 4.0)
Downloading bhd-corals.zip to /workspace/corals_health/inputs/corals-dataset
 91%|████████████████████████████████████▏   | 113M/125M [00:03<00:00, 36.0MB/s]
100%|████████████████████████████████████████| 125M/125M [00:03<00:00, 36.1MB/s]


#### Unzip the downloaded file, and delete the zip file.

In [15]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/bhd-corals.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/bhd-corals.zip')

## Data cleaning
### In the '/inputs/' folder, check which files are not images

In [1]:
import os
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                print(file_location)
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))


In [2]:
remove_non_image_file(my_data_dir='/workspace/corals_health/inputs/corals-dataset/Dataset')

Folder: Healthy - has image file 661
Folder: Healthy - has non-image file 0
Folder: Bleached - has image file 648
Folder: Bleached - has non-image file 0
Folder: Dead - has image file 161
Folder: Dead - has non-image file 0


### Remove black and white images

In [3]:
def remove_black_white (my_data_dir):
    folders = os.listdir(my_data_dir)
    print(folders)
    for folder in folders:
        files = os.listdir(os.path.join(my_data_dir, folder))
        for given_image in files:
            count = 0
            image = imread(os.path.join(my_data_dir, folder, given_image))
            if len(image.shape) == 3:
                pass
            else:
                os.remove(os.path.join(my_data_dir, folder, given_image))      
                count += 1
    print(f"{count} black-white images were detected and removed.")
        
    

In [8]:
remove_black_white(my_data_dir='/workspace/corals_health/inputs/corals-dataset/Dataset')

['Healthy', 'Bleached', 'Dead']


0 black-white images were detected and removed.


### Manual data check
The dataset used in this work was assembled and preprocessed in the context of the project, published by [Jamil <em>et al.</em>](https://www.mdpi.com/2504-2289/5/4/53). Although, the [dataset](https://www.kaggle.com/datasets/sonainjamil/bhd-corals) of coral images is labelled as 'Healthy', 'Bleached' and 'Dead', the work was focused on distinguishing beween 'Healthy' and 'Bleached' (binary classification task) using 'specific deep convolutional neural networks such as AlexNet, GoogLeNet, VGG-19, ResNet-50, Inception v3, and CoralNet. (c)' The subset labelled as 'Dead' was treated as 'Bleached'. Attempt to train the model to categorise the data into three groups: 'Bleached', 'Dead' and 'Healthy' resulted in poor genaralisation and overfitting. Manual inspection of the dataset revealed that some of the 'Dead' corals were labelled as 'Bleached' and the other way around. Futhermore, some of the 'Bleached' corals were marked as 'Healthy'. This image misplacement may be less critical for binary classification, but crucial for training models for more categories. Therefore, the author had to manually move some images in downloaded dataset where the misplacement was obvous, into more appropriate folders, following the [description](https://en.wikipedia.org/wiki/Coral_bleaching).

## Split train validation test set

In [9]:
import os
import shutil
import random


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    print(labels)
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(os.path.join(my_data_dir, folder, label))

        for label in labels:

            files = os.listdir(os.path.join(my_data_dir, label))
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(os.path.join(my_data_dir, label, file_name), os.path.join(
                                my_data_dir, 'train', label, file_name))

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(os.path.join( my_data_dir, label, file_name), os.path.join(
                                my_data_dir, 'validation', label, file_name))

                else:
                    # move given file to test set
                    shutil.move(os.path.join(my_data_dir, label, file_name), os.path.join(
                                my_data_dir, 'test', label, file_name))

                count += 1

            os.rmdir(os.path.join(my_data_dir,label))


In [10]:
split_train_validation_test_images(my_data_dir=f"/workspace/corals_health/inputs/corals-dataset/Dataset",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )


['Healthy', 'Bleached', 'Dead']
