# **Data Collection Notebook**

## Objectives

* Fetch cherry leaves image dataset from Kaggle and prepare the data for anaylsis and modelling.

## Inputs

* Kaggle JSON file - authentication token.

## Outputs

* Generate dataset: inputs/datasets/cherry_leaves_dataset

## Additional Comments

* Clean the data and split the dataset into train, validation and test sets.



---

# Change working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir("/workspace/cherry-leaves-mildew-detection")
print("You set a new current directory.")

In [None]:
current_dir = os.getcwd()
current_dir

---

# Install Kaggle

In [None]:
# Install kaggle
%pip install kaggle

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [None]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

In [None]:
import zipfile
with zipfile.ZipFile(DestinationFolder + "/" + "cherry-leaves.zip", "r") as zip:
    zip.extractall(DestinationFolder)
os.remove(DestinationFolder + "/" + "cherry-leaves.zip")

---

# Data Cleaning

### Removing non-image files

In [None]:

def remove_non_image(data_dir):
    """
    Function searches directory to ensure all 
    the files end with the listed extension. 
    Files that do not are appended to the non_image
    list. 
    """
    image_extension = (".png", ".jpg", ".jpeg", ".tiff")
    folders = os.listdir(data_dir)
    for folder in folders:
        files = os.listdir(data_dir + "/" + folder)
        image = []
        non_image = []
        for file in files:
            if not file.lower().endswith(image_extension):
                non_image.append(file)
                os.remove(data_dir + "/" + folder + "/" + file)
            else:
                image.append(file)
        print(f"Label: {folder} has {len(image)} image files and {len(non_image)} non-image files.")

In [None]:
remove_non_image(data_dir="inputs/cherry_leaves_dataset/cherry_leaves")

---

# Data Preparation

### Split Dataset into Train, Validation and Test Sets

In [None]:
import os
import shutil
import random
import joblib

def split_image_dataset(data_dir, train_ratio, val_ratio, test_ratio):
    """
    Function to split the dataset into train, validation and
    test sets, creating new folders. The ratios must total 1.
    """
    if train_ratio + val_ratio + test_ratio != 1:
        print("The ratios need to sum up to 1.")
        return
    
    labels = os.listdir(data_dir)
    if "test" in labels:
        print("A test folder is already present.")
        return
    else:
        for folder in ["train", "validation", "test"]:
            # To create a subfolder for the labels within the class
            for label in labels:
                os.makedirs(name=data_dir + "/" + folder + "/" + label)
        # To distribute the data into the sets at random
        for label in labels:
            files = os.listdir(data_dir + "/" + label)
            random.shuffle(files)

            # So that the files are moved according to the ratio
            no_train_files = int(len(files)) * train_ratio
            no_val_files = int(len(files)) * val_ratio
            sum_train_val_files = no_train_files + no_val_files

            count = 0

            for file in files:
                if count <= no_train_files:
                    shutil.move(data_dir + "/" + label + "/" + file,
                                data_dir + "/train/" + label + "/" + file)
                elif count <= sum_train_val_files:
                    shutil.move(data_dir + "/" + label + "/" + file,
                                data_dir + "/validation/" + label + "/" + file)
                else:
                    shutil.move(data_dir + "/" + label + "/" + file,
                                data_dir + "/test/" + label + "/" + file)
                count += 1

            # Remove old label directory 
            os.rmdir(data_dir + "/" + label)

In [None]:
split_image_dataset(data_dir="inputs/cherry_leaves_dataset/cherry_leaves",
                    train_ratio=0.7,
                    val_ratio=0.15,
                    test_ratio=0.15)

---

# Conclusion

Data has been cleaned to remove non-image files (if there were any). The data was split into three datasets, train, validation and test in a (0.7, 0.15, 0.15) ratio.

---

# Next Steps

The data will be analysed for differences between the two classes (healthy and powdery mildew) with the standard deviation and mean of the images calculated to determine image variability and the average image. An image montage will be created to display the images per label. 

---