# **DATA COLLECTION**

## Objectives

* Fetch data from Kaggle and organize into train, validate and test sets using folders

## Inputs

* Kaggle JSON file with authentication token

## Outputs

* Organized dataset:
    - inputs/dataset/cherry-leaves/train folder
    - inputs/dataset/cherry-leaves/validation folder
    - inputs/dataset/cherry-leaves/test folder

---

# Change working directory

In [None]:
import os

os.chdir("./..")  # change to parent directory
working_dir = os.getcwd()
working_dir  # check output for correct directory

# Fetch dataset from kaggle

Set kaggle configuration and download dataset

In [None]:
os.environ["KAGGLE_CONFIG_DIR"] = working_dir

dataset_url = "codeinstitute/cherry-leaves"
dl_destination = "inputs/dataset/raw"

! kaggle datasets download -d {dataset_url} -p {dl_destination}

Unzip dataset file and remove zip file and it's folder

In [None]:
from zipfile import ZipFile

zip_path = dl_destination + "/cherry-leaves.zip"
destination_folder = working_dir + "/inputs/dataset/"

with ZipFile(zip_path, "r") as zip_file:
    zip_file.extractall(destination_folder)

os.remove(zip_path)
os.rmdir(dl_destination)

---

# Organize data into train, validation and test folders

All files in the dataset are in .jpg format and contain images of interest, no need for additional preparation.

Make a function to randomize and separate dataset into train, validation and test sets with given ratios.
Also, make a function to list number of files per folder

In [None]:
import random, shutil


def train_validation_test_split(data_dir, train_ratio, validation_ratio, test_ratio):
    
    if train_ratio + validation_ratio + test_ratio != 1.0:
        print("Ratios must add to exactly 1.0")
        return
    
    categories = os.listdir(data_dir)

    if "test" in categories:  # prevent creation of sub-divisions on multiple function calls
        print("Dataset already organized")
        return

    for category in categories:
        all_files = os.listdir(data_dir + "/" + category)
        random.shuffle(all_files)

        number_of_files = len(all_files)
        train_files_number = int(number_of_files * train_ratio)
        validation_files_number = int(number_of_files * validation_ratio) + train_files_number

        set_names = ["train", "validation", "test"]

        for name in set_names:
            os.makedirs(data_dir + "/" + name + "/" + category)

        for i in range(0, number_of_files):
            if i < train_files_number:
                set = 0
                
            elif i < validation_files_number:
                set = 1

            else:
                set = 2

            shutil.move(
                    data_dir + "/" + category + "/" + all_files[i],
                    data_dir + "/" + set_names[set] + "/" + category + "/" + all_files[i]
                    )
    
        os.rmdir(data_dir + "/" + category)


def show_number_of_files(data_dir):
    for folder in os.listdir(data_dir):
        current_dir = data_dir + "/" + folder
        
        for dir in os.listdir(current_dir):
            print_dir = current_dir + "/" + dir
            nr = len(os.listdir(print_dir))
            print(f"{nr} files in {print_dir}")

Split dataset: 70% train, 10% validate, 20% test

In [None]:
train_validation_test_split(
    destination_folder + "/cherry-leaves",
    0.7,
    0.1,
    0.2
    )

Inspect number of files per set

In [None]:
show_number_of_files(destination_folder + "/cherry-leaves")

Data is now organized and ready for next steps