# **Data Collection Notebook**

## Objectives

* Fetch cherry leaves image dataset from Kaggle and prepare the data for anaylsis and modelling.

## Inputs

* Kaggle JSON file - authentication token.

## Outputs

* Generate dataset: inputs/datasets/cherry_leaves_dataset

## Additional Comments

* Clean the data and split the dataset into train, validation and test sets.



---

# Change working directory

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/cherry-leaves-mildew-detection/jupyter_notebooks'

In [2]:
os.chdir("/workspace/cherry-leaves-mildew-detection")
print("You set a new current directory.")

You set a new current directory.


In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/cherry-leaves-mildew-detection'

# Install Kaggle

In [4]:
# Install kaggle
%pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.13.tar.gz (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.3/63.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting python-slugify
  Downloading python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.13-py3-none-any.whl size=77717 sha256=2ad0cbed825fdaee04011b8b58acb56a59a1c99d4b2540b57712ef1329f5cbfa
  Stored in directory: /home/gitpod/.cache/pip/wheels/e6/8e/67/e07554a720a493dc6b39b30488590ba92ed45448ad0134d253
Successfully built kaggle
Installing collected packages: text-unidecode, python-slugify, 

In [4]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [5]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
 96%|████████████████████████████████████▌ | 53.0M/55.0M [00:02<00:00, 31.6MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 23.7MB/s]


In [6]:
import zipfile
with zipfile.ZipFile(DestinationFolder + "/" + "cherry-leaves.zip", "r") as zip:
    zip.extractall(DestinationFolder)
os.remove(DestinationFolder + "/" + "cherry-leaves.zip")

---

# Data Cleaning

### Removing non-image files

In [7]:

def remove_non_image(data_dir):
    """
    Function searches directory to ensure all 
    the files end with the listed extension. 
    Files that do not are appended to the non_image
    list. 
    """
    image_extension = (".png", ".jpg", ".jpeg", ".tiff")
    folders = os.listdir(data_dir)
    for folder in folders:
        files = os.listdir(data_dir + "/" + folder)
        image = []
        non_image = []
        for file in files:
            if not file.lower().endswith(image_extension):
                non_image.append(file)
                os.remove(data_dir + "/" + folder + "/" + file)
            else:
                image.append(file)
        print(f"Label: {folder} has {len(image)} image files and {len(non_image)} non-image files.")

In [9]:
remove_non_image(data_dir="inputs/cherry_leaves_dataset/cherry_leaves")

Label: healthy has 2104 image files and 0 non-image files.
Label: powdery_mildew has 2104 image files and 0 non-image files.


# Data Preparation

### Split Dataset into Train, Validation and Test Sets

In [15]:
import os
import shutil
import random
import joblib

def split_image_dataset(data_dir, train_ratio, val_ratio, test_ratio):
    """
    Function to split the dataset into train, validation and
    test sets, creating new folders. The ratios must total 1.
    """
    if train_ratio + val_ratio + test_ratio != 1:
        print("The ratios need to sum up to 1.")
        return
    
    labels = os.listdir(data_dir)
    if "test" in labels:
        print("A test folder is already present.")
        return
    else:
        for folder in ["train", "validation", "test"]:
            # To create a subfolder for the labels within the class
            for label in labels:
                os.makedirs(name=data_dir + "/" + folder + "/" + label)
        # To distribute the data into the sets at random
        for label in labels:
            files = os.listdir(data_dir + "/" + label)
            random.shuffle(files)

            # So that the files are moved according to the ratio
            no_train_files = int(len(files)) * train_ratio
            no_val_files = int(len(files)) * val_ratio
            sum_train_val_files = no_train_files + no_val_files

            count = 0

            for file in files:
                if count <= no_train_files:
                    shutil.move(data_dir + "/" + label + "/" + file,
                                data_dir + "/train/" + label + "/" + file)
                elif count <= sum_train_val_files:
                    shutil.move(data_dir + "/" + label + "/" + file,
                                data_dir + "/validation/" + label + "/" + file)
                else:
                    shutil.move(data_dir + "/" + label + "/" + file,
                                data_dir + "/test/" + label + "/" + file)
                count += 1

            # Remove old label directory 
            os.rmdir(data_dir + "/" + label)

In [16]:
split_image_dataset(data_dir="inputs/cherry_leaves_dataset/cherry_leaves",
                    train_ratio=0.7,
                    val_ratio=0.15,
                    test_ratio=0.15)

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
    # create here your folder
    # os.makedirs(name='')
except Exception as e:
    print(e)
