# **Data Collection**

## Objectives

* Fetch data from Kaggle
* Check for files with no images
* Split the data into test, train and validation sets

## Inputs

* Kaggle JSON file - authentication token

## Outputs

"```bash\n",
        ". \n",
        "├── inputs \n",
        "│   └──cherry-leaves \n",
        "│      └──cherry-leaves                                     \n",
        "│           ├── test\n",
        "│           │   ├── healthy\n",
        "│           │   └── powdery_mildew                   \n",
        "│           ├── train\n",
        "│           │   ├── healthy\n",
        "│           │   └── powdery_mildew          \n",
        "│           └── validation\n",
        "│               ├── healthy\n",
        "│               └── powdery_mildew                 \n",
        "└── ...\n",
        "```\n",


## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 



---

**Import packages**

In [1]:
pip install -r ../requirements.txt

Collecting numpy==1.26.1 (from -r ../requirements.txt (line 1))
  Obtaining dependency information for numpy==1.26.1 from https://files.pythonhosted.org/packages/ad/00/adb57a4974931c97a9bbbc92fd2cc998aa47569fcd7fb65ded4b81b72455/numpy-1.26.1-cp312-cp312-macosx_10_9_x86_64.whl.metadata
  Downloading numpy-1.26.1-cp312-cp312-macosx_10_9_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m876.0 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hCollecting pandas==2.1.1 (from -r ../requirements.txt (line 2))
  Obtaining dependency information for pandas==2.1.1 from https://files.pythonhosted.org/packages/a5/d2/9e130353d2358b463095a42aaa4432d6a91c42ff22e55c39dae4597e3ae5/pandas-2.1.1-cp312-cp312-macosx_10_9_x86_64.whl.metadata
  Downloading pandas-2.1.1-cp312-cp312-macosx_10_9_x86_64.whl.metadata (18 kB)
Collecting matplotlib==3.8.0 (from -r ../requirements.txt (line 3))
  Obtaining dependency information for matplotlib==3.8.0 f

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/Users/lukenicklin/mildew-detection-in-cherry-leaves/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/Users/lukenicklin/mildew-detection-in-cherry-leaves'

# Install Kaggle

In [5]:
pip install kaggle

Collecting kaggle
  Obtaining dependency information for kaggle from https://files.pythonhosted.org/packages/14/83/7f29c7abe0d5dc769dad7da993382c3e4239ad63e1dd58414d129e0a4da2/kaggle-1.7.4.5-py3-none-any.whl.metadata
  Downloading kaggle-1.7.4.5-py3-none-any.whl.metadata (16 kB)
Collecting bleach (from kaggle)
  Obtaining dependency information for bleach from https://files.pythonhosted.org/packages/fc/55/96142937f66150805c25c4d0f31ee4132fd33497753400734f9dfdcbdc66/bleach-6.2.0-py3-none-any.whl.metadata
  Downloading bleach-6.2.0-py3-none-any.whl.metadata (30 kB)
Collecting python-slugify (from kaggle)
  Obtaining dependency information for python-slugify from https://files.pythonhosted.org/packages/a4/62/02da182e544a51a5c3ccf4b03ab79df279f9c60c5e82d5e8bec7ca26ac11/python_slugify-8.0.4-py2.py3-none-any.whl.metadata
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode (from kaggle)
  Obtaining dependency information for text-unidecode from 

Run this code to set the permissions for the Kaggle authentication JSON and change the Kaggle configuration directory to the current working directory

In [8]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set the Kaggle Dataset and download it

In [9]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry-leaves"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to inputs/cherry-leaves
 98%|█████████████████████████████████████▎| 54.0M/55.0M [00:00<00:00, 55.4MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:00<00:00, 63.4MB/s]


Unzip the downloaded file and then delete the zip file

In [10]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Data cleaning

Check all files and remove files that are not images

In [11]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [12]:
remove_non_image_file(my_data_dir='inputs/cherry-leaves/cherry-leaves')

Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0
Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0


# Split train, validation and test the data set

In [13]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

- The training set is divided into a 0.70 ratio of data
- The validation set is divided into a 0.10 ratio of data
- The test set is divided into a 0.20 ratio of data

In [14]:
split_train_validation_test_images(my_data_dir='inputs/cherry-leaves/cherry-leaves',
                                    train_set_ratio=0.7,
                                    validation_set_ratio=0.15,
                                    test_set_ratio=0.15
                                    )

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Next steps

- Carry out steps to differentiate a healthy cherry leaf from a leaf that's infected with mildew.