# **Data Collection**

## Objectives

* Fetch data from Kaggle and save as raw data

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* The output of this folder is a directory named outputs/datasets/raw/csv inside in outputs which contains various CSV files. If the user wishes, they can also keep the database vrersion of the files.

## Additional Comments

* The dataset is image data type. It is a tabelled dataset and the class labels are healthy and powdery_mildew.
It is a balanced dataset with each class having 2104 image data
No non-image file was found in the dataset 



---

## Import packages

In [2]:
import numpy
import os

# Change working directory

In [4]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/mildew_cherry_detection/jupyter_notebooks'

In [5]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [6]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew_cherry_detection'

# Install Kaggle

Section 1 content

In [5]:
# install kaggle package
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle==1.5.12)
  Downloading tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)
Collecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading tqdm-4.66.4-py3-none-any.whl (78 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73026 sha256=09d482501a66a6f445760f48bb5a2089f3610c52428bbc2e38f4febde469e068
  Stored in directory: /home/gitpod/.cache/pip/wheels/29/da/11/144cc25aebdaeb4931b231e25fd34b3

Run the cell below to change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON.

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We can now download the zip file containing the datasets.

In [7]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/datasets/raw
 95%|███████████████████████████████████▉  | 52.0M/55.0M [00:02<00:00, 22.7MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 20.1MB/s]


Unzip the downloaded file, and delete the zip file and the koggle tokens jason file.

In [5]:

! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

unzip:  cannot find or open {DestinationFolder}/*.zip, {DestinationFolder}/*.zip.zip or {DestinationFolder}/*.zip.ZIP.

No zipfiles found.


---

# Data Preparation

---

## Data cleaning
Check and remove non-image files

In [3]:
os.sep

'/'

In [7]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + os.sep + folder)
        # print(files)
        i = 0
        j = 0
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + os.sep + folder + os.sep + given_file
                os.remove(file_location)  # remove non image file
                i += 1
            else:
                j += 1
                pass
        print(f"Folder: {folder} - has {j} image file(s)")
        print(f"Folder: {folder} - has {i} non-image file(s)")

In [8]:
remove_non_image_file(my_data_dir='inputs/datasets/raw')

Folder: healthy - has 2104 image file(s)
Folder: healthy - has 0 non-image file(s)
Folder: mildew - has 2104 image file(s)
Folder: mildew - has 0 non-image file(s)


In [9]:
os.path.join("..","inputs","datasets","raw")

'../inputs/datasets/raw'

## Split train, validation and test dataset
We will split the data into 70% training, 15% validation, and 15% test sets.

In [10]:
import os

# Define paths
data_dir = os.path.join(".","inputs","datasets","raw")
train_dir = os.path.join(".","inputs","datasets","train")
val_dir = os.path.join(".","inputs","datasets","val")
test_dir = os.path.join(".","inputs","datasets","test")

# Create directories if they don't exist
for dir_path in [train_dir, val_dir, test_dir]:
    for class_dir in ['healthy', 'mildew']:
        os.makedirs(os.path.join(dir_path, class_dir), exist_ok=True)

In [11]:
os.path.exists(os.path.join(".","inputs","datasets","raw","healthy"))

True

In [12]:
import os
import shutil
from sklearn.model_selection import train_test_split

# Helper function to split data
def split_data(source_dir, train_dir, val_dir, test_dir, test_size=0.15, val_size=0.15):
    all_files = [os.path.join(source_dir, f) for f in os.listdir(source_dir) if os.path.isfile(os.path.join(source_dir, f))]
    train_files, test_files = train_test_split(all_files, test_size=test_size, random_state=42)
    train_files, val_files = train_test_split(train_files, test_size=val_size/(1-test_size), random_state=42)

    # Copy files to their respective directories
    for file in train_files:
        shutil.copy(file, os.path.join(train_dir, os.path.basename(file)))
    for file in val_files:
        shutil.copy(file, os.path.join(val_dir, os.path.basename(file)))
    for file in test_files:
        shutil.copy(file, os.path.join(test_dir, os.path.basename(file)))

# Split healthy leaves
split_data(os.path.join(data_dir, 'healthy'), os.path.join(train_dir, 'healthy'), os.path.join(val_dir, 'healthy'), os.path.join(test_dir, 'healthy'))

# Split mildew leaves
split_data(os.path.join(data_dir, 'mildew'), os.path.join(train_dir, 'mildew'), os.path.join(val_dir, 'mildew'), os.path.join(test_dir, 'mildew'))

In [13]:
import os
data_dir
os.path.join(data_dir, r"healthy")

'./inputs/datasets/raw/healthy'

In [14]:
os.path.normpath(os.path.join(data_dir, r"healthy"))

'inputs/datasets/raw/healthy'

---

## Conclusion

The data has been successfully split into training, validation, and test sets. We now have separate directories for each set and class.


---

## Next Steps
Next, we'll move on to data visualization. This involves:

   - Visualizing the average and variability of images per label to identify any patterns or inconsistencies.
   - Comparing average images of different labels to understand the visual differences.
   - Creating image montages to get a visual overview of the dataset, enhancing our understanding of the data's diversity and characteristics.