# **Data Collection**

## Objectives

- Download data from Kaggle and prepare it for processing

## Inputs

- kaggle.json - authntication token
- dataset - images

## Outputs

- Generated dataset: inputs/datasets/mildew_dataset
- Split dataset - train, test, validation

## Change working directory

By default, the working directory is "jupyter_notebooks", where the notebook is running. However, we need to change the working directory to its parent folder so that file references align with the broader project structure.

To do this, we first check the current working directory — note that the output below only displays the last two folders in the file path, rather than the full system path. This is done intentionally to prevent exposing the full local file path stored on my machine.

**Any time you revisit this notebook after logging out, or open a different notebook for the first time, you must repeat these steps to ensure the working directory is always correctly set.**

In [1]:
import os
from pathlib import Path # ensure file path consistency

# Get the current working directory
current_dir = Path.cwd()

# Extract the last two directory names
filtered_path = Path(*current_dir.parts[-2:])

# Print with a folder emoji 🗂️
print(f"📂 {filtered_path}")  # Example output: 📂 mildew_detector/jupyter_notebooks

📂 mildew_detector\jupyter_notebooks


Now we change the working directory from "jupyter_notebooks" to the parent directory.

In [2]:
# Change the working directory to its parent folder
os.chdir(os.path.dirname(os.getcwd()))

# ✅ Confirmation message
print("✅ You set a new current directory")

✅ You set a new current directory


Confirm the new current directory.

In [3]:
# Get the current working directory
current_dir = Path.cwd()

# Extract the last two directory names
filtered_path = Path(*current_dir.parts[-2:])

# Print with a folder emoji 🗂️
print(f"📂 {filtered_path}")  # Example output: 📂 mildew_detector/jupyter_notebooks

📂 Projects\mildew_detector


## Import Packages

In [4]:
%pip install -r requirements.txt- fix this at the end with a new requirememts file curated from actual use

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt-'


## Install Kaggle

Now we need to think about gathering our data. We will be downloading our images from kaggle.com so we first install kaggle to help with the download.

For this you need to have your Kaggle Token handy.

In [5]:
# install kaggle package
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Drag and drop your kaggle.json file (Kaggle Token) into the same directory as README.md.

The code below will check that kaggle.json appears in the directory by listing its contents. You should see a list of entries in this directory, including kaggle.json.

In [6]:
print(os.listdir())  # Should list `kaggle.json`

['.git', '.gitignore', '.python-version', '.venv', 'jupyter_notebooks', 'kaggle.json', 'README.md', 'requirements.txt', 'test.py']


Now we get the path for the dataset and set the destination folder where the downloaded images will be stored.

This code will download a zip folder, then create new folders ("inputs" and "cherry-leaves") for storing the images.

In [7]:
# Define Kaggle dataset and destination folder using pathlib
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = Path("inputs") / "cherry-leaves"  # Ensures correct path handling across OS

# Download the Kaggle dataset into the specified folder
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs\cherry-leaves




  0%|          | 0.00/55.0M [00:00<?, ?B/s]
  2%|▏         | 1.00M/55.0M [00:00<00:21, 2.66MB/s]
  5%|▌         | 3.00M/55.0M [00:00<00:08, 6.33MB/s]
  9%|▉         | 5.00M/55.0M [00:00<00:05, 8.94MB/s]
 13%|█▎        | 7.00M/55.0M [00:00<00:04, 10.7MB/s]
 16%|█▋        | 9.00M/55.0M [00:00<00:03, 12.2MB/s]
 20%|█▉        | 11.0M/55.0M [00:01<00:03, 13.0MB/s]
 24%|██▎       | 13.0M/55.0M [00:01<00:03, 13.7MB/s]
 27%|██▋       | 15.0M/55.0M [00:01<00:02, 14.7MB/s]
 31%|███       | 17.0M/55.0M [00:01<00:02, 15.0MB/s]
 35%|███▍      | 19.0M/55.0M [00:01<00:02, 15.7MB/s]
 38%|███▊      | 21.0M/55.0M [00:01<00:02, 15.6MB/s]
 42%|████▏     | 23.0M/55.0M [00:01<00:02, 16.5MB/s]
 45%|████▌     | 25.0M/55.0M [00:02<00:01, 16.0MB/s]
 49%|████▉     | 27.0M/55.0M [00:02<00:01, 16.4MB/s]
 53%|█████▎    | 29.0M/55.0M [00:02<00:01, 16.5MB/s]
 56%|█████▋    | 31.0M/55.0M [00:02<00:01, 16.0MB/s]
 60%|█████▉    | 33.0M/55.0M [00:02<00:01, 15.5MB/s]
 64%|██████▎   | 35.0M/55.0M [00:02<00:01, 16.3MB/s]
 

Now we need to unzip the downloaded file and get hold of the images. 

The cell below with unzip the file and store the images inside a new directory within the "inputs" folder.

The code will also delete the zip file and your Kaggle Token for data protection purposes.

In [None]:
import zipfile
from pathlib import Path

# Define paths
zip_file = Path("inputs") / "cherry-leaves" / "cherry-leaves.zip"
extract_folder = Path("inputs") / "cherry-leaves"
kaggle_token = Path("kaggle.json")

# Unzip the file
with zipfile.ZipFile(zip_file, "r") as zip_ref:
    zip_ref.extractall(extract_folder)

print(f"📂 Extracted files into: {extract_folder}")

# Delete the zip file after extraction
zip_file.unlink()
print(f"🗑️ Deleted: {zip_file}")

# Remove Kaggle token for security
if kaggle_token.exists():
    kaggle_token.unlink()
    print(f"🛡️ Kaggle Token removed: {kaggle_token}")

📂 Extracted files into: inputs\cherry-leaves
🗑️ Deleted: inputs\cherry-leaves\cherry-leaves.zip
🛡️ Kaggle Token removed: kaggle.json
