# **1 – Data Collection**


## Objectives

* Authenticate with Kaggle and download the cherry leaf image dataset  

## Inputs

* kaggle.json authentication token 
* Cherry leaf image dataset from Kaggle

## Outputs
 
* Dataset saved to raw data folder
* folder structure with train, validation and test data


---

# Change working directory

Change the working directory from its current folder to its parent folder

In [None]:
import os

project_dir = r"C:\Users\amyno\OneDrive\Documents\CherryLeafProject\milestone-project-mildew-detection-in-cherry-leaves"

os.chdir(project_dir)

print(f" Current working directory is now: {os.getcwd()}")

✅ Current working directory is now: C:\Users\amyno\OneDrive\Documents\CherryLeafProject\milestone-project-mildew-detection-in-cherry-leaves


Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [12]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [13]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\amyno\\OneDrive\\Documents\\CherryLeafProject'

---


# Downloading the cherry leaves dataset


This section will:

* Authenticate the data with Kaggle using the kaggle.json API token  
* Download the cherry leaves image dataset 
* Unzip the dataset into the directory  
* Clean up by deleting the zip file and the Kaggle token for security


Install Kaggle to fetch data

In [39]:
!pip install kaggle




[notice] A new release of pip is available: 24.3.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Set environment variable to direct Kaggle to API key

In [40]:
import os

os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

Use Kaggle command to download the cherry leaf dataset

In [41]:
assert os.path.exists('kaggle.json'), "kaggle.json not found in current directory"
print("✅ kaggle.json found!")

✅ kaggle.json found!


In [42]:
import os
print(os.getcwd())

C:\Users\amyno\OneDrive\Documents\CherryLeafProject\milestone-project-mildew-detection-in-cherry-leaves


In [43]:
!kaggle datasets download -d codeinstitute/cherry-leaves

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown


Unzip downloaded dataset into the appropriate folder

In [44]:
import zipfile

with zipfile.ZipFile("cherry-leaves.zip", "r") as zip_ref:
    zip_ref.extractall("inputs/dataset/raw")

Delete the zip file and kaggle.json token after use for security

In [45]:
os.remove("cherry-leaves.zip")
os.remove("kaggle.json")

Check to see expected contents of folder

In [46]:
import os

os.listdir("inputs/dataset/raw")


['cherry-leaves']

---

### Credits
* The dataset used was from code institute on Kaggle and can be found [here](https://www.kaggle.com/datasets/codeinstitute/cherry-leaves)

* code from code blocks 3 through to 7 was helpfully provided by code institute 
* code from code block 9 was inspired by Kaggle (ref. in readme)

---

# Conclusions and next steps

## Conclusions

* Authenticated with Kaggle using the API key
* Downloaded the cherry leaves image dataset
* Unzipped and organized the dataset into the appropriate directory structure
* Deleted the Kaggle token and zip file after use

#### The dataset is now ready for analysis and modeling.

## Next steps

* Begin a visual study of the dataset to explore visual differences between healthy and diseased leaves
* Prepare insights to be included in the dashboard and support later model development
