# **Data Collection**

<p style="text-align: center;">
    <img style="width: 35%; height: 20%; float: left;" src="../assets/images/data_collection.jpg" alt="Data Collection image">
</p>

## Objectives

**1. Data Gathering:**
* Collect the dataset from the Kaggle API endpoint and simplify the download process.

**2. Preparing and Refining Data:**
* Perform comprehensive cleaning and preprocessing on the data to ensure its quality and readiness.

**3. Splitting and Organizing Data:**
* Divide the refined dataset into separate Train, Validation, and Test subsets, optimizing their composition for accurate model training and evaluation.

## Inputs Required

**1. Authentication file (kaggle.json):**

* The Kaggle API authentication key is required to access and retrieve the data seamlessly.

**2. Kaggle API Integration:**

* Utilize the Kaggle API to facilitate the systematic download and integration of the dataset.

## Generated Outputs

**1. Split Dataset Distribution:**

* The processed Train, Validation, and Test datasets are structured within the `inputs/cherry_leaves_dataset/cherry-leaves` directory.

**2. Visual representation of Data Distribution:**

* Showcase the distribution of data across the above folders through a concise and insightful Pie Chart visualization.


---

# Set up the working environment

## Install the required packages

In [1]:
! pip install -r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt

Collecting typing-extensions~=3.7.4 (from tensorflow-cpu==2.6.0->-r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt (line 10))
  Using cached typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Installing collected packages: typing-extensions
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.7.1
    Uninstalling typing_extensions-4.7.1:
      Successfully uninstalled typing_extensions-4.7.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
astroid 2.15.6 requires typing-extensions>=4.0.0; python_version < "3.11", but you have typing-extensions 3.7.4.3 which is incompatible.
async-lru 2.0.4 requires typing-extensions>=4.0.0; python_version < "3.11", but you have typing-extensions 3.7.4.3 which is incompatible.
filelock 3.12.3 requires typing-extensions>=4.7.1; python_version < "3.11", but you have typ

## Import libraries

In [2]:
import numpy
import os
import matplotlib.pyplot as plt
import shutil
import random
import zipfile
print("\033[92mLibraries Imported Successfully!\033[0m")

[92mLibraries Imported Successfully![0m


# Change working directory

* To maintain a straightforward folder structure for the application, we must navigate from the current folder to its parent folder by using `os.getcwd()` to access the current directory.

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-in-cherry-leaves/jupyter_notebooks'

* To update the current directory to its parent directory, we follow these steps:

  * Use `os.path.dirname()` to obtain the parent directory.
  * Utilize `os.chdir()` to set the new current directory to the parent directory.

In [4]:
os.chdir(os.path.dirname(current_dir))
print(f"\033[92mYou set a new current directory!\033[0m")

[92mYou set a new current directory![0m


* Confirm the new current directory.

In [5]:
new_current_dir = os.getcwd()
new_current_dir

'/workspaces/mildew-detection-in-cherry-leaves'

# Set input and output directory paths

**Inputs**

In [6]:
data_dir = 'inputs/cherry_leaves_dataset/cherry-leaves'
train_path = data_dir + '/train'
validation_path = data_dir + '/validation'
test_path = data_dir + '/test'

**Outputs**

In [7]:
version = 'V_1'

file_path = f'outputs/{version}'
version_file_path = os.path.join(new_current_dir, file_path)

if os.path.exists(version_file_path):
    # check version file path exists, if not creates a new directory.
     print(f"\033[91mVersion {version} already exists. Create a new version please! \033[0m")
     pass
else:
    os.makedirs(name=file_path)
    print(f"\033[92mVersion {version} created successfully! \033[0m")

[92mVersion V_1 created successfully! [0m


---

# Install Kaggle Package

In [8]:
! pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.16.tar.gz (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.6/83.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle)
  Obtaining dependency information for tqdm from https://files.pythonhosted.org/packages/00/e5/f12a80907d0884e6dff9c16d0c0114d81b8cd07dc3ae54c5e962cc83037e/tqdm-4.66.1-py3-none-any.whl.metadata
  Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
Downlo

Set the Kaggle configuration directory to the current working directory so that we can change the permission of the JSON file to 600 to establish connection.

In [9]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
print(os.environ['KAGGLE_CONFIG_DIR'])
! chmod 600 kaggle.json

/workspaces/mildew-detection-in-cherry-leaves


## Download dataset from Kaggle API

* First set Kaggle Dataset path, destination folder and then download dataset so can use it in further steps.

In [10]:
KaggleDatasetPath = 'codeinstitute/cherry-leaves'
DestinationFolder = 'inputs/cherry_leaves_dataset'
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
 89%|█████████████████████████████████▊    | 49.0M/55.0M [00:01<00:00, 39.1MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 38.5MB/s]


* Next extract files from downloaded file and delete the zip file.

In [11]:
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')