# **Data Collection**

## Objectives

- Import necessary Python libraries, including numpy for numerical operations, pandas for data handling, matplotlib and seaborn for plotting, plotly for interactive visualizations, and streamlit for web app development. Also, include machine learning packages like scikit-learn for modeling, and tensorflow-cpu with keras for neural network implementations
- Set working directory  to ensure proper access to data files
- Fetch data from Kaggle
- Unzip the dataset
- Load th image data into a suitable structure
- Consider resizing the images to a uniform size if they vary in dimensions
- Assessment and Handling of Corrupt or Missing Data
- Normalize the pixel values of the images for better performance of machine learning models
- Perform any additional preprocessing needed, such as image augmentation, to improve the model's ability to generalize.
- Split the dataset into training, validation, and test sets
- Ensure that the split is stratified if the dataset is imbalanced, meaning each set should have a representative distribution of each class
- Ensure to document each step of the project's development process


## Inputs

- Kaggle authentication token (kaggle.json) for acces to datasets on Kaggle

## Outputs  

         .
         ├── inputs   
         │   └──datasets_devided
         │      └──potatoe  
         │           ├── test  
         │           │   ├── healthy  
         │           │   ├── Black_Scurf  
         │           │   ├── Blackleg  
         │           │   ├── Common_Scab  
         │           │   ├── Dry_Rot  
         │           │   └── Pink_Rot
         │           ├── train
         │           │   ├── healthy  
         │           │   ├── Black_Scurf  
         │           │   ├── Blackleg  
         │           │   ├── Common_Scab  
         │           │   ├── Dry_Rot  
         │           │   └── Pink_Rot
         │           └── validation
         │               ├── healthy  
         │               ├── Black_Scurf  
         │               ├── Blackleg  
         │               ├── Common_Scab  
         │               ├── Dry_Rot  
         │               ├── Pink_Rot 
         │               └── Miscellaneous  
         └── ...

## Additional Comments

- Next step will be Data Visualization to understand the data and discover patterns. 



---

# Import Packages

In [1]:
! pip install -r /workspace/Potato-Diseases-Detector/requirements.txt

Defaulting to user installation because normal site-packages is not writeable


---

# Change working directory

We store our Jupyter notebooks in a subfolder of the project. Therefore, when we run the notebooks in the editor, we need to change the working directory. This is necessary to ensure proper access to data files and other project resources that might be located outside the notebook's subfolder.

We need to change the working directory from its current folder to its parent folder

- To access the current working directory, we use the os.getcwd() command. 

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Potato-Diseases-Detector/jupyter_notebooks'

Then, we change the working directory from its current folder to its parent folder to facilitate the correct file path references within our notebooks.

- os.path.dirname() gets the parent directory
- os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/Potato-Diseases-Detector'

---

# Load and Fetch data from Kaggle

Install kaggle package to fetch data

In [5]:
%pip install kaggle

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


Please note that to run this section, you must first upload your personal kaggle.json file into the workspace. This file is necessary for authenticating your requests to Kaggle. In this code block, we're setting up the KAGGLE_CONFIG_DIR environment variable to point to the project's directory. Additionally, we modify the file permissions of kaggle.json to make it readable for all users. This step is crucial to ensure that requests to the Kaggle API are processed correctly.

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Now, we'll set the path for the Kaggle dataset and create a specific directory for it. Following that, we'll execute a command through Kaggle's interface to download the dataset into this newly created directory.

In [7]:
KaggleDatasetPath = "mukaffimoin/potato-diseases-datasets"
DestinationFolder = "inputs/datasets_raw"
if not os.path.isdir(DestinationFolder):
    os.makedirs(DestinationFolder)

! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading potato-diseases-datasets.zip to inputs/datasets_raw
 64%|████████████████████████▎             | 8.00M/12.5M [00:00<00:00, 21.4MB/s]
100%|██████████████████████████████████████| 12.5M/12.5M [00:00<00:00, 21.3MB/s]


Unzip the downloaded file and delete the zip file

In [8]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \

Archive:  inputs/datasets_raw/potato-diseases-datasets.zip
  inflating: inputs/datasets_raw/Black Scurf/1.jpg  
  inflating: inputs/datasets_raw/Black Scurf/10.jpg  
  inflating: inputs/datasets_raw/Black Scurf/11.jpg  
  inflating: inputs/datasets_raw/Black Scurf/12.jpg  
  inflating: inputs/datasets_raw/Black Scurf/13.jpg  
  inflating: inputs/datasets_raw/Black Scurf/14.jpg  
  inflating: inputs/datasets_raw/Black Scurf/15.jpg  
  inflating: inputs/datasets_raw/Black Scurf/16.jpg  
  inflating: inputs/datasets_raw/Black Scurf/17.jpg  
  inflating: inputs/datasets_raw/Black Scurf/18.jpg  
  inflating: inputs/datasets_raw/Black Scurf/19.jpg  
  inflating: inputs/datasets_raw/Black Scurf/2.jpg  
  inflating: inputs/datasets_raw/Black Scurf/20.jpg  
  inflating: inputs/datasets_raw/Black Scurf/21.jpg  
  inflating: inputs/datasets_raw/Black Scurf/22.jpg  
  inflating: inputs/datasets_raw/Black Scurf/23.jpg  
  inflating: inputs/datasets_raw/Black Scurf/24.jpg  
  inflating: inputs/datas

---

# Data Preprocessing and Cleaning

Check and remove all non-image files

In [9]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        folder_path = os.path.join(my_data_dir, folder)
        if os.path.isdir(folder_path):  # Проверка, что это директория
            files = os.listdir(folder_path)
            non_image_count = 0
            image_count = 0
            for given_file in files:
                file_path = os.path.join(folder_path, given_file)
                if not given_file.lower().endswith(image_extension):
                    os.remove(file_path)  # Удаление файла, не являющегося изображением
                    non_image_count += 1
                else:
                    image_count += 1
            print(f"Folder: {folder} - has {image_count} image files")
            print(f"Folder: {folder} - has {non_image_count} non-image files")
        else:
            print(f"{folder_path} is not a directory")

remove_non_image_file(my_data_dir='inputs/datasets_raw/')

Folder: Black Scurf - has 58 image files
Folder: Black Scurf - has 0 non-image files
Folder: Blackleg - has 60 image files
Folder: Blackleg - has 0 non-image files
Folder: Common Scab - has 62 image files
Folder: Common Scab - has 0 non-image files
Folder: Dry Rot - has 60 image files
Folder: Dry Rot - has 0 non-image files
Folder: Healthy Potatoes - has 80 image files
Folder: Healthy Potatoes - has 0 non-image files
Folder: Miscellaneous - has 74 image files
Folder: Miscellaneous - has 0 non-image files
Folder: Pink Rot - has 57 image files
Folder: Pink Rot - has 0 non-image files


---

# Split data

Splitting Image Data into Train, Validation, Test and Miscellaneous Sets 

This script is designed to manage datasets containing multiple categories of images, each initially stored in separate folders. It systematically organizes these images into dedicated 'train', 'validation', and 'test' folders for the purposes of machine learning model development.

The script first checks if the specified target data directory exists. If not, it creates this directory along with necessary subdirectories for each category, excluding the 'Miscellaneous' category.  
It's important to note that original image files are moved, not copied, to the new directories. This approach is adopted to avoid data duplication and manage disk space efficiently, altering the original dataset organization.  
Each category, except for 'Miscellaneous', is processed and its images are distributed among 'train', 'validation', and 'test' folders based on predefined ratios.  
After the images are moved, the script removes the processed source category folders, leaving the original 'Miscellaneous' folder intact in the source directory.  
The 'Miscellaneous' category, containing a diverse mix of healthy and diseased potato images, is specifically set aside for the final testing phase of the model. This enables evaluation of the model's performance on a dataset that simulates real-world variability and complexity.  
The 'Miscellaneous' dataset remains in its original location throughout the script's execution. It is neither moved nor altered, ensuring its availability for comprehensive final model testing.

In [10]:
import os
import shutil
import random

def split_train_validation_test_images(source_dir, target_dir, train_set_ratio, validation_set_ratio):
    # Check if the target directory exists, create if not
    if not os.path.exists(target_dir):
        os.makedirs(target_dir, exist_ok=True)

    labels = os.listdir(source_dir)  # Get categories in the source folder

    # Create folder structure in the target directory
    for label in labels:
        if label == 'Miscellaneous':  # Skip creating 'Miscellaneous' in train/validation/test
            continue

        for folder in ['train', 'validation', 'test']:
            new_dir = os.path.join(target_dir, folder, label)
            os.makedirs(new_dir, exist_ok=True)

    for label in labels:
        if label == 'Miscellaneous':  # Skip 'Miscellaneous' for splitting
            continue

        files = os.listdir(os.path.join(source_dir, label))
        random.shuffle(files)

        train_end = int(len(files) * train_set_ratio)
        validation_end = train_end + int(len(files) * validation_set_ratio)

        for i, file in enumerate(files):
            src_file = os.path.join(source_dir, label, file)

            if i < train_end:
                dest = os.path.join(target_dir, 'train', label, file)
            elif i < validation_end:
                dest = os.path.join(target_dir, 'validation', label, file)
            else:
                dest = os.path.join(target_dir, 'test', label, file)

            shutil.move(src_file, dest)
        
        # Remove processed category folder
        shutil.rmtree(os.path.join(source_dir, label))

# Source directory with categories
source_dir = "inputs/datasets_raw"

# Target directory for split sets
target_dir = "inputs/datasets_devided"

# Splitting data
split_train_validation_test_images(source_dir, target_dir, 0.7, 0.15)

Notes  
- The training set is divided into a 0.70 ratio of data.  
- The validation set is divided into a 0.15 ratio of data.  
- The test set is divided into a 0.15 ratio of data.  
The decision to allocate 15% each for validation and test sets is driven by the aim to enhance the thoroughness of model evaluation on diverse data and to minimize the risk of overfitting. This approach is particularly vital in projects where model accuracy is critical and where ensuring the model's reliability and generalization capability is a priority.