# Preparation of the PlantVillage Dataset (Introduction/Initial Steps)
This notebook demonstrates the initial steps to prepare the PlantVillage dataset
for training a Convolutional Neural Network (CNN) model to classify plant diseases.
We will extract the dataset, check its structure, and prepare it for further analysis.
Link to the dataset: [PlantVillage Dataset](https://www.kaggle.com/datasets/emmarex/plantdisease)

### Requirements:
Install all dependencies from the `requirements.txt` file
and verify that the environment is correctly set up.

In [1]:
%pip install -r ../requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip list

Package                      Version
---------------------------- -----------
absl-py                      2.3.1
anyio                        4.11.0
argon2-cffi                  25.1.0
argon2-cffi-bindings         25.1.0
arrow                        1.4.0
asttokens                    3.0.0
astunparse                   1.6.3
async-lru                    2.0.5
attrs                        25.4.0
babel                        2.17.0
beautifulsoup4               4.14.2
bleach                       6.3.0
cachetools                   6.2.1
certifi                      2025.10.5
cffi                         2.0.0
charset-normalizer           3.4.4
colorama                     0.4.6
comm                         0.2.3
contourpy                    1.3.2
cycler                       0.12.1
debugpy                      1.8.17
decorator                    5.2.1
defusedxml                   0.7.1
exceptiongroup               1.3.0
executing                    2.2.1
fastjsonschema               2.21.2

## Step 1: Import dependencies and set up paths
We import all necessary libraries and configuration paths from `src/config.py`.
The configuration file keeps directory paths centralized to avoid hardcoding them across the project.

In [3]:
import zipfile
import shutil
import random
from pathlib import Path
from src.config import DATA_RAW_DIR, DATA_PROCESSED_DIR, MODELS_DIR

# Set random seed for reproducibility
random.seed(42)

print("Libraries and configuration loaded successfully.")
print(f"Raw data folder: {DATA_RAW_DIR}")
print(f"Processed data folder: {DATA_PROCESSED_DIR}")
print(f"Models folder: {MODELS_DIR}")

Libraries and configuration loaded successfully.
Raw data folder: C:\Users\Alexandre\PycharmProjects\MachineLearningProject\data\raw
Processed data folder: C:\Users\Alexandre\PycharmProjects\MachineLearningProject\data\processed
Models folder: C:\Users\Alexandre\PycharmProjects\MachineLearningProject\models


## Step 2: Extract the PlantVillage ZIP archive
We first check if the dataset has already been extracted.
If not, we extract it into `data/processed/PlantVillage`.

In [4]:
# Ensure base folders exist
for folder in [DATA_RAW_DIR, DATA_PROCESSED_DIR, MODELS_DIR]:
    if not folder.exists():
        folder.mkdir(parents=True, exist_ok=True)
        print(f"Created missing directory: {folder.resolve()}")

zip_files = list(Path(DATA_RAW_DIR).glob("*.zip"))
if not zip_files:
    raise FileNotFoundError(
        f"No ZIP file found in: {DATA_RAW_DIR.resolve()}\n"
        "Please download the PlantVillage dataset from Kaggle:\n"
        "https://www.kaggle.com/datasets/emmarex/plantdisease\n"
        "Then place the ZIP file inside the /data/raw directory."
    )
else:
    raw_zip = zip_files[0]
    print(f"Found ZIP file: {raw_zip.name}")

extract_dir = Path(DATA_PROCESSED_DIR) / "PlantVillage"

if not extract_dir.exists():
    # Create temporary extraction directory
    temp_extract = Path(DATA_PROCESSED_DIR) / "temp_extract"
    if temp_extract.exists():
        shutil.rmtree(temp_extract)
    temp_extract.mkdir(parents=True, exist_ok=True)

    print("Extracting PlantVillage dataset... please wait.")
    with zipfile.ZipFile(raw_zip, "r") as z:
        z.extractall(temp_extract)

    # Check the structure in temporary directory
    temp_entries = list(temp_extract.iterdir())
    temp_subdirs = [p for p in temp_entries if p.is_dir()]
    image_exts = ('.jpg', '.jpeg', '.png')
    temp_has_images = any(p.is_file() and p.suffix.lower() in image_exts for p in temp_entries)

    # If there's a single subdirectory and no images at root, use the subdirectory content
    if len(temp_subdirs) == 1 and not temp_has_images:
        print(f"Detected nested structure. Moving content from {temp_subdirs[0].name} to avoid nesting...")
        source_dir = temp_subdirs[0]
    else:
        print("No nested structure detected.")
        source_dir = temp_extract

    # Create final extract directory and move content
    extract_dir.mkdir(parents=True, exist_ok=True)
    for item in source_dir.iterdir():
        shutil.move(str(item), extract_dir / item.name)

    # Clean up temporary directory
    shutil.rmtree(temp_extract)
    print(f"Dataset extracted successfully to: {extract_dir.resolve()}")
else:
    print(f"Dataset already extracted at: {extract_dir.resolve()}")

dataset_root = extract_dir
print(f"Dataset root set to: {dataset_root}")

Created missing directory: C:\Users\Alexandre\PycharmProjects\MachineLearningProject\data\processed
Found ZIP file: PlantVillageDataset.zip
Extracting PlantVillage dataset... please wait.
Detected nested structure. Moving content from PlantVillage to avoid nesting...
Dataset extracted successfully to: C:\Users\Alexandre\PycharmProjects\MachineLearningProject\data\processed\PlantVillage
Dataset root set to: C:\Users\Alexandre\PycharmProjects\MachineLearningProject\data\processed\PlantVillage


## Step 4: Identify valid classes
First, we identify valid classes by checking which directories contain images.
This prevents us from including empty folders or system directories.

In [5]:
image_exts = ('.jpg', '.jpeg', '.png')

# Find all directories that contain images (actual classes)
potential_classes = [p for p in dataset_root.iterdir() if p.is_dir()]
valid_classes = []

for class_dir in potential_classes:
    images = [p for p in class_dir.iterdir() if p.is_file() and p.suffix.lower() in image_exts]
    if images:
        valid_classes.append(class_dir)

print(f"Found {len(valid_classes)} valid classes:")
for class_path in valid_classes:
    images_count = len([p for p in class_path.iterdir() if p.is_file() and p.suffix.lower() in image_exts])
    print(f" - {class_path.name}: {images_count} images")


Found 15 valid classes:
 - Pepper__bell___Bacterial_spot: 997 images
 - Pepper__bell___healthy: 1478 images
 - Potato___Early_blight: 1000 images
 - Potato___healthy: 152 images
 - Potato___Late_blight: 1000 images
 - Tomato_Bacterial_spot: 2127 images
 - Tomato_Early_blight: 1000 images
 - Tomato_healthy: 1591 images
 - Tomato_Late_blight: 1909 images
 - Tomato_Leaf_Mold: 952 images
 - Tomato_Septoria_leaf_spot: 1771 images
 - Tomato_Spider_mites_Two_spotted_spider_mite: 1676 images
 - Tomato__Target_Spot: 1404 images
 - Tomato__Tomato_mosaic_virus: 373 images
 - Tomato__Tomato_YellowLeaf__Curl_Virus: 3208 images


## Step 5: Prepare train and validation folders
We create new `train` and `val` directories under `data/processed/`,
cleaning any existing ones to ensure a fresh split.

In [6]:
plantvillage_root = Path(DATA_PROCESSED_DIR) / "PlantVillage"
train_dir = plantvillage_root / "train"
val_dir = plantvillage_root / "val"

for d in (train_dir, val_dir):
    if d.exists():
        shutil.rmtree(d)
    d.mkdir(parents=True, exist_ok=True)

print("Train and validation directories are ready.")

Train and validation directories are ready.


## Step 6: Split the dataset
We shuffle all images in each class and split them into:
* 80% training data
* 20% validation data

Then, we copy the images into their respective class folders.

In [7]:
split_ratio = 0.8

for class_path in valid_classes:
    print(f"Processing {class_path.name}...")

    images = [p for p in class_path.iterdir() if p.is_file() and p.suffix.lower() in image_exts]

    random.shuffle(images)
    split_idx = int(len(images) * split_ratio)
    train_images = images[:split_idx]
    val_images = images[split_idx:]

    (train_dir / class_path.name).mkdir(parents=True, exist_ok=True)
    (val_dir / class_path.name).mkdir(parents=True, exist_ok=True)

    for img in train_images:
        shutil.copy(img, train_dir / class_path.name / img.name)
    for img in val_images:
        shutil.copy(img, val_dir / class_path.name / img.name)

    print(f"{class_path.name}: {len(train_images)} train, {len(val_images)} val")

print("Dataset split completed.")

Processing Pepper__bell___Bacterial_spot...
Pepper__bell___Bacterial_spot: 797 train, 200 val
Processing Pepper__bell___healthy...
Pepper__bell___healthy: 1182 train, 296 val
Processing Potato___Early_blight...
Potato___Early_blight: 800 train, 200 val
Processing Potato___healthy...
Potato___healthy: 121 train, 31 val
Processing Potato___Late_blight...
Potato___Late_blight: 800 train, 200 val
Processing Tomato_Bacterial_spot...
Tomato_Bacterial_spot: 1701 train, 426 val
Processing Tomato_Early_blight...
Tomato_Early_blight: 800 train, 200 val
Processing Tomato_healthy...
Tomato_healthy: 1272 train, 319 val
Processing Tomato_Late_blight...
Tomato_Late_blight: 1527 train, 382 val
Processing Tomato_Leaf_Mold...
Tomato_Leaf_Mold: 761 train, 191 val
Processing Tomato_Septoria_leaf_spot...
Tomato_Septoria_leaf_spot: 1416 train, 355 val
Processing Tomato_Spider_mites_Two_spotted_spider_mite...
Tomato_Spider_mites_Two_spotted_spider_mite: 1340 train, 336 val
Processing Tomato__Target_Spot...
T

## Step 7: Verify the split
We check that both the training and validation sets contain the expected number of images.

In [8]:
def count_images(folder):
    exts = ('.jpg', '.jpeg', '.png')
    return sum(1 for p in folder.rglob("*") if p.is_file() and p.suffix.lower() in exts)

train_count = count_images(train_dir)
val_count = count_images(val_dir)

print(f"Total training images: {train_count}")
print(f"Total validation images: {val_count}")

Total training images: 16504
Total validation images: 4134


## Conclusion
We now have a clean and well-structured dataset ready for model training.

* The dataset was extracted and split successfully.
* Train and validation directories follow the same class hierarchy.
* Randomization ensures a fair split for evaluation.

Next step: we can load the datasets in TensorFlow or PyTorch using `image_dataset_from_directory()` and start training the CNN model.