# Exploration of the PlantVillage Dataset (Introduction/Initial Steps)
This notebook demonstrates the initial steps to explore the PlantVillage dataset.
We will extract the dataset, check its structure, and prepare it for further analysis.
Link to the dataset: [PlantVillage Dataset](https://www.kaggle.com/datasets/emmarex/plantdisease)

### Requirements:
Install all dependencies from the `requirements.txt` file
and verify that the environment is correctly set up.

In [None]:
%pip install -r ../requirements.txt

In [None]:
%pip list

## Step 1: Import dependencies and set up paths
We import all necessary libraries and configuration paths from `src/config.py`.
The configuration file keeps directory paths centralized to avoid hardcoding them across the project.

In [1]:
import zipfile
import shutil
import random
from pathlib import Path
from src.config import DATA_RAW_DIR, DATA_PROCESSED_DIR

# Set random seed for reproducibility
random.seed(42)

print("Libraries and configuration loaded successfully.")
print(f"Raw data folder: {DATA_RAW_DIR}")
print(f"Processed data folder: {DATA_PROCESSED_DIR}")

Libraries and configuration loaded successfully.
Raw data folder: C:\Users\Alexandre\PycharmProjects\PlantVillageMachineLearning\data\raw
Processed data folder: C:\Users\Alexandre\PycharmProjects\PlantVillageMachineLearning\data\processed


## Step 2: Extract the PlantVillage ZIP archive
We first check if the dataset has already been extracted.
If not, we extract it into `data/processed/PlantVillage`.

In [2]:
raw_zip = Path(DATA_RAW_DIR) / "PlantVillageDataset.zip"
extract_dir = Path(DATA_PROCESSED_DIR) / "PlantVillage"

if not extract_dir.exists():
    extract_dir.mkdir(parents=True, exist_ok=True)
    if not raw_zip.exists():
        raise FileNotFoundError(f"Zip file not found at: {raw_zip.resolve()}")
    with zipfile.ZipFile(raw_zip, "r") as z:
        z.extractall(extract_dir)
    print(f"Dataset extracted to: {extract_dir.resolve()}")
else:
    print(f"Dataset already extracted at: {extract_dir.resolve()}")

Dataset extracted to: C:\Users\Alexandre\PycharmProjects\PlantVillageMachineLearning\data\processed\PlantVillage


## Step 3: Check dataset structure
Some datasets include an extra inner folder (e.g. `PlantVillage/PlantVillage/`).
We automatically detect and handle that case.

In [3]:
entries = list(extract_dir.iterdir())
subdirs = [p for p in entries if p.is_dir()]
image_exts = ('.jpg', '.jpeg', '.png')
has_images = any(p.is_file() and p.suffix.lower() in image_exts for p in entries)

if len(subdirs) == 1 and not has_images:
    dataset_root = subdirs[0]
    print(f"Detected single inner folder: {dataset_root.name}")
else:
    dataset_root = extract_dir

print(f"Dataset root set to: {dataset_root}")

Detected single inner folder: PlantVillage
Dataset root set to: C:\Users\Alexandre\PycharmProjects\PlantVillageMachineLearning\data\processed\PlantVillage\PlantVillage


## Step 4: Prepare train and validation folders
We create new `train` and `val` directories under `data/processed/`,
cleaning any existing ones to ensure a fresh split.

In [4]:
train_dir = Path(DATA_PROCESSED_DIR) / "train"
val_dir = Path(DATA_PROCESSED_DIR) / "val"

for d in (train_dir, val_dir):
    if d.exists():
        shutil.rmtree(d)
    d.mkdir(parents=True, exist_ok=True)

print("Train and validation directories are ready.")

Train and validation directories are ready.


## Step 5: Split the dataset
We shuffle all images in each class and split them into:
* 80% training data
* 20% validation data

Then, we copy (or move) the images into their respective class folders.

In [5]:
split_ratio = 0.8
classes = [p for p in dataset_root.iterdir() if p.is_dir()]
print(f"Found {len(classes)} classes:")

image_exts = ('.jpg', '.jpeg', '.png')

for class_path in classes:
    print(" -", class_path.name)

    images = [p for p in class_path.iterdir() if p.is_file() and p.suffix.lower() in image_exts]
    if not images:
        print(f"No images found in {class_path.name}, skipping.")
        continue

    random.shuffle(images)
    split_idx = int(len(images) * split_ratio)
    train_images = images[:split_idx]
    val_images = images[split_idx:]

    (train_dir / class_path.name).mkdir(parents=True, exist_ok=True)
    (val_dir / class_path.name).mkdir(parents=True, exist_ok=True)

    for img in train_images:
        shutil.copy(img, train_dir / class_path.name / img.name)
    for img in val_images:
        shutil.copy(img, val_dir / class_path.name / img.name)

    print(f"{class_path.name}: {len(train_images)} train, {len(val_images)} val")

print("Dataset split completed.")

Found 16 classes:
 - Pepper__bell___Bacterial_spot
Pepper__bell___Bacterial_spot: 797 train, 200 val
 - Pepper__bell___healthy
Pepper__bell___healthy: 1182 train, 296 val
 - PlantVillage
No images found in PlantVillage, skipping.
 - Potato___Early_blight
Potato___Early_blight: 800 train, 200 val
 - Potato___healthy
Potato___healthy: 121 train, 31 val
 - Potato___Late_blight
Potato___Late_blight: 800 train, 200 val
 - Tomato_Bacterial_spot
Tomato_Bacterial_spot: 1701 train, 426 val
 - Tomato_Early_blight
Tomato_Early_blight: 800 train, 200 val
 - Tomato_healthy
Tomato_healthy: 1272 train, 319 val
 - Tomato_Late_blight
Tomato_Late_blight: 1527 train, 382 val
 - Tomato_Leaf_Mold
Tomato_Leaf_Mold: 761 train, 191 val
 - Tomato_Septoria_leaf_spot
Tomato_Septoria_leaf_spot: 1416 train, 355 val
 - Tomato_Spider_mites_Two_spotted_spider_mite
Tomato_Spider_mites_Two_spotted_spider_mite: 1340 train, 336 val
 - Tomato__Target_Spot
Tomato__Target_Spot: 1123 train, 281 val
 - Tomato__Tomato_mosaic_v

## Step 6: Verify the split
We check that both the training and validation sets contain the expected number of images.

In [6]:
def count_images(folder):
    exts = ('.jpg', '.jpeg', '.png')
    return sum(1 for p in folder.rglob("*") if p.is_file() and p.suffix.lower() in exts)

train_count = count_images(train_dir)
val_count = count_images(val_dir)

print(f"Total training images: {train_count}")
print(f"Total validation images: {val_count}")

Total training images: 16504
Total validation images: 4134


## Conclusion
We now have a clean and well-structured dataset ready for model training.

* The dataset was extracted and split successfully.
* Train and validation directories follow the same class hierarchy.
* Randomization ensures a fair split for evaluation.

Next step: we can load the datasets in TensorFlow or PyTorch using `image_dataset_from_directory()` and start training the CNN model.