# Preparation of the PlantVillage Dataset (Introduction/Initial Steps)
This notebook demonstrates the initial steps to prepare the PlantVillage dataset
for training a Convolutional Neural Network (CNN) model to classify plant diseases.
We will extract the dataset, check its structure, and prepare it for further analysis.
Link to the dataset: [PlantVillage Dataset](https://www.kaggle.com/datasets/emmarex/plantdisease)

### Requirements:
Install all dependencies from the `requirements.txt` file
and verify that the environment is correctly set up.

In [None]:
%pip install -r ../requirements.txt

In [None]:
%pip list

## Step 1: Import dependencies and set up paths
We import all necessary libraries and configuration paths from `src/config.py`.
The configuration file keeps directory paths centralized to avoid hardcoding them across the project.

In [None]:
import zipfile
import shutil
import random
from pathlib import Path
from src.config import DATA_RAW_DIR, DATA_PROCESSED_DIR, MODELS_DIR

# Set random seed for reproducibility
random.seed(42)

print("Libraries and configuration loaded successfully.")
print(f"Raw data folder: {DATA_RAW_DIR}")
print(f"Processed data folder: {DATA_PROCESSED_DIR}")
print(f"Models folder: {MODELS_DIR}")

## Step 2: Extract the PlantVillage ZIP archive
We first check if the dataset has already been extracted.
If not, we extract it into `data/processed/PlantVillage`.

In [None]:
# Ensure base folders exist
for folder in [DATA_RAW_DIR, DATA_PROCESSED_DIR, MODELS_DIR]:
    if not folder.exists():
        folder.mkdir(parents=True, exist_ok=True)
        print(f"Created missing directory: {folder.resolve()}")

zip_files = list(Path(DATA_RAW_DIR).glob("*.zip"))
if not zip_files:
    raise FileNotFoundError(
        f"No ZIP file found in: {DATA_RAW_DIR.resolve()}\n"
        "Please download the PlantVillage dataset from Kaggle:\n"
        "https://www.kaggle.com/datasets/emmarex/plantdisease\n"
        "Then place the ZIP file inside the /data/raw directory."
    )
else:
    raw_zip = zip_files[0]
    print(f"Found ZIP file: {raw_zip.name}")

extract_dir = Path(DATA_PROCESSED_DIR) / "PlantVillage"

if not extract_dir.exists():
    extract_dir.mkdir(parents=True, exist_ok=True)
    print("Extracting PlantVillage dataset... please wait.")
    with zipfile.ZipFile(raw_zip, "r") as z:
        z.extractall(extract_dir)
    print(f"Dataset extracted successfully to: {extract_dir.resolve()}")
else:
    print(f"Dataset already extracted at: {extract_dir.resolve()}")

## Step 3: Check dataset structure
Some datasets include an extra inner folder (e.g. `PlantVillage/PlantVillage/`).
We automatically detect and handle that case.

In [None]:
entries = list(extract_dir.iterdir())
subdirs = [p for p in entries if p.is_dir()]
image_exts = ('.jpg', '.jpeg', '.png')
has_images = any(p.is_file() and p.suffix.lower() in image_exts for p in entries)

if len(subdirs) == 1 and not has_images:
    inner = subdirs[0]
    print(f"Detected extra inner folder: {inner.name}. Flattening...")
    for item in inner.iterdir():
        shutil.move(str(item), extract_dir / item.name)
    inner.rmdir()
else:
    print("No flatten needed.")

dataset_root = extract_dir
print(f"Dataset root set to: {dataset_root}")

## Step 4: Prepare train and validation folders
We create new `train` and `val` directories under `data/processed/`,
cleaning any existing ones to ensure a fresh split.

In [None]:
plantvillage_root = Path(DATA_PROCESSED_DIR) / "PlantVillage"
train_dir = plantvillage_root / "train"
val_dir = plantvillage_root / "val"

for d in (train_dir, val_dir):
    if d.exists():
        shutil.rmtree(d)
    d.mkdir(parents=True, exist_ok=True)

print("Train and validation directories are ready.")

## Step 5: Split the dataset
We shuffle all images in each class and split them into:
* 80% training data
* 20% validation data

Then, we copy (or move) the images into their respective class folders.

In [None]:
split_ratio = 0.8
classes = [p for p in dataset_root.iterdir() if p.is_dir()]
print(f"Found {len(classes)} classes:")

image_exts = ('.jpg', '.jpeg', '.png')

for class_path in classes:
    print(" -", class_path.name)

    images = [p for p in class_path.iterdir() if p.is_file() and p.suffix.lower() in image_exts]
    if not images:
        print(f"No images found in {class_path.name}, skipping.")
        continue

    random.shuffle(images)
    split_idx = int(len(images) * split_ratio)
    train_images = images[:split_idx]
    val_images = images[split_idx:]

    (train_dir / class_path.name).mkdir(parents=True, exist_ok=True)
    (val_dir / class_path.name).mkdir(parents=True, exist_ok=True)

    for img in train_images:
        shutil.copy(img, train_dir / class_path.name / img.name)
    for img in val_images:
        shutil.copy(img, val_dir / class_path.name / img.name)

    print(f"{class_path.name}: {len(train_images)} train, {len(val_images)} val")

print("Dataset split completed.")

## Step 6: Verify the split
We check that both the training and validation sets contain the expected number of images.

In [None]:
def count_images(folder):
    exts = ('.jpg', '.jpeg', '.png')
    return sum(1 for p in folder.rglob("*") if p.is_file() and p.suffix.lower() in exts)

train_count = count_images(train_dir)
val_count = count_images(val_dir)

print(f"Total training images: {train_count}")
print(f"Total validation images: {val_count}")

## Conclusion
We now have a clean and well-structured dataset ready for model training.

* The dataset was extracted and split successfully.
* Train and validation directories follow the same class hierarchy.
* Randomization ensures a fair split for evaluation.

Next step: we can load the datasets in TensorFlow or PyTorch using `image_dataset_from_directory()` and start training the CNN model.