## Importing Required Libraries and Defining Constants

We define:
- `TARGET_SIZE = (64, 64)` to resize all images into a fixed shape.
- `DATASET_DIR` which points to the PlantVillage color dataset directory.

In [1]:
import os
import cv2
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix

TARGET_SIZE = (64, 64)
DATASET_DIR = "/kaggle/input/plantvillage-dataset/color"

## Loading and Preprocessing Image Data

This function loads the PlantVillage dataset from the directory structure and converts images into machine-learning-friendly features.

Steps involved:
- Each subfolder represents a disease class.
- Images are read using OpenCV and resized to a fixed resolution of 64×64.
- Pixel values are normalized between 0 and 1.
- Images are flattened into 1D feature vectors.
- Corresponding folder names are stored as labels.

The function finally returns:
- `X`: NumPy array of image feature vectors
- `y`: NumPy array of class labels

This preprocessing step is crucial for using traditional machine learning models.

In [2]:
def load_data(directory):
    X = []
    y = []

    for label in os.listdir(directory):
        label_path = os.path.join(directory, label)
        if not os.path.isdir(label_path):
            continue

        for img_name in os.listdir(label_path):
            img_path = os.path.join(label_path, img_name)

            img = cv2.imread(img_path)
            if img is None:
                continue

            img = cv2.resize(img, TARGET_SIZE)
            img = img / 255.0  # normalize

            vector = img.flatten()

            X.append(vector)
            y.append(label)

    return np.array(X), np.array(y)


## Dataset Loading and Shape Verification

In this cell, we load the dataset using the `load_data()` function and verify the dimensions of the extracted features and labels.

- Each image is converted into a 12,288-dimensional feature vector.
- The number of samples matches the number of labels.

This step ensures that data loading and preprocessing have been done correctly.

64 (height) × 64 (width) × 3 (RGB channels) = 12,288

Each image is resized to 64 by 64 pixels for consistency, and since it is a color image, each pixel has three channels: red, green, and blue. So one image becomes 64×64×3 values, which are flattened into a 12,288-dimensional feature vector.



In [3]:
X, y = load_data(DATASET_DIR)

print("Feature shape:", X.shape)
print("Labels shape:", y.shape)


Feature shape: (54305, 12288)
Labels shape: (54305,)


In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## Why `stratify = y` Is Used in Train-Test Split

When splitting the dataset into training and testing sets, it is important to maintain the original class distribution. Many real-world datasets, including plant disease datasets, are imbalanced.

Using `stratify = y` ensures that each class is represented in the same proportion in both the training and testing sets.

### Example

If a disease class represents:
- **10% of the total dataset**

Then after applying a stratified train-test split:
- Approximately **10% of the training data** will belong to that disease
- Approximately **10% of the testing data** will belong to that disease

This helps ensure fair evaluation and prevents rare classes from being underrepresented or missing in the test set.


In [5]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [6]:
import pandas as pd

pd.Series(y_train).value_counts(normalize=True).head()


Orange___Haunglongbing_(Citrus_greening)    0.101395
Tomato___Tomato_Yellow_Leaf_Curl_Virus      0.098656
Soybean___healthy                           0.093730
Peach___Bacterial_spot                      0.042284
Tomato___Bacterial_spot                     0.039177
Name: proportion, dtype: float64

## Baseline Model using Dummy Classifier

A Dummy Classifier is used as a baseline model.

- It always predicts the most frequent class.
- This provides a reference accuracy that any trained model must outperform.

The low baseline accuracy highlights the complexity of the dataset and justifies the need for advanced models.


In [7]:
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train_scaled, y_train)

dummy_preds = dummy.predict(X_test_scaled)
dummy_acc = accuracy_score(y_test, dummy_preds)

print("Dummy Baseline Accuracy:", dummy_acc)


Dummy Baseline Accuracy: 0.1014639535954332


## Random Forest Classifier Training and Evaluation

A Random Forest Classifier is trained on the image feature vectors.

- It is an ensemble learning method that combines multiple decision trees.
- It handles high-dimensional data effectively and reduces overfitting.

The achieved accuracy significantly outperforms the baseline, demonstrating the effectiveness of traditional machine learning on image-based features.

Here, n_estimators controls the number of trees, random_state ensures reproducibility, and n_jobs = -1 enables parallel training using all CPU cores.


In [8]:
rf = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)

rf_acc = accuracy_score(y_test, rf_preds)
print("Random Forest Accuracy:", rf_acc)


Random Forest Accuracy: 0.6683546634748182
