# Project 1 — Vegetable Image Classification

This notebook contains the project code scaffold for the class assignment.
Goals: build and evaluate two supervised models (CNN and Decision Tree) using two different feature representations (raw images for CNN, handcrafted color/texture features for Decision Tree).
Follow the submission rules: convert this notebook to `.py` and place the `.py` files in a single folder named `{family_name}_{first_name}_{group}` (no subfolders). The report (PDF) goes into `{family_name}_{first_name}_{group}_doc`. Do not include the dataset in the submission zip.

## Environment and requirements
Run the following cell to install the packages (use a virtual env). If you prefer, run the commands from the README on your machine.

In [5]:
# Minimal imports and versions
import sys
print('Python', sys.version)
try:
    import tensorflow as tf; print('TensorFlow', tf.__version__)
except Exception as e:
    print('TensorFlow not available:', e)
import sklearn; print('scikit-learn', sklearn.__version__)
import cv2
import numpy as np
import matplotlib.pyplot as plt

Python 3.11.3 (tags/v3.11.3:f3909b8, Apr  4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)]
TensorFlow 2.18.0
scikit-learn 1.4.1.post1


## 1) Download dataset (local-only)
The dataset should NOT be included in the final zip. Use the following code to download and prepare data locally. The snippet below uses `kagglehub` as you provided; adapt it if you prefer the Kaggle CLI or manual download.

In [1]:
# Example: download dataset using kagglehub (user-provided snippet)
# NOTE: this will place dataset files on your local disk. DO NOT include dataset files in the submission zip.
try:
    import kagglehub
    path = kagglehub.dataset_download("misrakahmed/vegetable-image-dataset")
    print('Path to dataset files:', path)
except Exception as e:
    print('kagglehub not available or failed:', e)
    # Alternative instructions: use Kaggle CLI or manually download from Kaggle website.

Downloading from https://www.kaggle.com/api/v1/datasets/download/misrakahmed/vegetable-image-dataset?dataset_version_number=1...


100%|██████████| 534M/534M [00:16<00:00, 34.7MB/s] 

Extracting files...





Path to dataset files: C:\Users\iTarantula PC\.cache\kagglehub\datasets\misrakahmed\vegetable-image-dataset\versions\1


## 2) Project contract (inputs/outputs & evaluation)
- Inputs: image files from the dataset; for Decision Tree also extracted feature vectors (color histograms, texture).
- Outputs: trained models, confusion matrices, hyperparameter tuning plots, short report (PDF).
- Success: code runs end-to-end locally, produces saved model files and evaluation plots.

## 3) Data loading and preprocessing (images)
Load images, apply a consistent size (e.g., 128x128) and standard normalization. Create train/val/test split (example: 70/15/15) and ensure splits are fixed with a random seed.

In [2]:
import os, glob, random
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import img_to_array, load_img

DATA_DIR = 'path_to_downloaded_dataset'  # replace with actual path printed by kagglehub
IMAGE_SIZE = (128, 128)
RANDOM_SEED = 42

def load_image_paths(data_dir):
    classes = [d for d in os.listdir(data_dir) if os.path.isdir(os.path.join(data_dir, d))]
    paths = []
    labels = []
    for i, cls in enumerate(sorted(classes)):
        cls_dir = os.path.join(data_dir, cls)
        files = glob.glob(os.path.join(cls_dir, '*'))
        for f in files:
            paths.append(f)
            labels.append(i)
    return paths, labels, classes

# Example usage (uncomment and set DATA_DIR)
# paths, labels, classes = load_image_paths(DATA_DIR)
# train_paths, test_paths, y_train, y_test = train_test_split(paths, labels, test_size=0.3, random_state=RANDOM_SEED, stratify=labels)

## 4) Feature representations (two types)
We will create two distinct representations: (A) raw images resized and fed to a CNN; (B) handcrafted numeric features (e.g., color histograms + simple texture descriptors) fed to a Decision Tree. These are different feature types as required by the rubric.

In [6]:
import cv2
import numpy as np

def extract_color_histogram(image_path, size=IMAGE_SIZE, bins=(8,8,8)):
    image = cv2.imread(image_path)
    image = cv2.resize(image, size)
    hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
    hist = cv2.calcHist([hsv], [0,1,2], None, bins, [0,180,0,256,0,256])
    cv2.normalize(hist, hist)
    return hist.flatten()

def extract_features_for_paths(paths):
    feats = [extract_color_histogram(p) for p in paths]
    return np.vstack(feats)

# Example: X_handcrafted = extract_features_for_paths(train_paths)

In [None]:
# --- Run full pipeline on workspace dataset (train/validation/test assumed in ./data)
# Set DATA_DIR to the workspace copy we created earlier
DATA_DIR = 'data'
train_dir = os.path.join(DATA_DIR, 'train')
val_dir = os.path.join(DATA_DIR, 'validation')
test_dir = os.path.join(DATA_DIR, 'test')

# Quick sanity checks and class counts
import glob

def get_class_counts(folder):
    classes = sorted([d for d in os.listdir(folder) if os.path.isdir(os.path.join(folder, d))])
    counts = {c: len(glob.glob(os.path.join(folder, c, '*'))) for c in classes}
    return classes, counts

train_classes, train_counts = get_class_counts(train_dir)
val_classes, val_counts = get_class_counts(val_dir)
test_classes, test_counts = get_class_counts(test_dir)

print('Number of classes (train):', len(train_classes))
print('Example class counts (train):')
for k in list(train_counts.keys())[:10]:
    print('  ', k, train_counts[k])
print('Totals - train:', sum(train_counts.values()), 'val:', sum(val_counts.values()), 'test:', sum(test_counts.values()))

# Build path lists and tf.data pipelines for images
try:
    import tensorflow as tf
    TF_AVAILABLE = True
except Exception:
    TF_AVAILABLE = False
    print('TensorFlow not available; CNN training cells will fail unless you install TensorFlow.')

IMG_SIZE = IMAGE_SIZE if 'IMAGE_SIZE' in globals() else (128, 128)
BATCH_SIZE = 32

def paths_labels_from_dir(base_dir):
    pts = []
    labs = []
    classes = sorted([d for d in os.listdir(base_dir) if os.path.isdir(os.path.join(base_dir, d))])
    cls_to_idx = {c: i for i, c in enumerate(classes)}
    for c in classes:
        files = glob.glob(os.path.join(base_dir, c, '*'))
        for f in files:
            pts.append(f)
            labs.append(cls_to_idx[c])
    return pts, labs, classes

train_paths, train_labels, classes = paths_labels_from_dir(train_dir)
val_paths, val_labels, _ = paths_labels_from_dir(val_dir)
test_paths, test_labels, _ = paths_labels_from_dir(test_dir)
print('Found classes:', classes)

if TF_AVAILABLE:
    AUTOTUNE = tf.data.AUTOTUNE
    def preprocess_image(path):
        image = tf.io.read_file(path)
        image = tf.image.decode_image(image, channels=3)
        image.set_shape([None, None, 3])
        image = tf.image.resize(image, IMG_SIZE)
        image = tf.cast(image, tf.float32) / 255.0
        return image

    def make_dataset(paths, labels, batch_size=BATCH_SIZE, shuffle=True):
        ds = tf.data.Dataset.from_tensor_slices((paths, labels))
        if shuffle:
            ds = ds.shuffle(buffer_size=len(paths), seed=RANDOM_SEED)
        ds = ds.map(lambda p, l: (preprocess_image(p), l), num_parallel_calls=AUTOTUNE)
        ds = ds.batch(batch_size).prefetch(AUTOTUNE)
        return ds

    train_ds = make_dataset(train_paths, train_labels)
    val_ds = make_dataset(val_paths, val_labels, shuffle=False)
    test_ds = make_dataset(test_paths, test_labels, shuffle=False)
else:
    train_ds = val_ds = test_ds = None

# --- CNN training (small/quick by default) ---
RUN_QUICK = True
EPOCHS = 3 if RUN_QUICK else 20

if TF_AVAILABLE:
    print('Building and training CNN for', EPOCHS, 'epochs (RUN_QUICK=', RUN_QUICK, ')')
    cnn = build_simple_cnn(input_shape=(IMG_SIZE[0], IMG_SIZE[1], 3), num_classes=len(classes))
    history = cnn.fit(train_ds, epochs=EPOCHS, validation_data=val_ds)
    cnn.save(os.path.join('.', 'cnn_model.h5'))

    # Evaluate on test and plot confusion matrix
    import numpy as np
    y_pred_batches = []
    y_true_batches = []
    for x_batch, y_batch in test_ds:
        preds = cnn.predict(x_batch)
        y_pred_batches.append(np.argmax(preds, axis=1))
        y_true_batches.append(y_batch.numpy())
    y_pred = np.concatenate(y_pred_batches)
    y_true = np.concatenate(y_true_batches)
    plot_cm(y_true, y_pred, classes, 'CNN Confusion Matrix')
else:
    print('Skipping CNN training because TensorFlow is not available.')

# --- Handcrafted features (color histograms) and Decision Tree training ---
print('\nExtracting handcrafted features for Decision Tree...')
X_train = extract_features_for_paths(train_paths)
X_val = extract_features_for_paths(val_paths)
X_test = extract_features_for_paths(test_paths)
y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

# Optionally subsample for speed if RUN_QUICK
if RUN_QUICK:
    from sklearn.utils import resample
    sample_n = min(2000, X_train.shape[0])
    X_train_sample, y_train_sample = resample(X_train, y_train, n_samples=sample_n, random_state=RANDOM_SEED)
else:
    X_train_sample, y_train_sample = X_train, y_train

print('Training Decision Tree (grid search) on', X_train_sample.shape[0], 'samples')
dt_clf = train_decision_tree(X_train_sample, y_train_sample)
print('Best params:', dt_clf.best_params_)

# Evaluate Decision Tree on test set
from sklearn.metrics import accuracy_score
y_pred_dt = dt_clf.predict(X_test)
print('Decision Tree test acc:', accuracy_score(y_test, y_pred_dt))
plot_cm(y_test, y_pred_dt, classes, 'Decision Tree Confusion Matrix')

# Save the decision tree model (best estimator)
import joblib
joblib.dump(dt_clf.best_estimator_, 'decision_tree_best.pkl')
print('Saved Decision Tree as decision_tree_best.pkl')


## 5) Model 1 — CNN (raw images)
A simple CNN implemented with Keras. We'll keep the model small to run on CPU but you can scale it up for GPU.

In [7]:
from tensorflow.keras import layers, models, optimizers

def build_simple_cnn(input_shape=(128,128,3), num_classes=10):
    model = models.Sequential([
        layers.Input(shape=input_shape),
        layers.Conv2D(32, 3, activation='relu'),
        layers.MaxPooling2D(2),
        layers.Conv2D(64, 3, activation='relu'),
        layers.MaxPooling2D(2),
        layers.Flatten(),
        layers.Dense(128, activation='relu'),
        layers.Dense(num_classes, activation='softmax')
    ])
    model.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

# Example: model = build_simple_cnn(input_shape=(128,128,3), num_classes=len(classes))

## 6) Model 2 — Decision Tree (handcrafted features)
Train a Decision Tree on the extracted color/texture features. We'll perform grid search for `max_depth` and `min_samples_split` as the tuning example.

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

def train_decision_tree(X_train, y_train):
    param_grid = {'max_depth': [5, 10, 20, None], 'min_samples_split': [2,5,10]}
    clf = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=3, scoring='accuracy', n_jobs=1)
    clf.fit(X_train, y_train)
    return clf

# Example usage: dt_clf = train_decision_tree(X_handcrafted_train, y_train)
# print('Best params:', dt_clf.best_params_)

## 7) Evaluation and confusion matrix
Produce confusion matrices for both models and plot them. Save plots and include them in the report.

In [9]:
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

def plot_cm(y_true, y_pred, labels, title='Confusion matrix'):
    disp = ConfusionMatrixDisplay.from_predictions(y_true, y_pred, display_labels=labels, cmap='Blues')
    plt.title(title)
    plt.show()

# Example usage after prediction: plot_cm(y_test, y_pred_dt, classes, 'Decision Tree CM')

## 8) Comparison with literature & report notes
Add a short paragraph comparing your best result with a method from the literature. Cite a paper or blog and briefly discuss differences (dataset, preprocessing, model capacity).

## 9) Export to .py and submission checklist
Use nbconvert or the helper script in the project to convert the notebook to a single `.py` file. Place the `.py` in the submission folder (no subfolders). Create a PDF report of at least 2 pages (excluding figures) and put it into the documentation subfolder.

In [10]:
# Example programmatic export (also see export_notebook.py in the repo)
try:
    import nbformat
    from nbconvert import PythonExporter
    nb = nbformat.read('project1_vegetable_images.ipynb', as_version=4)
    exporter = PythonExporter()
    source, meta = exporter.from_notebook_node(nb)
    with open('project1_vegetable_images.py', 'w', encoding='utf-8') as f:
        f.write(source)
    print('Exported to project1_vegetable_images.py')
except Exception as e:
    print('Programmatic export failed, you can run: jupyter nbconvert --to script project1_vegetable_images.ipynb', e)

Programmatic export failed, you can run: jupyter nbconvert --to script project1_vegetable_images.ipynb Notebook does not appear to be JSON: ''
