# Final Model Evaluation on V2 (Cropped) Dataset

This notebook loads the two final models trained on the refined V2 dataset and evaluates their performance on the test set.

In [None]:
import numpy as np
import tensorflow as tf
from pathlib import Path
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# --- Configuration ---
PROJECT_ROOT = Path.cwd().parent # Assumes notebook is in /notebooks
TEST_DIR = PROJECT_ROOT / "data/test"
SCRATCH_MODEL_PATH = PROJECT_ROOT / "saved_models/scratch_model.keras"
TRANSFER_MODEL_PATH = PROJECT_ROOT / "saved_models/transfer_model.keras"
IMAGE_SIZE = (150, 150)
BATCH_SIZE = 32

# --- Load Test Data ---
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_directory(
    TEST_DIR,
    target_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='categorical',
    shuffle=False
)

class_names = list(test_generator.class_indices.keys())

## 1. Evaluation of Model #1-V2 (Custom CNN from Scratch)

In [None]:
# Load the model
scratch_model = tf.keras.models.load_model(SCRATCH_MODEL_PATH)

# Generate predictions
y_pred_scratch = np.argmax(scratch_model.predict(test_generator), axis=1)
y_true = test_generator.classes

# Print Classification Report
print("Classification Report for Scratch Model (V2 Data):")
print(classification_report(y_true, y_pred_scratch, target_names=class_names))

# Display Confusion Matrix
cm_scratch = confusion_matrix(y_true, y_pred_scratch)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_scratch, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
plt.title('Confusion Matrix - Scratch Model (V2 Data)')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

### Analysis of Scratch Model (V2)

The model achieved an overall accuracy of **63%**. While a significant improvement over the V1 baseline, its performance remains limited. 

The confusion matrix reveals a key weakness: massive confusion between `paper` and `scissors`. When the gesture was `scissors`, the model incorrectly predicted it as `paper` 24 times. The model has learned some distinct features (like the closed fist for `rock`, giving it high precision), but it lacks the sophistication to differentiate between the more nuanced open-fingered gestures. This demonstrates the inherent difficulty of training a deep network from scratch on a limited dataset, even a clean one.

## 2. Evaluation of Model #2-V2 (Transfer Learning)

In [None]:
# Load the model
transfer_model = tf.keras.models.load_model(TRANSFER_MODEL_PATH)

# Generate predictions
y_pred_transfer = np.argmax(transfer_model.predict(test_generator), axis=1)

# Print Classification Report
print("Classification Report for Transfer Learning Model (V2 Data):")
print(classification_report(y_true, y_pred_transfer, target_names=class_names))

# Display Confusion Matrix
cm_transfer = confusion_matrix(y_true, y_pred_transfer)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_transfer, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
plt.title('Confusion Matrix - Transfer Model (V2 Data)')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

### Analysis of Transfer Learning Model (V2)

The transfer learning model achieved a final accuracy of **74%**. While numerically lower than the 83% achieved on the flawed V1 data, this result represents a more significant and successful outcome.

The model is now **more balanced and robust**. Critically, the F1-score for the previously problematic `scissors` class has improved to **0.75**. By removing the background "cheat" from the V1 data, the model was forced to solve the harder, correct problem. This 74% score reflects a **genuine understanding** of the hand gestures, proven by its more equitable performance across all classes. This demonstrates the critical principle that a lower accuracy on the correct problem is more valuable than a high accuracy on the wrong one.