# Notebook 3: Model Performance Evaluation

**Objective:** To quantitatively assess the performance of our two trained models on the unseen test dataset and determine the superior architecture.

This is the final and most important step in the modeling process. A model is only useful if it can generalize to new data it has never seen before. Here, we will load the saved models from the previous notebook and evaluate them using several key metrics.

The evaluation will include:
- **Training History Plots:** Visualizing the learning curves for accuracy and loss over epochs.
- **Classification Report:** A detailed breakdown of precision, recall, and F1-score for each class.
- **Confusion Matrix:** A visual representation of the model's predictions, showing where it gets confused.

### Setup: Imports and Configuration

We import the necessary libraries for evaluation, including TensorFlow for loading models, Scikit-learn for metrics, and Matplotlib/Seaborn for plotting.

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import logging

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import classification_report, confusion_matrix

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%H:%M:%S')
logger = logging.getLogger(__name__)

# Configuration
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / "data"
SAVED_MODELS_DIR = PROJECT_ROOT / "saved_models" / "v2_cropped"
RESULTS_DIR = PROJECT_ROOT / "results" / "v2_cropped"
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

# Model Hyperparameters
IMAGE_SIZE = (150, 150)
BATCH_SIZE = 32

### Part 1: Load Data and Models

First, we need to create a data generator for our `test` set. It's crucial that this data is **not augmented** and that `shuffle` is set to `False` so we can correctly align the true labels with the model's predictions.

Then, we load the two `.keras` model files we saved during training.

In [None]:
# Create test data generator (no augmentation, no shuffling)
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_directory(
    directory=DATA_DIR / "test",
    target_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='categorical',
    shuffle=False
)

# Load the trained models
logger.info("Loading saved models...")
scratch_model = tf.keras.models.load_model(SAVED_MODELS_DIR / "scratch_model.keras")
transfer_model = tf.keras.models.load_model(SAVED_MODELS_DIR / "transfer_model.keras")
logger.info("Models loaded successfully.")

### Part 2: Evaluation Helper Function

To keep our code clean and reusable, we'll use the evaluation function from `src/train.py`. This function automates the entire evaluation process for any given model:

1.  It uses the model to `predict` class probabilities for the entire test set.
2.  It converts these probabilities into final class predictions using `argmax`.
3.  It generates and prints a `classification_report` from Scikit-learn.
4.  It generates and plots a `confusion_matrix` using Seaborn.
5.  It saves both the report and the matrix plot to the `results/` directory.

In [None]:
# This function is from src/train.py

def evaluate_model(model, test_generator, save_path_prefix):
    """Evaluates the model and saves the classification report and confusion matrix."""
    model_name = Path(save_path_prefix).name
    logger.info(f"Evaluating Model: {model_name}")
    class_names = list(test_generator.class_indices.keys())

    # Get predictions
    y_pred_probs = model.predict(test_generator)
    y_pred_indices = np.argmax(y_pred_probs, axis=1)
    y_true_indices = test_generator.classes

    # Classification report
    report = classification_report(y_true_indices, y_pred_indices, target_names=class_names)
    print("Classification Report:")
    print(report)
    report_path = f"{save_path_prefix}_classification_report.txt"
    with open(report_path, "w") as f:
        f.write(report)
    logger.info(f"Classification report saved to {report_path}")

    # Confusion matrix
    cm = confusion_matrix(y_true_indices, y_pred_indices)
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
    plt.title(f'Confusion Matrix - {model_name}')
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    cm_save_path = f"{save_path_prefix}_confusion_matrix.png"
    plt.savefig(cm_save_path)
    logger.info(f"Confusion matrix saved to {cm_save_path}")
    plt.show()
    plt.close()

*(Note: The training history plots are generated during the training phase in Notebook 2. The resulting `.png` files are also saved in the `results/` directory.)*

### Part 3: Evaluating Model #1 (Scratch CNN)

In [None]:
evaluate_model(
    model=scratch_model,
    test_generator=test_generator,
    save_path_prefix=str(RESULTS_DIR / "scratch_model")
)

### Part 4: Evaluating Model #2 (Transfer Learning)

In [None]:
evaluate_model(
    model=transfer_model,
    test_generator=test_generator,
    save_path_prefix=str(RESULTS_DIR / "transfer_model")
)

### Conclusion and Final Model Selection

Based on the evaluation metrics:

- **Scratch Model:** Achieved an accuracy of **78%**. It performed very well on `rock` and `scissors` but struggled with `paper`, often confusing it with `scissors`. This is a strong result for a model trained from scratch on a relatively small dataset, confirming that our data curation pipeline was successful.

- **Transfer Learning Model:** Achieved an accuracy of **82%**, meeting the project's success criterion. It showed superior and more balanced performance across all classes, particularly improving the recall for `scissors` and `rock`.

**Decision:** The **Transfer Learning (MobileNetV2) model is the clear winner.** It is more accurate, robust, and reliable. This model will be selected as the final model for the interactive game application.