# 07 – Final Evaluation, Model Comparison & Conclusion

*Summarise performance of every iteration and decide which model ships to production.*


In [None]:
import numpy as np, matplotlib.pyplot as plt, seaborn as sns, pandas as pd
from src.evaluation import evaluate


## 1  Collect validation metrics from each notebook

Below we hard-code the best validation accuracy, precision & F1 that each model achieved  
(you can pull them automatically if you saved `history` objects).  


In [None]:
results = {
    "Baseline-majority":  {"acc": 0.66, "f1_rec": None, "f1_non": None},
    "V1 MLP":            {"acc": 0.79, "f1_rec": 0.80, "f1_non": 0.77},
    "V2 CNN":            {"acc": 0.86, "f1_rec": 0.87, "f1_non": 0.85},
    "V3 ResNet50":       {"acc": 0.94, "f1_rec": 0.94, "f1_non": 0.93},
    "V4 Aug CNN":        {"acc": 0.91, "f1_rec": 0.91, "f1_non": 0.89},
}
df = pd.DataFrame(results).T
df


## 2  Accuracy bar-chart


In [None]:
plt.figure(figsize=(6,4))
sns.barplot(x=df.index, y=df["acc"], palette="viridis")
plt.xticks(rotation=25); plt.ylim(0.6,1)
plt.ylabel("validation accuracy"); plt.title("Model comparison")
for i,v in enumerate(df["acc"]):
    plt.text(i, v+0.01, f"{v:.2f}", ha="center")
plt.tight_layout(); plt.show()


## 3  Best confusion matrix (ResNet50)


In [None]:
# If you saved the trained ResNet model as .h5 or kept it in memory, load here:
# from tensorflow.keras.models import load_model
# best_model = load_model("../models/resnet50_best.h5")

# Otherwise just re-assign from notebook 04 variable if still in RAM
best_model   = resnet        # <-- rename if different
X_val_images = X_val         # <-- pick the same validation set used in notebook 04
y_val_hot    = y_val

cm = evaluate(best_model, X_val_images, y_val_hot,
              labels=["recyclable", "non-recyclable"])


## 4  Summary & Recommendation

* **ResNet50 (frozen) wins** with **94 %** validation accuracy and balanced F1 scores.  
* The augmented CNN (V4) is close (91 %) with 10 × fewer parameters → candidate for edge deployment.  
* Error analysis (nb. 06) showed residual confusion in non-recyclable items caused by texture blur and multiple small objects.

### Next steps
1. Collect 500 high-quality non-recyclable images to balance subclasses.  
2. Fine-tune upper 20 layers of ResNet50 at lr = 1e-5 for potential +1 % accuracy.  
3. Integrate Grad-CAM heatmaps into operator dashboard for transparency.

*Decision*: **Ship ResNet50 TL** for first pilot at WMCN’s Utrecht facility, keep Aug-CNN as edge fallback.
