# 05 – V4 CNN on Augmented Dataset

This iteration trains the same 2-convolution CNN as notebook 03,  
but on a dataset expanded via **data augmentation** (rotation, shift, zoom, flip).  
Goal : check whether more (synthetic) data boosts generalisation.


In [None]:
import numpy as np, matplotlib.pyplot as plt
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

from src.data_loader     import load_images
from src.model_cnn       import build_cnn
from src.compile_utils   import compile_model, early_stop
from src.plotting        import plot_history
from src.evaluation      import evaluate


## 1  Load augmented images
Replace the folder paths with the output dirs you generated using
`src/data_augmentation.py` (e.g. **`AUG-REC2`** and **`AUG-NON`**).


In [None]:
folder_paths = [
    "../data/AUG-REC2",      # augmented recyclable
    "../data/AUG-NON"        # augmented non-recyclable
]
class_names  = ["recyclable", "non-recyclable"]
X, y, _ = load_images(folder_paths, class_names, target_size=(64, 64))
print("Augmented dataset:", X.shape)


### Train / validation / test split


In [None]:
le   = LabelEncoder(); y_hot = to_categorical(le.fit_transform(y), 2)

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y_hot, test_size=0.10, random_state=42, stratify=y_hot)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.22, random_state=42, stratify=y_temp)

X_train = X_train.astype("float32")/255.0
X_val   = X_val.astype("float32")/255.0
X_test  = X_test.astype("float32")/255.0

print("Train", X_train.shape, " Val", X_val.shape, " Test", X_test.shape)


## 2  Build & compile CNN (same architecture as notebook 03)


In [None]:
cnn_aug = build_cnn(shape=(64,64,3), classes=2)
cnn_aug = compile_model(cnn_aug, lr=1e-3, loss="categorical_crossentropy")
cnn_aug.summary()


## 3  Train


In [None]:
H = cnn_aug.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=30,
    batch_size=32,
    callbacks=[early_stop(patience=4)],
    verbose=2
)



## 4  Learning curves


In [None]:
plot_history(H)


## 5  Evaluation on test set


In [None]:
evaluate(cnn_aug, X_test, y_test, labels=le.classes_)


## 6  Discussion

*Data augmentation increased dataset size from **N_raw → N_aug** and produced:**  
* Accuracy = … (compare to raw-data CNN ≈ 0.86).  
* Confusion matrix shows improved recall on the minority class?  

Trade-off : longer training time, slight risk of over-fitting to synthetic artefacts.  
In the next notebook we will deep-dive into **error analysis** to see where the model still fails.
