## 📂 Data Upload and Extraction



First, let's upload the zip file containing your dataset.

nb:you can find the dataset in this kaggle dataset
https://www.kaggle.com/datasets/raniastudentlitim/cleaned-dataset

In [None]:
from google.colab import files
import zipfile
import os

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

  # Assuming the uploaded file is a zip file
  zip_ref = zipfile.ZipFile(fn, 'r')
  zip_ref.extractall('.')
  zip_ref.close()

  print(f"Extracted contents of {fn}")

  # Assuming the cleaned dataset is named 'cleaned_dataset' and is a CSV file
  # You might need to adjust this based on the actual file name and type
  dataset_path = 'cleaned_dataset' # Update this if the filename is different

  if os.path.exists(dataset_path):
    print(f"Found dataset at: {dataset_path}")
    # Now you can proceed to load and visualize the dataset
  else:
    print(f"Could not find dataset at: {dataset_path}. Please check the filename inside the zip.")

In [None]:
import os

extracted_files = os.listdir('.')
print("Files extracted:")
for file in extracted_files:
  print(file)

## 🧹 Dataset Preparation and Splitting


- Scan all class subdirectories (e.g., `Plastic`, `Glass`, etc.) to build a DataFrame containing image file paths and their associated labels.
- the dataset is split into training and validation sets using an 80/20 ratio, ensuring class balance via stratification.




In [None]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# --- Verify the path to your dataset ---
# This path should point to the directory containing subdirectories for each class (e.g., 'Cardboard', 'Plastic', etc.)
dataset_path = "/content/cleaned_dataset" # Assuming it's extracted to the root /content/ directory

if not os.path.exists(dataset_path):
    print(f"ERROR: The directory '{dataset_path}' does not exist.")
    print("Please make sure you have unzipped your data and the path is correct.")
else:
    print(f"Dataset found at: {dataset_path}")

    # --- Create a DataFrame of filepaths and labels ---
    filepaths = []
    labels = []

    # Iterate through each folder (which represents a class)
    for class_folder in os.listdir(dataset_path):
        class_path = os.path.join(dataset_path, class_folder)
        if os.path.isdir(class_path):
            for image_file in os.listdir(class_path):
                filepaths.append(os.path.join(class_path, image_file))
                labels.append(class_folder)

    # Create the main DataFrame
    df = pd.DataFrame({'filepath': filepaths, 'label': labels})

    # --- Split into Training and Validation Sets ---
    # We stratify by label to ensure both train and validation sets have a similar distribution of classes.
    train_df, val_df = train_test_split(
        df,
        test_size=0.2,       # 20% for validation
        random_state=42,
        stratify=df['label']
    )

    print("\nData successfully organized into training and validation DataFrames.")
    print(f"Total images: {len(df)}")
    print(f"Training samples: {len(train_df)}")
    print(f"Validation samples: {len(val_df)}")

In [None]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# --- Corriger le chemin vers votre jeu de données ---
# Puisque vos dossiers de catégories (Cardboard, Glass, etc.) sont directement dans /content/,
# le chemin du jeu de données est simplement "/content/".
dataset_path = "/content/"

print(f"Vérification du chemin : {dataset_path}")

# --- Créer un DataFrame des chemins de fichiers et des étiquettes ---
filepaths = []
labels = []

# Liste des dossiers qui sont des catégories d'images (pour ignorer d'autres fichiers/dossiers)
# D'après votre sortie, ce sont les catégories
image_folders = [
    'Textiles', 'Metals', 'Paper', 'Electronic Waste',
    'Glass', 'Plastics', 'Cardboard', 'General Waste', 'Organic Waste'
]

# Parcourir chaque dossier de catégorie d'image
for class_folder in image_folders:
    class_path = os.path.join(dataset_path, class_folder)
    if os.path.isdir(class_path):
        for image_file in os.listdir(class_path):
            filepaths.append(os.path.join(class_path, image_file))
            labels.append(class_folder)
    else:
        print(f"Attention : Le dossier de catégorie attendu n'a pas été trouvé : {class_path}")


# Créer le DataFrame principal
df = pd.DataFrame({'filepath': filepaths, 'label': labels})

# --- Diviser en ensembles d'entraînement et de validation ---
# Nous stratifions par étiquette pour nous assurer que les ensembles d'entraînement et de validation ont une distribution de classes similaire.
train_df, val_df = train_test_split(
    df,
    test_size=0.2,       # 20% pour la validation
    random_state=42,
    stratify=df['label']
)

print("\nDonnées organisées avec succès en DataFrames d'entraînement et de validation.")
print(f"Total des images trouvées : {len(df)}")
print(f"Échantillons d'entraînement : {len(train_df)}")
print(f"Échantillons de validation : {len(val_df)}")

## 🧪 Step 2: Create Data Generators and Compute Class Weights

In this step, we prepare the data for training and address class imbalance:

- **Image Generators**: We use `ImageDataGenerator` to normalize pixel values by scaling them to the range [0, 1].
  - The training generator shuffles data for better learning.
  - The validation generator does not shuffle to ensure consistent evaluation.

- **Class Weights Calculation**:
  - We compute the class weights using `compute_class_weight` from `sklearn`, which balances the impact of underrepresented classes.
  - The weights are mapped to each class index so they can be used during model training to compensate for class imbalance.

This step ensures proper data feeding and handles any imbalance across waste categories to improve the model’s generalization.


In [None]:
# --- Étape 2 : Créer les Générateurs de Données et Calculer les Poids des Classes ---

import numpy as np
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# --- Paramètres de prétraitement ---
IMG_SIZE = (224, 224)
BATCH_SIZE = 32

# Définir les générateurs (uniquement avec mise à l'échelle)
train_datagen = ImageDataGenerator(rescale=1./255)
val_datagen = ImageDataGenerator(rescale=1./255)

# --- Créer le générateur d'entraînement ---
train_generator = train_datagen.flow_from_dataframe(
    train_df,
    x_col="filepath",
    y_col="label",
    target_size=IMG_SIZE,
    class_mode="categorical",
    batch_size=BATCH_SIZE,
    shuffle=True,
    seed=42
)

# --- Créer le générateur de validation ---
val_generator = val_datagen.flow_from_dataframe(
    val_df,
    x_col="filepath",
    y_col="label",
    target_size=IMG_SIZE,
    class_mode="categorical",
    batch_size=BATCH_SIZE,
    shuffle=False  # Pas besoin de mélanger les données de validation
)

# --- Calculer les Poids des Classes ---
# Obtenir les indices des classes depuis le générateur
class_indices = train_generator.class_indices

# Inverser le dictionnaire pour mapper l'indice au nom de l'étiquette
index_to_label = {v: k for k, v in class_indices.items()}

# Obtenir les étiquettes des classes depuis le dataframe d'entraînement
y_labels = train_df['label']

# Calculer les poids des classes
class_weights_array = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_labels),
    y=y_labels
)
label_to_weight = dict(zip(np.unique(y_labels), class_weights_array))

# Construire le dictionnaire final class_weights en utilisant les indices de classe entiers
class_weights_dict = {
    class_indices[label]: weight for label, weight in label_to_weight.items()
}

print("\n✅ Dictionnaire final class_weights_dict (indice → poids):\n")
for idx, weight in sorted(class_weights_dict.items()):
    label_name = index_to_label.get(idx, 'Inconnu')
    print(f"Classe {idx:2d} ({label_name:20s}): {weight:.2f}")

## 🧠 Step 3: Build and Compile the MobileNetV2 Model

In this step, we create a transfer learning model based on **MobileNetV2**, a lightweight CNN architecture pre-trained on ImageNet.

- We **load MobileNetV2** without its top classification layer and freeze its convolutional base to preserve pre-trained features.
- We add new layers:
  - A `GlobalAveragePooling2D` layer to reduce the spatial dimensions.
  - A fully connected `Dense` layer with ReLU activation.
  - A final `Dense` layer with softmax activation for multiclass classification, using the correct number of classes inferred from the training generator.
- The model is then compiled using the **Adam optimizer**, with **categorical crossentropy** loss (suitable for one-hot encoded labels), and **accuracy** as the performance metric.

This approach leverages pre-trained visual features while allowing the model to adapt to our waste classification task.


In [None]:
# --- Step 3: Build and Compile the MobileNetV2 Model (Corrected) ---

from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

# Load the MobileNetV2 base model, excluding its top classification layer
base_model = MobileNetV2(
    input_shape=IMG_SIZE + (3,),
    include_top=False,
    weights='imagenet'
)

# Freeze the convolutional layers of the base model
base_model.trainable = False

# --- FIX IS HERE ---
# Get the number of classes by finding the length of the class_indices dictionary
num_classes = len(train_generator.class_indices)
print(f"Found {num_classes} classes.")

# --- Build the new model on top of the base model ---
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(num_classes, activation='softmax')(x)

# Combine the base model and our new custom layers into a single, final model
model = Model(inputs=base_model.input, outputs=predictions)

# --- Compile the model ---
model.compile(
    optimizer=Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Print a summary of the model's architecture
print("\n--- Model Summary ---")
model.summary()

## 🏋️ Step 4: Train the Model

In this step, we train the MobileNetV2-based classification model:

- **Epochs**: The model is trained for 15 epochs.
- **Callbacks**: We use `ModelCheckpoint` to save the model only when it achieves the best validation accuracy. This ensures we retain the most performant version.
- **Class Weights**: We provide `class_weight` to the training process to handle class imbalance, allowing the model to give more importance to underrepresented categories.

The training process will output accuracy and loss for both the training and validation sets across all epochs. The best-performing model is saved as `best_model.keras`.


In [None]:
# --- Step 4: Train the Model ---

import tensorflow as tf

# Define the number of times the model will see the entire training dataset
EPOCHS = 15

# Define a callback to save the model's weights only when the validation accuracy improves.
# This ensures we keep the best performing model.
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(
    "best_model.keras",        # Filepath to save the model
    save_best_only=True,     # Only save if the model is the best so far
    monitor="val_accuracy",    # The metric to monitor for improvement
    mode="max"                 # We want to maximize this metric
)

print("\n--- Starting Model Training ---")
print(f"The model will be trained for {EPOCHS} epochs.")

# Train the model using the .fit() method
history = model.fit(
    train_generator,
    epochs=EPOCHS,
    validation_data=val_generator,
    class_weight=class_weights_dict, # This is crucial for handling the imbalanced data!
    callbacks=[checkpoint_cb]        # Pass in our callback to save the best model
)

print("\n--- Model Training Complete ---")
# The history object holds a record of the loss and accuracy values during training
print("Best model has been saved to 'best_model.keras'")

## 📊 Step 5: Evaluate Model Performance and Visualize Results

In this step, we assess the model’s performance and visualize its learning progress:

### 1. Evaluation
- We load the **best model** saved during training (`best_model.keras`) to ensure we evaluate the most optimal weights.
- The model is evaluated on the validation set to report the **final loss and accuracy**.

### 2. Visualization
- We use the `history` object to plot:
  - **Training vs. Validation Accuracy**
  - **Training vs. Validation Loss**
- These plots help us understand how well the model learned over time and if it overfit or underfit.

Visual inspection of these graphs is crucial for diagnosing model behavior and identifying potential improvements.


In [None]:
# --- Step 5: Evaluate Performance and Visualize Results ---

import matplotlib.pyplot as plt
import tensorflow as tf

# --- 1. Evaluate the Model's Final Performance ---
print("\n--- Evaluating Final Model Performance ---")

# It's best practice to load the best saved model to ensure we evaluate the absolute best checkpoint.
print("Loading the best model from 'best_model.keras'...")
best_model = tf.keras.models.load_model("best_model.keras")

# Evaluate the best model on the validation generator
results = best_model.evaluate(val_generator)

print("\n--- Final Evaluation Metrics ---")
print(f"Validation Loss: {results[0]:.4f}")
print(f"Validation Accuracy: {results[1] * 100:.2f}%") # Display as a percentage


# --- 2. Visualize Training History ---
print("\n--- Plotting Training History ---")

# Extracting metrics from the 'history' object
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

# Get the number of epochs the training actually ran for
epochs_range = range(len(acc))

# Create a figure to plot on
plt.figure(figsize=(14, 6))

# Plot Training & Validation Accuracy
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')

# Plot Training & Validation Loss
plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')

# Display the plots
plt.show()

## 📈 Interpretation of Results

### ✅ Final Evaluation Metrics:
- **Validation Accuracy**: **96.83%**
- **Validation Loss**: **0.1536**

These results indicate that the trained model generalizes **very well** to unseen validation data. Achieving nearly 97% accuracy means the model has successfully learned to distinguish between the different waste categories with high precision.

The low validation loss further confirms that the predictions are confident and well-calibrated, not just correct by chance.

---

### 📉 Training History Analysis:

- The **training and validation accuracy curves** show consistent improvement and closely follow each other, which suggests that:
  - The model is **not overfitting** (no divergence between training and validation performance).
  - The training was **stable**, and the model benefited from the class weights and preprocessing.
  
- The **loss curves** steadily decrease, and the validation loss does not spike, confirming that:
  - The model is not underfitting.
  - The optimizer was able to effectively minimize the error across epochs.

---

### 📌 Inference:
- The combination of a well-prepared dataset, class weighting, and transfer learning from MobileNetV2 led to a **high-performing image classifier**.
- This model is now well-suited for deployment or further fine-tuning (e.g., unfreezing the base layers for even better results).


In [None]:
from google.colab import files

files.download('best_model.keras')