# SVM Image Classification - Garbage Dataset

This notebook implements Support Vector Machine (SVM) classification on the garbage dataset from Kaggle. We'll process the image data and apply SVM with different kernels for classification.


In [1]:
# Install required packages
!pip install opendatasets scikit-learn matplotlib seaborn numpy pandas pillow


Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl.metadata (9.2 kB)
Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


In [2]:
# Import required libraries
import opendatasets as od
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
import os
from PIL import Image
import glob
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")


Libraries imported successfully!


## Dataset Download and Loading


In [3]:
# Download the garbage classification dataset from Kaggle
dataset_url = "https://www.kaggle.com/datasets/zlatan599/garbage-dataset-classification"
od.download(dataset_url)


Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: bumithaekanayake
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/zlatan599/garbage-dataset-classification
Downloading garbage-dataset-classification.zip to ./garbage-dataset-classification


100%|██████████| 121M/121M [00:00<00:00, 1.06GB/s]







**Note for Google Colab users**: If you encounter authentication issues when downloading the dataset, you may need to:

1. Go to your Kaggle account → Account → API → Create New API Token
2. Download the `kaggle.json` file
3. Upload it to Colab when prompted, or run:
```python
from google.colab import files
files.upload()  # Upload kaggle.json
```


In [4]:
# Explore the dataset structure
data_dir = "./garbage-dataset-classification/Garbage_Dataset_Classification/images/"
print("Dataset directory:", data_dir)
print("\nDirectory contents:")
for item in os.listdir(data_dir):
    item_path = os.path.join(data_dir, item)
    if os.path.isdir(item_path):
        print(f"📁 {item}/")
        subdir_contents = os.listdir(item_path)
        print(f"   Contains {len(subdir_contents)} items")
        if len(subdir_contents) <= 10:
            for subitem in subdir_contents[:5]:
                print(f"   - {subitem}")
        else:
            for subitem in subdir_contents[:5]:
                print(f"   - {subitem}")
            print(f"   ... and {len(subdir_contents) - 5} more items")
    else:
        print(f"📄 {item}")


Dataset directory: ./garbage-dataset-classification/Garbage_Dataset_Classification/images/

Directory contents:
📁 glass/
   Contains 2500 items
   - glass_03245.jpg
   - glass_02661.jpg
   - glass_00833.jpg
   - glass_00960.jpg
   - glass_03149.jpg
   ... and 2495 more items
📁 paper/
   Contains 2315 items
   - paper_02086.jpg
   - paper_02643.jpg
   - paper_02011.jpg
   - paper_02352.jpg
   - paper_01373.jpg
   ... and 2310 more items
📁 trash/
   Contains 2500 items
   - trash_12702.jpg
   - trash_19143.jpg
   - trash_13674.jpg
   - trash_11007.jpg
   - trash_07523.jpg
   ... and 2495 more items
📁 plastic/
   Contains 2288 items
   - plastic_01588.jpg
   - plastic_02534.jpg
   - plastic_02422.jpg
   - plastic_01650.jpg
   - plastic_01485.jpg
   ... and 2283 more items
📁 cardboard/
   Contains 2214 items
   - cardboard_02624.jpg
   - cardboard_02562.jpg
   - cardboard_01421.jpg
   - cardboard_00521.jpg
   - cardboard_02370.jpg
   ... and 2209 more items
📁 metal/
   Contains 2084 items


## Image Preprocessing for SVM


In [7]:
def load_and_preprocess_images(data_dir, target_size=(64, 64), max_samples_per_class=1000):
    """
    Load and preprocess images for SVM classification

    Args:
        data_dir: Path to the dataset directory
        target_size: Target size for resizing images (width, height)
        max_samples_per_class: Maximum number of samples per class to avoid memory issues

    Returns:
        X: Flattened image features
        y: Class labels
        class_names: List of class names
    """
    X = []
    y = []
    class_names = []

    # Get all subdirectories (classes)
    subdirs = [d for d in os.listdir(data_dir) if os.path.isdir(os.path.join(data_dir, d))]
    subdirs.sort()  # Sort for consistent ordering

    print(f"Found {len(subdirs)} classes: {subdirs}")

    for class_idx, class_name in enumerate(subdirs):
        class_path = os.path.join(data_dir, class_name)
        image_files = glob.glob(os.path.join(class_path, "*.jpg")) + glob.glob(os.path.join(class_path, "*.png"))

        print(f"Processing class '{class_name}' with {len(image_files)} images...")

        # Limit samples per class to avoid memory issues
        if len(image_files) > max_samples_per_class:
            image_files = image_files[:max_samples_per_class]
            print(f"  Limited to {max_samples_per_class} samples")

        class_names.append(class_name)

        for img_path in tqdm(image_files, desc=f"Loading {class_name}"):
            try:
                # Load and resize image
                img = Image.open(img_path).convert('RGB')
                img = img.resize(target_size)

                # Convert to numpy array and normalize
                img_array = np.array(img) / 255.0

                # Flatten the image
                img_flattened = img_array.flatten()

                X.append(img_flattened)
                y.append(class_idx)

            except Exception as e:
                print(f"Error loading {img_path}: {e}")
                continue

    return np.array(X), np.array(y), class_names

# Load and preprocess the dataset
print("Loading and preprocessing images...")
X, y, class_names = load_and_preprocess_images(data_dir, target_size=(64, 64), max_samples_per_class=500)

print(f"\nDataset loaded successfully!")
print(f"Total samples: {len(X)}")
print(f"Feature dimension: {X.shape[1]}")
print(f"Number of classes: {len(class_names)}")
print(f"Classes: {class_names}")


Loading and preprocessing images...
Found 6 classes: ['cardboard', 'glass', 'metal', 'paper', 'plastic', 'trash']
Processing class 'cardboard' with 2214 images...
  Limited to 500 samples


Loading cardboard: 100%|██████████| 500/500 [00:01<00:00, 283.91it/s]


Processing class 'glass' with 2500 images...
  Limited to 500 samples


Loading glass: 100%|██████████| 500/500 [00:01<00:00, 309.69it/s]


Processing class 'metal' with 2084 images...
  Limited to 500 samples


Loading metal: 100%|██████████| 500/500 [00:00<00:00, 688.00it/s]


Processing class 'paper' with 2315 images...
  Limited to 500 samples


Loading paper: 100%|██████████| 500/500 [00:00<00:00, 659.72it/s]


Processing class 'plastic' with 2288 images...
  Limited to 500 samples


Loading plastic: 100%|██████████| 500/500 [00:00<00:00, 690.71it/s]


Processing class 'trash' with 2500 images...
  Limited to 500 samples


Loading trash: 100%|██████████| 500/500 [00:00<00:00, 697.04it/s]



Dataset loaded successfully!
Total samples: 3000
Feature dimension: 12288
Number of classes: 6
Classes: ['cardboard', 'glass', 'metal', 'paper', 'plastic', 'trash']


In [8]:
# Apply PCA for dimensionality reduction (optional but recommended for SVM)
print("Applying PCA for dimensionality reduction...")

# First, let's see how much variance we can retain with different numbers of components
pca_full = PCA()
pca_full.fit(X)

# Calculate cumulative explained variance
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

# Find number of components for 95% variance
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
n_components_90 = np.argmax(cumulative_variance >= 0.90) + 1

print(f"Components needed for 90% variance: {n_components_90}")
print(f"Components needed for 95% variance: {n_components_95}")

# Apply PCA with 95% variance retention
n_components = min(n_components_95, 1000)  # Cap at 1000 components for computational efficiency
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)

print(f"Reduced feature dimension from {X.shape[1]} to {X_pca.shape[1]}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_.sum():.4f}")

# Use PCA-transformed features
X_final = X_pca


Applying PCA for dimensionality reduction...
Components needed for 90% variance: 118
Components needed for 95% variance: 311
Reduced feature dimension from 12288 to 311
Explained variance ratio: 0.9495


In [9]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_final, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize the features (important for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set size: {X_train_scaled.shape}")
print(f"Test set size: {X_test_scaled.shape}")
print(f"Number of classes: {len(np.unique(y))}")
print(f"Class distribution in training set: {np.bincount(y_train)}")
print(f"Class distribution in test set: {np.bincount(y_test)}")


Training set size: (2400, 311)
Test set size: (600, 311)
Number of classes: 6
Class distribution in training set: [400 400 400 400 400 400]
Class distribution in test set: [100 100 100 100 100 100]


## SVM Classification


In [10]:
# Train SVM with different kernels
def train_svm_model(X_train, y_train, kernel='rbf', C=1.0, gamma='scale',degree=3):
    """
    Train SVM model with specified parameters
    """
    svm_model = SVC(kernel=kernel, C=C, gamma=gamma, random_state=42)
    svm_model.fit(X_train, y_train)
    return svm_model

# Test different SVM kernels
kernels = ['linear', 'rbf', 'poly']
svm_models = {}
results = {}

print("Training SVM models with different kernels...")

for kernel in kernels:
    print(f"\nTraining SVM with {kernel} kernel...")

    if kernel == 'linear':
        model = train_svm_model(X_train_scaled, y_train, kernel='linear', C=1.0)
    elif kernel == 'rbf':
        model = train_svm_model(X_train_scaled, y_train, kernel='rbf', C=1.0, gamma='scale')
    elif kernel == 'poly':
        model = train_svm_model(X_train_scaled, y_train, kernel='poly', C=1.0, gamma='scale', degree=3)

    svm_models[kernel] = model

    # Make predictions
    y_pred = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)

    results[kernel] = {
        'model': model,
        'predictions': y_pred,
        'accuracy': accuracy
    }

    print(f"Accuracy with {kernel} kernel: {accuracy:.4f}")

# Display results
print("\n" + "="*50)
print("SVM PERFORMANCE SUMMARY")
print("="*50)
for kernel, result in results.items():
    print(f"{kernel.upper()} Kernel: {result['accuracy']:.4f}")


Training SVM models with different kernels...

Training SVM with linear kernel...
Accuracy with linear kernel: 0.3217

Training SVM with rbf kernel...
Accuracy with rbf kernel: 0.3983

Training SVM with poly kernel...
Accuracy with poly kernel: 0.2050

SVM PERFORMANCE SUMMARY
LINEAR Kernel: 0.3217
RBF Kernel: 0.3983
POLY Kernel: 0.2050


In [11]:
# Hyperparameter tuning for the best performing kernel
print("Performing hyperparameter tuning...")

# Find the best kernel
best_kernel = max(results.keys(), key=lambda k: results[k]['accuracy'])
print(f"Best performing kernel: {best_kernel}")

# Define parameter grid for hyperparameter tuning
if best_kernel == 'linear':
    param_grid = {'C': [0.1, 1, 10, 100]}
elif best_kernel == 'rbf':
    param_grid = {'C': [0.1, 1, 10, 100], 'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1]}
elif best_kernel == 'poly':
    param_grid = {'C': [0.1, 1, 10], 'gamma': ['scale', 'auto'], 'degree': [2, 3, 4]}

# Perform grid search
print(f"Tuning hyperparameters for {best_kernel} kernel...")
grid_search = GridSearchCV(
    SVC(kernel=best_kernel, random_state=42),
    param_grid,
    cv=3,  # Use 3-fold CV for faster computation
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Train final model with best parameters
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test_scaled)
best_accuracy = accuracy_score(y_test, y_pred_best)

print(f"Final test accuracy: {best_accuracy:.4f}")


Performing hyperparameter tuning...
Best performing kernel: rbf
Tuning hyperparameters for rbf kernel...
Best parameters: {'C': 1, 'gamma': 'scale'}
Best cross-validation score: 0.3779
Final test accuracy: 0.3983


In [None]:
print(X.shape)


## Results Visualization and Analysis


In [None]:
# Visualize results safely (for tabular or image data)
plt.figure(figsize=(15, 10))

# 1. Accuracy comparison
plt.subplot(2, 3, 1)
kernels = list(results.keys())
accuracies = [results[k]['accuracy'] for k in kernels]
bars = plt.bar(kernels, accuracies, color=['skyblue', 'lightgreen', 'lightcoral'])
plt.title('SVM Accuracy by Kernel')
plt.ylabel('Accuracy')
plt.ylim(0, 1)
for bar, acc in zip(bars, accuracies):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{acc:.3f}', ha='center', va='bottom')

# 2. Confusion Matrix for best model
plt.subplot(2, 3, 2)
cm = confusion_matrix(y_test, y_pred_best)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'Confusion Matrix ({best_kernel.upper()} Kernel)')
plt.xlabel('Predicted')
plt.ylabel('Actual')

# 3. Classification Report
plt.subplot(2, 3, 3)
report = classification_report(y_test, y_pred_best, output_dict=True)
metrics_df = pd.DataFrame(report).iloc[:-1, :-2].T  # Exclude support and avg
sns.heatmap(metrics_df, annot=True, cmap='YlOrRd', fmt='.3f')
plt.title('Classification Metrics')
plt.xlabel('Metrics')
plt.ylabel('Classes')

# 4. Feature importance (only for linear kernel)
if best_kernel == 'linear':
    plt.subplot(2, 3, 4)
    feature_importance = np.abs(best_model.coef_[0])
    top_features = np.argsort(feature_importance)[-20:]
    plt.barh(range(len(top_features)), feature_importance[top_features])
    plt.title('Top 20 Feature Importances (Linear SVM)')
    plt.xlabel('Importance')
    plt.ylabel('Feature Index')

# 5. Cross-validation scores
plt.subplot(2, 3, 5)
cv_scores = cross_val_score(best_model, X_train_scaled, y_train, cv=5)
plt.boxplot([cv_scores], labels=[f'{best_kernel.upper()} SVM'])
plt.title('Cross-Validation Scores')
plt.ylabel('Accuracy')
plt.text(1, np.mean(cv_scores), f'Mean: {np.mean(cv_scores):.3f}',
         ha='center', va='bottom')

# 6. Sample predictions visualization (works for tabular data)
plt.subplot(2, 3, 6)
sample_indices = np.random.choice(len(X_test), 10, replace=False)
sample_true = y_test[sample_indices]
sample_pred = y_pred_best[sample_indices]

comparison_df = pd.DataFrame({
    'True Label': sample_true,
    'Predicted Label': sample_pred
})
sns.heatmap(pd.crosstab(comparison_df['True Label'], comparison_df['Predicted Label']),
            annot=True, fmt='d', cmap='Greens')
plt.title('Sample Predictions Overview')

plt.tight_layout()
plt.show()

# Print detailed classification report
print("\n" + "="*60)
print("DETAILED CLASSIFICATION REPORT")
print("="*60)
print(classification_report(y_test, y_pred_best))



## Summary and Conclusions


## Model Comparison with Other Algorithms


In [13]:
# Define models for comparison
import time # Import the time module

models = {
    'SVM (Linear)': SVC(kernel='linear', C=1.0, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42),
    'SVM (Polynomial)': SVC(kernel='poly', C=1.0, gamma='scale', degree=3, random_state=42),
}

# Train and evaluate all models
model_results = {}
training_times = {}
prediction_times = {}

print("Training and evaluating multiple models...")
print("="*60)

for name, model in models.items():
    print(f"\nTraining {name}...")

    # Measure training time
    start_time = time.time()
    model.fit(X_train_scaled, y_train)
    training_time = time.time() - start_time

    # Measure prediction time
    start_time = time.time()
    y_pred = model.predict(X_test_scaled)
    prediction_time = time.time() - start_time

    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)

    # Store results
    model_results[name] = {
        'model': model,
        'predictions': y_pred,
        'accuracy': accuracy
    }
    training_times[name] = training_time
    prediction_times[name] = prediction_time

    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  Training time: {training_time:.2f}s")
    print(f"  Prediction time: {prediction_time:.2f}s")

print("\n" + "="*60)
print("MODEL COMPARISON SUMMARY")
print("="*60)

Training and evaluating multiple models...

Training SVM (Linear)...
  Accuracy: 0.3217
  Training time: 29.30s
  Prediction time: 0.16s

Training SVM (RBF)...
  Accuracy: 0.3983
  Training time: 1.77s
  Prediction time: 0.73s

Training SVM (Polynomial)...
  Accuracy: 0.2050
  Training time: 2.08s
  Prediction time: 0.25s

MODEL COMPARISON SUMMARY


In [14]:
# Create comprehensive comparison table
comparison_df = pd.DataFrame({
    'Model': list(model_results.keys()),
    'Accuracy': [model_results[name]['accuracy'] for name in model_results.keys()],
    'Training Time (s)': [training_times[name] for name in model_results.keys()],
    'Prediction Time (s)': [prediction_times[name] for name in model_results.keys()]
})

# Sort by accuracy (descending)
comparison_df = comparison_df.sort_values('Accuracy', ascending=False).reset_index(drop=True)

print("RANKED MODEL PERFORMANCE:")
print("="*60)
print(comparison_df.to_string(index=False, float_format='%.4f'))

# Find best model
best_model_name = comparison_df.iloc[0]['Model']
best_accuracy = comparison_df.iloc[0]['Accuracy']

print(f"\n🏆 BEST PERFORMING MODEL: {best_model_name}")
print(f"   Accuracy: {best_accuracy:.4f}")
print(f"   Training Time: {comparison_df.iloc[0]['Training Time (s)']:.2f}s")
print(f"   Prediction Time: {comparison_df.iloc[0]['Prediction Time (s)']:.2f}s")

# Performance insights
print(f"\n📊 PERFORMANCE INSIGHTS:")
print(f"   • Accuracy Range: {comparison_df['Accuracy'].min():.4f} - {comparison_df['Accuracy'].max():.4f}")
print(f"   • Fastest Training: {comparison_df.loc[comparison_df['Training Time (s)'].idxmin(), 'Model']}")
print(f"   • Fastest Prediction: {comparison_df.loc[comparison_df['Prediction Time (s)'].idxmin(), 'Model']}")
print(f"   • Most Balanced: {comparison_df.loc[(comparison_df['Accuracy'] * 0.8 + (1/comparison_df['Training Time (s)']) * 0.2).idxmax(), 'Model']}")


RANKED MODEL PERFORMANCE:
           Model  Accuracy  Training Time (s)  Prediction Time (s)
       SVM (RBF)    0.3983             1.7681               0.7326
    SVM (Linear)    0.3217            29.2969               0.1646
SVM (Polynomial)    0.2050             2.0830               0.2534

🏆 BEST PERFORMING MODEL: SVM (RBF)
   Accuracy: 0.3983
   Training Time: 1.77s
   Prediction Time: 0.73s

📊 PERFORMANCE INSIGHTS:
   • Accuracy Range: 0.2050 - 0.3983
   • Fastest Training: SVM (RBF)
   • Fastest Prediction: SVM (Linear)
   • Most Balanced: SVM (RBF)


In [None]:
# Visualize model comparison
plt.figure(figsize=(20, 12))

# 1. Accuracy Comparison
plt.subplot(2, 4, 1)
model_names = comparison_df['Model']
accuracies = comparison_df['Accuracy']
bars = plt.bar(range(len(model_names)), accuracies, color=plt.cm.viridis(np.linspace(0, 1, len(model_names))))
plt.title('Model Accuracy Comparison', fontsize=14, fontweight='bold')
plt.ylabel('Accuracy')
plt.xticks(range(len(model_names)), model_names, rotation=45, ha='right')
plt.ylim(0, 1)
for i, (bar, acc) in enumerate(zip(bars, accuracies)):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{acc:.3f}', ha='center', va='bottom', fontsize=8)

# 2. Training Time Comparison
plt.subplot(2, 4, 2)
training_times_list = comparison_df['Training Time (s)']
bars = plt.bar(range(len(model_names)), training_times_list, color=plt.cm.plasma(np.linspace(0, 1, len(model_names))))
plt.title('Training Time Comparison', fontsize=14, fontweight='bold')
plt.ylabel('Time (seconds)')
plt.xticks(range(len(model_names)), model_names, rotation=45, ha='right')
for i, (bar, time_val) in enumerate(zip(bars, training_times_list)):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(training_times_list)*0.01,
             f'{time_val:.1f}s', ha='center', va='bottom', fontsize=8)

# 3. Prediction Time Comparison
plt.subplot(2, 4, 3)
pred_times_list = comparison_df['Prediction Time (s)']
bars = plt.bar(range(len(model_names)), pred_times_list, color=plt.cm.inferno(np.linspace(0, 1, len(model_names))))
plt.title('Prediction Time Comparison', fontsize=14, fontweight='bold')
plt.ylabel('Time (seconds)')
plt.xticks(range(len(model_names)), model_names, rotation=45, ha='right')
for i, (bar, time_val) in enumerate(zip(bars, pred_times_list)):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(pred_times_list)*0.01,
             f'{time_val:.3f}s', ha='center', va='bottom', fontsize=8)

# 4. Accuracy vs Training Time Scatter
plt.subplot(2, 4, 4)
plt.scatter(training_times_list, accuracies, s=100, alpha=0.7, c=range(len(model_names)), cmap='tab10')
for i, name in enumerate(model_names):
    plt.annotate(name, (training_times_list.iloc[i], accuracies.iloc[i]),
                xytext=(5, 5), textcoords='offset points', fontsize=8)
plt.xlabel('Training Time (s)')
plt.ylabel('Accuracy')
plt.title('Accuracy vs Training Time', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

# 5. Top 5 Models Performance
plt.subplot(2, 4, 5)
top_5 = comparison_df.head(5)
bars = plt.bar(range(len(top_5)), top_5['Accuracy'], color=plt.cm.Set3(np.linspace(0, 1, len(top_5))))
plt.title('Top 5 Models by Accuracy', fontsize=14, fontweight='bold')
plt.ylabel('Accuracy')
plt.xticks(range(len(top_5)), top_5['Model'], rotation=45, ha='right')
plt.ylim(0, 1)
for i, (bar, acc) in enumerate(zip(bars, top_5['Accuracy'])):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{acc:.3f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

# 6. Model Categories Performance
plt.subplot(2, 4, 6)
svm_models = comparison_df[comparison_df['Model'].str.contains('SVM')]
ensemble_models = comparison_df[comparison_df['Model'].str.contains('Forest|Boosting')]
other_models = comparison_df[~comparison_df['Model'].str.contains('SVM|Forest|Boosting')]

categories = ['SVM Models', 'Ensemble Models', 'Other Models']
category_accuracies = [
    svm_models['Accuracy'].mean(),
    ensemble_models['Accuracy'].mean(),
    other_models['Accuracy'].mean()
]

bars = plt.bar(categories, category_accuracies, color=['skyblue', 'lightgreen', 'lightcoral'])
plt.title('Performance by Model Category', fontsize=14, fontweight='bold')
plt.ylabel('Average Accuracy')
plt.ylim(0, 1)
for bar, acc in zip(bars, category_accuracies):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{acc:.3f}', ha='center', va='bottom', fontsize=12, fontweight='bold')

# 7. Confusion Matrix for Best Model
plt.subplot(2, 4, 7)
best_model_predictions = model_results[best_model_name]['predictions']
cm = confusion_matrix(y_test, best_model_predictions)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names, yticklabels=class_names)
plt.title(f'Best Model Confusion Matrix\n({best_model_name})', fontsize=12, fontweight='bold')
plt.xlabel('Predicted')
plt.ylabel('Actual')

# 8. Performance Summary Table
plt.subplot(2, 4, 8)
plt.axis('off')
table_data = comparison_df.head(8)[['Model', 'Accuracy']].values
table = plt.table(cellText=table_data,
                  colLabels=['Model', 'Accuracy'],
                  cellLoc='center',
                  loc='center',
                  colWidths=[0.6, 0.3])
table.auto_set_font_size(False)
table.set_fontsize(8)
table.scale(1, 2)
plt.title('Top 8 Models Summary', fontsize=12, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()


In [15]:
# Detailed analysis of best performing models
print("="*70)
print("DETAILED ANALYSIS OF TOP PERFORMING MODELS")
print("="*70)

# Get top 3 models
top_3_models = comparison_df.head(3)

for idx, row in top_3_models.iterrows():
    model_name = row['Model']
    print(f"\n🏆 RANK #{idx+1}: {model_name}")
    print(f"   Accuracy: {row['Accuracy']:.4f}")
    print(f"   Training Time: {row['Training Time (s)']:.2f}s")
    print(f"   Prediction Time: {row['Prediction Time (s)']:.2f}s")

    # Get detailed classification report for top 3
    predictions = model_results[model_name]['predictions']
    report = classification_report(y_test, predictions, target_names=class_names, output_dict=True)

    print(f"   Per-class Performance:")
    for class_name in class_names:
        if class_name in report:
            precision = report[class_name]['precision']
            recall = report[class_name]['recall']
            f1 = report[class_name]['f1-score']
            print(f"     {class_name}: P={precision:.3f}, R={recall:.3f}, F1={f1:.3f}")

# Model recommendations
print(f"\n💡 MODEL RECOMMENDATIONS:")
print(f"="*50)

# Best overall accuracy
best_accuracy_model = comparison_df.iloc[0]['Model']
print(f"🎯 For Maximum Accuracy: {best_accuracy_model}")

# Fastest training
fastest_training = comparison_df.loc[comparison_df['Training Time (s)'].idxmin(), 'Model']
print(f"⚡ For Fastest Training: {fastest_training}")

# Fastest prediction
fastest_prediction = comparison_df.loc[comparison_df['Prediction Time (s)'].idxmin(), 'Model']
print(f"🚀 For Fastest Prediction: {fastest_prediction}")

# Most balanced (good accuracy + reasonable speed)
balanced_score = comparison_df['Accuracy'] * 0.7 + (1 / comparison_df['Training Time (s)']) * 0.3
most_balanced = comparison_df.loc[balanced_score.idxmax(), 'Model']
print(f"⚖️  For Balanced Performance: {most_balanced}")

# SVM specific analysis
svm_performance = comparison_df[comparison_df['Model'].str.contains('SVM')]
if len(svm_performance) > 0:
    best_svm = svm_performance.iloc[0]['Model']
    svm_rank = comparison_df[comparison_df['Model'] == best_svm].index[0] + 1
    print(f"🔬 Best SVM Kernel: {best_svm} (Overall Rank: #{svm_rank})")

print(f"\n📈 PERFORMANCE INSIGHTS:")
print(f"   • Accuracy improvement over worst model: {((comparison_df.iloc[0]['Accuracy'] - comparison_df.iloc[-1]['Accuracy']) / comparison_df.iloc[-1]['Accuracy'] * 100):.1f}%")
print(f"   • Training time range: {comparison_df['Training Time (s)'].min():.2f}s - {comparison_df['Training Time (s)'].max():.2f}s")
print(f"   • Prediction time range: {comparison_df['Prediction Time (s)'].min():.3f}s - {comparison_df['Prediction Time (s)'].max():.3f}s")
print(f"   • Models with >80% accuracy: {len(comparison_df[comparison_df['Accuracy'] > 0.8])}")
print(f"   • Models with <1s training time: {len(comparison_df[comparison_df['Training Time (s)'] < 1.0])}")


DETAILED ANALYSIS OF TOP PERFORMING MODELS

🏆 RANK #1: SVM (RBF)
   Accuracy: 0.3983
   Training Time: 1.77s
   Prediction Time: 0.73s
   Per-class Performance:
     cardboard: P=0.380, R=0.350, F1=0.365
     glass: P=0.352, R=0.250, F1=0.292
     metal: P=0.480, R=0.470, F1=0.475
     paper: P=0.449, R=0.310, F1=0.367
     plastic: P=0.381, R=0.510, F1=0.436
     trash: P=0.368, R=0.500, F1=0.424

🏆 RANK #2: SVM (Linear)
   Accuracy: 0.3217
   Training Time: 29.30s
   Prediction Time: 0.16s
   Per-class Performance:
     cardboard: P=0.312, R=0.390, F1=0.347
     glass: P=0.277, R=0.360, F1=0.313
     metal: P=0.367, R=0.360, F1=0.364
     paper: P=0.354, R=0.290, F1=0.319
     plastic: P=0.279, R=0.290, F1=0.284
     trash: P=0.393, R=0.240, F1=0.298

🏆 RANK #3: SVM (Polynomial)
   Accuracy: 0.2050
   Training Time: 2.08s
   Prediction Time: 0.25s
   Per-class Performance:
     cardboard: P=0.833, R=0.050, F1=0.094
     glass: P=1.000, R=0.060, F1=0.113
     metal: P=0.818, R=0.090, 

In [None]:
# Final summary
print("="*70)
print("SVM IMAGE CLASSIFICATION - FINAL SUMMARY")
print("="*70)

print(f"Dataset: Garbage Classification Dataset")
print(f"Total samples processed: {len(X)}")
print(f"Number of classes: {len(class_names)}")
print(f"Classes: {', '.join(class_names)}")
print(f"Original feature dimension: {X.shape[1]}")
print(f"PCA reduced dimension: {X_final.shape[1]}")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

print(f"\nSVM Performance Comparison:")
for kernel, result in results.items():
    print(f"  {kernel.upper()} Kernel: {result['accuracy']:.4f}")

print(f"\nBest Model:")
print(f"  Kernel: {best_kernel.upper()}")
print(f"  Best Parameters: {grid_search.best_params_}")
print(f"  Cross-validation Score: {grid_search.best_score_:.4f}")
print(f"  Test Accuracy: {best_accuracy:.4f}")

print(f"\nKey Insights:")
print(f"  1. PCA dimensionality reduction helped reduce computational complexity")
print(f"  2. Feature standardization improved SVM performance")
print(f"  3. Hyperparameter tuning optimized the model performance")
print(f"  4. The {best_kernel} kernel performed best for this dataset")

print(f"\nRecommendations for Improvement:")
print(f"  1. Try different image preprocessing techniques (edge detection, texture features)")
print(f"  2. Experiment with different feature extraction methods (HOG, LBP)")
print(f"  3. Consider ensemble methods combining multiple SVM models")
print(f"  4. Use more sophisticated data augmentation techniques")
print(f"  5. Try deep learning approaches for comparison")

print("="*70)
