# Class-Specific Data Augmentation

## Purpose

This script addresses class imbalance in the training dataset by generating synthetic samples for underrepresented classes. Instead of augmenting the entire dataset, it only creates additional samples for minority classes to match the size of the majority class.

## Why This Is Needed

Real-world datasets often have class imbalance. For example, you might have:
- Cardboard: 500 images
- Glass: 300 images  
- Metal: 150 images

Training on imbalanced data causes the model to bias toward majority classes. This script generates augmented samples for minority classes (glass and metal in this example) to create a balanced training set.

---

## What This Script Does

The script performs five main steps:

1. **Loads the training data** - Reads the manifest files created by the EDA script
2. **Analyzes class imbalance** - Counts samples per class and identifies which classes need augmentation
3. **Defines augmentation strategy** - Sets up various transformations (rotation, flipping, color changes)
4. **Generates synthetic samples** - Creates augmented images for minority classes only
5. **Saves augmented data** - Exports the new samples as numpy arrays

---

## Augmentation Techniques Used

The script applies random combinations of:

- **Geometric transformations**:
  - Random cropping (224x224 from 256x256)
  - Horizontal flipping (50% chance)
  - Vertical flipping (50% chance)
  - Rotation (±30 degrees)
  - Translation (±10% in x and y)
  - Scaling (90-110%)
  - Perspective distortion (20% scale, 30% chance)

- **Color transformations**:
  - Brightness adjustment (±30%)
  - Contrast adjustment (±30%)
  - Saturation adjustment (±20%)
  - Hue adjustment (±10%)

Each augmented image is a unique variation of an original image from the same class.

---

## Configuration

Key parameters at the top of the script:

```python
RANDOM_SEED = 42      # For reproducibility
IMAGE_SIZE = 256      # Output image size
```

---

## Output Files

### Generated Files

- **augmented_train_images.npy** - Numpy array containing augmented images
  - Shape: (N, 256, 256, 3) where N = number of augmented samples
  - Data type: uint8 (0-255 RGB values)
  
- **augmented_train_labels.npy** - Numpy array with corresponding class labels
  - Shape: (N,) 
  - Contains string labels matching the augmented images

### Console Output

The script prints:
- Original class distribution
- Target count (size of largest class)
- Number of augmented samples needed per class
- Progress bars during generation
- Final balanced distribution
- Memory usage statistics

---

## How It Works

### Example Scenario

Original training distribution:
```
cardboard: 500 samples
glass:     300 samples  
metal:     150 samples
```

The script will:
1. Identify target count = 500 (maximum)
2. Calculate needed augmentations:
   - cardboard: 0 (already at target)
   - glass: 200 (500 - 300)
   - metal: 350 (500 - 150)
3. Randomly select source images from each minority class
4. Apply random augmentations to create 550 total new images
5. Save only these augmented samples

### Augmentation Process

For each needed sample:
1. Randomly pick an existing image from that class
2. Load and convert to RGB
3. Apply the augmentation pipeline
4. Store the result as a numpy array

This ensures diversity while maintaining class characteristics.

---

## How to Run

```bash
python augmentation_notebook.py
```

**Prerequisites:**
- Run the EDA script first to generate manifest files
- Required files:
  - train_manifest.csv
  - val_manifest.csv
  - test_manifest.csv
  - classes.json

**Runtime:** 5-20 minutes depending on augmentation needs

---

## Memory Considerations

The script loads all augmented images into memory before saving. For large augmentation needs:

- 1000 images at 256x256x3: ~190 MB
- 5000 images at 256x256x3: ~950 MB
- 10000 images at 256x256x3: ~1.9 GB

If you encounter memory errors, consider:
- Reducing IMAGE_SIZE
- Processing in batches
- Using online augmentation during training instead

---

## Integration with Training

After running this script, the next notebook should:

1. Load original training images
2. Load augmented images from this script
3. Combine both datasets
4. Encode labels
5. Export final training arrays

Example:
```python
# Load original
original_images = load_from_manifests(train_df)
original_labels = train_df['label'].values

# Load augmented
aug_images = np.load('augmented_train_images.npy')
aug_labels = np.load('augmented_train_labels.npy')

# Combine
all_images = np.concatenate([original_images, aug_images])
all_labels = np.concatenate([original_labels, aug_labels])
```

---

## Important Notes

### This Script Only Augments Training Data

Validation and test sets are NOT augmented because:
- They need to represent real-world distribution
- Augmentation would artificially inflate performance metrics
- Model evaluation must be on genuine unseen data

### Already Balanced Datasets

If your dataset is already balanced (all classes have similar counts), the script will:
- Detect this automatically
- Skip augmentation generation
- Report that no augmentation is needed
- Exit without creating files

### Reproducibility

Setting RANDOM_SEED ensures:
- Same augmentations each run
- Consistent results across experiments
- Reproducible experiments for your report

---

## Advantages of This Approach

**Compared to oversampling:**
- Creates diverse variations instead of exact duplicates
- Helps model generalize better
- Reduces overfitting risk

**Compared to undersampling:**
- Doesn't discard valuable majority class samples
- Uses all available information
- Better for small datasets

**Compared to augmenting everything:**
- More efficient (only augments what's needed)
- Faster processing
- Less storage required



In [1]:
"""
Notebook 02: Class-Specific Augmentation for Imbalanced Data
=============================================================
Generates synthetic samples for minority classes ONLY
"""

import json
import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import torchvision.transforms as T
from tqdm import tqdm
import random

# ==========================================
# 0) Configuration
# ==========================================
RANDOM_SEED = 42
IMAGE_SIZE = 256
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

print("="*70)
print("CLASS-SPECIFIC AUGMENTATION FOR IMBALANCED DATA")
print("="*70)

# ==========================================
# 1) Load Training Data
# ==========================================
print("\n[1/5] Loading training data...")

train_df = pd.read_csv("train_manifest.csv")
val_df = pd.read_csv("val_manifest.csv")
test_df = pd.read_csv("test_manifest.csv")

with open("classes.json") as f:
    class_names = json.load(f)

print(f"✓ Train: {len(train_df)} images")
print(f"✓ Val:   {len(val_df)} images")
print(f"✓ Test:  {len(test_df)} images")

# ==========================================
# 2) Analyze Class Imbalance
# ==========================================
print("\n[2/5] Analyzing class distribution...")

class_counts = train_df['label'].value_counts()
print("\nOriginal class distribution (TRAIN only):")
for cls in class_names:
    count = class_counts.get(cls, 0)
    print(f"  {cls:20s}: {count:4d} samples")

# Determine target count
target_count = class_counts.max()
print(f"\n✓ Target count per class: {target_count}")

# Calculate needed augmentations
augmentation_needed = {}
for cls in class_names:
    current = class_counts.get(cls, 0)
    needed = max(0, target_count - current)
    augmentation_needed[cls] = needed
    if needed > 0:
        print(f"  {cls:20s}: needs {needed} augmented samples")

# ==========================================
# 3) Define Augmentation Transform
# ==========================================
print("\n[3/5] Defining augmentation strategy...")

augment_transform = T.Compose([
    T.ToPILImage(),
    T.Resize((IMAGE_SIZE, IMAGE_SIZE)),
    T.RandomCrop(224),
    T.Resize((IMAGE_SIZE, IMAGE_SIZE)),  # Resize back
    T.RandomHorizontalFlip(p=0.5),
    T.RandomVerticalFlip(p=0.5),
    T.RandomRotation(degrees=30),
    T.RandomAffine(degrees=0, translate=(0.1, 0.1), scale=(0.9, 1.1)),
    T.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.2, hue=0.1),
    T.RandomPerspective(distortion_scale=0.2, p=0.3),
])

print("✓ Augmentation strategy defined")

# ==========================================
# 4) Generate Augmented Samples
# ==========================================
print("\n[4/5] Generating augmented samples...")

augmented_images = []
augmented_labels = []

total_to_generate = sum(augmentation_needed.values())
print(f"Total augmented samples to generate: {total_to_generate}")

if total_to_generate == 0:
    print("\n⚠ Dataset is already balanced! No augmentation needed.")
else:
    for class_name in class_names:
        needed = augmentation_needed[class_name]
        
        if needed == 0:
            continue
        
        print(f"\nProcessing {class_name}...")
        
        # Get all images from this class
        class_df = train_df[train_df['label'] == class_name]
        class_paths = class_df['path'].values
        
        # Generate augmented samples
        for i in tqdm(range(needed), desc=f"Augmenting {class_name}"):
            # Randomly select a source image
            source_path = np.random.choice(class_paths)
            
            # Load image
            img = Image.open(source_path).convert('RGB')
            img_array = np.array(img)
            
            # Apply augmentation
            aug_img = augment_transform(img_array)
            aug_array = np.array(aug_img, dtype=np.uint8)
            
            # Store
            augmented_images.append(aug_array)
            augmented_labels.append(class_name)
    
    # Convert to numpy arrays
    augmented_images = np.array(augmented_images)
    augmented_labels = np.array(augmented_labels)
    
    print(f"\n✓ Generated {len(augmented_images)} augmented images")
    print(f"  Shape: {augmented_images.shape}")
    print(f"  Memory: {augmented_images.nbytes / (1024**2):.1f} MB")

# ==========================================
# 5) Save Augmented Data
# ==========================================
print("\n[5/5] Saving augmented data...")

if total_to_generate > 0:
    np.save("augmented_train_images.npy", augmented_images)
    np.save("augmented_train_labels.npy", augmented_labels)
    
    print("✓ Saved files:")
    print("  - augmented_train_images.npy")
    print("  - augmented_train_labels.npy")
else:
    print("✓ No augmented data to save (dataset already balanced)")

# ==========================================
# Summary
# ==========================================
print("\n" + "="*70)
print("AUGMENTATION SUMMARY")
print("="*70)

print("\nClass distribution after augmentation:")
for cls in class_names:
    original = class_counts.get(cls, 0)
    augmented = augmentation_needed[cls]
    total = original + augmented
    print(f"  {cls:20s}: {original:4d} (original) + {augmented:4d} (augmented) = {total:4d} (total)")

print(f"\nTotal training samples: {len(train_df)} → {len(train_df) + total_to_generate}")

print("\nNext step: Run Notebook 03 to combine and export to final NPY format")
print("="*70)

CLASS-SPECIFIC AUGMENTATION FOR IMBALANCED DATA

[1/5] Loading training data...
✓ Train: 3326 images
✓ Val:   713 images
✓ Test:  713 images

[2/5] Analyzing class distribution...

Original class distribution (TRAIN only):
  Cardboard           :  323 samples
  Food Organics       :  288 samples
  Glass               :  294 samples
  Metal               :  553 samples
  Miscellaneous Trash :  346 samples
  Paper               :  350 samples
  Plastic             :  645 samples
  Textile Trash       :  222 samples
  Vegetation          :  305 samples

✓ Target count per class: 645
  Cardboard           : needs 322 augmented samples
  Food Organics       : needs 357 augmented samples
  Glass               : needs 351 augmented samples
  Metal               : needs 92 augmented samples
  Miscellaneous Trash : needs 299 augmented samples
  Paper               : needs 295 augmented samples
  Textile Trash       : needs 423 augmented samples
  Vegetation          : needs 340 augmented sample

Augmenting Cardboard: 100%|██████████| 322/322 [00:04<00:00, 78.77it/s]



Processing Food Organics...


Augmenting Food Organics: 100%|██████████| 357/357 [00:05<00:00, 70.77it/s]



Processing Glass...


Augmenting Glass: 100%|██████████| 351/351 [00:04<00:00, 73.98it/s]



Processing Metal...


Augmenting Metal: 100%|██████████| 92/92 [00:01<00:00, 71.46it/s]



Processing Miscellaneous Trash...


Augmenting Miscellaneous Trash: 100%|██████████| 299/299 [00:04<00:00, 69.59it/s]



Processing Paper...


Augmenting Paper: 100%|██████████| 295/295 [00:04<00:00, 69.36it/s]



Processing Textile Trash...


Augmenting Textile Trash: 100%|██████████| 423/423 [00:06<00:00, 69.05it/s]



Processing Vegetation...


Augmenting Vegetation: 100%|██████████| 340/340 [00:05<00:00, 64.66it/s]



✓ Generated 2479 augmented images
  Shape: (2479, 256, 256, 3)
  Memory: 464.8 MB

[5/5] Saving augmented data...
✓ Saved files:
  - augmented_train_images.npy
  - augmented_train_labels.npy

AUGMENTATION SUMMARY

Class distribution after augmentation:
  Cardboard           :  323 (original) +  322 (augmented) =  645 (total)
  Food Organics       :  288 (original) +  357 (augmented) =  645 (total)
  Glass               :  294 (original) +  351 (augmented) =  645 (total)
  Metal               :  553 (original) +   92 (augmented) =  645 (total)
  Miscellaneous Trash :  346 (original) +  299 (augmented) =  645 (total)
  Paper               :  350 (original) +  295 (augmented) =  645 (total)
  Plastic             :  645 (original) +    0 (augmented) =  645 (total)
  Textile Trash       :  222 (original) +  423 (augmented) =  645 (total)
  Vegetation          :  305 (original) +  340 (augmented) =  645 (total)

Total training samples: 3326 → 5805

Next step: Run Notebook 03 to combine and 