# Dataset Export to NPY Format

## Purpose

This script combines the original dataset with augmented minority class samples and exports everything to numpy (NPY) format. This creates the final training-ready arrays that will be used by the CNN training scripts.

## Position in the Pipeline

This is the third and final preprocessing step:

1. **EDA script** - Cleans data and creates train/val/test splits
2. **Augmentation script** - Generates synthetic samples for minority classes
3. **This export script** - Combines everything and exports to NPY format

---

## What This Script Does

The script performs six main steps:

1. **Loads split information** - Reads the manifest files from the EDA script
2. **Loads augmented data** - Imports synthetic samples (if they exist)
3. **Processes original images** - Loads and resizes all original images to 256x256
4. **Creates balanced training set** - Combines original training images with augmented samples
5. **Saves NPY files** - Exports everything as numpy arrays for fast loading during training
6. **Verifies the output** - Checks class distribution across all splits

---

## Why NPY Format?

### Advantages over loading images directly during training:

**Speed:**
- Loading from disk: ~50-200ms per image
- Loading from NPY: ~1-5ms per image
- Training becomes 10-40x faster

**Consistency:**
- All images preprocessed identically
- No runtime resizing errors
- Deterministic data loading

**Simplicity:**
- Single file instead of thousands of image files
- Easier to share and backup
- No broken path issues

---

## Output Files

### Main Data Files

**images.npy**
- Contains all images (original + augmented)
- Shape: (N, 256, 256, 3)
- Data type: uint8 (0-255 RGB values)
- N = total number of images

**labels.npy**
- Contains encoded labels for all images
- Shape: (N,)
- Data type: int64
- Values: 0 to (num_classes-1)

### Split Index Files

**split_train.npy**
- Indices for training set
- Includes both original and augmented images
- Use: `train_images = images[split_train]`

**split_val.npy**
- Indices for validation set
- Contains only original images (no augmentation)
- Use: `val_images = images[split_val]`

**split_test.npy**
- Indices for test set
- Contains only original images (no augmentation)
- Use: `test_images = images[split_test]`

### Metadata Files

**class_names.npy**
- Array of class names in order
- Example: ['cardboard', 'glass', 'metal', ...]
- Use for decoding predictions back to class names

**filepaths.npy**
- Original file paths for all images
- Useful for debugging and error analysis
- Maps back to source images if needed

---

## How the Combination Works

### Without Augmentation:
```
images.npy = [original_images]
labels.npy = [original_labels]
split_train = [indices to original train images]
```

### With Augmentation:
```
images.npy = [original_images | augmented_images]
                  ↑                    ↑
              0 to N-1            N to N+M-1

split_train = [original train indices | augmented indices]
                     ↑                         ↑
                  subset of 0:N            all of N:N+M
```

The augmented images are appended to the end, and training indices include both original and augmented samples.

---

## Configuration

```python
FINAL_SIZE = (256, 256)  # All images resized to this dimension
```

You can modify this if needed, but 256x256 is recommended based on:
- RealWaste paper findings
- Balance between detail preservation and memory
- Common practice for waste classification

---

## How to Run

**Prerequisites:**
- Completed EDA script (manifest files exist)
- Optionally completed augmentation script

**Runtime:** 5-15 minutes depending on dataset size


## Notes

### Preprocessing is Done Once

After exporting to NPY, you don't need to rerun the preprocessing steps unless:
- You add more images to the dataset
- You want to change augmentation strategy
- You need different train/val/test splits

### Reproducibility

The exported NPY files ensure:
- Same preprocessing across all experiments
- No variation in image loading
- Fair comparison between models
- Consistent results for your report

### Storage Considerations

Keep the original images even after exporting to NPY because:
- You might need different resolutions later
- NPY files can be regenerated if lost
- Original data is always valuable

### Label Encoding

Labels are encoded as integers:
- Class 0 = first class alphabetically
- Class 1 = second class alphabetically
- etc.

Use `class_names.npy` to decode predictions back to readable names.

In [1]:
"""
Notebook 03: Export Combined (Original + Augmented) Images to NPY
==================================================================
Combines original images with augmented minority class samples
"""

import numpy as np
import pandas as pd
from PIL import Image
from tqdm import tqdm
import json
import os

print("="*70)
print("EXPORTING BALANCED DATASET TO NPY FORMAT")
print("="*70)

# ==========================================
# 0) Configuration
# ==========================================
FINAL_SIZE = (256, 256)
print(f"\nTarget image size: {FINAL_SIZE[0]}×{FINAL_SIZE[1]}")

# ==========================================
# 1) Load Original Split Information
# ==========================================
print("\n[1/5] Loading original split information...")

train_df = pd.read_csv("train_manifest.csv")
val_df = pd.read_csv("val_manifest.csv")
test_df = pd.read_csv("test_manifest.csv")
df = pd.read_csv("manifest_clean.csv")

with open("classes.json") as f:
    class_names = json.load(f)

print(f"✓ Original data:")
print(f"  Total: {len(df)}")
print(f"  Train: {len(train_df)}")
print(f"  Val: {len(val_df)}")
print(f"  Test: {len(test_df)}")

# ==========================================
# 2) Load Augmented Data (if exists)
# ==========================================
print("\n[2/5] Loading augmented data...")

has_augmented = os.path.exists("augmented_train_images.npy")

if has_augmented:
    augmented_images = np.load("augmented_train_images.npy")
    augmented_labels = np.load("augmented_train_labels.npy", allow_pickle=True)
    print(f"✓ Loaded {len(augmented_images)} augmented images")
else:
    augmented_images = np.array([])
    augmented_labels = np.array([])
    print("⚠ No augmented data found - using original data only")

# ==========================================
# 3) Load and Resize Original Images
# ==========================================
print("\n[3/5] Loading and resizing ORIGINAL images...")

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(class_names)

paths = df["path"].values
labels_encoded = le.transform(df["label"].values)
original_images = []

for path in tqdm(paths, desc="Processing original images"):
    try:
        img = Image.open(path).convert('RGB')
        img_resized = img.resize(FINAL_SIZE, resample=Image.BILINEAR)
        img_array = np.array(img_resized, dtype=np.uint8)
        original_images.append(img_array)
    except Exception as e:
        print(f"\n⚠ Error loading {path}: {e}")
        original_images.append(np.zeros((FINAL_SIZE[0], FINAL_SIZE[1], 3), dtype=np.uint8))

original_images = np.stack(original_images, axis=0)
print(f"✓ Original images shape: {original_images.shape}")

# ==========================================
# 4) Create Balanced Training Split
# ==========================================
print("\n[4/5] Creating balanced training split...")

# Create index mapping
path_to_idx = {path: idx for idx, path in enumerate(df["path"].values)}

# Original train indices
train_indices_original = np.array([path_to_idx[path] for path in train_df["path"].values])

# Val and test indices (unchanged)
val_indices = np.array([path_to_idx[path] for path in val_df["path"].values])
test_indices = np.array([path_to_idx[path] for path in test_df["path"].values])

if has_augmented:
    # Combine original images with augmented ones
    all_images = np.concatenate([original_images, augmented_images], axis=0)
    
    # Encode augmented labels
    augmented_labels_encoded = le.transform(augmented_labels)
    all_labels = np.concatenate([labels_encoded, augmented_labels_encoded], axis=0)
    
    # Train indices = original train + new augmented indices
    augmented_start_idx = len(original_images)
    augmented_indices = np.arange(augmented_start_idx, augmented_start_idx + len(augmented_images))
    train_indices = np.concatenate([train_indices_original, augmented_indices])
    
    print(f"✓ Combined dataset:")
    print(f"  Original train: {len(train_indices_original)} images")
    print(f"  Augmented: {len(augmented_images)} images")
    print(f"  Total train: {len(train_indices)} images")
else:
    all_images = original_images
    all_labels = labels_encoded
    train_indices = train_indices_original
    print(f"✓ Using original data only (no augmentation)")

print(f"  Val: {len(val_indices)} images")
print(f"  Test: {len(test_indices)} images")
print(f"  Total: {len(all_images)} images")

# ==========================================
# 5) Save Everything
# ==========================================
print("\n[5/5] Saving final NPY files...")

np.save("images.npy", all_images)
np.save("labels.npy", all_labels.astype(np.int64))
np.save("split_train.npy", train_indices)
np.save("split_val.npy", val_indices)
np.save("split_test.npy", test_indices)
np.save("class_names.npy", np.array(class_names))
np.save("filepaths.npy", paths)

print("✓ Saved files:")
print("  - images.npy (all images including augmented)")
print("  - labels.npy (all labels)")
print("  - split_train.npy (train indices - includes augmented)")
print("  - split_val.npy")
print("  - split_test.npy")
print("  - class_names.npy")
print("  - filepaths.npy")

# ==========================================
# 6) Verification
# ==========================================
print("\n" + "="*70)
print("VERIFICATION")
print("="*70)

train_labels = all_labels[train_indices]
val_labels = all_labels[val_indices]
test_labels = all_labels[test_indices]

print("\n✓ Balanced class distribution:")
print(f"\n{'Class':<20} {'Train':>8} {'Val':>8} {'Test':>8}")
print("-" * 60)

for i, cls in enumerate(class_names):
    train_count = np.sum(train_labels == i)
    val_count = np.sum(val_labels == i)
    test_count = np.sum(test_labels == i)
    print(f"{cls:<20} {train_count:>8} {val_count:>8} {test_count:>8}")

print("-" * 60)
print(f"{'TOTAL':<20} {len(train_indices):>8} {len(val_indices):>8} {len(test_indices):>8}")

print("\n✓ Dataset is now balanced and ready for training!")
print("="*70)

EXPORTING BALANCED DATASET TO NPY FORMAT

Target image size: 256×256

[1/5] Loading original split information...
✓ Original data:
  Total: 4752
  Train: 3433
  Val: 606
  Test: 713

[2/5] Loading augmented data...
⚠ No augmented data found - using original data only

[3/5] Loading and resizing ORIGINAL images...


Processing original images:  42%|████▏     | 2000/4752 [00:00<00:00, 9906.47it/s]


⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Cardboard/Cardboard_395.jpg: [Errno 2] No such file or directory: '/home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Cardboard/Cardboard_395.jpg'

⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Cardboard/Cardboard_341.jpg: [Errno 2] No such file or directory: '/home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Cardboard/Cardboard_341.jpg'

⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Cardboard/Cardboard_207.jpg: [Errno 2] No such file or directory: '/home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Cardboard/Cardboard_207.jpg'

⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/real

Processing original images:  63%|██████▎   | 2991/4752 [00:00<00:00, 5231.06it/s]


⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Paper/Paper_345.jpg: [Errno 2] No such file or directory: '/home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Paper/Paper_345.jpg'

⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Paper/Paper_78.jpg: [Errno 2] No such file or directory: '/home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Paper/Paper_78.jpg'

⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Paper/Paper_177.jpg: [Errno 2] No such file or directory: '/home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Paper/Paper_177.jpg'

⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Paper/Paper_427.jpg: [Errno 2

Processing original images:  90%|████████▉ | 4256/4752 [00:00<00:00, 4584.76it/s]


⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Plastic/Plastic_774.jpg: [Errno 2] No such file or directory: '/home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Plastic/Plastic_774.jpg'

⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Plastic/Plastic_59.jpg: [Errno 2] No such file or directory: '/home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Plastic/Plastic_59.jpg'

⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Plastic/Plastic_222.jpg: [Errno 2] No such file or directory: '/home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Plastic/Plastic_222.jpg'

⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Plast

Processing original images: 100%|██████████| 4752/4752 [00:00<00:00, 5098.08it/s]



⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Textile Trash/Textile Trash_68.jpg: [Errno 2] No such file or directory: '/home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Textile Trash/Textile Trash_68.jpg'

⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Textile Trash/Textile Trash_277.jpg: [Errno 2] No such file or directory: '/home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Textile Trash/Textile Trash_277.jpg'

⚠ Error loading /home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Textile Trash/Textile Trash_212.jpg: [Errno 2] No such file or directory: '/home/user/Desktop/PR_GROUP_ASSIGNMENT/EN3150-Assignment-03/realwaste/realwaste-main/RealWaste/Textile Trash/Textile Trash_212.jpg'

⚠ Error loading /home/user/Desktop/PR_GROUP_