# Hyperspectral Material Classification - Google Colab Training

This notebook trains the material classification model on Google Colab GPU.

**Before running:**
1. Runtime → Change runtime type → GPU (A100 or V100 for Pro+, T4 for free tier)
2. Upload your data to Google Drive in folder: `dl-plastics-data`
3. Run cells in order

**Note:** All models are saved directly to Google Drive to prevent loss on disconnect!

## 1. Setup Environment

In [None]:
# Check GPU availability
!nvidia-smi

In [42]:
# Clone repository
!git clone https://github.com/PlugNawapong/my-ml-project.git
%cd my-ml-project
!pwd  # Verify we're in the right directory

Cloning into 'my-ml-project'...
remote: Enumerating objects: 103, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 103 (delta 1), reused 4 (delta 1), pack-reused 95 (from 1)[K
Receiving objects: 100% (103/103), 44.95 MiB | 10.97 MiB/s, done.
Resolving deltas: 100% (19/19), done.
/content/my-ml-project/my-ml-project/my-ml-project/my-ml-project/my-ml-project
/content/my-ml-project/my-ml-project/my-ml-project/my-ml-project/my-ml-project


In [None]:
# Install dependencies
!pip install -q torch torchvision tqdm Pillow numpy matplotlib scikit-learn

## 2. Mount Google Drive and Load Data

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Create output directory in Google Drive (models will be saved here)
!mkdir -p /content/drive/MyDrive/dl-plastics-models
!mkdir -p /content/drive/MyDrive/dl-plastics-predictions

print('✓ Google Drive mounted')
print('✓ Output directories created')

In [None]:
# Copy data from Google Drive to Colab workspace
# ADJUST THE PATH to match your Google Drive folder structure

import os

# Path to your data in Google Drive
drive_data_path = '/content/drive/MyDrive/dl-plastics-data'

# Verify data exists
if not os.path.exists(drive_data_path):
    print(f'⚠ ERROR: {drive_data_path} not found!')
    print('Please upload your data to Google Drive first.')
else:
    print(f'✓ Data folder found: {drive_data_path}')
    !ls -la {drive_data_path}

# Copy training data
if os.path.exists(f'{drive_data_path}/data'):
    !cp -r {drive_data_path}/data ./
    print('✓ Training data copied')
else:
    print('⚠ Training data not found at {drive_data_path}/data')

# Copy inference datasets
if os.path.exists(f'{drive_data_path}/inference_data_set1'):
    !cp -r {drive_data_path}/inference_data_set1 ./
    print('✓ Inference dataset 1 copied')

if os.path.exists(f'{drive_data_path}/inference_data_set2'):
    !cp -r {drive_data_path}/inference_data_set2 ./
    print('✓ Inference dataset 2 copied')

# Verify data is copied
print('\n=== Files in workspace ===')
!ls -la

## 3. Inspect Data (Optional but Recommended)

In [None]:
# Inspect training data
!python inspect_data.py --data_dir data

# Display generated plots
from IPython.display import Image, display

print('\n=== Band Visualization ===')
display(Image('data_inspection_bands.png'))

print('\n=== Label Visualization ===')
display(Image('data_inspection_labels.png'))

print('\n=== Raw Spectral Signatures ===')
display(Image('data_inspection_spectra_raw.png'))

print('\n=== Normalized Spectral Signatures (Used in Training) ===')
display(Image('data_inspection_spectra_normalized.png'))

## 4. Train Model (Saved to Google Drive)

**Choose one of the training options below:**

All models will be automatically saved to: `/content/drive/MyDrive/dl-plastics-models/`

### Option A: Fast 1D CNN with Spectral Augmentation (RECOMMENDED)

In [None]:
# Fast 1D CNN with spectral augmentation for better generalization
!python train.py \
    --model spectral_cnn_1d \
    --epochs 100 \
    --batch_size 4096 \
    --max_samples_per_class 20000 \
    --dropout 0.6 \
    --lr 0.001 \
    --spectral_augment medium \
    --norm_method percentile \
    --output_dir /content/drive/MyDrive/dl-plastics-models \
    --num_workers 0

print('\n✓ Training complete! Model saved to Google Drive.')

### Option B: Ultra-Fast with 4x4 Binning (For Quick Experiments)

In [None]:
# Ultra-fast training with binning (16x fewer pixels)
!python train.py \
    --model spectral_cnn_1d \
    --epochs 150 \
    --batch_size 8192 \
    --max_samples_per_class 10000 \
    --dropout 0.6 \
    --bin_factor 4 \
    --spectral_augment heavy \
    --norm_method percentile \
    --output_dir /content/drive/MyDrive/dl-plastics-models \
    --num_workers 0

print('\n✓ Training complete! Model saved to Google Drive.')

### Option C: 2D CNN with Spatial Patches

In [None]:
!python train.py \
    --model spectral_cnn_2d \
    --use_patches \
    --patch_size 3 \
    --epochs 100 \
    --batch_size 1024 \
    --max_samples_per_class 10000 \
    --dropout 0.6 \
    --bin_factor 2 \
    --norm_method percentile \
    --output_dir /content/drive/MyDrive/dl-plastics-models \
    --num_workers 0

print('\n✓ Training complete! Model saved to Google Drive.')

### Option D: Hybrid Model (Best Accuracy)

In [31]:
!python train.py \
    --model hybrid \
    --use_patches \
    --patch_size 5 \
    --epochs 150 \
    --batch_size 512 \
    --max_samples_per_class 10000 \
    --dropout 0.6 \
    --augment \
    --bin_factor 2 \
    --norm_method percentile \
    --output_dir /content/drive/MyDrive/dl-plastics-models \
    --num_workers 0

print('\n✓ Training complete! Model saved to Google Drive.')

Using device: cuda

NORMALIZATION CHECK

Visualizing normalization for training data...
  ✓ Saved normalization check: /content/drive/MyDrive/dl-plastics-models/hybrid_20251004_124105/normalization_check_training_data.png
  Raw mean intensity: 18.5, Normalized mean: 22.2

Loading dataset from data...
Training samples: 88000, Validation samples: 22000
Creating model: hybrid
Model parameters: 181,995

Starting training for 150 epochs...

Epoch 1/150
Training: 100% 172/172 [00:07<00:00, 22.85it/s, loss=0.636, acc=63.7]
Train Loss: 0.9865, Train Acc: 63.71%
Validation: 100% 43/43 [00:02<00:00, 16.09it/s, loss=0.516, acc=79.9]
Val Loss: 0.5386, Val Acc: 79.92%
Per-class accuracy:
  Background: 91.21%
  95PU: 56.04%
  HIPS: 91.86%
  HVDF-HFP: 98.91%
  GPSS: 87.20%
  PU: 70.16%
  75PU: 87.63%
  85PU: 60.44%
  PETE: 88.31%
  PET: 53.42%
  PMMA: 92.73%
Best model saved with validation accuracy: 79.92%

Epoch 2/150
Training: 100% 172/172 [00:07<00:00, 24.34it/s, loss=0.503, acc=80.5]
Train Loss:

## 5. Find and Load Trained Model

In [40]:
# Find the latest trained model from Google Drive
import glob
import os
import re

# Search for all models in Google Drive
model_files = glob.glob('/content/drive/MyDrive/dl-plastics-models/*/best_model.pth')

print(f'Found {len(model_files)} model(s):')
for m in model_files:
    print(f'  - {m}')

if model_files:
    # Sort by timestamp extracted from path (YYYYMMDD_HHMMSS)
    def extract_timestamp(path):
        # Extract timestamp from path like: .../model_name_20251004_124105/...
        match = re.search(r'(\d{8}_\d{6})', path)
        if match:
            return match.group(1)
        return '00000000_000000'  # fallback for paths without timestamp

    latest_model = sorted(model_files, key=extract_timestamp)[-1]
    print(f'\n✓ Using latest model: {latest_model}')

    # Extract model type from path
    if 'spectral_cnn_1d' in latest_model:
        model_type = 'spectral_cnn_1d'
    elif 'spectral_cnn_2d' in latest_model:
        model_type = 'spectral_cnn_2d'
    elif 'hybrid' in latest_model:
        model_type = 'hybrid'
    elif 'resnet' in latest_model:
        model_type = 'resnet'
    else:
        model_type = 'spectral_cnn_1d'  # default

    print(f'Model type: {model_type}')
    print(f'Timestamp: {extract_timestamp(latest_model)}')
else:
    print('\n⚠ No trained model found!')
    print('Please run a training cell first.')

Found 4 model(s):
  - /content/drive/MyDrive/dl-plastics-models/spectral_cnn_1d_20251004_004340/best_model.pth
  - /content/drive/MyDrive/dl-plastics-models/spectral_cnn_1d_20251004_011920/best_model.pth
  - /content/drive/MyDrive/dl-plastics-models/spectral_cnn_1d_20251004_114550/best_model.pth
  - /content/drive/MyDrive/dl-plastics-models/hybrid_20251004_124105/best_model.pth

✓ Using latest model: /content/drive/MyDrive/dl-plastics-models/hybrid_20251004_124105/best_model.pth
Model type: hybrid
Timestamp: 20251004_124105


## 6. Run Inference on Both Datasets

In [44]:
# Run inference on inference_data_set1
if model_files:
    # Determine if patches are needed
    use_patches_flag = '--use_patches' if model_type in ['spectral_cnn_2d', 'hybrid', 'resnet'] else ''

    print(f'Running inference on inference_data_set1...')
    !python inference.py \
        --checkpoint {latest_model} \
        --model {model_type} \
        {use_patches_flag} \
        --data_dir /content/my-ml-project/inference_data_set1 \
        --norm_method percentile \
        --output_dir /content/drive/MyDrive/dl-plastics-predictions

    print('\n✓ Inference complete for dataset 1!')
else:
    print('⚠ No model available for inference')

Running inference on inference_data_set1...
Using device: cuda
Loading model from: /content/drive/MyDrive/dl-plastics-models/hybrid_20251004_124105/best_model.pth
Model loaded successfully
Found 1 dataset(s) to process

Processing: inference_data_set1

NORMALIZATION CHECK

Visualizing normalization for inference_data_set1...
  ✓ Saved normalization check: /content/drive/MyDrive/dl-plastics-predictions/normalization_check_inference_data_set1.png
  Raw mean intensity: 18.5, Normalized mean: 22.2

Predicting: 100% 9611/9611 [02:47<00:00, 57.33it/s]
Visualization saved to: /content/drive/MyDrive/dl-plastics-predictions/inference_data_set1/prediction_visualization.png
Results saved to: /content/drive/MyDrive/dl-plastics-predictions/inference_data_set1
Mean confidence: 0.9448

Class distribution:
  Background: 67.28% (1655136 pixels, conf: 0.9494)
  95PU: 3.37% (82977 pixels, conf: 0.8721)
  HIPS: 4.21% (103566 pixels, conf: 0.8695)
  HVDF-HFP: 4.05% (99660 pixels, conf: 0.9755)
  GPSS: 3.07

In [None]:
# Run inference on inference_data_set2 with industry-standard post-processing
if model_files:
    use_patches_flag = '--use_patches' if model_type in ['spectral_cnn_2d', 'hybrid', 'resnet'] else ''
    
    print(f'Running inference on inference_data_set2...')
    !python inference.py \
        --checkpoint {latest_model} \
        --model {model_type} \
        {use_patches_flag} \
        --data_dir /content/my-ml-project/inference_data_set2 \
        --norm_method percentile \
        --post_process \
        --min_region_size 50 \
        --smooth_sigma 0.8 \
        --show_edges \
        --output_dir /content/drive/MyDrive/dl-plastics-predictions
    
    print('\n✓ Inference complete for dataset 2!')
    print('\nGenerated visualizations:')
    print('  - prediction_visualization.png (raw)')
    print('  - prediction_filtered_visualization.png (cleaned with industry-standard post-processing)')
    print('  - prediction_enhanced_labeled.png (with edges, legend only)')
else:
    print('⚠ No model available for inference')

## 7. Visualize Results

In [None]:
# Display prediction visualizations
from IPython.display import Image, display
import json

# Dataset 1
print('\n' + '='*80)
print('INFERENCE DATA SET 1 RESULTS')
print('='*80)

pred_path_1 = '/content/drive/MyDrive/dl-plastics-predictions/inference_data_set1/prediction_visualization.png'
stats_path_1 = '/content/drive/MyDrive/dl-plastics-predictions/inference_data_set1/statistics.json'

if os.path.exists(pred_path_1):
    display(Image(pred_path_1))

    with open(stats_path_1, 'r') as f:
        stats1 = json.load(f)

    print(f"\nMean Confidence: {stats1['mean_confidence']:.4f}")
    print("\nClass Distribution:")
    for class_name, class_stats in stats1['class_distribution'].items():
        if class_stats['percentage'] > 0.01:  # Only show classes with >0.01%
            print(f"  {class_name:<15}: {class_stats['percentage']:>6.2f}% (conf: {class_stats['mean_confidence']:.4f})")
else:
    print('⚠ Results not found. Run inference first.')

# Dataset 2
print('\n' + '='*80)
print('INFERENCE DATA SET 2 RESULTS')
print('='*80)

pred_path_2 = '/content/drive/MyDrive/dl-plastics-predictions/inference_data_set2/prediction_visualization.png'
stats_path_2 = '/content/drive/MyDrive/dl-plastics-predictions/inference_data_set2/statistics.json'

if os.path.exists(pred_path_2):
    display(Image(pred_path_2))

    with open(stats_path_2, 'r') as f:
        stats2 = json.load(f)

    print(f"\nMean Confidence: {stats2['mean_confidence']:.4f}")
    print("\nClass Distribution:")
    for class_name, class_stats in stats2['class_distribution'].items():
        if class_stats['percentage'] > 0.01:
            print(f"  {class_name:<15}: {class_stats['percentage']:>6.2f}% (conf: {class_stats['mean_confidence']:.4f})")
else:
    print('⚠ Results not found. Run inference first.')

## 8. Download Results (Optional)

**Note:** Results are already saved to Google Drive, but you can download them to your computer if needed.

In [None]:
# Copy results from Google Drive to local workspace for download
!cp -r /content/drive/MyDrive/dl-plastics-models ./models-backup
!cp -r /content/drive/MyDrive/dl-plastics-predictions ./predictions-backup

# Zip all results
!zip -r results.zip models-backup/ predictions-backup/

# Download to your computer
from google.colab import files
files.download('results.zip')

print('\n✓ Results downloaded!')
print('  - models-backup/ : Trained models and training history')
print('  - predictions-backup/ : Inference results and visualizations')

## 9. Summary

**Your models and results are saved in Google Drive:**
- Models: `/MyDrive/dl-plastics-models/`
- Predictions: `/MyDrive/dl-plastics-predictions/`

You can access them anytime, even after this Colab session ends!