# Promoter CNN with Best Hyperparameters

This notebook implements the Promoter CNN model using the optimal hyperparameters found through comprehensive hyperparameter tuning. It leverages the modular codebase structure and applies the best configuration from `best_hyperparameters.json`.

## Key Features
- **Optimal hyperparameters** from comprehensive tuning results
- **One-hot encoding** of DNA sequences (A, T, G, C, N)
- **Optimized CNN architecture** with best depth and channel configuration
- **Advanced training setup** with optimal optimizer, scheduler, and loss function
- **Device-agnostic training** (CUDA/MPS/CPU)
- **Comprehensive evaluation** and visualization

## Hyperparameter Configuration
Based on `best_hyperparameters.json`:
- **Architecture**: 4-layer depth, 64 base channels, 0.3 dropout
- **Training**: AdamW optimizer, 0.0005 learning rate, 5e-05 weight decay
- **Advanced**: Cosine scheduler, KL-divergence loss, gradient clipping
- **Batch size**: 64, Max epochs: 50, Early stopping: 15 patience

## Table of Contents
1. [Environment Setup](#Environment-Setup)
2. [Hyperparameter Loading](#Hyperparameter-Loading)
3. [Data Loading and Preprocessing](#Data-Loading-and-Preprocessing)
4. [Model Architecture](#Model-Architecture)
5. [Training Configuration](#Training-Configuration)
6. [Training Loop](#Training-Loop)
7. [Evaluation and Results](#Evaluation-and-Results)
8. [Visualization](#Visualization)
9. [Model Saving](#Model-Saving)


## Environment Setup

First, let's import all necessary libraries and modules from our codebase.


In [None]:
# Standard libraries
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
import time
from typing import Dict, List, Tuple, Any
import warnings
warnings.filterwarnings('ignore')

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, random_split
from torch.optim import Adam, AdamW, SGD, RMSprop
from torch.optim.lr_scheduler import ReduceLROnPlateau, CosineAnnealingLR, StepLR, ExponentialLR

# Sklearn for metrics
from sklearn.metrics import r2_score, mean_squared_error

# Add project root to path
project_root = Path().resolve().parent.parent
sys.path.append(str(project_root))

print(f"Project root: {project_root}")
print(f"Python path updated")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if hasattr(torch.backends, 'mps'):
    print(f"MPS available: {torch.backends.mps.is_available()}")
else:
    print("MPS not available")


Project root: /Users/jaydenthai/Dev/University Work/2025sem2/Data Science Research Project /DS-Research-Project-Tumor-Expression
Python path updated
PyTorch version: 2.5.1
CUDA available: False
MPS available: True


In [None]:
# Import custom modules from our codebase
try:
    # Core model and data utilities
    from src.models.cnn.model import PromoterCNN
    from src.utils.data import PromoterDataset, load_and_prepare_data
    
    # Training and evaluation utilities
    from src.utils.training import train_epoch, validate_epoch, evaluate_model
    from src.utils.device import create_device_manager

    from improved_hyperparameter_tuning import EnhancedPromoterCNN
    
    # Visualization utilities
    from src.utils.viz import plot_results
    
    print("✅ Successfully imported all custom modules")
    print("   📁 CNN Model: PromoterCNN")
    print("   📁 Data Utils: PromoterDataset, load_and_prepare_data")
    print("   📁 Training Utils: train_epoch, validate_epoch, evaluate_model")
    print("   📁 Device Utils: create_device_manager")
    print("   📁 Visualization: plot_results")
    
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("Please ensure the project structure is correct and src/ modules are available")


✅ Successfully imported all custom modules
   📁 CNN Model: PromoterCNN
   📁 Data Utils: PromoterDataset, load_and_prepare_data
   📁 Training Utils: train_epoch, validate_epoch, evaluate_model
   📁 Device Utils: create_device_manager
   📁 Visualization: plot_results


## Hyperparameter Loading and Configuration

Load the best hyperparameters from the JSON file and set up the experiment configuration.


In [12]:
# Load best hyperparameters from JSON file
print("⚙️  Loading best hyperparameters from JSON...")

# Load the best hyperparameters
best_hyperparams_path = project_root / "results" / "analysis" / "best_hyperparameters.json"

if not best_hyperparams_path.exists():
    raise FileNotFoundError(f"Best hyperparameters file not found: {best_hyperparams_path}")

with open(best_hyperparams_path, 'r') as f:
    params = json.load(f)

print(f"✅ Loaded best hyperparameters from: {best_hyperparams_path}")

# Display the loaded hyperparameters
print(f"\n🏗️  Architecture Configuration:")
print(f"   Depth (conv blocks): {params['depth']}")
print(f"   Base channels: {params['base_channels']}")
print(f"   Dropout: {params['dropout']}")
print(f"   Number of classes: {params['num_classes']}")

print(f"\n🎓 Training Configuration:")
print(f"   Optimizer: {params['optimizer']}")
print(f"   Learning rate: {params['learning_rate']}")
print(f"   Weight decay: {params['weight_decay']}")
print(f"   Batch size: {params['batch_size']}")
print(f"   Max epochs: {params['max_epochs']}")
print(f"   Early stopping patience: {params['early_stopping_patience']}")

print(f"\n📊 Advanced Configuration:")
print(f"   Loss function: {params['loss_function']}")
print(f"   Scheduler: {params['scheduler']}")
print(f"   Gradient clipping: {params['gradient_clipping']}")
if params['gradient_clipping']:
    print(f"   Max grad norm: {params['max_grad_norm']}")
print(f"   Label smoothing: {params['label_smoothing']}")

# Experiment configuration
EXPERIMENT_NAME = "promoter_cnn_best_hyperparams"
SEQUENCE_LENGTH = 600
RANDOM_SEED = 42

print(f"\n🔧 Experiment Setup:")
print(f"   Name: {EXPERIMENT_NAME}")
print(f"   Sequence Length: {SEQUENCE_LENGTH}")
print(f"   Random Seed: {RANDOM_SEED}")


⚙️  Loading best hyperparameters from JSON...
✅ Loaded best hyperparameters from: /Users/jaydenthai/Dev/University Work/2025sem2/Data Science Research Project /DS-Research-Project-Tumor-Expression/results/analysis/best_hyperparameters.json

🏗️  Architecture Configuration:
   Depth (conv blocks): 4
   Base channels: 64
   Dropout: 0.3
   Number of classes: 5

🎓 Training Configuration:
   Optimizer: adamw
   Learning rate: 0.0005
   Weight decay: 5e-05
   Batch size: 64
   Max epochs: 50
   Early stopping patience: 15

📊 Advanced Configuration:
   Loss function: kldiv
   Scheduler: cosine
   Gradient clipping: True
   Max grad norm: 1.0
   Label smoothing: 0.0

🔧 Experiment Setup:
   Name: promoter_cnn_best_hyperparams
   Sequence Length: 600
   Random Seed: 42


In [13]:
# Setup paths and device management
DATA_PATH = project_root / "data" / "processed" / "ProSeq_with_5component_analysis.csv"
OUTPUT_DIR = project_root / "results"
MODEL_WEIGHTS_DIR = OUTPUT_DIR / "model_weights"
PLOTS_DIR = OUTPUT_DIR / "plots"

# Create output directories
MODEL_WEIGHTS_DIR.mkdir(parents=True, exist_ok=True)
PLOTS_DIR.mkdir(parents=True, exist_ok=True)

# Set random seeds for reproducibility
torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(RANDOM_SEED)

print(f"📁 Paths Configuration:")
print(f"   Data Path: {DATA_PATH}")
print(f"   Output Directory: {OUTPUT_DIR}")
print(f"   Model Weights: {MODEL_WEIGHTS_DIR}")
print(f"   Plots: {PLOTS_DIR}")

# Device setup using our custom device manager
print(f"\n🖥️  Setting up device management...")
device_manager = create_device_manager(prefer_cuda=True, verbose=True)

# Get device information
device = device_manager.device
device_name = device_manager.device_name
loader_kwargs = device_manager.get_dataloader_kwargs()

print(f"\n📱 Device Configuration:")
print(f"   Device: {device}")
print(f"   Device Name: {device_name}")
print(f"   DataLoader kwargs: {loader_kwargs}")

# Display memory info if available
memory_info = device_manager.get_memory_info()
if memory_info['total'] > 0:
    print(f"   Memory Total: {memory_info['total']:.1f} GB")
    print(f"   Memory Free: {memory_info['free']:.1f} GB")


📁 Paths Configuration:
   Data Path: /Users/jaydenthai/Dev/University Work/2025sem2/Data Science Research Project /DS-Research-Project-Tumor-Expression/data/processed/ProSeq_with_5component_analysis.csv
   Output Directory: /Users/jaydenthai/Dev/University Work/2025sem2/Data Science Research Project /DS-Research-Project-Tumor-Expression/results
   Model Weights: /Users/jaydenthai/Dev/University Work/2025sem2/Data Science Research Project /DS-Research-Project-Tumor-Expression/results/model_weights
   Plots: /Users/jaydenthai/Dev/University Work/2025sem2/Data Science Research Project /DS-Research-Project-Tumor-Expression/results/plots

🖥️  Setting up device management...

📱 Device Configuration:
   Device: mps
   Device Name: mps
   DataLoader kwargs: {'num_workers': 0}


## Data Loading and Preprocessing

Load the promoter sequence data and create datasets with automatic one-hot encoding.


In [14]:
# Load data using our custom data loading function
print("📊 Loading promoter sequence data...")
print(f"   Data path: {DATA_PATH}")

if not DATA_PATH.exists():
    raise FileNotFoundError(f"Data file not found: {DATA_PATH}")

sequences, targets = load_and_prepare_data(str(DATA_PATH))

print(f"\n📈 Data Overview:")
print(f"   Total sequences: {len(sequences)}")
print(f"   Target shape: {targets.shape}")
print(f"   Sequence length range: {min(len(seq) for seq in sequences)} - {max(len(seq) for seq in sequences)}")

# Display data statistics
print(f"\n📊 Target Statistics:")
for i in range(5):
    print(f"   Component {i+1}: mean={targets[:, i].mean():.3f}, std={targets[:, i].std():.3f}")

# Show example sequence
print(f"\n🧬 Example DNA Sequence (first 100 bp):")
print(f"   {sequences[0][:100]}...")
print(f"   Length: {len(sequences[0])} bp")
print(f"   Targets: {targets[0]}")

# Demonstrate one-hot encoding
print(f"\n🔢 One-Hot Encoding Demonstration:")
sample_dataset = PromoterDataset([sequences[0]], targets[:1], max_length=SEQUENCE_LENGTH)
sample_batch = sample_dataset[0]

encoded_sequence = sample_batch["sequence"]
target_probs = sample_batch["target"]

print(f"   Encoded sequence shape: {encoded_sequence.shape}")
print(f"   Expected: (5, {SEQUENCE_LENGTH}) for Conv1d input")
print(f"   Target shape: {target_probs.shape}")
print(f"   Target sum: {target_probs.sum():.6f} (should be ~1.0)")

print("✅ Data loading and encoding validation complete")


📊 Loading promoter sequence data...
   Data path: /Users/jaydenthai/Dev/University Work/2025sem2/Data Science Research Project /DS-Research-Project-Tumor-Expression/data/processed/ProSeq_with_5component_analysis.csv

📈 Data Overview:
   Total sequences: 8304
   Target shape: (8304, 5)
   Sequence length range: 472 - 600

📊 Target Statistics:
   Component 1: mean=0.088, std=0.250
   Component 2: mean=0.320, std=0.376
   Component 3: mean=0.115, std=0.274
   Component 4: mean=0.214, std=0.313
   Component 5: mean=0.264, std=0.381

🧬 Example DNA Sequence (first 100 bp):
   AAGCTGCACAGTCGAGCCTGCGGCTCCGCAGCCGAATAGAGCGGAAATGCCCTCTCAGGGCATCAAAGAGCAACAAGCTGCCACTGTAAGAGGGGCCCAG...
   Length: 600 bp
   Targets: [6.32430892e-10 1.75677897e-04 2.10593641e-01 4.88930448e-07
 7.89230191e-01]

🔢 One-Hot Encoding Demonstration:
   Encoded sequence shape: torch.Size([5, 600])
   Expected: (5, 600) for Conv1d input
   Target shape: torch.Size([5])
   Target sum: 1.000000 (should be ~1.0)
✅ Data loading 

In [15]:
# Create train/validation/test splits
print("🔀 Creating train/validation/test splits...")

# Split ratios
train_ratio = 0.7
val_ratio = 0.15
test_ratio = 0.15

# Calculate split sizes
total_size = len(sequences)
train_size = int(train_ratio * total_size)
val_size = int(val_ratio * total_size)
test_size = total_size - train_size - val_size

print(f"   Total samples: {total_size}")
print(f"   Train: {train_size} ({train_ratio:.1%})")
print(f"   Validation: {val_size} ({val_ratio:.1%})")
print(f"   Test: {test_size} ({test_ratio:.1%})")

# Create full dataset first
full_dataset = PromoterDataset(sequences, targets, max_length=SEQUENCE_LENGTH)

# Split the dataset
train_dataset, temp_dataset = random_split(
    full_dataset, 
    [train_size, val_size + test_size],
    generator=torch.Generator().manual_seed(RANDOM_SEED)
)

val_dataset, test_dataset = random_split(
    temp_dataset, 
    [val_size, test_size],
    generator=torch.Generator().manual_seed(RANDOM_SEED)
)

print(f"\n✅ Dataset splits created:")
print(f"   Train dataset: {len(train_dataset)} samples")
print(f"   Validation dataset: {len(val_dataset)} samples")
print(f"   Test dataset: {len(test_dataset)} samples")


🔀 Creating train/validation/test splits...
   Total samples: 8304
   Train: 5812 (70.0%)
   Validation: 1245 (15.0%)
   Test: 1247 (15.0%)

✅ Dataset splits created:
   Train dataset: 5812 samples
   Validation dataset: 1245 samples
   Test dataset: 1247 samples


## Model Architecture

Create the PromoterCNN model using the optimal hyperparameters from the JSON configuration.


In [16]:
# Create PromoterCNN model using best hyperparameters
print("🚀 Creating PromoterCNN model with optimal hyperparameters...")

# Create model using the loaded parameters
model = PromoterCNN(
    sequence_length=SEQUENCE_LENGTH, 
    num_blocks=params['depth'], 
    base_channels=params['base_channels'], 
    dropout=params['dropout'], 
    num_classes=params['num_classes']
)

print(f"✅ Model created with optimal configuration:")
print(f"   Architecture: {params['depth']} blocks, {params['base_channels']} base channels")
print(f"   Dropout: {params['dropout']}")
print(f"   Sequence length: {SEQUENCE_LENGTH}")
print(f"   Output classes: {params['num_classes']}")

# Move model to device
model = device_manager.create_model_wrapper(model)

# Display model summary
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\n📋 Model Summary:")
print(f"   Total parameters: {total_params:,}")
print(f"   Trainable parameters: {trainable_params:,}")
print(f"   Model device: {next(model.parameters()).device}")

# Print model architecture
print(f"\n🏗️  Model Architecture:")
print(model)


🚀 Creating PromoterCNN model with optimal hyperparameters...
✅ Model created with optimal configuration:
   Architecture: 4 blocks, 64 base channels
   Dropout: 0.3
   Sequence length: 600
   Output classes: 5

📋 Model Summary:
   Total parameters: 32,645
   Trainable parameters: 32,645
   Model device: mps:0

🏗️  Model Architecture:
PromoterCNN(
  (sequence_conv): Sequential(
    (0): Conv1d(5, 64, kernel_size=(11,), stride=(1,), padding=(5,))
    (1): ReLU()
    (2): MaxPool1d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
    (3): Dropout(p=0.3, inplace=False)
    (4): Conv1d(64, 64, kernel_size=(7,), stride=(1,), padding=(3,))
    (5): ReLU()
    (6): Dropout(p=0.3, inplace=False)
    (7): AdaptiveAvgPool1d(output_size=1)
  )
  (classifier): Linear(in_features=64, out_features=5, bias=True)
)


In [17]:
# Create data loaders with optimal batch size
print("🔄 Creating data loaders with optimal batch size...")

# Use the optimal batch size from hyperparameters
dataloader_kwargs = device_manager.get_dataloader_kwargs(
    batch_size=params['batch_size'],
    shuffle=True,
    drop_last=True
)

train_loader = DataLoader(train_dataset, **dataloader_kwargs)

# Validation and test loaders don't need shuffling
val_kwargs = dataloader_kwargs.copy()
val_kwargs['shuffle'] = False
val_loader = DataLoader(val_dataset, **val_kwargs)
test_loader = DataLoader(test_dataset, **val_kwargs)

print(f"✅ Data loaders created with optimal configuration:")
print(f"   Train loader: {len(train_loader)} batches")
print(f"   Validation loader: {len(val_loader)} batches")
print(f"   Test loader: {len(test_loader)} batches")
print(f"   Batch size: {params['batch_size']} (optimized)")

# Test a batch to ensure everything works
print(f"\n🧪 Testing data loading and model forward pass...")
sample_batch = next(iter(train_loader))
sample_sequences = sample_batch["sequence"]
sample_targets = sample_batch["target"]

print(f"   Batch sequences shape: {sample_sequences.shape}")
print(f"   Batch targets shape: {sample_targets.shape}")

# Test model forward pass
with torch.no_grad():
    sample_sequences = device_manager.to_device(sample_sequences)
    sample_output = model(sample_sequences)
    print(f"   Model output shape: {sample_output.shape}")
    print(f"   Output device: {sample_output.device}")

print("✅ Data loading and model forward pass successful!")


🔄 Creating data loaders with optimal batch size...
✅ Data loaders created with optimal configuration:
   Train loader: 90 batches
   Validation loader: 19 batches
   Test loader: 19 batches
   Batch size: 64 (optimized)

🧪 Testing data loading and model forward pass...
   Batch sequences shape: torch.Size([64, 5, 600])
   Batch targets shape: torch.Size([64, 5])
   Model output shape: torch.Size([64, 5])
   Output device: mps:0
✅ Data loading and model forward pass successful!


## Training Configuration

Set up optimizer, loss function, and scheduler using the optimal hyperparameters.


In [18]:
# Set up optimizer based on optimal hyperparameters
print(f"🎯 Setting up optimizer: {params['optimizer']}")

# Create optimizer with optimal parameters
optimizer_params = {
    'lr': params['learning_rate'],
    'weight_decay': params['weight_decay']
}

if params['optimizer'].lower() == 'adamw':
    optimizer = AdamW(model.parameters(), **optimizer_params)
elif params['optimizer'].lower() == 'adam':
    optimizer = Adam(model.parameters(), **optimizer_params)
elif params['optimizer'].lower() == 'sgd':
    optimizer = SGD(model.parameters(), **optimizer_params, momentum=0.9)
elif params['optimizer'].lower() == 'rmsprop':
    optimizer = RMSprop(model.parameters(), **optimizer_params)
else:
    print(f"⚠️  Unknown optimizer {params['optimizer']}, using AdamW")
    optimizer = AdamW(model.parameters(), **optimizer_params)

print(f"✅ Optimizer created: {type(optimizer).__name__}")
print(f"   Learning rate: {params['learning_rate']}")
print(f"   Weight decay: {params['weight_decay']}")

# Set up loss function
print(f"\n🎯 Setting up loss function: {params['loss_function']}")
label_smoothing = params.get('label_smoothing', 0.0)

if params['loss_function'].lower() == 'kldiv':
    criterion = nn.KLDivLoss(reduction='batchmean')
elif params['loss_function'].lower() == 'mse':
    criterion = nn.MSELoss()
elif params['loss_function'].lower() == 'crossentropy':
    if label_smoothing > 0:
        criterion = nn.CrossEntropyLoss(label_smoothing=label_smoothing)
        print(f"   Label smoothing: {label_smoothing}")
    else:
        criterion = nn.CrossEntropyLoss()
else:
    print(f"⚠️  Unknown loss function {params['loss_function']}, using KLDivLoss")
    criterion = nn.KLDivLoss(reduction='batchmean')

print(f"✅ Loss function created: {type(criterion).__name__}")
if label_smoothing > 0:
    print(f"   Label smoothing: {label_smoothing}")

# Set up learning rate scheduler
print(f"\n📊 Setting up scheduler: {params['scheduler']}")

if params['scheduler'].lower() == 'cosine':
    scheduler = CosineAnnealingLR(optimizer, T_max=params['max_epochs'])
elif params['scheduler'].lower() == 'plateau':
    scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=5, factor=0.5)
elif params['scheduler'].lower() == 'step':
    scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
elif params['scheduler'].lower() == 'exponential':
    scheduler = ExponentialLR(optimizer, gamma=0.95)
elif params['scheduler'].lower() == 'none':
    scheduler = None
    print("   No scheduler selected")
else:
    print(f"⚠️  Unknown scheduler {params['scheduler']}, using CosineAnnealingLR")
    scheduler = CosineAnnealingLR(optimizer, T_max=params['max_epochs'])

if scheduler:
    print(f"✅ Scheduler created: {type(scheduler).__name__}")

# Set up gradient clipping if specified
gradient_clipping = params.get('gradient_clipping', False)
max_grad_norm = params.get('max_grad_norm', 1.0)

print(f"\n🚀 Optimal training setup complete!")
print(f"   Model: {type(model).__name__}")
print(f"   Optimizer: {type(optimizer).__name__}")
print(f"   Loss function: {type(criterion).__name__}")
print(f"   Scheduler: {type(scheduler).__name__ if scheduler else 'None'}")
print(f"   Gradient clipping: {gradient_clipping}")
if gradient_clipping:
    print(f"   Max grad norm: {max_grad_norm}")
print(f"   Device: {device}")

print(f"\n📋 Complete Optimal Configuration Summary:")
print(f"   🏗️  Architecture: {params['depth']} blocks, {params['base_channels']} channels, {params['dropout']} dropout")
print(f"   🎯 Training: {params['optimizer']}, lr={params['learning_rate']}, wd={params['weight_decay']}")
print(f"   📊 Schedule: {params['scheduler']}, batch_size={params['batch_size']}, epochs={params['max_epochs']}")
print(f"   🔧 Advanced: patience={params['early_stopping_patience']}, clipping={gradient_clipping}")


🎯 Setting up optimizer: adamw
✅ Optimizer created: AdamW
   Learning rate: 0.0005
   Weight decay: 5e-05

🎯 Setting up loss function: kldiv
✅ Loss function created: KLDivLoss

📊 Setting up scheduler: cosine
✅ Scheduler created: CosineAnnealingLR

🚀 Optimal training setup complete!
   Model: PromoterCNN
   Optimizer: AdamW
   Loss function: KLDivLoss
   Scheduler: CosineAnnealingLR
   Gradient clipping: True
   Max grad norm: 1.0
   Device: mps

📋 Complete Optimal Configuration Summary:
   🏗️  Architecture: 4 blocks, 64 channels, 0.3 dropout
   🎯 Training: adamw, lr=0.0005, wd=5e-05
   📊 Schedule: cosine, batch_size=64, epochs=50
   🔧 Advanced: patience=15, clipping=True
