# Week 3 Checkpoint: Advanced Machine Learning for Drug Discovery

## 🎯 **Learning Objectives Verification**
By completing this checkpoint, you will demonstrate:
- [ ] Implementation of neural networks for molecular property prediction
- [ ] Understanding of graph neural networks for molecular data
- [ ] Advanced feature engineering techniques
- [ ] Cross-validation and model evaluation best practices
- [ ] Integration with existing QSAR pipeline

## 📊 **Progress Tracking**
- **Prerequisites**: Weeks 1-2 completion, basic neural network understanding
- **Time Estimate**: 4-5 hours
- **Skills Level**: Intermediate → Advanced
- **Portfolio Contribution**: Deep learning QSAR models and comparison framework

## 🔄 **Connection to Previous Weeks**
This week extends your QSAR pipeline with deep learning capabilities, building on:
- Week 1: ML fundamentals and model evaluation
- Week 2: Molecular descriptors and cheminformatics
- New: Graph representations and neural network architectures

In [None]:
# Environment Setup and Verification
import sys
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch_geometric.data import Data, Batch
from torch_geometric.nn import GCNConv, global_mean_pool

import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors
import deepchem as dc

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import warnings
warnings.filterwarnings('ignore')

print("✅ All required libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"RDKit version: {rdkit.__version__}")
print(f"DeepChem version: {dc.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## 📚 **Knowledge Check (25 minutes)**

### Question 1: Neural Networks vs Traditional ML
Compare neural networks with traditional ML methods for molecular property prediction. What are the advantages and disadvantages of each approach?

**Your Answer:** 
<!-- Write your answer here -->

### Question 2: Graph Neural Networks
Explain why graph neural networks are particularly suitable for molecular data. What information is captured in the molecular graph representation?

**Your Answer:**
<!-- Write your answer here -->

### Question 3: Overfitting in Deep Learning
What are the main strategies to prevent overfitting in deep learning models for drug discovery? How do you detect overfitting?

**Your Answer:**
<!-- Write your answer here -->

### Question 4: Cross-Validation Strategy
Why is temporal split or scaffold split often preferred over random split in drug discovery ML? What biases does it help avoid?

**Your Answer:**
<!-- Write your answer here -->

## 🔬 **Practical Challenge 1: Neural Network for Molecular Properties (60 minutes)**

Build a feedforward neural network that predicts molecular properties from descriptors.

In [None]:
# Load and prepare the dataset
# We'll use a larger synthetic dataset for this exercise
np.random.seed(42)
torch.manual_seed(42)

# Generate synthetic molecular dataset
n_compounds = 5000
n_descriptors = 50

# Create synthetic molecular descriptors
X = np.random.randn(n_compounds, n_descriptors)
# Add some realistic correlations
X[:, 1] = X[:, 0] * 0.7 + np.random.randn(n_compounds) * 0.3  # MW and LogP correlation
X[:, 2] = np.abs(X[:, 0]) * 0.5 + np.random.randn(n_compounds) * 0.2  # MW and PSA

# Generate realistic target values with non-linear relationships
y = (
    0.3 * X[:, 0] + 
    0.2 * X[:, 1] + 
    0.1 * X[:, 2] +
    0.15 * X[:, 0] * X[:, 1] +  # Interaction term
    0.1 * np.sin(X[:, 0]) +  # Non-linear term
    np.random.randn(n_compounds) * 0.1
)

# Convert to PyTorch tensors
X_tensor = torch.FloatTensor(X)
y_tensor = torch.FloatTensor(y).view(-1, 1)

print(f"Dataset shape: {X_tensor.shape}")
print(f"Target shape: {y_tensor.shape}")
print(f"Target statistics: mean={y_tensor.mean():.3f}, std={y_tensor.std():.3f}")

In [None]:
# TODO: Implement a comprehensive neural network architecture

class MolecularPropertyPredictor(nn.Module):
    """
    Neural network for molecular property prediction.
    
    Your task: Complete this implementation with:
    - Multiple hidden layers with dropout
    - Batch normalization
    - Appropriate activation functions
    - Configurable architecture
    """
    
    def __init__(self, input_dim, hidden_dims, output_dim=1, dropout_rate=0.2):
        super(MolecularPropertyPredictor, self).__init__()
        
        # TODO: Implement the network architecture
        # Consider: Linear layers, batch norm, dropout, activation functions
        pass
    
    def forward(self, x):
        # TODO: Implement forward pass
        pass

# TODO: Implement training and evaluation functions

def train_model(model, train_loader, val_loader, num_epochs=100, learning_rate=0.001):
    """
    Train the neural network model.
    
    Returns:
    - Training history (losses, metrics)
    """
    # TODO: Implement training loop with:
    # - Appropriate loss function
    # - Optimizer (Adam recommended)
    # - Learning rate scheduling
    # - Early stopping
    # - Validation monitoring
    pass

def evaluate_model(model, test_loader):
    """
    Evaluate the trained model.
    
    Returns:
    - Test metrics (R², RMSE, MAE)
    - Predictions for visualization
    """
    # TODO: Implement evaluation with appropriate metrics
    pass

# Test your implementation
# model = MolecularPropertyPredictor(input_dim=n_descriptors, hidden_dims=[128, 64, 32])
# print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")
# print(model)

## 🧬 **Practical Challenge 2: Graph Neural Network Implementation (75 minutes)**

Implement a Graph Convolutional Network (GCN) for molecular property prediction using molecular graphs.

In [None]:
# Molecular graph data preparation
def mol_to_graph(smiles):
    """
    Convert SMILES string to molecular graph representation.
    
    Returns:
    - PyTorch Geometric Data object
    """
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    
    # Node features (atoms)
    atom_features = []
    for atom in mol.GetAtoms():
        features = [
            atom.GetAtomicNum(),
            atom.GetDegree(),
            atom.GetFormalCharge(),
            int(atom.GetHybridization()),
            int(atom.GetIsAromatic())
        ]
        atom_features.append(features)
    
    # Edge indices (bonds)
    edge_indices = []
    for bond in mol.GetBonds():
        i = bond.GetBeginAtomIdx()
        j = bond.GetEndAtomIdx()
        edge_indices.extend([[i, j], [j, i]])  # Undirected graph
    
    # Convert to tensors
    x = torch.FloatTensor(atom_features)
    edge_index = torch.LongTensor(edge_indices).t().contiguous()
    
    return Data(x=x, edge_index=edge_index)

# Sample molecules for testing
sample_smiles = [
    'CCO',  # Ethanol
    'CC(=O)O',  # Acetic acid
    'c1ccccc1',  # Benzene
    'CC(=O)Nc1ccc(O)cc1',  # Paracetamol
]

# Test graph conversion
for smiles in sample_smiles:
    graph = mol_to_graph(smiles)
    if graph is not None:
        print(f"SMILES: {smiles}")
        print(f"  Nodes: {graph.x.shape[0]}, Edges: {graph.edge_index.shape[1]}")
        print(f"  Node features shape: {graph.x.shape}")
    else:
        print(f"Failed to process: {smiles}")

In [None]:
# TODO: Implement Graph Convolutional Network

class MolecularGCN(nn.Module):
    """
    Graph Convolutional Network for molecular property prediction.
    
    Your task: Complete this implementation with:
    - Multiple GCN layers
    - Global pooling for graph-level prediction
    - Dropout and batch normalization
    - Final prediction layers
    """
    
    def __init__(self, num_features, hidden_dim=64, num_layers=3, output_dim=1, dropout=0.2):
        super(MolecularGCN, self).__init__()
        
        # TODO: Implement GCN architecture
        # Consider: GCN layers, activation functions, global pooling, final layers
        pass
    
    def forward(self, data):
        # TODO: Implement forward pass
        # x: node features, edge_index: graph connectivity, batch: batch assignment
        pass

class GraphDataset(Dataset):
    """
    Dataset class for molecular graphs.
    """
    
    def __init__(self, smiles_list, targets):
        self.graphs = []
        self.targets = []
        
        for smiles, target in zip(smiles_list, targets):
            graph = mol_to_graph(smiles)
            if graph is not None:
                self.graphs.append(graph)
                self.targets.append(target)
    
    def __len__(self):
        return len(self.graphs)
    
    def __getitem__(self, idx):
        return self.graphs[idx], self.targets[idx]

# TODO: Generate synthetic SMILES data for training
# You can use RDKit to generate random molecules or use a predefined set

# TODO: Implement GCN training and evaluation pipeline
# Similar to the feedforward network but adapted for graph data

## 📊 **Practical Challenge 3: Model Comparison and Cross-Validation (45 minutes)**

Compare different models using proper cross-validation strategies.

In [None]:
# TODO: Implement comprehensive model comparison framework

from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score, StratifiedKFold

class ModelComparison:
    """
    Framework for comparing different ML models for molecular property prediction.
    """
    
    def __init__(self):
        self.models = {}
        self.results = {}
    
    def add_model(self, name, model, model_type='sklearn'):
        """
        Add a model to the comparison.
        
        model_type: 'sklearn', 'pytorch', 'graph'
        """
        self.models[name] = {'model': model, 'type': model_type}
    
    def cross_validate(self, X, y, cv_folds=5, cv_strategy='random'):
        """
        Perform cross-validation for all models.
        
        cv_strategy: 'random', 'stratified', 'temporal', 'scaffold'
        """
        # TODO: Implement different CV strategies
        # - Random split
        # - Stratified split (for classification)
        # - Temporal split (by date/time)
        # - Scaffold split (by molecular scaffold)
        pass
    
    def evaluate_models(self, X_test, y_test):
        """
        Evaluate all models on test set.
        """
        # TODO: Implement evaluation for different model types
        pass
    
    def create_comparison_plots(self):
        """
        Create comprehensive comparison visualizations.
        """
        # TODO: Implement comparison plots:
        # - Performance metrics comparison
        # - Prediction vs actual scatter plots
        # - Residual analysis
        # - Learning curves
        # - Feature importance comparison
        pass

# TODO: Set up and run the model comparison
# Include: Random Forest, SVR, Neural Network, GCN

# comparison = ModelComparison()
# comparison.add_model('Random Forest', RandomForestRegressor(n_estimators=100))
# comparison.add_model('SVR', SVR(kernel='rbf'))
# comparison.add_model('Neural Network', your_nn_model, 'pytorch')
# comparison.add_model('GCN', your_gcn_model, 'graph')
# 
# comparison.cross_validate(X, y, cv_folds=5)
# comparison.evaluate_models(X_test, y_test)
# comparison.create_comparison_plots()

## 📂 **Portfolio Integration: Advanced QSAR Pipeline (60 minutes)**

Integrate all models into a comprehensive pipeline that extends your Week 1-2 work.

In [None]:
# TODO: Extend the ComprehensiveQSARPipeline from Week 2

class AdvancedQSARPipeline:
    """
    Advanced QSAR pipeline integrating classical ML, neural networks, and GNNs.
    """
    
    def __init__(self):
        self.molecules = None
        self.descriptors = None
        self.graphs = None
        self.models = {
            'classical': {},
            'neural_network': None,
            'graph_nn': None
        }
        self.results = {}
        self.ensemble_model = None
    
    def prepare_data(self, smiles_list, targets):
        """
        Prepare all data representations.
        """
        # TODO: Implement data preparation for:
        # - Molecular descriptors (for classical ML and NN)
        # - Molecular graphs (for GNN)
        # - Data validation and cleaning
        pass
    
    def train_all_models(self, train_indices, val_indices):
        """
        Train all model types.
        """
        # TODO: Implement training for:
        # - Classical ML models (RF, SVR, etc.)
        # - Neural networks
        # - Graph neural networks
        pass
    
    def create_ensemble(self, method='stacking'):
        """
        Create ensemble model combining all approaches.
        
        method: 'voting', 'stacking', 'blending'
        """
        # TODO: Implement ensemble methods
        pass
    
    def comprehensive_evaluation(self, test_indices):
        """
        Perform comprehensive evaluation and analysis.
        """
        # TODO: Implement evaluation including:
        # - Individual model performance
        # - Ensemble performance
        # - Uncertainty quantification
        # - Applicability domain analysis
        # - Chemical space analysis
        pass
    
    def generate_insights(self):
        """
        Generate insights and interpretations.
        """
        # TODO: Implement insight generation:
        # - Feature importance across models
        # - Molecular substructure analysis
        # - Model agreement/disagreement analysis
        # - Recommendations for model selection
        pass
    
    def export_model_pipeline(self, filepath):
        """
        Export the complete pipeline for deployment.
        """
        # TODO: Implement model serialization and export
        pass

# TODO: Test the advanced pipeline
# pipeline = AdvancedQSARPipeline()
# pipeline.prepare_data(smiles_list, targets)
# pipeline.train_all_models(train_idx, val_idx)
# pipeline.create_ensemble()
# pipeline.comprehensive_evaluation(test_idx)
# insights = pipeline.generate_insights()

## 🔍 **Self-Assessment and Reflection (20 minutes)**

### Technical Skills Assessment
Rate your confidence (1-5 scale) in the following areas:

| Skill Area | Confidence (1-5) | Evidence/Notes |
|------------|------------------|----------------|
| PyTorch Neural Networks | __ | |
| Graph Neural Networks | __ | |
| Cross-Validation Strategies | __ | |
| Model Comparison & Evaluation | __ | |
| Ensemble Methods | __ | |
| Deep Learning Best Practices | __ | |
| Code Organization & Modularity | __ | |

### Reflection Questions

1. **Which model type performed best on your dataset and why do you think that is?**
   <!-- Your reflection here -->

2. **What are the main challenges in applying deep learning to molecular data?**
   <!-- Your reflection here -->

3. **How do you decide between different cross-validation strategies in drug discovery?**
   <!-- Your reflection here -->

4. **What insights did you gain from the model comparison exercise?**
   <!-- Your reflection here -->

### Progress Indicators
- [ ] Successfully implemented neural network architecture
- [ ] Built functional graph neural network
- [ ] Completed comprehensive model comparison
- [ ] Applied proper cross-validation strategies
- [ ] Created ensemble models
- [ ] Integrated with previous weeks' work
- [ ] Generated meaningful insights from model comparison
- [ ] Documented code clearly with proper structure

## 🚀 **Next Week Preview: Molecular Modeling and Simulation**

### Coming Up in Week 4:
- **Molecular Dynamics**: Introduction to MD simulations
- **Protein-Ligand Interactions**: Docking and binding analysis
- **Conformational Analysis**: Exploring molecular flexibility
- **Free Energy Calculations**: Thermodynamic predictions

### Preparation Tasks:
1. Install OpenMM and MDTraj
2. Review basic physical chemistry concepts
3. Read about protein-drug interactions
4. Complete any remaining Week 3 challenges

### Resources for Next Week:
- [OpenMM User Guide](http://docs.openmm.org/)
- [MDTraj Documentation](https://mdtraj.org/)
- [Protein-Ligand Docking Tutorial](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4014821/)

---

## 📝 **Submission Guidelines**

### Portfolio Submission:
1. Complete all practical challenges with working code
2. Implement the AdvancedQSARPipeline class
3. Create comprehensive model comparison report
4. Document your code with clear explanations
5. Upload notebook to your portfolio repository
6. Update your progress tracking dashboard

### Peer Review Assignment:
- Review 2 peer submissions focusing on:
  - Neural network architecture choices
  - Cross-validation implementation
  - Model comparison methodology
  - Code quality and documentation
- Submit reviews by [deadline]

### Assessment Criteria:
- **Technical Implementation** (40%): Working neural networks and GNNs
- **Methodology** (30%): Proper CV and evaluation practices
- **Analysis Quality** (20%): Insights from model comparisons
- **Code Quality** (10%): Documentation and organization

**Completion Criteria**: All challenges completed, comprehensive comparison performed, portfolio updated, peer reviews submitted