# Hands-on Exercise: Processing Legacy Lab Data for ML

## Duration: 90 minutes

### Overview

In this exercise, you'll learn how to: 
1. Clean and standardize heterogeneous lab data 
2. Merge data from multiple sources
3. Handle missing values and inconsistencies
4. Prepare data for ML model training
5. Build a simple ML model for antibody property prediction

### Part 1: Data Loading and Initial Assessment (20 mins)

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import re

# Load data files
def load_binding_data(file_path):
    df = pd.read_csv(file_path)
    return df

def load_stability_data(file_path):
    df = pd.read_excel(file_path)
    return df

def load_aggregation_data(file_path):
    df = pd.read_csv(file_path)
    return df

# Function to assess data quality
def assess_data_quality(df, dataset_name):
    """Analyze common data quality issues"""
    print(f"\nAnalyzing {dataset_name}:")
    print(f"Shape: {df.shape}")
    print("\nMissing values:")
    print(df.isnull().sum())
    print("\nUnique values per column:")
    for col in df.columns:
        print(f"{col}: {df[col].nunique()} unique values")

### Part 2: Data Cleaning and Standardization (30 mins)

In [6]:
def clean_binding_data(df):
    """Clean and standardize binding data"""
    # Handle temperature standardization
    def standardize_temp(temp):
        if pd.isna(temp):
            return None
        temp = str(temp).lower()
        if 'rt' in temp or 'room' in temp:
            return 25.0
        return float(re.findall(r'[-+]?\d*\.*\d+', temp)[0])
    
    # Clean KD values
    def clean_kd(kd):
        if pd.isna(kd) or kd == 'n.b.':
            return None
        if isinstance(kd, str) and '<' in kd:
            return float(kd.replace('<', ''))
        return float(kd)
    
    df_clean = df.copy()
    df_clean['Temperature'] = df_clean['Temperature'].apply(standardize_temp)
    df_clean['KD (nM)'] = df_clean['KD (nM)'].apply(clean_kd)
    
    return df_clean

def clean_stability_data(df):
    """Clean and standardize stability data"""
    df_clean = df.copy()
    # Ensure numeric Tm values
    df_clean['Tm1'] = pd.to_numeric(df_clean['Tm1'], errors='coerce')
    df_clean['Tm2'] = pd.to_numeric(df_clean['Tm2'], errors='coerce')
    
    return df_clean

def clean_aggregation_data(df):
    """Clean and standardize aggregation data"""
    df_clean = df.copy()
    # Convert percentage strings to floats if necessary
    df_clean['% Monomer'] = pd.to_numeric(df_clean['% Monomer'], errors='coerce')
    df_clean['% Aggregate'] = pd.to_numeric(df_clean['% Aggregate'], errors='coerce')
    
    return df_clean


### Part 3: Data Integration (20 mins)

In [5]:
def integrate_datasets(binding_df, stability_df, aggregation_df):
    """Merge all datasets on sample ID"""
    # Standardize sample ID columns
    binding_df['Sample_ID'] = binding_df['Sample ID']
    stability_df['Sample_ID'] = stability_df['Sample']
    aggregation_df['Sample_ID'] = aggregation_df['ID']
    
    # Get mean values for repeated measurements
    binding_summary = binding_df.groupby('Sample_ID')['KD (nM)'].mean().reset_index()
    
    # Merge all datasets
    merged_df = binding_summary.merge(
        stability_df[['Sample_ID', 'Tm1', 'Tm2']], 
        on='Sample_ID', 
        how='outer'
    ).merge(
        aggregation_df[['Sample_ID', '% Monomer', '% Aggregate']], 
        on='Sample_ID', 
        how='outer'
    )
    
    return merged_df

### Part 4: Feature Engineering and ML Preparation (20 mins)

In [4]:
def prepare_for_ml(merged_df):
    """Prepare merged data for ML modeling"""
    # Create feature matrix
    features = [
        'KD (nM)',
        'Tm1',
        'Tm2',
        '% Monomer'
    ]
    
    # Handle missing values
    X = merged_df[features].copy()
    X = X.fillna(X.mean())
    
    # Create a simple target variable (could be customized)
    # Here we're creating a "quality score" combining multiple properties
    y = (
        (X['Tm1'] > X['Tm1'].median()).astype(int) + 
        (X['% Monomer'] > X['% Monomer'].median()).astype(int) +
        (X['KD (nM)'] < X['KD (nM)'].median()).astype(int)
    )
    
    return X, y

### Exercise Tasks

1.  Data Loading and Assessment
    -   Load all three data files
    -   Run initial quality assessment
    -   Identify key data issues
2.  Data Cleaning
    -   Implement temperature standardization
    -   Clean binding affinity values
    -   Handle missing values
    -   Standardize units
3.  Data Integration
    -   Merge datasets
    -   Handle duplicate measurements
    -   Create final feature matrix
4.  ML Model Development
    -   Create feature matrix
    -   Define target variable
    -   Train simple model
    -   Evaluate results

### Bonus Challenges

1.  Advanced Data Cleaning
    -   Implement outlier detection
    -   Add data validation rules
    -   Create data quality reports
2.  Feature Engineering
    -   Create derived features
    -   Implement domain-specific transformations
    -   Add sequence-based features
3.  Model Improvements
    -   Implement cross-validation
    -   Try different ML algorithms
    -   Add uncertainty quantification

### Tips for Success

1.  Data Quality
    -   Always plot your data
    -   Check for outliers
    -   Validate units
    -   Document assumptions
2.  Integration
    -   Verify sample IDs
    -   Check for duplicates
    -   Validate merged data
3.  ML Development
    -   Start simple
    -   Test assumptions
    -   Validate results
