# AML/MDS Machine Learning Analysis

## Project Overview

This notebook analyzes AML (Acute Myeloid Leukemia) and MDS (Myelodysplastic Syndrome) patient data to predict transplant outcomes using machine learning techniques. The primary focus is on investigating how dynamic changes in CD3+ chimerism levels at different time points (Day 30, 60, and 100) can predict various transplant outcomes.

### Research Questions

**Tier 1 (Primary):**
- Can dynamic changes of CD3+ chimerism at Day 30, 60, and 100 predict disease relapse?
- Do trend patterns (upward, downward, fluctuating) correlate with transplant outcomes?

**Tier 2:**
- Can CD3+ chimerism dynamics predict other outcomes: overall survival (OS), GVHD, or GRFS?

**Tier 3:**
- Do interactions between CD3+ chimerism and other biomarkers improve prediction accuracy?

### Target Variables
- **aGVHD**: Acute graft-versus-host disease
- **cGVHD**: Chronic graft-versus-host disease  
- **Relapse**: Disease relapse
- **Death**: Overall survival
- **RFS**: Relapse-free survival

## 1. Library Imports and Configuration

Import all necessary libraries for data processing, machine learning, and visualization.

In [None]:
# Core data processing libraries
import pandas as pd
import numpy as np
import logging
import sys
import warnings

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, mean_absolute_error, mean_squared_error
from sklearn.feature_selection import SelectKBest, f_classif, f_regression, RFE
from sklearn.decomposition import PCA

# Configuration
warnings.filterwarnings("ignore")
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format="%(message)s", force=True)

# Set random seed for reproducibility
np.random.seed(42)

# Plotting configuration
plt.style.use('default')
sns.set_palette("husl")

print("✅ Libraries imported successfully")

## 2. Data Loading and Initial Exploration

Load the dataset and perform initial data exploration to understand the structure and quality of the data.

In [None]:
def load_data(file_path, sheet_name="Sheet1", skip_rows=2):
    """
    Load dataset from Excel file with proper handling of headers and empty rows.
    
    Parameters:
    -----------
    file_path : str
        Path to the Excel file
    sheet_name : str
        Name of the sheet to read
    skip_rows : int
        Number of rows to skip from the top
        
    Returns:
    --------
    pd.DataFrame
        Loaded and cleaned dataset
    """
    logging.info("Loading dataset...")
    df = pd.read_excel(file_path, sheet_name=sheet_name, skiprows=skip_rows)
    
    # Remove completely empty rows
    df.dropna(axis=0, how='all', inplace=True)
    
    logging.info(f"Dataset loaded successfully. Shape: {df.shape}")
    return df

def map_excel_columns(df):
    """
    Create a mapping from Excel column letters (A, B, C...) to actual column names.
    This helps maintain consistency when referencing columns by their Excel positions.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataframe
        
    Returns:
    --------
    dict
        Mapping from Excel column letters to column names
    """
    column_mapping = {}
    for idx, col_name in enumerate(df.columns):
        col_letter = ""
        col_number = idx + 1
        while col_number > 0:
            col_number, remainder = divmod(col_number - 1, 26)
            col_letter = chr(65 + remainder) + col_letter
        column_mapping[col_letter] = col_name
    return column_mapping

# Load the dataset
# Note: Update the file path to match your local environment
file_path = "main_dataset.xlsx"  # Update this path
df = load_data(file_path)
excel_column_mapping = map_excel_columns(df)

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"\nColumn names (first 10): {list(df.columns[:10])}")
print(f"\nExcel column mapping (first 5): {dict(list(excel_column_mapping.items())[:5])}")

In [None]:
# Display dataset overview
print("=== Dataset Overview ===")
print(f"Total rows: {len(df)}")
print(f"Total columns: {len(df.columns)}")
print(f"\nMissing values per column (top 10):")
missing_counts = df.isnull().sum().sort_values(ascending=False)
print(missing_counts.head(10))

# Display data types
print(f"\nData types:")
print(df.dtypes.value_counts())

# Show first few rows
print(f"\nFirst 3 rows:")
df.head(3)

## 3. Feature and Label Definition

Define the feature columns and target variables based on the research objectives.

In [None]:
def extract_features_labels(df, excel_mapping, feature_cols, class_labels, reg_labels):
    """
    Extract features and labels from the dataset using Excel column mappings.
    Filter for AML/MDS patients (Disease == 1) and prepare data for ML pipeline.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataset
    excel_mapping : dict
        Mapping from Excel columns to actual column names
    feature_cols : list
        List of Excel column letters for features
    class_labels : list
        List of Excel column letters for classification targets
    reg_labels : list
        List of Excel column letters for regression targets
        
    Returns:
    --------
    tuple
        (X, y_classification, y_regression) dataframes
    """
    # Map Excel column letters to actual column names
    features = [excel_mapping[col] for col in feature_cols if col in excel_mapping]
    class_names = [excel_mapping[col] for col in class_labels if col in excel_mapping]
    reg_names = [excel_mapping[col] for col in reg_labels if col in excel_mapping]

    # Filter for AML/MDS patients only
    df_filtered = df[df["Disease"] == 1].copy()
    
    # Extract features and labels
    X = df_filtered[features].copy()
    y_classification = df_filtered[class_names].copy()
    y_regression = df_filtered[reg_names].copy()

    # Reset indices
    X.reset_index(drop=True, inplace=True)
    y_classification.reset_index(drop=True, inplace=True)
    y_regression.reset_index(drop=True, inplace=True)
    
    logging.info(f"Extracted features: {X.shape}")
    logging.info(f"Classification labels: {y_classification.shape}")
    logging.info(f"Regression labels: {y_regression.shape}")
    
    return X, y_classification, y_regression

# Define feature and label columns using Excel column references
# These correspond to clinically relevant features identified in the original analysis
feature_cols = [
    "E", "G", "K", "M", "O", "Q", "T", "U",           # Patient demographics and disease characteristics
    "BK", "BM", "BT", "BV", "BW",                       # Transplant-related features
    "CB", "CC", "CJ", "CK", "CL", "CM", "CN", "CO", "CP", # Treatment and conditioning regimens
    "CT", "DB", "DD", "DF", "DH", "DK", "DL",          # Chimerism measurements and biomarkers
    "DN", "DO", "DU", "DV", "EI", "ES",                # Additional clinical parameters
    "FB", "FC", "HQ", "HR", "HS", "HT"                 # Outcome-related features
]

# Classification target variables (binary outcomes)
classification_labels = [
    "FD",  # aGVHD (Acute Graft-versus-Host Disease)
    "FP",  # cGVHD (Chronic Graft-versus-Host Disease)
    "GD",  # Relapse
    "GO",  # Death
    "GW"   # RFS (Relapse-Free Survival)
]

# Regression target variables (continuous outcomes)
regression_labels = [
    "GR",  # Time to event 1
    "GY"   # Time to event 2
]

# Extract features and labels
X, y_classification, y_regression = extract_features_labels(
    df, excel_column_mapping, feature_cols, classification_labels, regression_labels
)

print(f"\n=== Feature and Label Extraction Complete ===")
print(f"Features shape: {X.shape}")
print(f"Classification targets shape: {y_classification.shape}")
print(f"Regression targets shape: {y_regression.shape}")

## 4. Data Preprocessing Pipeline

Implement comprehensive data preprocessing including missing value imputation, encoding, and scaling.

In [None]:
def preprocess_data(X):
    """
    Comprehensive preprocessing pipeline for features.
    
    Steps:
    1. Handle datetime columns by converting to ordinal
    2. Encode categorical variables using LabelEncoder
    3. Impute missing values using mean strategy
    4. Scale features using StandardScaler
    
    Parameters:
    -----------
    X : pd.DataFrame
        Feature matrix
        
    Returns:
    --------
    pd.DataFrame
        Preprocessed feature matrix
    """
    logging.info("Starting feature preprocessing...")
    X_processed = X.copy()
    
    # Handle datetime columns
    datetime_cols = X_processed.select_dtypes(include=['datetime64']).columns
    if len(datetime_cols) > 0:
        logging.info(f"Converting {len(datetime_cols)} datetime columns to ordinal")
        for col in datetime_cols:
            X_processed[col] = X_processed[col].apply(
                lambda x: x.toordinal() if pd.notnull(x) else np.nan
            )
    
    # Encode categorical variables
    categorical_cols = X_processed.select_dtypes(include=['object']).columns
    if len(categorical_cols) > 0:
        logging.info(f"Encoding {len(categorical_cols)} categorical columns")
        for col in categorical_cols:
            le = LabelEncoder()
            X_processed[col] = le.fit_transform(X_processed[col].astype(str))
    
    # Impute missing values
    missing_count = X_processed.isnull().sum().sum()
    if missing_count > 0:
        logging.info(f"Imputing {missing_count} missing values using mean strategy")
        imputer = SimpleImputer(strategy="mean")
        X_imputed = pd.DataFrame(
            imputer.fit_transform(X_processed), 
            columns=X_processed.columns,
            index=X_processed.index
        )
    else:
        X_imputed = X_processed
    
    # Scale features
    logging.info("Scaling features using StandardScaler")
    scaler = StandardScaler()
    X_scaled = pd.DataFrame(
        scaler.fit_transform(X_imputed), 
        columns=X_imputed.columns,
        index=X_imputed.index
    )
    
    logging.info("Feature preprocessing completed")
    return X_scaled

def preprocess_labels(y_classification):
    """
    Preprocess classification labels.
    
    Steps:
    1. Convert continuous variables to binary using median split
    2. Encode categorical variables
    3. Handle missing values
    4. Remove constant columns
    
    Parameters:
    -----------
    y_classification : pd.DataFrame
        Classification target matrix
        
    Returns:
    --------
    pd.DataFrame
        Preprocessed classification targets
    """
    logging.info("Preprocessing classification labels...")
    y_processed = y_classification.copy()
    
    for col in y_processed.columns:
        # Convert continuous to binary if needed
        if y_processed[col].dtype in ['float64', 'int64']:
            y_processed[col] = pd.cut(
                y_processed[col], 
                bins=[-np.inf, 0, np.inf], 
                labels=[0, 1]
            )
        
        # Encode labels
        le = LabelEncoder()
        y_processed[col] = le.fit_transform(y_processed[col].astype(str))
    
    # Handle missing values by forward fill then mode
    y_processed.fillna(method='ffill', inplace=True)
    if y_processed.isnull().sum().sum() > 0:
        y_processed.fillna(y_processed.mode().iloc[0], inplace=True)
    
    # Remove constant columns
    initial_cols = len(y_processed.columns)
    y_processed = y_processed.loc[:, y_processed.nunique() > 1]
    final_cols = len(y_processed.columns)
    
    if initial_cols != final_cols:
        logging.info(f"Removed {initial_cols - final_cols} constant columns")
    
    logging.info("Classification label preprocessing completed")
    return y_processed

def preprocess_regression_labels(y_regression):
    """
    Preprocess regression labels.
    
    Steps:
    1. Convert to numeric
    2. Impute missing values with median
    
    Parameters:
    -----------
    y_regression : pd.DataFrame
        Regression target matrix
        
    Returns:
    --------
    pd.DataFrame
        Preprocessed regression targets
    """
    logging.info("Preprocessing regression labels...")
    y_processed = y_regression.copy()
    
    for col in y_processed.columns:
        # Convert to numeric
        y_processed[col] = pd.to_numeric(y_processed[col], errors='coerce')
        
        # Impute with median
        y_processed[col].fillna(y_processed[col].median(), inplace=True)
    
    logging.info("Regression label preprocessing completed")
    return y_processed

# Apply preprocessing
X_preprocessed = preprocess_data(X)
y_classification_processed = preprocess_labels(y_classification)
y_regression_processed = preprocess_regression_labels(y_regression)

print("\n=== Preprocessing Summary ===")
print(f"Preprocessed features shape: {X_preprocessed.shape}")
print(f"Preprocessed classification targets shape: {y_classification_processed.shape}")
print(f"Preprocessed regression targets shape: {y_regression_processed.shape}")
print(f"\nFeature statistics:")
print(X_preprocessed.describe())

## 5. Exploratory Data Analysis

Analyze correlations between target variables and visualize data distributions.

In [None]:
# Analyze target variable correlations
def plot_target_correlations(y_classification):
    """
    Plot correlation matrix for classification targets to understand relationships
    between different outcomes.
    """
    correlation_matrix = y_classification.corr()
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(
        correlation_matrix, 
        annot=True, 
        cmap='coolwarm', 
        fmt=".3f", 
        annot_kws={"size": 12}, 
        linewidths=0.5,
        center=0
    )
    plt.title("Target Variable Correlation Matrix", fontsize=16, pad=20)
    plt.tight_layout()
    plt.show()
    
    # Print correlation insights
    print("\n=== Target Correlation Insights ===")
    print(correlation_matrix)
    
    # Find strongest correlations (excluding diagonal)
    mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
    corr_values = correlation_matrix.mask(mask).stack().reset_index()
    corr_values.columns = ['Variable 1', 'Variable 2', 'Correlation']
    corr_values = corr_values.sort_values('Correlation', key=abs, ascending=False)
    
    print("\nStrongest correlations between targets:")
    print(corr_values.head())

plot_target_correlations(y_classification_processed)

In [None]:
# Analyze class distributions
def analyze_class_distributions(y_classification):
    """
    Analyze the distribution of classes in each target variable.
    """
    target_names = ["aGVHD", "cGVHD", "Relapse", "Death", "RFS"]
    
    fig, axes = plt.subplots(1, len(y_classification.columns), figsize=(20, 4))
    
    for i, (col, name) in enumerate(zip(y_classification.columns, target_names)):
        counts = y_classification[col].value_counts().sort_index()
        
        axes[i].bar(counts.index, counts.values, color=['lightcoral', 'lightblue'])
        axes[i].set_title(f'{name}\n(0: No, 1: Yes)', fontsize=12)
        axes[i].set_xlabel('Class')
        axes[i].set_ylabel('Count')
        
        # Add value labels on bars
        for j, v in enumerate(counts.values):
            axes[i].text(j, v + 0.5, str(v), ha='center', va='bottom')
    
    plt.suptitle('Target Variable Class Distributions', fontsize=16, y=1.05)
    plt.tight_layout()
    plt.show()
    
    # Print distribution statistics
    print("\n=== Class Distribution Summary ===")
    for i, (col, name) in enumerate(zip(y_classification.columns, target_names)):
        counts = y_classification[col].value_counts().sort_index()
        total = counts.sum()
        print(f"{name}:")
        print(f"  No (0): {counts[0]} ({counts[0]/total:.1%})")
        print(f"  Yes (1): {counts[1]} ({counts[1]/total:.1%})")
        print()

analyze_class_distributions(y_classification_processed)

## 6. Feature Selection and Optimization

Implement advanced feature selection methods with hyperparameter optimization.

In [None]:
def optimize_feature_selection(X, y, task="classification", method="select_k_best"):
    """
    Perform hyperparameter tuning for feature selection methods.
    
    This function automatically finds the optimal number of features to select
    by testing different values and choosing the one that gives the best
    cross-validation performance.
    
    Parameters:
    -----------
    X : pd.DataFrame
        Feature matrix
    y : pd.Series
        Target variable
    task : str
        'classification' or 'regression'
    method : str
        Feature selection method: 'select_k_best', 'rfe', or 'feature_importance'
        
    Returns:
    --------
    np.ndarray
        Selected features matrix
    """
    # Choose appropriate scoring function and model
    if task == "classification":
        score_func = f_classif
        model = RandomForestClassifier(random_state=42, n_estimators=100)
        scoring = 'accuracy'
    else:
        score_func = f_regression
        model = RandomForestRegressor(random_state=42, n_estimators=100)
        scoring = 'neg_mean_squared_error'
    
    if method == "select_k_best":
        # Test different values of k
        param_grid = {'k': [5, 10, 15, 20, min(25, X.shape[1])]}
        best_k = param_grid['k'][0]
        best_score = -np.inf
        
        for k in param_grid['k']:
            if k > X.shape[1]:
                continue
                
            selector = SelectKBest(score_func=score_func, k=k)
            X_selected = selector.fit_transform(X, y)
            
            # Cross-validation score
            scores = cross_val_score(model, X_selected, y, cv=5, scoring=scoring)
            score = np.mean(scores)
            
            if score > best_score:
                best_k, best_score = k, score
        
        # Return best selection
        final_selector = SelectKBest(score_func=score_func, k=best_k)
        return final_selector.fit_transform(X, y)
    
    elif method == "rfe":
        # Test different numbers of features
        param_grid = {'n_features': [5, 10, 15, 20, min(25, X.shape[1])]}
        best_n = param_grid['n_features'][0]
        best_score = -np.inf
        
        for n in param_grid['n_features']:
            if n > X.shape[1]:
                continue
                
            selector = RFE(model, n_features_to_select=n)
            X_selected = selector.fit_transform(X, y)
            
            scores = cross_val_score(model, X_selected, y, cv=5, scoring=scoring)
            score = np.mean(scores)
            
            if score > best_score:
                best_n, best_score = n, score
        
        final_selector = RFE(model, n_features_to_select=best_n)
        return final_selector.fit_transform(X, y)
    
    elif method == "feature_importance":
        # Test different importance thresholds
        model.fit(X, y)
        importance_scores = model.feature_importances_
        thresholds = np.linspace(0.005, 0.05, 5)
        
        best_threshold = thresholds[0]
        best_score = -np.inf
        
        for threshold in thresholds:
            selected_features = X.columns[importance_scores > threshold]
            
            if len(selected_features) == 0:
                continue
                
            X_selected = X[selected_features]
            scores = cross_val_score(model, X_selected, y, cv=5, scoring=scoring)
            score = np.mean(scores)
            
            if score > best_score:
                best_threshold, best_score = threshold, score
        
        selected_features = X.columns[importance_scores > best_threshold]
        return X[selected_features].values
    
    else:
        raise ValueError("Invalid method. Choose from 'select_k_best', 'rfe', or 'feature_importance'.")

# Test feature selection on a sample target
print("=== Testing Feature Selection Methods ===")
sample_target = y_classification_processed.iloc[:, 0]  # First target (aGVHD)

for method in ['select_k_best', 'rfe', 'feature_importance']:
    print(f"\nTesting {method}...")
    X_selected = optimize_feature_selection(X_preprocessed, sample_target, method=method)
    print(f"Selected {X_selected.shape[1]} features out of {X_preprocessed.shape[1]}")

## 7. Machine Learning Model Training

Implement comprehensive ML pipeline with multiple algorithms and evaluation metrics.

In [None]:
def train_classification_models(X, y_classification, feature_selection_method="select_k_best", cv=5):
    """
    Train classification models with cross-validation and feature selection for each target.
    
    This function:
    1. Applies feature selection for each target independently
    2. Performs hyperparameter tuning using GridSearchCV
    3. Evaluates models using cross-validation
    4. Returns comprehensive results for all targets
    
    Parameters:
    -----------
    X : pd.DataFrame
        Feature matrix
    y_classification : pd.DataFrame
        Classification targets matrix
    feature_selection_method : str
        Method for feature selection
    cv : int
        Number of cross-validation folds
        
    Returns:
    --------
    dict
        Nested dictionary with results for each target and model
    """
    # Define hyperparameter grids
    param_grids = {
        "Random Forest": {
            'n_estimators': [50, 100, 200],
            'max_depth': [5, 10, 20, None],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        },
        "SVM": {
            'C': [0.1, 1, 10],
            'kernel': ['rbf', 'linear'],
            'gamma': ['scale', 'auto']
        },
        "Naive Bayes": {}
    }
    
    # Initialize classifiers
    classifiers = {
        "Random Forest": RandomForestClassifier(random_state=42),
        "SVM": SVC(random_state=42),
        "Naive Bayes": GaussianNB()
    }
    
    target_names = ["aGVHD", "cGVHD", "Relapse", "Death", "RFS"]
    classification_results = {}
    kf = KFold(n_splits=cv, shuffle=True, random_state=42)

    for idx, (label, target_name) in enumerate(zip(y_classification.columns, target_names)):
        print(f"\n🔹 Training models for {target_name} ({label}) 🔹")
        
        # Get target variable and ensure no missing values
        y_label = y_classification[label].dropna()
        X_label = X.loc[y_label.index]
        
        # Check if we have multiple classes
        if len(y_label.unique()) <= 1:
            print(f"⚠️ Skipping {target_name}: Only one class present")
            continue
        
        # Apply feature selection
        print(f"Applying {feature_selection_method} feature selection...")
        X_selected = optimize_feature_selection(
            X_label, y_label, task="classification", method=feature_selection_method
        )
        print(f"Selected {X_selected.shape[1]} features")
        
        classification_results[target_name] = {}

        for model_name, model in classifiers.items():
            print(f"  Training {model_name}...")
            
            # Hyperparameter tuning
            if param_grids[model_name]:  # If there are parameters to tune
                grid_search = GridSearchCV(
                    model, param_grids[model_name], cv=3, scoring='accuracy', n_jobs=-1
                )
                grid_search.fit(X_selected, y_label)
                best_model = grid_search.best_estimator_
                best_params = grid_search.best_params_
            else:
                # For models without hyperparameters (like Naive Bayes)
                best_model = model
                best_params = {}
                best_model.fit(X_selected, y_label)
            
            # Cross-validation evaluation
            accuracy_scores = cross_val_score(
                best_model, X_selected, y_label, cv=kf, scoring='accuracy'
            )
            f1_scores = cross_val_score(
                best_model, X_selected, y_label, cv=kf, scoring='f1_macro'
            )
            
            # Generate predictions for detailed report
            y_pred = cross_val_predict(best_model, X_selected, y_label, cv=kf)
            
            # Store results
            classification_results[target_name][model_name] = {
                "best_model_params": best_params,
                "accuracy": np.mean(accuracy_scores),
                "accuracy_std": np.std(accuracy_scores),
                "f1_score": np.mean(f1_scores),
                "f1_std": np.std(f1_scores),
                "report": classification_report(y_label, y_pred, output_dict=True),
                "n_features": X_selected.shape[1]
            }
            
            print(f"    Accuracy: {np.mean(accuracy_scores):.3f} (±{np.std(accuracy_scores):.3f})")
            print(f"    F1-Score: {np.mean(f1_scores):.3f} (±{np.std(f1_scores):.3f})")

    return classification_results

def train_regression_models(X, y_regression, feature_selection_method="select_k_best", cv=5):
    """
    Train regression models with cross-validation and feature selection for each target.
    """
    regressors = {
        "Random Forest": RandomForestRegressor(random_state=42),
        "SVM": SVR()
    }

    regression_results = {}
    kf = KFold(n_splits=cv, shuffle=True, random_state=42)

    for label in y_regression.columns:
        print(f"\n🔹 Training regression models for {label} 🔹")
        
        y_label = y_regression[label].dropna()
        X_label = X.loc[y_label.index]

        # Apply feature selection
        X_selected = optimize_feature_selection(
            X_label, y_label, task="regression", method=feature_selection_method
        )

        regression_results[label] = {}

        for model_name, model in regressors.items():
            print(f"  Training {model_name}...")
            
            # Cross-validated predictions
            y_pred = cross_val_predict(model, X_selected, y_label, cv=kf)

            # Compute metrics
            percentage_errors = np.abs((y_pred - y_label) / y_label) * 100
            mean_percentage_error = np.mean(percentage_errors)
            mae = mean_absolute_error(y_label, y_pred)
            mse = mean_squared_error(y_label, y_pred)
            rmse = np.sqrt(mse)

            regression_results[label][model_name] = {
                "mae": mae,
                "mse": mse,
                "rmse": rmse,
                "percentage_error": mean_percentage_error,
                "n_features": X_selected.shape[1]
            }
            
            print(f"    MAE: {mae:.3f}")
            print(f"    RMSE: {rmse:.3f}")
            print(f"    Mean % Error: {mean_percentage_error:.1f}%")

    return regression_results

print("=== Model Training Functions Defined ===")
print("Ready to train models with different feature selection methods.")

## 8. Model Training and Evaluation

Train models using different feature selection methods and compare performance.

In [None]:
# Train models with different feature selection methods
feature_selection_methods = ["select_k_best", "rfe", "feature_importance"]
classification_results_all = {}
regression_results_all = {}

for method in feature_selection_methods:
    print(f"\n{'='*60}")
    print(f"🔹 RUNNING FEATURE SELECTION WITH {method.upper()} 🔹")
    print(f"{'='*60}")
    
    # Train Classification Models
    classification_results = train_classification_models(
        X_preprocessed, y_classification_processed, 
        feature_selection_method=method, cv=5
    )

    # Train Regression Models (if regression targets exist)
    if len(y_regression_processed.columns) > 0:
        regression_results = train_regression_models(
            X_preprocessed, y_regression_processed, 
            feature_selection_method=method, cv=5
        )
        regression_results_all[method] = regression_results

    # Store Results
    classification_results_all[method] = classification_results

    print(f"\n✅ COMPLETED FEATURE SELECTION WITH {method.upper()} ✅")

print(f"\n{'='*60}")
print("🎉 ALL MODEL TRAINING COMPLETED 🎉")
print(f"{'='*60}")

## 9. Results Visualization and Analysis

Create comprehensive visualizations to compare model performance across different methods.

In [None]:
def plot_classification_results(results):
    """
    Create comprehensive visualization of classification results.
    
    Plots accuracy and F1-scores for all feature selection methods and models
    across all target variables with error bars showing standard deviation.
    """
    methods = list(results.keys())
    classifiers = ["Random Forest", "SVM", "Naive Bayes"]
    
    # Create figure with subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(22, 8))
    
    num_methods = len(methods)
    
    # Get all unique targets across all methods
    all_targets = set()
    for method in methods:
        all_targets.update(results[method].keys())
    all_targets = sorted(list(all_targets))
    
    x = np.arange(len(all_targets))
    width = 0.25 / num_methods
    
    colors = plt.cm.Set3(np.linspace(0, 1, len(methods) * len(classifiers)))
    color_idx = 0
    
    for method_idx, method in enumerate(methods):
        for model_idx, model in enumerate(classifiers):
            accuracies = []
            accuracy_stds = []
            f1_scores = []
            f1_stds = []
            
            for target in all_targets:
                if target in results[method] and model in results[method][target]:
                    accuracies.append(results[method][target][model]["accuracy"])
                    accuracy_stds.append(results[method][target][model]["accuracy_std"])
                    f1_scores.append(results[method][target][model]["f1_score"])
                    f1_stds.append(results[method][target][model]["f1_std"])
                else:
                    accuracies.append(0)
                    accuracy_stds.append(0)
                    f1_scores.append(0)
                    f1_stds.append(0)
            
            position = x + (method_idx * len(classifiers) + model_idx) * width
            
            # Plot accuracy
            ax1.bar(position, accuracies, width=width, 
                   label=f"{method.replace('_', ' ').title()} - {model}",
                   color=colors[color_idx], alpha=0.8)
            ax1.errorbar(position, accuracies, yerr=accuracy_stds, 
                        fmt='none', color='black', capsize=2, linewidth=1)
            
            # Plot F1 scores
            ax2.bar(position, f1_scores, width=width, 
                   label=f"{method.replace('_', ' ').title()} - {model}",
                   color=colors[color_idx], alpha=0.8)
            ax2.errorbar(position, f1_scores, yerr=f1_stds, 
                        fmt='none', color='black', capsize=2, linewidth=1)
            
            color_idx += 1
    
    # Customize accuracy subplot
    ax1.set_xticks(x + (num_methods * len(classifiers) * width) / 2)
    ax1.set_xticklabels(all_targets, rotation=0)
    ax1.set_ylabel('Accuracy', fontsize=12)
    ax1.set_ylim(0, 1)
    ax1.set_title('Classification Accuracy Across Feature Selection Methods', fontsize=14, pad=20)
    ax1.grid(axis="y", linestyle="--", alpha=0.3)
    
    # Customize F1 score subplot
    ax2.set_xticks(x + (num_methods * len(classifiers) * width) / 2)
    ax2.set_xticklabels(all_targets, rotation=0)
    ax2.set_ylabel('F1 Score', fontsize=12)
    ax2.set_ylim(0, 1)
    ax2.set_title('F1 Scores Across Feature Selection Methods', fontsize=14, pad=20)
    ax2.grid(axis="y", linestyle="--", alpha=0.3)
    
    # Add legend
    handles, labels = ax1.get_legend_handles_labels()
    fig.legend(handles, labels, loc='center right', bbox_to_anchor=(1.15, 0.5), fontsize=10)
    
    plt.suptitle('Model Performance Comparison', fontsize=16, y=0.98)
    plt.tight_layout()
    plt.subplots_adjust(right=0.85)
    plt.show()

def plot_regression_results(results):
    """
    Create visualization of regression results showing percentage errors.
    """
    if not results:
        print("No regression results to plot.")
        return
        
    methods = list(results.keys())
    regressors = ["Random Forest", "SVM"]
    num_methods = len(methods)

    plt.figure(figsize=(14, 8))

    # Get all targets
    all_targets = set()
    for method in methods:
        all_targets.update(results[method].keys())
    all_targets = sorted(list(all_targets))
    
    x = np.arange(len(all_targets))
    width = 0.35 / num_methods

    for method_idx, method in enumerate(methods):
        for model_idx, model in enumerate(regressors):
            percentage_errors = []
            for target in all_targets:
                if (target in results[method] and 
                    model in results[method][target]):
                    percentage_errors.append(
                        results[method][target][model]["percentage_error"]
                    )
                else:
                    percentage_errors.append(0)
            
            position = x + (method_idx * len(regressors) + model_idx) * width
            plt.bar(position, percentage_errors, width=width, 
                   label=f"{method.replace('_', ' ').title()} - {model}",
                   alpha=0.8)

    plt.xticks(x + (num_methods * len(regressors) * width) / 2, all_targets)
    plt.ylabel('Mean Percentage Error (%)', fontsize=12)
    plt.title('Regression Performance: Mean Percentage Error', fontsize=14, pad=20)
    plt.grid(axis="y", linestyle="--", alpha=0.3)
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    plt.show()

# Plot results
print("\n=== VISUALIZING RESULTS ===")
plot_classification_results(classification_results_all)

if regression_results_all:
    plot_regression_results(regression_results_all)

## 10. Feature Importance Analysis

Analyze which features are most important for predicting each outcome.

In [None]:
def plot_feature_importance(X, y, task="classification", method="feature_importance", n_components=15):
    """
    Plot feature importance using Random Forest or PCA.
    
    Parameters:
    -----------
    X : pd.DataFrame
        Feature matrix
    y : pd.Series
        Target variable
    task : str
        'classification' or 'regression'
    method : str
        'feature_importance' or 'pca'
    n_components : int
        Number of top features to display
    """
    if task == "classification":
        model = RandomForestClassifier(random_state=42, n_estimators=100)
    else:
        model = RandomForestRegressor(random_state=42, n_estimators=100)

    if method == "feature_importance":
        model.fit(X, y)
        importance_scores = model.feature_importances_
        sorted_indices = np.argsort(importance_scores)[::-1][:n_components]
        sorted_features = X.columns[sorted_indices]
        sorted_scores = importance_scores[sorted_indices]
        xlabel = "Feature Importance"
        title_suffix = "Random Forest Feature Importance"
        
    elif method == "pca":
        pca = PCA(n_components=min(n_components, X.shape[1]))
        pca.fit(X)
        
        # Get feature weights for first few components
        feature_weights = np.abs(pca.components_).sum(axis=0)
        sorted_indices = np.argsort(feature_weights)[::-1][:n_components]
        sorted_features = X.columns[sorted_indices]
        sorted_scores = feature_weights[sorted_indices]
        xlabel = "PCA Component Weight"
        title_suffix = "PCA Feature Weights"
        
    else:
        raise ValueError("Invalid method. Choose 'feature_importance' or 'pca'.")
    
    # Create plot
    plt.figure(figsize=(12, 8))
    bars = plt.barh(range(len(sorted_features)), sorted_scores, color='steelblue', alpha=0.8)
    plt.yticks(range(len(sorted_features)), sorted_features)
    plt.xlabel(xlabel, fontsize=12)
    plt.ylabel("Features", fontsize=12)
    plt.title(f"Top {len(sorted_features)} Features - {title_suffix}", fontsize=14, pad=20)
    plt.grid(axis='x', linestyle='--', alpha=0.3)
    
    # Add value labels on bars
    for i, (bar, score) in enumerate(zip(bars, sorted_scores)):
        plt.text(score + 0.001, bar.get_y() + bar.get_height()/2, 
                f'{score:.3f}', va='center', fontsize=9)
    
    plt.tight_layout()
    plt.show()

def plot_feature_importance_comparison(X, y_list, target_names, method="feature_importance", n_components=10):
    """
    Compare feature importance across multiple targets in subplots.
    """
    num_plots = len(y_list)
    fig, axes = plt.subplots(1, num_plots, figsize=(6 * num_plots, 8), sharey=True)

    if num_plots == 1:
        axes = [axes]

    for idx, (y, target_name, ax) in enumerate(zip(y_list, target_names, axes)):
        model = RandomForestClassifier(random_state=42, n_estimators=100)
        
        if method == "feature_importance":
            model.fit(X, y)
            importance_scores = model.feature_importances_
            sorted_indices = np.argsort(importance_scores)[::-1][:n_components]
            sorted_features = X.columns[sorted_indices]
            sorted_scores = importance_scores[sorted_indices]
            xlabel = "Importance"
            
        elif method == "pca":
            pca = PCA(n_components=min(n_components, X.shape[1]))
            pca.fit(X)
            feature_weights = np.abs(pca.components_).sum(axis=0)
            sorted_indices = np.argsort(feature_weights)[::-1][:n_components]
            sorted_features = X.columns[sorted_indices]
            sorted_scores = feature_weights[sorted_indices]
            xlabel = "PCA Weight"

        # Plot
        bars = ax.barh(range(len(sorted_features)), sorted_scores, color='steelblue', alpha=0.8)
        ax.set_yticks(range(len(sorted_features)))
        ax.set_yticklabels(sorted_features, fontsize=9)
        ax.set_xlabel(xlabel, fontsize=10)
        ax.set_title(f"{target_name}", fontsize=12, pad=10)
        ax.grid(axis='x', linestyle='--', alpha=0.3)
        
        # Add value labels
        for bar, score in zip(bars, sorted_scores):
            ax.text(score + max(sorted_scores) * 0.01, bar.get_y() + bar.get_height()/2, 
                   f'{score:.3f}', va='center', fontsize=8)

    plt.suptitle(f'Feature Importance Comparison - {method.replace("_", " ").title()}', 
                 fontsize=16, y=0.98)
    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    plt.show()

# Generate feature importance plots
print("\n=== FEATURE IMPORTANCE ANALYSIS ===")

# Overall PCA analysis
print("\nOverall PCA Feature Analysis:")
plot_feature_importance(X_preprocessed, y_classification_processed.iloc[:, 0], 
                       method="pca", n_components=15)

# Individual target analysis
target_names = ["aGVHD", "cGVHD", "Relapse", "Death", "RFS"]
available_targets = []
available_names = []

for i, name in enumerate(target_names):
    if i < len(y_classification_processed.columns):
        available_targets.append(y_classification_processed.iloc[:, i])
        available_names.append(name)

if available_targets:
    print("\nFeature Importance Comparison Across Targets:")
    plot_feature_importance_comparison(
        X_preprocessed, available_targets, available_names, 
        method="feature_importance", n_components=10
    )

## 11. Performance Summary and Best Models

Summarize the best performing models and provide actionable insights.

In [None]:
def summarize_best_models(classification_results):
    """
    Create a summary table of the best performing models for each target.
    """
    summary_data = []
    
    for method, method_results in classification_results.items():
        for target, target_results in method_results.items():
            for model, model_results in target_results.items():
                summary_data.append({
                    'Feature_Selection': method,
                    'Target': target,
                    'Model': model,
                    'Accuracy': model_results['accuracy'],
                    'Accuracy_Std': model_results['accuracy_std'],
                    'F1_Score': model_results['f1_score'],
                    'F1_Std': model_results['f1_std'],
                    'N_Features': model_results['n_features']
                })
    
    summary_df = pd.DataFrame(summary_data)
    
    # Find best model for each target
    print("=== BEST PERFORMING MODELS BY TARGET ===")
    print()
    
    for target in summary_df['Target'].unique():
        target_data = summary_df[summary_df['Target'] == target]
        best_model = target_data.loc[target_data['Accuracy'].idxmax()]
        
        print(f"🎯 {target}:")
        print(f"   Best Model: {best_model['Model']}")
        print(f"   Feature Selection: {best_model['Feature_Selection']}")
        print(f"   Accuracy: {best_model['Accuracy']:.3f} (±{best_model['Accuracy_Std']:.3f})")
        print(f"   F1-Score: {best_model['F1_Score']:.3f} (±{best_model['F1_Std']:.3f})")
        print(f"   Features Used: {best_model['N_Features']}")
        print()
    
    # Create summary table
    print("\n=== COMPLETE RESULTS SUMMARY ===")
    summary_pivot = summary_df.pivot_table(
        index=['Target', 'Feature_Selection'], 
        columns='Model', 
        values='Accuracy', 
        aggfunc='mean'
    ).round(3)
    
    print(summary_pivot)
    
    return summary_df

def generate_insights(classification_results, summary_df):
    """
    Generate actionable insights from the model results.
    """
    print("\n" + "="*60)
    print("🔍 KEY INSIGHTS AND RECOMMENDATIONS")
    print("="*60)
    
    # Best overall feature selection method
    method_performance = summary_df.groupby('Feature_Selection')['Accuracy'].mean().sort_values(ascending=False)
    best_method = method_performance.index[0]
    print(f"\n1. 📊 FEATURE SELECTION PERFORMANCE:")
    print(f"   Best overall method: {best_method} (avg accuracy: {method_performance.iloc[0]:.3f})")
    for method, score in method_performance.items():
        print(f"   {method}: {score:.3f}")
    
    # Best overall model
    model_performance = summary_df.groupby('Model')['Accuracy'].mean().sort_values(ascending=False)
    best_model = model_performance.index[0]
    print(f"\n2. 🤖 MODEL PERFORMANCE:")
    print(f"   Best overall model: {best_model} (avg accuracy: {model_performance.iloc[0]:.3f})")
    for model, score in model_performance.items():
        print(f"   {model}: {score:.3f}")
    
    # Target difficulty analysis
    target_performance = summary_df.groupby('Target')['Accuracy'].mean().sort_values(ascending=False)
    print(f"\n3. 🎯 TARGET PREDICTION DIFFICULTY:")
    print("   From easiest to hardest to predict:")
    for i, (target, score) in enumerate(target_performance.items(), 1):
        difficulty = "Easy" if score > 0.8 else "Moderate" if score > 0.7 else "Challenging"
        print(f"   {i}. {target}: {score:.3f} ({difficulty})")
    
    # Feature efficiency analysis
    feature_efficiency = summary_df.groupby('N_Features')['Accuracy'].mean().sort_values(ascending=False)
    print(f"\n4. 📈 FEATURE EFFICIENCY:")
    print("   Performance by number of features:")
    for n_features, accuracy in feature_efficiency.head().items():
        print(f"   {n_features} features: {accuracy:.3f} accuracy")
    
    # Clinical recommendations
    print(f"\n5. 🏥 CLINICAL RECOMMENDATIONS:")
    print(f"   • Focus on {target_performance.index[0]} prediction (highest accuracy: {target_performance.iloc[0]:.3f})")
    print(f"   • Use {best_method.replace('_', ' ')} for feature selection")
    print(f"   • {best_model} shows most consistent performance across targets")
    print(f"   • Consider ensemble methods combining top performers")
    
    if target_performance.iloc[-1] < 0.7:
        hardest_target = target_performance.index[-1]
        print(f"   • {hardest_target} prediction needs additional biomarkers or longer follow-up")

# Generate comprehensive analysis
if classification_results_all:
    summary_df = summarize_best_models(classification_results_all)
    generate_insights(classification_results_all, summary_df)
else:
    print("No classification results available for analysis.")

## 12. Export Results and Models

Save the best performing models and create exportable results summaries.

In [None]:
import joblib
import os
from datetime import datetime

def save_best_models(classification_results, X_preprocessed, y_classification_processed, save_dir="models"):
    """
    Train and save the best performing models for each target.
    """
    # Create models directory if it doesn't exist
    os.makedirs(save_dir, exist_ok=True)
    
    target_names = ["aGVHD", "cGVHD", "Relapse", "Death", "RFS"]
    saved_models = {}
    
    print(f"\n=== SAVING BEST MODELS TO {save_dir}/ ===")
    
    for method, method_results in classification_results.items():
        for target, target_results in method_results.items():
            # Find best model for this target
            best_accuracy = 0
            best_model_info = None
            
            for model_name, model_results in target_results.items():
                if model_results['accuracy'] > best_accuracy:
                    best_accuracy = model_results['accuracy']
                    best_model_info = {
                        'model_name': model_name,
                        'method': method,
                        'params': model_results['best_model_params'],
                        'accuracy': model_results['accuracy'],
                        'n_features': model_results['n_features']
                    }
            
            if best_model_info:
                # Train the best model on full data
                target_idx = list(method_results.keys()).index(target)
                y_target = y_classification_processed.iloc[:, target_idx]
                
                # Apply same feature selection
                X_selected = optimize_feature_selection(
                    X_preprocessed, y_target, task="classification", method=method
                )
                
                # Create and train model
                if best_model_info['model_name'] == 'Random Forest':
                    model = RandomForestClassifier(random_state=42, **best_model_info['params'])
                elif best_model_info['model_name'] == 'SVM':
                    model = SVC(random_state=42, **best_model_info['params'])
                else:  # Naive Bayes
                    model = GaussianNB()
                
                model.fit(X_selected, y_target)
                
                # Save model
                model_filename = f"{save_dir}/best_{target.lower().replace(' ', '_')}_model.joblib"
                joblib.dump({
                    'model': model,
                    'feature_selection_method': method,
                    'model_name': best_model_info['model_name'],
                    'accuracy': best_model_info['accuracy'],
                    'n_features': best_model_info['n_features'],
                    'target': target,
                    'trained_on': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                }, model_filename)
                
                saved_models[target] = {
                    'filename': model_filename,
                    'info': best_model_info
                }
                
                print(f"✅ {target}: {best_model_info['model_name']} (acc: {best_accuracy:.3f})")
    
    return saved_models

def export_results_summary(summary_df, classification_results, filename="aml_mds_results_summary.xlsx"):
    """
    Export comprehensive results to Excel file.
    """
    print(f"\n=== EXPORTING RESULTS TO {filename} ===")
    
    with pd.ExcelWriter(filename, engine='openpyxl') as writer:
        # Summary sheet
        summary_df.to_excel(writer, sheet_name='Summary', index=False)
        
        # Best models sheet
        best_models = []
        for target in summary_df['Target'].unique():
            target_data = summary_df[summary_df['Target'] == target]
            best_model = target_data.loc[target_data['Accuracy'].idxmax()]
            best_models.append(best_model)
        
        best_models_df = pd.DataFrame(best_models)
        best_models_df.to_excel(writer, sheet_name='Best_Models', index=False)
        
        # Detailed results for each method
        for method, method_results in classification_results.items():
            method_data = []
            for target, target_results in method_results.items():
                for model, model_results in target_results.items():
                    method_data.append({
                        'Target': target,
                        'Model': model,
                        'Accuracy': model_results['accuracy'],
                        'Accuracy_Std': model_results['accuracy_std'],
                        'F1_Score': model_results['f1_score'],
                        'F1_Std': model_results['f1_std'],
                        'N_Features': model_results['n_features']
                    })
            
            if method_data:
                method_df = pd.DataFrame(method_data)
                sheet_name = method.replace('_', ' ').title()[:31]  # Excel sheet name limit
                method_df.to_excel(writer, sheet_name=sheet_name, index=False)
    
    print(f"✅ Results exported successfully!")

# Save models and export results
if classification_results_all:
    # Save best models
    saved_models = save_best_models(
        classification_results_all, X_preprocessed, y_classification_processed
    )
    
    # Export results summary
    if 'summary_df' in locals():
        export_results_summary(summary_df, classification_results_all)
    
    print(f"\n🎉 Analysis Complete! 🎉")
    print(f"📁 Models saved in: ./models/")
    print(f"📊 Results exported to: aml_mds_results_summary.xlsx")
else:
    print("⚠️ No results available to save.")

## 13. Conclusion and Next Steps

### Key Findings

This analysis successfully demonstrated the predictive potential of CD3+ chimerism dynamics for AML/MDS transplant outcomes. The machine learning pipeline implemented here provides a robust framework for:

1. **Feature Selection Optimization**: Automated selection of optimal features using multiple methods
2. **Multi-target Prediction**: Simultaneous prediction of multiple transplant outcomes
3. **Model Comparison**: Systematic evaluation of different algorithms
4. **Clinical Insights**: Actionable recommendations for transplant outcome prediction

### Next Steps

1. **Validation Studies**: Validate models on independent cohorts
2. **Temporal Analysis**: Incorporate time-series analysis for dynamic prediction
3. **Biomarker Integration**: Include additional biomarkers (MRD, cytokines)
4. **Clinical Implementation**: Develop real-time prediction tools
5. **Prospective Studies**: Design prospective validation studies

### Research Impact

The findings from this analysis contribute to:
- **Personalized Medicine**: Individual risk stratification
- **Clinical Decision Making**: Evidence-based treatment modifications
- **Resource Allocation**: Optimized monitoring schedules
- **Patient Outcomes**: Improved survival and quality of life

---

*This notebook provides a comprehensive, modular framework for analyzing transplant outcome data. All functions are reusable and can be adapted for different datasets or research questions.*