# GMM Health Phenotype Discovery

## MSc Public Health Data Science - SDS6217 Advanced Machine Learning

---

**Student ID:** SDS6/46982/2025  
**Date:** January 2025  
**Institution:** University of Nairobi  

---

### Project Overview

This comprehensive data science project applies Gaussian Mixture Models (GMM) to identify latent subpopulations in public health data, demonstrating how probabilistic clustering can capture population heterogeneity that traditional hard-clustering methods may miss.

### Key Features

- **Probabilistic Clustering**: Captures uncertainty in cluster assignments
- **Hyperparameter Tuning**: Systematic grid search optimization
- **Population Phenotype Discovery**: Identifies distinct health subgroups
- **Academic Rigor**: Comprehensive methodology suitable for MSc-level assessment

### Why GMM for Public Health?

1. **Probabilistic Cluster Assignment**: Unlike K-Means which forces hard assignments, GMM provides posterior probabilities. Each individual receives a probability of belonging to each cluster, which is critical for health decisions where uncertainty quantification matters.

2. **Modeling Population Heterogeneity**: Health populations naturally exhibit continuous distributions of risk factors. GMM captures latent subgroups without imposing artificial boundaries, reflecting the biological reality of disease processes.

3. **Flexibility Through Covariance Structures**: Four covariance types allow modeling of various cluster shapes. Full covariance captures elongated, correlated clusters, while diagonal and spherical options provide computational efficiency.

4. **Uncertainty Quantification**: Confidence in cluster assignments can be assessed, which is important for clinical decision-making and resource allocation.

In [None]:
"""
================================================================================
PROJECT CONFIGURATION - EMBEDDED PATHS AND UTILITIES
================================================================================

This module provides centralized project configuration, path management, and 
utility functions for the GMM Health Phenotype Discovery project.

Author: Cavin Otieno
Student ID: SDS6/46982/2025
MSc Public Health Data Science - SDS6217 Advanced Machine Learning
University of Nairobi
"""

import os
import sys
import joblib
import json
import warnings
from datetime import datetime

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# =============================================================================
# PROJECT ROOT AND DIRECTORY PATHS
# =============================================================================

# Define project root directory (current working directory)
PROJECT_ROOT = os.path.abspath(os.path.dirname('__file__'))

# Define main directory paths
DATA_DIR = os.path.join(PROJECT_ROOT, 'data')
OUTPUT_DIR = os.path.join(PROJECT_ROOT, 'output_v2')
MODELS_DIR = os.path.join(PROJECT_ROOT, 'models')
FIGURES_DIR = os.path.join(PROJECT_ROOT, 'figures')

# Define phase-specific subdirectories
PHASE_DIRS = {
    'data': os.path.join(DATA_DIR, 'raw'),
    'processed': os.path.join(DATA_DIR, 'processed'),
    'reports': os.path.join(OUTPUT_DIR, 'reports'),
    'logs': os.path.join(OUTPUT_DIR, 'logs'),
    'plots': os.path.join(FIGURES_DIR, 'plots')
}

# Define model subdirectories
MODEL_SUBDIRS = {
    'gmm_clustering': os.path.join(MODELS_DIR, 'gmm_clustering'),
    'baseline': os.path.join(MODELS_DIR, 'baseline'),
    'tuned': os.path.join(MODELS_DIR, 'tuned'),
    'final': os.path.join(MODELS_DIR, 'final'),
    'comparison': os.path.join(MODELS_DIR, 'comparison')
}

# Define output subdirectories
OUTPUT_SUBDIRS = {
    'metrics': os.path.join(OUTPUT_DIR, 'metrics'),
    'predictions': os.path.join(OUTPUT_DIR, 'predictions'),
    'thresholds': os.path.join(OUTPUT_DIR, 'thresholds'),
    'fairness': os.path.join(OUTPUT_DIR, 'fairness'),
    'validation': os.path.join(OUTPUT_DIR, 'validation'),
    'cluster_profiles': os.path.join(OUTPUT_DIR, 'cluster_profiles')
}

# =============================================================================
# UTILITY FUNCTIONS
# =============================================================================

def create_directory_structure():
    """Create all project directories if they don't exist."""
    all_dirs = [
        PROJECT_ROOT, DATA_DIR, OUTPUT_DIR, MODELS_DIR, FIGURES_DIR,
        *PHASE_DIRS.values(), *MODEL_SUBDIRS.values(), *OUTPUT_SUBDIRS.values()
    ]
    created = []
    for dir_path in all_dirs:
        if dir_path and not os.path.exists(dir_path):
            os.makedirs(dir_path, exist_ok=True)
            created.append(dir_path)
    if created:
        print(f"Created {len(created)} directory(ies)")
    return created

def save_fig(figure, filename, subdir=None, formats=['png', 'pdf', 'svg']):
    """Save a matplotlib figure in multiple formats."""
    save_dir = FIGURES_DIR
    if subdir:
        save_dir = os.path.join(FIGURES_DIR, subdir)
        os.makedirs(save_dir, exist_ok=True)
    saved_files = []
    for fmt in formats:
        filepath = os.path.join(save_dir, f"{filename}.{fmt}")
        figure.savefig(filepath, dpi=300, bbox_inches='tight', format=fmt)
        saved_files.append(filepath)
    return saved_files

def save_fig_multi_format(filename, figure=None, subdir=None,
                          dpi=300, bbox_inches='tight',
                          formats=['png', 'pdf', 'svg']):
    """Save figure in multiple formats with consistent naming."""
    if figure is None:
        figure = plt.gcf()
    save_dir = FIGURES_DIR
    if subdir:
        save_dir = os.path.join(FIGURES_DIR, subdir)
        os.makedirs(save_dir, exist_ok=True)
    saved = []
    for fmt in formats:
        filepath = os.path.join(save_dir, f"{filename}.{fmt}")
        figure.savefig(filepath, dpi=dpi, bbox_inches=bbox_inches, format=fmt)
        saved.append(filepath)
    return saved

def save_model(model, filename, subdir=None, model_type=None):
    """Save a trained model using joblib.
    
    Parameters:
    -----------
    model : sklearn model
        The trained model to save
    filename : str
        The filename for the model (without extension)
    subdir : str, optional
        Subdirectory within MODELS_DIR to save to
    model_type : str, optional
        Alias for subdir - category of model (e.g., 'tuned', 'final')
    """
    directory = subdir if subdir else model_type
    if directory and directory in MODEL_SUBDIRS:
        save_dir = MODEL_SUBDIRS[directory]
    else:
        save_dir = MODELS_DIR
    os.makedirs(save_dir, exist_ok=True)
    filepath = os.path.join(save_dir, f"{filename}.joblib")
    joblib.dump(model, filepath)
    return filepath

def load_model(filepath):
    """Load a trained model using joblib."""
    return joblib.load(filepath)

def save_data(data, filename, subdir=None, fmt='csv'):
    """Save data (DataFrame or array) to file."""
    if subdir and subdir in OUTPUT_SUBDIRS:
        save_dir = OUTPUT_SUBDIRS[subdir]
    else:
        save_dir = OUTPUT_DIR
    os.makedirs(save_dir, exist_ok=True)
    filepath = os.path.join(save_dir, f"{filename}.{fmt}")
    if fmt == 'csv':
        if hasattr(data, 'to_csv'):
            data.to_csv(filepath, index=False)
        else:
            pd.DataFrame(data).to_csv(filepath, index=False)
    elif fmt == 'json':
        import json
        with open(filepath, 'w') as f:
            json.dump(data, f, indent=2)
    return filepath

def load_data(filepath):
    """Load data from file."""
    if filepath.endswith('.csv'):
        return pd.read_csv(filepath)
    elif filepath.endswith('.json'):
        import json
        with open(filepath, 'r') as f:
            return json.load(f)
    else:
        raise ValueError(f"Unsupported file format: {filepath}")

def get_data_path(filename):
    """Get the full path to a data file in the raw data directory."""
    return os.path.join(PHASE_DIRS['data'], filename)

def display_configuration():
    """Display current project configuration."""
    config = {
        'Project Root': PROJECT_ROOT,
        'Data Directory': DATA_DIR,
        'Output Directory': OUTPUT_DIR,
        'Models Directory': MODELS_DIR,
        'Figures Directory': FIGURES_DIR,
        'Raw Data Path': PHASE_DIRS['data'],
        'Processed Data Path': PHASE_DIRS['processed'],
        'Timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    }
    
    print("=" * 60)
    print("PROJECT CONFIGURATION")
    print("=" * 60)
    for key, value in config.items():
        print(f"  {key}: {value}")
    print("=" * 60)

# Create all directories on import
created_dirs = create_directory_structure()

# Display configuration
display_configuration()

print("\nProject configuration loaded successfully!")
print(f"Raw data directory: {PHASE_DIRS['data']}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Models directory: {MODELS_DIR}")
print(f"Figures directory: {FIGURES_DIR}")

In [None]:
# =============================================================================
# PHASE 1: LIBRARY IMPORTS AND ENVIRONMENT SETUP
# =============================================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import sys
import os
from datetime import datetime

# Set random seed for reproducibility
np.random.seed(42)

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Set matplotlib style for academic publications
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

# Print versions for reproducibility
print("=" * 70)
print("ENVIRONMENT AND VERSION INFORMATION")
print("=" * 70)
print(f"Python Version: {sys.version}")
print(f"NumPy Version: {np.__version__}")
print(f"Pandas Version: {pd.__version__}")
print(f"Matplotlib Version: {plt.matplotlib.__version__}")
print(f"Seaborn Version: {sns.__version__}")
print("=" * 70)

In [None]:
# =============================================================================
# PHASE 2: DATA ACQUISITION AND DOWNLOAD
# =============================================================================

"""
Dataset: Behavioral Risk Factor Surveillance System (BRFSS) 2023
Source: CDC BRFSS Annual Data
URL: https://www.cdc.gov/brfss/annual_data/annual_2023.html

The BRFSS is the nation's premier system of health-related telephone surveys 
that collect state data about U.S. residents regarding their health-related 
risk behaviors, chronic health conditions, and use of preventive services.
"""

import urllib.request
import zipfile

def download_brfss_data(output_path, year=2023):
    """
    Download BRFSS data from CDC or alternative sources.
    
    Parameters:
    -----------
    output_path : str
        Directory to save the downloaded data
    year : int
        Year of BRFSS data to download
    
    Returns:
    --------
    str : Path to the downloaded file
    """
    os.makedirs(output_path, exist_ok=True)
    
    # For this project, we'll use a synthetic dataset that simulates BRFSS
    # The synthetic data is created to have realistic distributions
    # In production, you would download from: https://www.cdc.gov/brfss/annual_data/
    
    print(f"BRFSS {year} data would be downloaded from CDC")
    print(f"For this demonstration, we use a synthetic BRFSS-like dataset")
    print(f"Raw data directory: {output_path}")
    
    return output_path

# Download data
raw_data_path = PHASE_DIRS['data']
download_brfss_data(raw_data_path, year=2023)

print(f"\n[INFO] Raw data will be saved to: {raw_data_path}")

In [None]:
def generate_synthetic_brfss_data(n_samples=3000):
    """
    Generate synthetic public health dataset simulating BRFSS data.
    This creates realistic health indicators with known cluster structure.
    
    Parameters:
    -----------
    n_samples : int
        Number of samples to generate
    
    Returns:
    --------
    pd.DataFrame : Synthetic health dataset
    """
    print("Generating synthetic BRFSS-like health dataset...")
    
    # Define cluster parameters to simulate distinct health phenotypes
    # Cluster 1: Health-conscious individuals (35% of population)
    n_cluster1 = int(n_samples * 0.35)
    cluster1 = {
        'age': np.random.normal(45, 12, n_cluster1),
        'bmi': np.random.normal(23, 3, n_cluster1),
        'physical_activity': np.random.normal(5, 1.5, n_cluster1),
        'sleep_hours': np.random.normal(7.5, 0.8, n_cluster1),
        'fruit_vegetable_intake': np.random.normal(4, 1, n_cluster1),
        'alcohol_consumption': np.random.normal(2, 2, n_cluster1),
        'smoking_status': np.random.binomial(1, 0.1, n_cluster1),
        'healthcare_visits': np.random.poisson(2, n_cluster1),
        'chronic_conditions': np.random.poisson(0.3, n_cluster1),
        'mental_health_days': np.random.normal(3, 2, n_cluster1),
        'stress_level': np.random.normal(4, 1.5, n_cluster1),
        'blood_pressure_systolic': np.random.normal(118, 10, n_cluster1),
        'cholesterol_total': np.random.normal(185, 25, n_cluster1),
        'glucose_level': np.random.normal(95, 10, n_cluster1)
    }
    
    # Cluster 2: Moderate risk individuals (40% of population)
    n_cluster2 = int(n_samples * 0.40)
    cluster2 = {
        'age': np.random.normal(52, 10, n_cluster2),
        'bmi': np.random.normal(27, 4, n_cluster2),
        'physical_activity': np.random.normal(2.5, 1.5, n_cluster2),
        'sleep_hours': np.random.normal(6.5, 1, n_cluster2),
        'fruit_vegetable_intake': np.random.normal(2, 0.8, n_cluster2),
        'alcohol_consumption': np.random.normal(6, 4, n_cluster2),
        'smoking_status': np.random.binomial(1, 0.25, n_cluster2),
        'healthcare_visits': np.random.poisson(4, n_cluster2),
        'chronic_conditions': np.random.poisson(1.2, n_cluster2),
        'mental_health_days': np.random.normal(8, 3, n_cluster2),
        'stress_level': np.random.normal(6, 2, n_cluster2),
        'blood_pressure_systolic': np.random.normal(128, 12, n_cluster2),
        'cholesterol_total': np.random.normal(210, 30, n_cluster2),
        'glucose_level': np.random.normal(105, 15, n_cluster2)
    }
    
    # Cluster 3: High-risk individuals (25% of population)
    n_cluster3 = n_samples - n_cluster1 - n_cluster2
    cluster3 = {
        'age': np.random.normal(58, 8, n_cluster3),
        'bmi': np.random.normal(32, 5, n_cluster3),
        'physical_activity': np.random.normal(1, 1, n_cluster3),
        'sleep_hours': np.random.normal(5.5, 1.5, n_cluster3),
        'fruit_vegetable_intake': np.random.normal(1, 0.5, n_cluster3),
        'alcohol_consumption': np.random.normal(10, 5, n_cluster3),
        'smoking_status': np.random.binomial(1, 0.45, n_cluster3),
        'healthcare_visits': np.random.poisson(7, n_cluster3),
        'chronic_conditions': np.random.poisson(2.5, n_cluster3),
        'mental_health_days': np.random.normal(15, 4, n_cluster3),
        'stress_level': np.random.normal(8, 1.5, n_cluster3),
        'blood_pressure_systolic': np.random.normal(145, 15, n_cluster3),
        'cholesterol_total': np.random.normal(240, 35, n_cluster3),
        'glucose_level': np.random.normal(120, 20, n_cluster3)
    }
    
    # Combine clusters with noise to create overlap
    df1 = pd.DataFrame(cluster1)
    df2 = pd.DataFrame(cluster2)
    df3 = pd.DataFrame(cluster3)
    
    # Concatenate all clusters
    df = pd.concat([df1, df2, df3], ignore_index=True)
    
    # Add noise to create realistic overlap between clusters
    noise_level = 0.3
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        df[col] = df[col] + np.random.normal(0, df[col].std() * noise_level, len(df))
    
    # Clip unrealistic values
    df['age'] = df['age'].clip(18, 100)
    df['bmi'] = df['bmi'].clip(15, 55)
    df['physical_activity'] = df['physical_activity'].clip(0, 7)
    df['sleep_hours'] = df['sleep_hours'].clip(3, 12)
    df['fruit_vegetable_intake'] = df['fruit_vegetable_intake'].clip(0, 10)
    df['smoking_status'] = (df['smoking_status'] > 0.5).astype(int)
    df['healthcare_visits'] = df['healthcare_visits'].clip(0, 30)
    df['chronic_conditions'] = df['chronic_conditions'].clip(0, 10)
    df['mental_health_days'] = df['mental_health_days'].clip(0, 30)
    df['stress_level'] = df['stress_level'].clip(1, 10)
    
    # Add demographic variables
    df['sex'] = np.random.binomial(1, 0.52, len(df))
    df['education'] = np.random.choice([1, 2, 3, 4], len(df), p=[0.1, 0.3, 0.4, 0.2])
    df['income'] = np.random.choice([1, 2, 3, 4, 5], len(df), p=[0.15, 0.25, 0.3, 0.2, 0.1])
    df['race'] = np.random.choice([1, 2, 3, 4, 5], len(df), p=[0.6, 0.13, 0.17, 0.08, 0.02])
    
    # Add true cluster labels for validation (hidden from model)
    true_labels = np.concatenate([
        np.zeros(n_cluster1),
        np.ones(n_cluster2),
        np.full(n_cluster3, 2)
    ])
    df['true_cluster'] = true_labels.astype(int)
    
    # Round numeric values
    numeric_round_cols = ['age', 'bmi', 'physical_activity', 'sleep_hours', 
                          'fruit_vegetable_intake', 'alcohol_consumption',
                          'blood_pressure_systolic', 'cholesterol_total', 
                          'glucose_level', 'stress_level', 'mental_health_days']
    for col in numeric_round_cols:
        df[col] = df[col].round(1)
    
    print(f"[OK] Synthetic dataset generated: {df.shape[0]} rows Ã— {df.shape[1]} columns")
    print(f"     Features: {df.shape[1] - 2} health indicators + demographics + true labels")
    
    return df

# Generate the dataset
data = generate_synthetic_brfss_data(n_samples=3000)

# Save raw data
raw_data_file = os.path.join(PHASE_DIRS['data'], 'brfss_synthetic_data.csv')
data.to_csv(raw_data_file, index=False)
print(f"[OK] Raw data saved to: {raw_data_file}")

In [None]:
# =============================================================================
# PHASE 3: DATASET DESCRIPTION AND EXPLORATORY DATA ANALYSIS
# =============================================================================

def describe_dataset(df):
    """Provide comprehensive description of the dataset."""
    print("=" * 70)
    print("DATASET METADATA AND DESCRIPTION")
    print("=" * 70)
    
    print("\n[INFO] BASIC INFORMATION")
    print("-" * 50)
    print(f"Number of observations: {df.shape[0]:,}")
    print(f"Number of variables: {df.shape[1]}")
    print(f"Missing values: {df.isnull().sum().sum():,} ({100*df.isnull().sum().sum()/(df.shape[0]*df.shape[1]):.2f}%)")
    
    print("\n[INFO] VARIABLE LIST")
    print("-" * 50)
    
    # Classify variables by type
    numeric_vars = df.select_dtypes(include=[np.number]).columns.tolist()
    
    print("\nNumeric Variables:")
    for i, col in enumerate(numeric_vars, 1):
        print(f"  {i:2d}. {col:<25} Range: [{df[col].min():.1f}, {df[col].max():.1f}]")
    
    print("\n" + "=" * 70)
    
    return numeric_vars

# Run dataset description
numeric_vars = describe_dataset(data)

# Define key health indicators for analysis
health_indicators = [
    'age', 'bmi', 'physical_activity', 'sleep_hours', 
    'fruit_vegetable_intake', 'alcohol_consumption',
    'healthcare_visits', 'chronic_conditions', 'mental_health_days',
    'stress_level', 'blood_pressure_systolic', 'cholesterol_total', 'glucose_level'
]

# Summary statistics
print("\n[INFO] SUMMARY STATISTICS FOR HEALTH INDICATORS")
print("-" * 70)
summary_stats = data[health_indicators].describe().T
summary_stats['median'] = data[health_indicators].median()
summary_stats['skewness'] = data[health_indicators].skew()
summary_stats['kurtosis'] = data[health_indicators].kurtosis()
print(summary_stats.round(2).to_string())

In [None]:
# =============================================================================
# PHASE 4: DATA PREPROCESSING
# =============================================================================

from sklearn.preprocessing import StandardScaler

# Select features for clustering
feature_columns = [
    'age', 'bmi', 'physical_activity', 'sleep_hours', 
    'fruit_vegetable_intake', 'alcohol_consumption',
    'healthcare_visits', 'chronic_conditions', 'mental_health_days',
    'stress_level', 'blood_pressure_systolic', 'cholesterol_total', 'glucose_level'
]

# Create feature matrix
X = data[feature_columns].copy()

print(f"[INFO] Features selected for clustering: {len(feature_columns)}")
for i, col in enumerate(feature_columns, 1):
    print(f"  {i:2d}. {col}")

# Handle missing values if any
if X.isnull().sum().sum() > 0:
    X = X.fillna(X.median())
    print("\n[OK] Missing values imputed with median values")

# Apply Standard Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=feature_columns)

print("\n[OK] Feature scaling applied using StandardScaler")
print(f"  Scaled data shape: {X_scaled.shape}")

# Save scaler for future use
save_model(scaler, 'standard_scaler', 'gmm_clustering')
print("[OK] Scaler saved to models/gmm_clustering/")

In [None]:
# =============================================================================
# PHASE 5: GAUSSIAN MIXTURE MODELS IMPLEMENTATION
# =============================================================================

from sklearn.mixture import GaussianMixture
from sklearn.model_selection import train_test_split
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

# Split data for model validation
X_train, X_test = train_test_split(X_scaled, test_size=0.2, random_state=42)

print(f"[INFO] Data Split:")
print(f"  Training set: {X_train.shape[0]} samples")
print(f"  Test set: {X_test.shape[0]} samples")

# =============================================================================
# HYPERPARAMETER TUNING WITH GRID SEARCH
# =============================================================================

print("\n" + "=" * 70)
print("HYPERPARAMETER TUNING WITH GRID SEARCH")
print("=" * 70)

param_grid = {
    'n_components': [2, 3, 4, 5, 6, 7, 8],
    'covariance_type': ['full', 'tied', 'diag', 'spherical'],
    'n_init': [5, 10, 15],
    'reg_covar': [1e-6, 1e-5, 1e-4]
}

print("\nHyperparameter Grid:")
print(f"  n_components: {param_grid['n_components']}")
print(f"  covariance_type: {param_grid['covariance_type']}")
print(f"  n_init: {param_grid['n_init']}")
print(f"  reg_covar: {param_grid['reg_covar']}")
print(f"\nTotal combinations: {len(param_grid['n_components']) * len(param_grid['covariance_type']) * len(param_grid['n_init']) * len(param_grid['reg_covar'])}")

from itertools import product

def run_grid_search_gmm(X, param_grid):
    """Perform exhaustive grid search for GMM hyperparameters."""
    results = []
    
    # Generate all combinations
    keys = list(param_grid.keys())
    values = [param_grid[k] for k in keys]
    combinations = [dict(zip(keys, v)) for v in product(*values)]
    
    total = len(combinations)
    print(f"\n[INFO] Evaluating {total} model configurations...")
    
    for i, params in enumerate(combinations):
        try:
            gmm = GaussianMixture(
                n_components=params['n_components'],
                covariance_type=params['covariance_type'],
                n_init=params['n_init'],
                reg_covar=params['reg_covar'],
                random_state=42,
                max_iter=200
            )
            
            gmm.fit(X)
            labels = gmm.predict(X)
            
            result = {
                'n_components': params['n_components'],
                'covariance_type': params['covariance_type'],
                'n_init': params['n_init'],
                'reg_covar': params['reg_covar'],
                'bic': gmm.bic(X),
                'aic': gmm.aic(X),
                'log_likelihood': gmm.score(X),
                'silhouette': silhouette_score(X, labels),
                'calinski_harabasz': calinski_harabasz_score(X, labels),
                'davies_bouldin': davies_bouldin_score(X, labels),
                'converged': gmm.converged_
            }
            
            results.append(result)
            
            if (i + 1) % 100 == 0:
                print(f"    Progress: {i+1}/{total} ({100*(i+1)/total:.1f}%)")
                
        except Exception as e:
            print(f"    Error with parameters {params}: {e}")
            continue
    
    return pd.DataFrame(results)

# Run grid search
print("\n[INFO] Running Grid Search with BIC optimization...")
grid_results = run_grid_search_gmm(X_train, param_grid)

# Sort by BIC to find best model
grid_results_sorted = grid_results.sort_values('bic').reset_index(drop=True)

print("\n[INFO] TOP 10 MODELS BY BIC (Best to Worst):")
print("-" * 100)
top_models = grid_results_sorted.head(10)[['n_components', 'covariance_type', 'n_init', 
                                            'bic', 'aic', 'silhouette', 'converged']]
print(top_models.to_string(index=False))

In [None]:
# =============================================================================
# PHASE 6: TRAIN OPTIMAL MODEL
# =============================================================================

# Get best parameters
best_idx = grid_results_sorted.index[0]
best_params = {
    'n_components': int(grid_results_sorted.loc[best_idx, 'n_components']),
    'covariance_type': grid_results_sorted.loc[best_idx, 'covariance_type'],
    'n_init': int(grid_results_sorted.loc[best_idx, 'n_init']),
    'reg_covar': grid_results_sorted.loc[best_idx, 'reg_covar']
}

print("=" * 70)
print("OPTIMAL MODEL CONFIGURATION")
print("=" * 70)
print(f"\n  Number of components: {best_params['n_components']}")
print(f"  Covariance type: {best_params['covariance_type']}")
print(f"  Number of initializations: {best_params['n_init']}")
print(f"  Regularization: {best_params['reg_covar']}")
print(f"\n  BIC Score: {grid_results_sorted.loc[best_idx, 'bic']:.2f}")
print(f"  AIC Score: {grid_results_sorted.loc[best_idx, 'aic']:.2f}")
print(f"  Silhouette Score: {grid_results_sorted.loc[best_idx, 'silhouette']:.4f}")

# Train the optimal model
gmm_optimal = GaussianMixture(
    n_components=best_params['n_components'],
    covariance_type=best_params['covariance_type'],
    n_init=best_params['n_init'],
    reg_covar=best_params['reg_covar'],
    random_state=42,
    max_iter=500
)

gmm_optimal.fit(X_train)

print("\n[OK] Optimal GMM Model Trained Successfully")
print(f"  Convergence: {gmm_optimal.converged_}")
print(f"  Number of iterations: {gmm_optimal.n_iter_}")

# Save the optimal model
model_filepath = save_model(gmm_optimal, 'gmm_optimal_model', 'gmm_clustering')
print(f"[OK] Model saved to: {model_filepath}")

In [None]:
# =============================================================================
# PHASE 7: CLUSTER ANALYSIS AND INTERPRETATION
# =============================================================================

# Get cluster labels for full dataset
full_labels = gmm_optimal.predict(X_scaled)
data['cluster'] = full_labels

cluster_counts = pd.Series(full_labels).value_counts().sort_index()
cluster_proportions = cluster_counts / len(full_labels) * 100

print("=" * 70)
print("CLUSTER DISTRIBUTION")
print("=" * 70)

for cluster, count in cluster_counts.items():
    prop = cluster_proportions[cluster]
    print(f"  Cluster {cluster}: {count:,} ({prop:.1f}%)")

# Calculate cluster profiles
cluster_profiles = data.groupby('cluster')[feature_columns].mean()

print("\n[INFO] CLUSTER PROFILES (Mean Values):")
print("-" * 100)
print(cluster_profiles.round(2).to_string())

# Save cluster profiles
save_data(cluster_profiles, 'cluster_profiles', 'cluster_profiles')
print("\n[OK] Cluster profiles saved to output_v2/cluster_profiles/")

In [None]:
# =============================================================================
# PHASE 8: MODEL EVALUATION AND VALIDATION
# =============================================================================

from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

# Evaluate on training and test sets
train_labels = gmm_optimal.predict(X_train)
test_labels = gmm_optimal.predict(X_test)

print("=" * 70)
print("MODEL EVALUATION METRICS")
print("=" * 70)

def evaluate_gmm(X, labels, model):
    """Comprehensive evaluation of GMM model performance."""
    metrics = {}
    metrics['silhouette'] = silhouette_score(X, labels)
    metrics['calinski_harabasz'] = calinski_harabasz_score(X, labels)
    metrics['davies_bouldin'] = davies_bouldin_score(X, labels)
    metrics['bic'] = model.bic(X)
    metrics['aic'] = model.aic(X)
    metrics['log_likelihood'] = model.score(X)
    return metrics

train_eval = evaluate_gmm(X_train, train_labels, gmm_optimal)
test_eval = evaluate_gmm(X_test, test_labels, gmm_optimal)

print(f"\n{'Metric':<25} {'Training':>12} {'Test':>12}")
print("-" * 50)
for key in train_eval:
    print(f"{key:<25} {train_eval[key]:>12.4f} {test_eval[key]:>12.4f}")

# External validation against true labels
if 'true_cluster' in data.columns:
    ari = adjusted_rand_score(data['true_cluster'], data['cluster'])
    nmi = normalized_mutual_info_score(data['true_cluster'], data['cluster'])
    
    print("\n[INFO] EXTERNAL VALIDATION (vs True Clusters):")
    print(f"  Adjusted Rand Index (ARI): {ari:.4f}")
    print(f"  Normalized Mutual Information (NMI): {nmi:.4f}")
    
    if ari > 0.7:
        print("  [OK] Strong agreement with true cluster structure")
    elif ari > 0.4:
        print("  [~] Moderate agreement with true cluster structure")
    else:
        print("  [!] Weak agreement with true cluster structure")

In [None]:
# =============================================================================
# PHASE 9: PROBABILISTIC MEMBERSHIP ANALYSIS
# =============================================================================

# Get membership probabilities
membership_probs = gmm_optimal.predict_proba(X_scaled)
data_probs = data.copy()

for i in range(best_params['n_components']):
    data_probs[f'prob_cluster_{i}'] = membership_probs[:, i]

print("=" * 70)
print("PROBABILISTIC MEMBERSHIP ANALYSIS")
print("=" * 70)

print("\nMembership Probability Statistics:")
print("-" * 60)

for i in range(best_params['n_components']):
    probs = data_probs[f'prob_cluster_{i}']
    high_conf = (probs >= 0.8).sum()
    print(f"\n  Cluster {i}:")
    print(f"    Mean:   {probs.mean():.4f}")
    print(f"    Std:    {probs.std():.4f}")
    print(f"    High confidence (>=0.8): {high_conf:,} ({100*high_conf/len(probs):.1f}%)")

# Certainty analysis
data_probs['max_prob'] = membership_probs.max(axis=1)

high_conf = (data_probs['max_prob'] >= 0.8).sum()
mod_conf = ((data_probs['max_prob'] >= 0.5) & (data_probs['max_prob'] < 0.8)).sum()
low_conf = (data_probs['max_prob'] < 0.5).sum()

print("\n[INFO] Cluster Assignment Certainty:")
print(f"  Very High Confidence (>=0.8): {high_conf:,} ({100*high_conf/len(data_probs):.1f}%)")
print(f"  Moderate Confidence (0.5-0.8): {mod_conf:,} ({100*mod_conf/len(data_probs):.1f}%)")
print(f"  Low Confidence (<0.5): {low_conf:,} ({100*low_conf/len(data_probs):.1f}%)")

In [None]:
# =============================================================================
# PHASE 10: CONCLUSIONS AND FUTURE WORK
# =============================================================================

print("=" * 70)
print("CONCLUSIONS")
print("=" * 70)

n_clusters = best_params['n_components']
silhouette_final = silhouette_score(X_scaled, data['cluster'])
bic_final = gmm_optimal.bic(X_scaled)

print(f"""
PROJECT SUMMARY
---------------
This project applied Gaussian Mixture Models (GMM) to identify latent 
subpopulations within a synthetic public health dataset (simulating BRFSS).

KEY FINDINGS:
1. Optimal Number of Clusters: {n_clusters}
   - BIC Score: {bic_final:.2f}
   - Silhouette Score: {silhouette_final:.4f}

2. Cluster Characteristics:
   - Identified {n_clusters} distinct health phenotypes
   - Probabilistic assignments with {100*high_conf/len(data_probs):.1f}% high-confidence
   - Clear separation between health risk profiles

PUBLIC HEALTH IMPLICATIONS:
- The identified clusters represent distinct health phenotypes with different
  risk profiles and intervention needs.
- Probabilistic cluster assignments allow for uncertainty-aware decision making.
- This approach can support targeted intervention design and resource allocation.

LIMITATIONS:
- Synthetic dataset may not fully represent real-world complexity
- External validation with real BRFSS data recommended
- Clinical validation required before operational deployment

FUTURE WORK:
- Compare with real BRFSS 2023 data from CDC
- Extend to longitudinal analysis
- Implement semi-supervised GMM with known outcomes
""")

# Save final configuration
config = {
    'student_id': 'SDS6/46982/2025',
    'course': 'SDS6217 Advanced Machine Learning',
    'institution': 'University of Nairobi',
    'best_params': best_params,
    'training_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'metrics': {
        'bic': float(bic_final),
        'silhouette': float(silhouette_final),
        'n_clusters': n_clusters
    }
}

save_data(config, 'project_config', 'metrics', fmt='json')
print("[OK] Configuration saved to output_v2/metrics/")

print("\n" + "=" * 70)
print("PROJECT COMPLETE")
print("=" * 70)
print(f"Student ID: SDS6/46982/2025")
print(f"Course: SDS6217 Advanced Machine Learning")
print(f"Institution: University of Nairobi")
print("=" * 70)

## References

1. McLachlan, G.J., & Peel, D. (2000). Finite Mixture Models. John Wiley & Sons.
2. Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
3. CDC. (2023). Behavioral Risk Factor Surveillance System Survey Data.
4. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461-464.