# Health Risk Assessment

## 1. Project Overview

**Objective**:  
Predict health risks (e.g., likelihood of developing a chronic condition) based on demographic, lifestyle, and medical data.

**Scope**:  
This project focuses on analyzing health-related data to build predictive models that estimate the likelihood of an individual developing coronary heart disease (CHD) within the next 10 years. The analysis involves preprocessing the data, performing exploratory data analysis (EDA), and evaluating several classification models to determine which best predicts the risk of CHD.

**Dataset**:  
The dataset includes various health-related features such as:
- **Age**: The age of the individual.
- **Gender**: The gender of the individual.
- **BMI**: Body Mass Index.
- **Smoking Status**: Cigarettes per day.
- **Blood Pressure**: Systolic and Diastolic Blood Pressure.
- **Cholesterol Levels**: Total cholesterol levels.
- **Heart Rate**: Resting heart rate.
- **Glucose**: Blood glucose levels.
- **Education**: Level of education.
- **BPMeds**: Use of blood pressure medication.
- **TenYearCHD**: Target variable indicating whether the individual has a 10-year risk of coronary heart disease.

## 2. Imports

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

## 3. Pipeline

In [2]:
def load_data(filepath):
    """Load the dataset from the given filepath."""
    return pd.read_csv(filepath)

def inspect_data(df):
    """Inspect the data for initial understanding."""
    print(df.info())
    print(df.describe())
    return df.isna().mean() * 100

def impute_missing_data(df):
    """Handle missing data using median and mode imputation."""
    median_columns = ['cigsPerDay', 'totChol', 'BMI', 'heartRate', 'glucose']
    mode_columns = ['education', 'BPMeds']
    
    for column in median_columns:
        df[column] = df[column].fillna(df[column].median())
    
    for column in mode_columns:
        df[column] = df[column].fillna(df[column].mode()[0])
    
    return df

def scale_features(df, target_col):
    """Scale features using StandardScaler."""
    scaler = StandardScaler()
    features = df.drop(target_col, axis=1)
    scaled_features = scaler.fit_transform(features)
    scaled_df = pd.DataFrame(scaled_features, columns=features.columns)
    scaled_df[target_col] = df[target_col]
    return scaled_df

def train_models(X_train, y_train):
    """Train multiple models and return them in a dictionary."""
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'Gradient Boosting': GradientBoostingClassifier(random_state=42)
    }
    
    for model_name, model in models.items():
        model.fit(X_train, y_train)
    
    return models

def evaluate_models(models, X_test, y_test):
    """Evaluate models and return a DataFrame with the results."""
    model_results = {}
    
    for model_name, model in models.items():
        y_pred = model.predict(X_test)
        model_results[model_name] = {
            'Accuracy': accuracy_score(y_test, y_pred),
            'Precision': precision_score(y_test, y_pred),
            'Recall': recall_score(y_test, y_pred),
            'F1 Score': f1_score(y_test, y_pred),
            'ROC AUC': roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
        }
    
    return pd.DataFrame(model_results).T

# Pipeline
def main():
    # Load and inspect data
    health = load_data('./datasets/framingham.csv')
    missing_values = inspect_data(health)
    
    # Data cleaning
    health = impute_missing_data(health)
    
    # Feature scaling
    health_scaled = scale_features(health, 'TenYearCHD')
    
    # Split data
    X = health_scaled.drop('TenYearCHD', axis=1)
    y = health_scaled['TenYearCHD']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train and evaluate models
    models = train_models(X_train, y_train)
    results_df = evaluate_models(models, X_test, y_test)
    
    print(results_df)


## 4. Execution

In [3]:
if __name__ == "__main__":
    main()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4240 entries, 0 to 4239
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4240 non-null   int64  
 1   age              4240 non-null   int64  
 2   education        4135 non-null   float64
 3   currentSmoker    4240 non-null   int64  
 4   cigsPerDay       4211 non-null   float64
 5   BPMeds           4187 non-null   float64
 6   prevalentStroke  4240 non-null   int64  
 7   prevalentHyp     4240 non-null   int64  
 8   diabetes         4240 non-null   int64  
 9   totChol          4190 non-null   float64
 10  sysBP            4240 non-null   float64
 11  diaBP            4240 non-null   float64
 12  BMI              4221 non-null   float64
 13  heartRate        4239 non-null   float64
 14  glucose          3852 non-null   float64
 15  TenYearCHD       4240 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 530.1 KB
None
        