# Linear Regression Model - Heart Disease Prediction

This notebook implements a linear regression model to predict heart disease prevalence using preprocessed county-level sociodemographic and health data.

## Objectives:
1. **Load Data**: Import preprocessed training and test datasets
2. **Baseline Model**: Train a linear regression model for heart disease prediction
3. **Model Evaluation**: Assess performance using regression metrics
4. **Results Analysis**: Interpret model performance and feature importance

**Target Variable**: Heart disease prevalence (%) across US counties

## 1. Import Libraries and Load Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import GridSearchCV, cross_val_score, learning_curve
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler, LabelEncoder
from scipy.stats import mstats
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

### Library Import Summary

Essential libraries have been imported for:
- **Data Manipulation**: pandas and numpy for data handling
- **Machine Learning**: scikit-learn for linear regression and metrics
- **Visualization**: matplotlib and seaborn (available if needed)
- **Statistical Analysis**: scipy for additional statistical functions

All warnings are suppressed to maintain clean output during model training and evaluation.

### Load Preprocessed Data

We'll load the preprocessed training and test datasets that were created in the EDA notebook. The data has already been cleaned, scaled, and split into appropriate sets ready for machine learning modeling.

In [2]:
# Load the preprocessed heart disease prediction dataset
import os

# Define path to heart disease prediction dataset (relative path)
dataset_dir = '../data/processed/heart_disease_prediction_dataset'

# Load training and test sets (already scaled and ready for modeling)
X_train = pd.read_csv(os.path.join(dataset_dir, 'X_train.csv'))
X_test = pd.read_csv(os.path.join(dataset_dir, 'X_test.csv'))

# Load target variables
y_train = pd.read_csv(os.path.join(dataset_dir, 'y_train.csv')).squeeze()
y_test = pd.read_csv(os.path.join(dataset_dir, 'y_test.csv')).squeeze()

print("Data loaded successfully!")
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Training target range: {y_train.min():.2f}% to {y_train.max():.2f}%")
print(f"Test target range: {y_test.min():.2f}% to {y_test.max():.2f}%")
print(f"Number of features: {X_train.shape[1]}")
print(f"Sample feature names: {list(X_train.columns[:5])}")  # Show first 5 features

Data loaded successfully!
Training set: (2512, 29)
Test set: (628, 29)
Training target range: 3.80% to 15.10%
Test target range: 3.50% to 14.30%
Number of features: 29
Sample feature names: ['young_adults_pct', 'middle_aged_pct', 'older_adults_pct', 'Percent of Population Aged 60+', '% Asian-alone']


### Data Loading Summary

The data has been successfully loaded with the following characteristics:
- Training and test sets are properly balanced
- Features are already scaled and preprocessed
- Target variable represents heart disease prevalence as a percentage
- All data is ready for model training without additional preprocessing

## 2. Linear Regression Models with Regularization

Train and compare Linear Regression, Ridge, Lasso, and ElasticNet models.

In [3]:
# Train multiple regression models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=1.0),
    'ElasticNet': ElasticNet(alpha=1.0, l1_ratio=0.5)
}

# Calculate performance metrics
def calculate_metrics(y_true, y_pred):
    """Calculate regression metrics"""
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    return {'R2': r2, 'RMSE': rmse, 'MAE': mae, 'MAPE': mape}

# Train and evaluate all models
results = {}
predictions = {}

for name, model in models.items():
    # Train model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Store predictions for best model
    predictions[name] = y_test_pred
    
    # Calculate metrics
    train_metrics = calculate_metrics(y_train, y_train_pred)
    test_metrics = calculate_metrics(y_test, y_test_pred)
    
    results[name] = {
        'Train R2': train_metrics['R2'],
        'Test R2': test_metrics['R2'],
        'Test RMSE': test_metrics['RMSE'],
        'Test MAE': test_metrics['MAE'],
        'Overfitting': train_metrics['R2'] - test_metrics['R2']
    }

# Display results
print("Model Comparison:")
print("-" * 70)
for name, metrics in results.items():
    print(f"{name:15} | R2: {metrics['Test R2']:.4f} | RMSE: {metrics['Test RMSE']:.4f} | Overfitting: {metrics['Overfitting']:.4f}")

# Find best model
best_model_name = max(results.keys(), key=lambda k: results[k]['Test R2'])
print(f"\nBest Model: {best_model_name} (R² = {results[best_model_name]['Test R2']:.4f})")

# Store best model for feature analysis
best_model = models[best_model_name]

Model Comparison:
----------------------------------------------------------------------
Linear Regression | R2: 0.9668 | RMSE: 0.3271 | Overfitting: 0.0030
Ridge           | R2: 0.9672 | RMSE: 0.3251 | Overfitting: 0.0025
Lasso           | R2: 0.5398 | RMSE: 1.2184 | Overfitting: -0.0047
ElasticNet      | R2: 0.7694 | RMSE: 0.8626 | Overfitting: -0.0033

Best Model: Ridge (R² = 0.9672)


### Regularization Models Performance Analysis

Results show clear differences between regularization techniques:

- **Ridge (Winner)**: R² = 0.9672, best performance with minimal overfitting (0.0025)
- **Linear Regression**: R² = 0.9668, very close performance but slightly more overfitting (0.0030)
- **ElasticNet**: R² = 0.7694, moderate performance with good generalization
- **Lasso**: R² = 0.5398, poor performance likely due to aggressive feature selection

**Ridge regression** achieved the best results by effectively reducing overfitting while maintaining all features. The excellent R² of 96.72% indicates the model explains nearly all variance in heart disease prevalence.

## 3. Feature Importance Analysis

Understanding which features have the strongest impact on heart disease prevalence is crucial for:
- **Clinical Insights**: Identifying key risk factors for heart disease
- **Policy Making**: Focusing public health interventions on the most influential factors
- **Model Interpretation**: Ensuring the model's predictions align with medical knowledge

The feature importance analysis will rank features by their coefficient magnitude, showing both positive and negative predictors of heart disease prevalence.

## 4. Model Analysis

The linear regression model shows good performance with stable results between training and test sets. This indicates that the model generalizes well without overfitting.

In [4]:
# Feature importance analysis using best model
feature_names = X_train.columns
coefficients = best_model.coef_

# Create feature importance dataframe
feature_importance = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients,
    'Abs_Coefficient': np.abs(coefficients)
}).sort_values('Abs_Coefficient', ascending=False)

print(f"Feature Importance Analysis ({best_model_name}):")
print("Top 10 Most Important Features:")
print(feature_importance.head(10)[['Feature', 'Coefficient']].to_string(index=False))

# Basic statistics
print(f"\nModel Summary:")
print(f"Total features: {len(feature_names)}")
print(f"Non-zero features: {sum(np.abs(coefficients) > 1e-6)}")  # For Lasso feature selection
print(f"Features with positive impact: {sum(coefficients > 0)}")
print(f"Features with negative impact: {sum(coefficients < 0)}")
if len(feature_importance) > 0:
    print(f"Strongest positive predictor: {feature_importance.iloc[0]['Feature']} ({feature_importance.iloc[0]['Coefficient']:.4f})")
    negative_features = feature_importance[feature_importance['Coefficient'] < 0]
    if len(negative_features) > 0:
        print(f"Strongest negative predictor: {negative_features.iloc[0]['Feature']} ({negative_features.iloc[0]['Coefficient']:.4f})")

Feature Importance Analysis (Ridge):
Top 10 Most Important Features:
                  Feature  Coefficient
          COPD_prevalence     0.854022
           CKD_prevalence     0.745077
  anycondition_prevalence     0.256354
         older_adults_pct     0.213297
           PCTPOV017_2018    -0.199034
            Employed_2018     0.190107
           PCTPOVALL_2018     0.174112
         young_adults_pct    -0.165090
      diabetes_prevalence    -0.162272
Civilian_labor_force_2018    -0.161072

Model Summary:
Total features: 29
Non-zero features: 29
Features with positive impact: 17
Features with negative impact: 12
Strongest positive predictor: COPD_prevalence (0.8540)
Strongest negative predictor: PCTPOV017_2018 (-0.1990)
                  Feature  Coefficient
          COPD_prevalence     0.854022
           CKD_prevalence     0.745077
  anycondition_prevalence     0.256354
         older_adults_pct     0.213297
           PCTPOV017_2018    -0.199034
            Employed_2018     0.1

### Feature Importance Interpretation

The feature importance analysis reveals key insights:

- **Positive Coefficients**: Features that increase heart disease prevalence when their values increase
- **Negative Coefficients**: Features that decrease heart disease prevalence when their values increase
- **Magnitude**: Larger absolute coefficient values indicate stronger influence on the prediction

This analysis helps validate that our model is capturing meaningful relationships between sociodemographic factors and heart disease prevalence, which can inform public health strategies.

## 4. Final Summary and Conclusions

### Model Performance Summary

Our Ridge regression model has achieved excellent results for predicting heart disease prevalence:

### Key Findings:
- **Outstanding Predictive Power**: Ridge model achieved R² = 0.9672 (96.72% variance explained)
- **Excellent Generalization**: Minimal overfitting (0.0025) shows robust performance on new data
- **Low Error Rate**: RMSE = 0.3251% indicates very accurate predictions
- **Optimal Regularization**: Ridge outperformed all other models including baseline Linear Regression

### Model Comparison Results:
- **Ridge (Best)**: R² = 0.9672, RMSE = 0.3251, Overfitting = 0.0025
- **Linear Regression**: R² = 0.9668, RMSE = 0.3271, Overfitting = 0.0030  
- **ElasticNet**: R² = 0.7694, RMSE = 0.8626
- **Lasso**: R² = 0.5398, RMSE = 1.2184

In [5]:
# Final model summary
print("Heart Disease Prediction Model Summary")
print("=" * 40)
print(f"Best Model: {best_model_name}")
print(f"Training Samples: {X_train.shape[0]}")
print(f"Test Samples: {X_test.shape[0]}")
print(f"Features: {X_train.shape[1]}")
print(f"Test R² Score: {results[best_model_name]['Test R2']:.4f}")
print(f"Test RMSE: {results[best_model_name]['Test RMSE']:.4f}")
print(f"Overfitting Score: {results[best_model_name]['Overfitting']:.4f}")
print("\nRegularization comparison completed. Best model selected for deployment.")

Heart Disease Prediction Model Summary
Best Model: Ridge
Training Samples: 2512
Test Samples: 628
Features: 29
Test R² Score: 0.9672
Test RMSE: 0.3251
Overfitting Score: 0.0025

Regularization comparison completed. Best model selected for deployment.


## 5. Model Saving

Save the trained linear regression model using pickle for future use and deployment.

In [6]:
import pickle
import os

# Save best model using relative path
model_path = os.path.join('../models/Linear_regressions_models', 'heart_disease_linear_regression.pkl')
with open(model_path, 'wb') as f:
    pickle.dump(best_model, f)

print("Best model saved successfully:")
print(f"- {best_model_name} model: {model_path}")
print(f"- Performance: R² = {results[best_model_name]['Test R2']:.4f}")

Best model saved successfully:
- Ridge model: ../models/Linear_regressions_models/heart_disease_linear_regression.pkl
- Performance: R² = 0.9672
