
# Regularized Regression Analysis of CVD Healthcare Costs
    


This notebook applies regularized regression models (Lasso, Ridge, and Elastic Net) to the data. These models are particularly useful for handling multicollinearity and performing feature selection, which are important considerations for this dataset. This analysis is a follow-up to the initial linear regression.


## 1. Data Loading and Preparation
We start by loading the dataset and preparing it for analysis. This includes cleaning the data and scaling the features, which is a crucial step for regularized regression models.
    

In [1]:

import pandas as pd
import numpy as np
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
    
# Load the dataset
df = pd.read_csv('CVD_data2.csv')
    
# Select and clean the relevant variables
regression_vars = ['TOTTCHY1', 'TOTTCHY1_rank', 'total_comorbidities', 'AGEY1X', 'PRVEVY1']
df_clean = df[regression_vars].dropna()
    
# Define independent and dependent variables
X = df_clean[['total_comorbidities', 'AGEY1X', 'PRVEVY1']]
y_raw = df_clean['TOTTCHY1']
y_rank = df_clean['TOTTCHY1_rank']
    
# Scale the independent variables
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
    


## 2. Lasso Regression
Lasso (Least Absolute Shrinkage and Selection Operator) regression can be used for feature selection as it can shrink the coefficients of less important features to zero.
    

In [None]:

# Split data for raw cost prediction
X_train, X_test, y_train_raw, y_test_raw = train_test_split(X_scaled, y_raw, test_size=0.2, random_state=42)
    
# Fit Lasso model on raw costs
lasso_raw = Lasso(alpha=1.0)
lasso_raw.fit(X_train, y_train_raw)
print("Lasso Coefficients (Raw Costs):", lasso_raw.coef_)
    
# Split data for ranked cost prediction
X_train, X_test, y_train_rank, y_test_rank = train_test_split(X_scaled, y_rank, test_size=0.2, random_state=42)
    
# Fit Lasso model on ranked costs
lasso_rank = Lasso(alpha=1.0)
lasso_rank.fit(X_train, y_train_rank)
print("Lasso Coefficients (Ranked Costs):", lasso_rank.coef_)
    

Lasso Coefficients (Raw Costs): [9792.57325841 4498.16546045  369.48427867]
Lasso Coefficients (Ranked Costs): [ 525.66367354  469.97952929 -162.0848049 ]



## 3. Ridge Regression
Ridge regression is useful for mitigating the impact of multicollinearity by shrinking the coefficients of correlated predictors.
    

In [3]:

# Fit Ridge model on raw costs
ridge_raw = Ridge(alpha=1.0)
ridge_raw.fit(X_train, y_train_raw)
print("Ridge Coefficients (Raw Costs):", ridge_raw.coef_)
    
# Fit Ridge model on ranked costs
ridge_rank = Ridge(alpha=1.0)
ridge_rank.fit(X_train, y_train_rank)
print("Ridge Coefficients (Ranked Costs):", ridge_rank.coef_)
    

Ridge Coefficients (Raw Costs): [9790.94945895 4499.02388136  370.61128528]
Ridge Coefficients (Ranked Costs): [ 526.44046452  470.60056032 -163.24934298]



## 4. Elastic Net Regression
Elastic Net is a hybrid of Lasso and Ridge regression, incorporating the benefits of both. It can perform feature selection while handling multicollinearity.
    

In [4]:

# Fit Elastic Net model on raw costs
elastic_net_raw = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic_net_raw.fit(X_train, y_train_raw)
print("Elastic Net Coefficients (Raw Costs):", elastic_net_raw.coef_)
    
# Fit Elastic Net model on ranked costs
elastic_net_rank = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic_net_rank.fit(X_train, y_train_rank)
print("Elastic Net Coefficients (Ranked Costs):", elastic_net_rank.coef_)
    

Elastic Net Coefficients (Raw Costs): [6657.3176066  4057.90028358  639.23582512]
Elastic Net Coefficients (Ranked Costs): [380.17369617 359.03082249 -84.26335847]



## 5. Conclusion
This notebook demonstrated the application of Lasso, Ridge, and Elastic Net regression models for analyzing healthcare costs. Key observations include:
- The models, especially when applied to the ranked cost data, provide insights into the relative importance of the predictors.
- The number of comorbidities and age consistently show a positive relationship with healthcare costs across all models, while private insurance status shows a negative relationship in the ranked models.
- The coefficients from these regularized models can be used to select the most influential features and to build more robust predictive models, as proposed in the project document.

    