
# Advanced Regression Techniques for CVD Healthcare Cost Analysis
    


This notebook explores advanced regression techniques to analyze the data. Including:
1. Linear Regression with Forward and Backward Feature Selection.
2. Principal Component Regression (PCR).
3. Partial Least Squares Regression (PLSR).
    
These methods will help in continuing identifying the most significant predictors of healthcare costs and in building more robust predictive models. We will use the rank of healthcare costs as our target variable, as it has proven to be more suitable for linear models in our previous analyses.
    


## 1. Data Loading and Preparation
Load the data, select relevant features, handle missing values, and scale the data.
    

In [3]:

import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
    
# Load the dataset
df = pd.read_csv('CVD_data2.csv')
    
# Select a broader set of predictors for feature selection
features = ['AGEY1X', 'ADSEX4', 'PRVEVY1', 'DIABDXY1_M18', 'ANGIDXY1', 'ARTHDXY1', 'CHDDXY1', 'MIDXY1', 'total_comorbidities']
target = 'TOTTCHY1_rank'
    
df_model = df[features + [target]].dropna()
    
# Binarize categorical predictors (assuming 1 is 'Yes' and 2 is 'No')
for col in ['ADSEX4', 'PRVEVY1', 'DIABDXY1_M18', 'ANGIDXY1', 'ARTHDXY1', 'CHDDXY1', 'MIDXY1']:
    df_model[col] = df_model[col].apply(lambda x: 1 if x == 1 else 0)
    
# Define X and y
X = df_model[features]
y = df_model[target]
    
# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=features)
    


## 2. Feature Selection
We will implement forward and backward selection to identify the most significant predictors for our linear regression model. We will use the p-value from the `statsmodels` library as our selection criterion.
    

In [7]:

def forward_selection(X, y, significance_level=0.05):
    initial_features = X.columns.tolist()
    best_features = []
    
    while True:
        remaining_features = list(set(initial_features) - set(best_features))
        new_pval = pd.Series(index=remaining_features, dtype=float)
        
        for new_column in remaining_features:
            model = sm.OLS(y, sm.add_constant(X[best_features + [new_column]])).fit()
            new_pval[new_column] = model.pvalues[new_column]
        
        min_p_value = new_pval.min()
        if min_p_value < significance_level:
            best_features.append(new_pval.idxmin())
        else:
            break
    
    return best_features

X_scaled = X_scaled.reset_index(drop=True)
y = y.reset_index(drop=True)
# Run forward selection
forward_features = forward_selection(X_scaled, y)
print("Selected features (Forward Selection):", forward_features)

# Fit final model
model_forward = sm.OLS(y, sm.add_constant(X_scaled[forward_features])).fit()
print(model_forward.summary())

Selected features (Forward Selection): ['total_comorbidities', 'AGEY1X', 'ADSEX4', 'PRVEVY1', 'ARTHDXY1']
                            OLS Regression Results                            
Dep. Variable:          TOTTCHY1_rank   R-squared:                       0.194
Model:                            OLS   Adj. R-squared:                  0.193
Method:                 Least Squares   F-statistic:                     167.9
Date:                Mon, 20 Oct 2025   Prob (F-statistic):          2.16e-160
Time:                        00:24:59   Log-Likelihood:                -30853.
No. Observations:                3487   AIC:                         6.172e+04
Df Residuals:                    3481   BIC:                         6.175e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
----------------

In [None]:

def backward_elimination(X, y, significance_level=0.05):
    features = X.columns.tolist()
    while(len(features) > 0):
        features_with_constant = sm.add_constant(X[features])
        p_values = sm.OLS(y, features_with_constant).fit().pvalues[1:]
        max_p_value = p_values.max()
        if(max_p_value >= significance_level):
            excluded_feature = p_values.idxmax()
            features.remove(excluded_feature)
        else:
            break 
    return features
    
backward_features = backward_elimination(X_scaled, y)
print("Selected features (Backward Elimination):", backward_features)
    
# Fit a model with the selected features
model_backward = sm.OLS(y, sm.add_constant(X_scaled[backward_features])).fit()
print(model_backward.summary())
    

Selected features (Backward Elimination): ['AGEY1X', 'ADSEX4', 'PRVEVY1', 'DIABDXY1_M18', 'ARTHDXY1', 'CHDDXY1', 'MIDXY1']
                            OLS Regression Results                            
Dep. Variable:          TOTTCHY1_rank   R-squared:                       0.195
Model:                            OLS   Adj. R-squared:                  0.194
Method:                 Least Squares   F-statistic:                     120.6
Date:                Mon, 20 Oct 2025   Prob (F-statistic):          4.63e-159
Time:                        00:33:41   Log-Likelihood:                -30851.
No. Observations:                3487   AIC:                         6.172e+04
Df Residuals:                    3479   BIC:                         6.177e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
------


## 3. Principal Component Regression (PCR)
PCR is a regression technique that uses the principal components of the predictor variables as the independent variables in a linear regression model.
    

In [None]:

    # Perform PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
    
# Split data
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)
    
# Fit a linear regression model on the principal components
pcr_model = LinearRegression()
pcr_model.fit(X_train_pca, y_train)
    
# Evaluate the model
y_pred_pcr = pcr_model.predict(X_test_pca)
print("PCR R-squared:", r2_score(y_test, y_pred_pcr))
    

PCR R-squared: 0.16537369045171757



## 4. Partial Least Squares Regression (PLSR)
PLSR is similar to PCR, but it creates the components by considering both the predictors and the response variable.
    

In [11]:

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
    
# Fit a PLS regression model
plsr_model = PLSRegression(n_components=5) # n_components can be tuned
plsr_model.fit(X_train, y_train)
    
# Evaluate the model
y_pred_plsr = plsr_model.predict(X_test)
print("PLSR R-squared:", r2_score(y_test, y_pred_plsr))
    

PLSR R-squared: 0.1653503745735313



    ## 5. Conclusion
    This notebook provided an overview of advanced regression techniques, including feature selection methods and dimensionality reduction regression.
    - **Forward and Backward Selection** helped us identify a subset of the most statistically significant predictors.
    - **PCR and PLSR** provide alternative ways to handle multicollinearity and high-dimensional data.
    