
# Logistic Regression Analysis for Predicting Coronary Heart Disease (CHD)
    


This notebook demonstrates how to use logistic regression to predict the likelihood of a person having Coronary Heart Disease (CHD). We will explore the effects of feature scaling (Standardization and Normalization) on the model's performance.
    


## 1. Data Loading and Preparation
We'll load the data, select our features and target variable, and prepare the data for modeling. Our target variable is `CHDDXY1`, and we will use `AGEY1X`, `ADSEX4`, `PRVEVY1`, and `total_comorbidities` as predictors. We will also handle missing values and binarize our target variable.
    

In [1]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
    
# Load the dataset
df = pd.read_csv('CVD_data2.csv')
    
# Select relevant columns and drop rows with missing values in these columns
log_reg_vars = ['CHDDXY1', 'AGEY1X', 'ADSEX4', 'PRVEVY1', 'total_comorbidities']
df_log_reg = df[log_reg_vars].dropna()
    
# Binarize the target variable 'CHDDXY1'
df_log_reg['CHD_Diagnosis'] = df_log_reg['CHDDXY1'].apply(lambda x: 1 if x == 1 else 0)
df_log_reg = df_log_reg.drop('CHDDXY1', axis=1)
    
print("Cleaned DataFrame for Logistic Regression:")
print(df_log_reg.info())
print("\nTarget variable distribution:")
print(df_log_reg['CHD_Diagnosis'].value_counts())
print("\nDataFrame Head:")
print(df_log_reg.head())
    

Cleaned DataFrame for Logistic Regression:
<class 'pandas.core.frame.DataFrame'>
Index: 3489 entries, 2 to 6740
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   AGEY1X               3489 non-null   float64
 1   ADSEX4               3489 non-null   float64
 2   PRVEVY1              3489 non-null   float64
 3   total_comorbidities  3489 non-null   int64  
 4   CHD_Diagnosis        3489 non-null   int64  
dtypes: float64(3), int64(2)
memory usage: 163.5 KB
None

Target variable distribution:
CHD_Diagnosis
0    3243
1     246
Name: count, dtype: int64

DataFrame Head:
    AGEY1X  ADSEX4  PRVEVY1  total_comorbidities  CHD_Diagnosis
2     62.0     1.0      1.0                    1              0
3     67.0     1.0      2.0                    1              0
5     79.0     2.0      2.0                    0              0
9     70.0     1.0      1.0                    0              0
11    71.0     2.0   


## 2. Feature Scaling
We will apply two common feature scaling techniques: Standardization and Normalization.
    

In [2]:

# Define predictors and target
X = df_log_reg[['AGEY1X', 'ADSEX4', 'PRVEVY1', 'total_comorbidities']]
y = df_log_reg['CHD_Diagnosis']
    
# Standardization
scaler_standard = StandardScaler()
X_standardized = scaler_standard.fit_transform(X)
    
# Normalization
scaler_minmax = MinMaxScaler()
X_normalized = scaler_minmax.fit_transform(X)
    


## 3. Logistic Regression Modeling
We will now build and evaluate three logistic regression models: one without feature scaling, one with standardized data, and one with normalized data.
    

In [3]:

# 1. Model without Feature Scaling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model_no_scaling = LogisticRegression(random_state=42)
model_no_scaling.fit(X_train, y_train)
y_pred = model_no_scaling.predict(X_test)
    
print("--- Model without Feature Scaling ---")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
    

--- Model without Feature Scaling ---
Accuracy: 0.9627507163323782

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98       966
           1       0.89      0.59      0.71        81

    accuracy                           0.96      1047
   macro avg       0.93      0.79      0.85      1047
weighted avg       0.96      0.96      0.96      1047



In [4]:

# 2. Model with Standardized Data
X_train_std, X_test_std, y_train_std, y_test_std = train_test_split(X_standardized, y, test_size=0.3, random_state=42)
model_standardized = LogisticRegression(random_state=42)
model_standardized.fit(X_train_std, y_train_std)
y_pred_std = model_standardized.predict(X_test_std)
    
print("--- Model with Standardized Data ---")
print("Accuracy:", accuracy_score(y_test_std, y_pred_std))
print("\nClassification Report:\n", classification_report(y_test_std, y_pred_std))
    

--- Model with Standardized Data ---
Accuracy: 0.9627507163323782

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98       966
           1       0.89      0.59      0.71        81

    accuracy                           0.96      1047
   macro avg       0.93      0.79      0.85      1047
weighted avg       0.96      0.96      0.96      1047



In [5]:

# 3. Model with Normalized Data
X_train_norm, X_test_norm, y_train_norm, y_test_norm = train_test_split(X_normalized, y, test_size=0.3, random_state=42)
model_normalized = LogisticRegression(random_state=42)
model_normalized.fit(X_train_norm, y_train_norm)
y_pred_norm = model_normalized.predict(X_test_norm)
    
print("--- Model with Normalized Data ---")
print("Accuracy:", accuracy_score(y_test_norm, y_pred_norm))
print("\nClassification Report:\n", classification_report(y_test_norm, y_pred_norm))
    

--- Model with Normalized Data ---
Accuracy: 0.9627507163323782

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98       966
           1       0.89      0.59      0.71        81

    accuracy                           0.96      1047
   macro avg       0.93      0.79      0.85      1047
weighted avg       0.96      0.96      0.96      1047




## 4. Conclusion
This notebook demonstrated the use of logistic regression to predict Coronary Heart Disease. We observed for logistic regression, in this case, both scaling methods resulted in the similar model's accuracy with the model without scaling. 
    