# Logistic Regression

As would have been discussed earlier this week, logistic regression is used for classification, i.e, cases where the dependent variable is a discrete and is drawn from a finite set of possibilities. Much like linear regression, logistic regression can be used in both an inferential context and a predictive context.

## Inference with Logistic Regression
In the inferential context, we typically have a binary dependent variable and we want to assess what factors increase or decrease the likelihood of said binary dependent variable being 1, i.e. the class is true. For example, as will see below, we might want to determine what factors increase the risk of cardiovascular disease. Logistic Regression in an inferential context, also allows us to compute odds ratios. The Odds ratio of an indepedent variable represents the percentage increase in liklihood of the class being 1 given a unit increase in said indepedent variable when all else is held constant

In [4]:
# imports
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt

# Load the heart dataset
heart = pd.read_csv('heart.csv')



In [5]:
heart.head() # description of dataset and its fields: https://www.kaggle.com/code/christophergd/introduction-to-seaborn-heart-attack-data/input


Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [6]:
heart.describe()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [27]:
heart['high_chol'] = (heart['chol'] >= 200).astype(int) # high cholesterol is over 200 mg/dL
X = heart[['age', 'sex', 'trtbps', 'high_chol', 'thalachh','fbs']]  # Independent variables
y = heart['output']  # Dependent variable
X = sm.add_constant(X)

In [28]:
# Fit the logistic regression model
model = sm.Logit(y, X).fit()

Optimization terminated successfully.
         Current function value: 0.531237
         Iterations 6


In [29]:
print(model.summary())

                           Logit Regression Results                           
Dep. Variable:                 output   No. Observations:                  303
Model:                          Logit   Df Residuals:                      296
Method:                           MLE   Df Model:                            6
Date:                Mon, 22 Jul 2024   Pseudo R-squ.:                  0.2292
Time:                        03:29:17   Log-Likelihood:                -160.96
converged:                       True   LL-Null:                       -208.82
Covariance Type:            nonrobust   LLR p-value:                 1.968e-18
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.2593      1.783     -0.706      0.480      -4.754       2.236
age           -0.0189      0.017     -1.086      0.277      -0.053       0.015
sex           -1.6628      0.323     -5.143      0.0

Notice that from the above results that `trtbps` , `thalachh`, and `sex` demonstrate a statistically significant relationship with heart disease at a cut-off of 0.05. However, the coefficients are not as easy to interpret as those in linear regression, so we turn to the calculation of odd's ratios

In [26]:
odds_ratios = np.exp(model.params) # odds ratios are times more likely
percentage_changes = (odds_ratios - 1) * 100 # substract 1 and multiply by 100 to get percentage increase


# 95% Confidence intervals for the odds ratios and percentage changes
conf = model.conf_int()
conf['Odds Ratio'] = odds_ratios
conf['Percentage Change'] = percentage_changes
conf.columns = ['2.5%', '97.5%', 'Odds Ratio', 'Percentage Change']
print("\n95% Confidence Intervals for Odds Ratios and Percentage Changes:")
print(conf)


95% Confidence Intervals for Odds Ratios and Percentage Changes:
              2.5%     97.5%  Odds Ratio  Percentage Change
const    -3.903622  3.316792    0.745713         -25.428749
age      -0.049448  0.019720    0.985246          -1.475407
sex      -2.537509 -1.191401    0.154981         -84.501934
trtbps   -0.036979 -0.003880    0.979778          -2.022228
chol     -0.012503 -0.001667    0.992940          -0.706017
thalachh  0.031838  0.062045    1.048060           4.806038
fbs      -0.631147  0.845675    1.113228          11.322807


From the above odds ratios and percentage changes, we see that (all else being equal)

1. Female sex decreases the change of heart disease by 84.5%
2. Every unit increase in maximum heart rate achieved (thalachh) increases the change by 4.8%


## Prediction

With inference demonstrated, let us also consider a predictive setting of trying to predict heart disease

In [30]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve

In [31]:
# Initialize the logistic regression model
model = LogisticRegression(max_iter=1000)

# Set up K-Fold Cross-Validation
kf = KFold(n_splits=10, shuffle=True, random_state=1)

In [34]:
cv_roc_auc_scores = cross_val_score(model, X, y, cv=kf, scoring='roc_auc')
cv_f1_scores = cross_val_score(model, X, y, cv=kf, scoring='f1')

In [35]:
# Print the mean and standard deviation of the cross-validation ROC AUC scores
print(f"Mean ROC AUC Score: {np.mean(cv_roc_auc_scores)}")
print(f"Standard Deviation of ROC AUC Scores: {np.std(cv_roc_auc_scores)}")

# Print the mean and standard deviation of the cross-validation F1 scores
print(f"Mean ROC AUC Score: {np.mean(cv_f1_scores)}")
print(f"Standard Deviation of ROC AUC Scores: {np.std(cv_f1_scores)}")


Mean ROC AUC Score: 0.7938799693312077
Standard Deviation of ROC AUC Scores: 0.09089208647796708
Mean ROC AUC Score: 0.7391292284187222
Standard Deviation of ROC AUC Scores: 0.07307016927708843


The classifier performance appears stable with such low SDs of the classification metrics. However performance, especially as judged by AUC, can be improved. Perhaps logistic regression is too restrictive in its linear assumptions? As an optional task, research non-linear classifiers in sklearn and try to use one here

In [36]:
from sklearn.ensemble import RandomForestClassifier

In [37]:
# Initialize the Random Forest classifier
model = RandomForestClassifier(n_estimators=100, random_state=1)

# Set up K-Fold Cross-Validation
kf = KFold(n_splits=10, shuffle=True, random_state=1)

cv_roc_auc_scores = cross_val_score(model, X, y, cv=kf, scoring='roc_auc')
cv_f1_scores = cross_val_score(model, X, y, cv=kf, scoring='f1')

# Print the mean and standard deviation of the cross-validation ROC AUC scores
print(f"Mean ROC AUC Score: {np.mean(cv_roc_auc_scores)}")
print(f"Standard Deviation of ROC AUC Scores: {np.std(cv_roc_auc_scores)}")

# Print the mean and standard deviation of the cross-validation F1 scores
print(f"Mean ROC AUC Score: {np.mean(cv_f1_scores)}")
print(f"Standard Deviation of ROC AUC Scores: {np.std(cv_f1_scores)}")


Mean ROC AUC Score: 0.7592828042700334
Standard Deviation of ROC AUC Scores: 0.10158698026748882
Mean ROC AUC Score: 0.7094160279124783
Standard Deviation of ROC AUC Scores: 0.06643521089957283
