<font size = "5" > Model Specification in Logistic Regression </font> <br>

This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

In [13]:
library(dplyr)

In [2]:
df = readr::read_csv("stroke_data.csv", show_col_types = FALSE, na = c("N/A", "NA"))
head(df)

id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<dbl>
9046,Male,67,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
31112,Male,80,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
60182,Female,49,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
1665,Female,79,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
56669,Male,81,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
53882,Male,74,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1


In [3]:
#Any null values?
colSums(is.na(df))

In [4]:
#Remove data points with null values
 df = na.omit(df)

#Convert target column to factors
df <- df %>% mutate(stroke = as.factor(stroke))

In [5]:
#Save the clean dataset for the future
write.csv(df, file = "stroke_data.csv", row.names = FALSE)

We will pick bmi and avg_glucose_level.

**y ~ x1**

In [6]:
model1 <- glm(formula = stroke ~ bmi,
              data = df,
              family = binomial()
             )
model1


Call:  glm(formula = stroke ~ bmi, family = binomial(), data = df)

Coefficients:
(Intercept)          bmi  
   -3.82844      0.02416  

Degrees of Freedom: 4908 Total (i.e. Null);  4907 Residual
Null Deviance:	    1728 
Residual Deviance: 1720 	AIC: 1724

**y ~ x2**


In [7]:
model2 <- glm(formula = stroke ~ avg_glucose_level,
              data = df,
              family = binomial()
              )
model2


Call:  glm(formula = stroke ~ avg_glucose_level, family = binomial(), 
    data = df)

Coefficients:
      (Intercept)  avg_glucose_level  
         -4.43638            0.01125  

Degrees of Freedom: 4908 Total (i.e. Null);  4907 Residual
Null Deviance:	    1728 
Residual Deviance: 1653 	AIC: 1657

**y ~ poly(x1,2)**


In [8]:
model3 <- glm(formula = stroke ~ poly(x = bmi, degree = 2),
              data = df,
              family = binomial()
              )
model3


Call:  glm(formula = stroke ~ poly(x = bmi, degree = 2), family = binomial(), 
    data = df)

Coefficients:
               (Intercept)  poly(x = bmi, degree = 2)1  
                    -3.219                      12.193  
poly(x = bmi, degree = 2)2  
                   -42.465  

Degrees of Freedom: 4908 Total (i.e. Null);  4906 Residual
Null Deviance:	    1728 
Residual Deviance: 1695 	AIC: 1701

**y ~ poly(x2,2)**


In [9]:
model4 <- glm(formula = stroke ~ poly(x = avg_glucose_level, degree = 2),
              data = df,
              family = binomial()
              )
model4


Call:  glm(formula = stroke ~ poly(x = avg_glucose_level, degree = 2), 
    family = binomial(), data = df)

Coefficients:
                             (Intercept)  
                                  -3.244  
poly(x = avg_glucose_level, degree = 2)1  
                                  31.759  
poly(x = avg_glucose_level, degree = 2)2  
                                   7.767  

Degrees of Freedom: 4908 Total (i.e. Null);  4906 Residual
Null Deviance:	    1728 
Residual Deviance: 1650 	AIC: 1656

**y ~ x1*x2**


In [10]:
model5 <- glm(formula = stroke ~ bmi * avg_glucose_level ,
              data = df,
              family = binomial()
              )
model5


Call:  glm(formula = stroke ~ bmi * avg_glucose_level, family = binomial(), 
    data = df)

Coefficients:
          (Intercept)                    bmi      avg_glucose_level  
           -4.636e+00              8.272e-03              1.050e-02  
bmi:avg_glucose_level  
            1.234e-05  

Degrees of Freedom: 4908 Total (i.e. Null);  4905 Residual
Null Deviance:	    1728 
Residual Deviance: 1652 	AIC: 1660

**y ~ poly(x1,2) + poly(x2,2) + x1:x2**


In [11]:
model6 <- glm(formula = stroke ~ poly(bmi, 2) + poly(avg_glucose_level, 2) + bmi:avg_glucose_level ,
              data = df,
              family = binomial()
              )
model6


Call:  glm(formula = stroke ~ poly(bmi, 2) + poly(avg_glucose_level, 
    2) + bmi:avg_glucose_level, family = binomial(), data = df)

Coefficients:
                (Intercept)                poly(bmi, 2)1  
                 -4.484e+00                   -2.595e+01  
              poly(bmi, 2)2  poly(avg_glucose_level, 2)1  
                 -4.935e+01                   -5.906e+00  
poly(avg_glucose_level, 2)2        bmi:avg_glucose_level  
                  6.194e+00                    3.685e-04  

Degrees of Freedom: 4908 Total (i.e. Null);  4903 Residual
Null Deviance:	    1728 
Residual Deviance: 1625 	AIC: 1637

Let's compare the aic scores.

In [12]:
comparison = data.frame(model = c('model1', 'model2', 'model3', 'model4', 'model5', 'model6'), 
           aic = c(AIC(model1), AIC(model2), AIC(model3), AIC(model4), AIC(model5), AIC(model6))
           )
comparison$aic = round(comparison$aic)
comparison[order(comparison$aic),]

Unnamed: 0_level_0,model,aic
Unnamed: 0_level_1,<chr>,<dbl>
6,model6,1637
4,model4,1656
2,model2,1657
5,model5,1660
3,model3,1701
1,model1,1724


Model 6 has the lowest aic, which shows that use of orthogonal polynomials for both predictors along with their interaction explains the relationship between the two predictors and target variable the best. It is interesting to notice that the coefficient for bmi*avg_glucose_level is very small.<br>

Model 2 has a higher AIC score but close to model 6, so we can say that just using avg_glucose_level can also be a good choice out of all 6 possible models because of its simplicity. <br>

Model 1 which only uses bmi as the predictor is the worst model in terms of AIC score.
