<a href="https://colab.research.google.com/github/DepartmentOfStatisticsPUE/cda-2021/blob/main/notebooks/cda_2021_05_11_lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In GLM with binary variable $Y$ we may use logit link function which leads to logistic regression given by the following equation:

$$
E(y_i | \mathbf{x}_i) = Pr(y_i = 1 | \mathbf{x}_i) = \frac{\exp(\mathbf{x}_i^T\mathbf{\beta})}{1 + \exp(\mathbf{x}_i^T\mathbf{\beta})},
$$

and then if we use logit transformation

$$
logit(Pr(y_i = 1 | \mathbf{x}_i)) = \mathbf{x}_i^T\mathbf{\beta}.
$$

During this lecture I will use `logistic regression` term for `GLM with logit link`. On the contrary, `GLM with probit link` is often called `probit regression`.

Load the `tidyverse` package.

In [3]:
library(tidyverse)

In [4]:
df <- readRDS("data_for_lecture.rds") %>%
    mutate_at(vars(locality, woj, gender), as.factor)
    
head(df)

respid,cluster,locality,gmina,woj,gender,age,status,status_detail
<dbl>,<dbl>,<fct>,<dbl+lbl>,<fct>,<fct>,<dbl>,<dbl+lbl>,<dbl+lbl>
30,3,3,202011,1,0,47,1,1
33,3,3,202011,1,0,47,1,1
27,3,3,202011,1,0,47,0,3
28,3,3,202011,1,0,47,0,3
29,3,3,202011,1,0,47,0,3
31,3,3,202011,1,0,47,0,2


Let's build a model using `glm` function. If the speed is crucial, e.g. due to size of the data, we can use package called `speedglm`.

In [5]:
model1 <- glm(formula = status ~ gender, data = df, family = binomial(link = "logit"))
summary(model1)


Call:
glm(formula = status ~ gender, family = binomial(link = "logit"), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9609  -0.9609  -0.9148   1.4106   1.4651  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.65479    0.03154  -20.76  < 2e-16 ***
gender1      0.12165    0.04028    3.02  0.00253 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 14751  on 11299  degrees of freedom
Residual deviance: 14742  on 11298  degrees of freedom
AIC: 14746

Number of Fisher Scoring iterations: 4


Table `Coefficients` reports the linear part i.e. $\hat{\mathbf{\beta}}$. So, we may write the model:

$$
Pr(\text{Status} = 1 | \text{gender}) = \frac{\exp(-0.65 + 0.12 \times \text{Women}) }{1 + \exp(-0.65 + 0.12 \times \text{Women})}.
$$

In order to interpret results of logistic regression we should calculate odds ratio which can be done by just using $\exp(\mathbf{\beta})$.

In [6]:
print(exp(coef(model1)))

(Intercept)     gender1 
  0.5195512   1.1293586 


Interpretation of `1.129`:

+ This value informs that probability that women will participate in the survey is 1.13 higher than the probability that male will participate in the survey.
+ Another interpretation: for 100 males, we will have 113 females that participate in the survey.

Now, let's update our model and include other variables.

In [7]:
model2 <- update(model1,  . ~ . + age + locality)
summary(model2)


Call:
glm(formula = status ~ gender + age + locality, family = binomial(link = "logit"), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1720  -0.9316  -0.8192   1.2781   1.7640  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.633509   0.089013  -7.117 1.10e-12 ***
gender1      0.183999   0.041133   4.473 7.70e-06 ***
age          0.006152   0.001461   4.212 2.53e-05 ***
locality2   -0.475442   0.057503  -8.268  < 2e-16 ***
locality3   -0.597314   0.055034 -10.854  < 2e-16 ***
locality4   -0.808302   0.052366 -15.436  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 14751  on 11299  degrees of freedom
Residual deviance: 14437  on 11294  degrees of freedom
AIC: 14449

Number of Fisher Scoring iterations: 4


Let's, transform parameters and calculate 95% confidence intervals.

In [8]:
exp_betas <- exp(coef(model2))
exp_betas_ci <- exp(confint(model2))
exp_df <- data.frame(exp_betas, exp_betas_ci)
colnames(exp_df) <- c("exp", "exp_lo", "exp_hi")
exp_df

Waiting for profiling to be done...



Unnamed: 0_level_0,exp,exp_lo,exp_hi
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
(Intercept),0.5307264,0.4455589,0.6316222
gender1,1.202015,1.1090259,1.3030775
age,1.0061714,1.0033001,1.0090621
locality2,0.6216105,0.5551659,0.6955494
locality3,0.5502875,0.4938552,0.6127723
locality4,0.4456142,0.4020122,0.493622


Often, in order to make more sense of some parameters, we may change the unit change in given variable. For example, for age the standard unit change 1 years (odds will be 0.62%), but actually we could use 10 years than the odds ratio would be 6.2%. 

In addition, if we would like to calculate marginal effects for GLM (not only for logistic regression but also for Poisson regression) we may use `margins` package.

In [9]:
model3 <- update(model2, . ~ . + gender*locality) ## interaction between gender and locality
summary(model3)


Call:
glm(formula = status ~ gender + age + locality + gender:locality, 
    family = binomial(link = "logit"), data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1738  -0.9397  -0.8103   1.2790   1.7455  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -0.642083   0.094702  -6.780 1.20e-11 ***
gender1            0.189632   0.062312   3.043  0.00234 ** 
age                0.006253   0.001468   4.259 2.06e-05 ***
locality2         -0.467263   0.092261  -5.065 4.09e-07 ***
locality3         -0.646976   0.092468  -6.997 2.62e-12 ***
locality4         -0.760575   0.084558  -8.995  < 2e-16 ***
gender1:locality2 -0.013694   0.118020  -0.116  0.90763    
gender1:locality3  0.073104   0.115288   0.634  0.52602    
gender1:locality4 -0.074981   0.108077  -0.694  0.48783    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 14

Marignal effects using package `margins` instead of `mfx` as it does not support interaction terms.

In [10]:
install.packages("margins")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘prediction’




In [11]:
library(margins)

Function `margins` calculates marginal effects for linear terms (betas) if we use `type = "link"` or in terms of probabilities if we set `type = "response"`.

In [14]:
model3_margeff <- margins(model3, type = "link")
summary(model3_margeff)

Unnamed: 0_level_0,factor,AME,SE,z,p,lower,upper
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,age,0.006252817,0.001468203,4.258822,2.055067e-05,0.003375191,0.009130442
2,gender1,0.182964994,0.041422923,4.416999,1.000809e-05,0.101777558,0.264152431
3,locality2,-0.475541519,0.057549145,-8.263225,1.417888e-16,-0.58833577,-0.362747269
4,locality3,-0.602783269,0.055383565,-10.883793,1.377104e-27,-0.711333062,-0.494233477
5,locality4,-0.805901499,0.052387001,-15.383616,2.108372e-53,-0.908578135,-0.703224863


In [15]:
model3_margeff_resp <- margins(model3, type = "response")
summary(model3_margeff_resp)

Unnamed: 0_level_0,factor,AME,SE,z,p,lower,upper
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,age,0.001398436,0.0003274262,4.270995,1.946024e-05,0.0007566922,0.002040179
2,gender1,0.041063246,0.0090829878,4.520896,6.157835e-06,0.0232609171,0.058865575
3,locality2,-0.112725905,0.0132485026,-8.508577,1.76081e-17,-0.1386924927,-0.086759317
4,locality3,-0.140193986,0.0123802009,-11.324048,9.975708e-30,-0.164458734,-0.115929238
5,locality4,-0.182405897,0.0112698703,-16.18527,6.406792e-59,-0.2044944368,-0.160317357
