<a href="https://colab.research.google.com/github/DepartmentOfStatisticsPUE/cda-2021/blob/main/notebooks/cda_2021_04_20_lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will start with coding categorical data in regression.

In [1]:
head(mtcars)

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


Based on this dataset we would like to build the following model:

$$
mpg \sim cyl
$$

In [2]:
model1 <- lm(formula = mpg ~ cyl, data = mtcars)
summary(model1)


Call:
lm(formula = mpg ~ cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9814 -2.1185  0.2217  1.0717  7.5186 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.8846     2.0738   18.27  < 2e-16 ***
cyl          -2.8758     0.3224   -8.92 6.11e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared:  0.7262,	Adjusted R-squared:  0.7171 
F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10


So, if we do not specify that variable `cyl` is categorial it will be treated as continuous variable. Thus, the model will have the following form

$$
mpg = 37.88 - 2.88 \times cyl
$$

In [3]:
class(mtcars$cyl)

In [5]:
print(sapply(mtcars, class))

      mpg       cyl      disp        hp      drat        wt      qsec        vs 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
       am      gear      carb 
"numeric" "numeric" "numeric" 


How we can change type of given variable? We two options:

1. change the type in the dataset by using `as.factor` or `factor` function,
2. specify this within formula (`mpg ~ as.factor(cyl)`)

In [8]:
mtcars$cyl_f <- as.factor(mtcars$cyl)
head(mtcars, n = 2)

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,cyl_f
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
Mazda RX4,21,6,160,110,3.9,2.62,16.46,0,1,4,4,6
Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4,6


Now, let's run the model once again.

In [9]:
model2 <- lm(formula = mpg ~ cyl_f, data = mtcars)
summary(model2)


Call:
lm(formula = mpg ~ cyl_f, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2636 -1.8357  0.0286  1.3893  7.2364 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26.6636     0.9718  27.437  < 2e-16 ***
cyl_f6       -6.9208     1.5583  -4.441 0.000119 ***
cyl_f8      -11.5636     1.2986  -8.905 8.57e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared:  0.7325,	Adjusted R-squared:  0.714 
F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09


The above model may be written the following way

$$
mpg = 26.66 - 6.92 \times (cyl == 6) - 11.56 \times (cyl == 8)
$$

Now, let's change the reference level. How we can do that? We may use `relevel` function on dataset or within formula.

In [11]:
model3 <- lm(formula = mpg ~ relevel(cyl_f, ref = "8"), data = mtcars)
summary(model3)


Call:
lm(formula = mpg ~ relevel(cyl_f, ref = "8"), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2636 -1.8357  0.0286  1.3893  7.2364 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 15.1000     0.8614  17.529  < 2e-16 ***
relevel(cyl_f, ref = "8")4  11.5636     1.2986   8.905 8.57e-10 ***
relevel(cyl_f, ref = "8")6   4.6429     1.4920   3.112  0.00415 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared:  0.7325,	Adjusted R-squared:  0.714 
F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09


The above model may be written the following way

$$
mpg = 15.1 + 11.56 \times (cyl == 4) + 4.64 \times (cyl == 6)
$$

# Marginal effects

So, let's assume the following model

$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_1 X_2
$$

Then, the marginal efffect for $X_1$ is

$$
ME(X_1) = \frac{\partial Y}{\partial X_1} = \beta_1 + \beta_2 X_2.
$$

In order to calculate marginal effect for $X_1$ we may use the following options:

1. Marginal effects at means (MEM)

$$
MEM(X_1) = \beta_1 + \beta_2 \overline{X}_2,
$$

where $\overline{X}_2$ is overall average of $X_2$ variable.

2. Average marginal effect (AME)

$$
AME(X_1) = \frac{\sum_i \beta_1 + \beta_2 X_{2i}}{n},
$$

where $n$ is the sample size and $i$ is the $i$-th row of our data. So, this is just an average of $ME(X_2)$.

In [12]:
install.packages("margins")
library(margins)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘prediction’, ‘data.table’




In [13]:
model1 <- lm(mpg ~ wt, mtcars)
model2 <- lm(mpg ~ wt + I(wt^2), mtcars) ## I() allows for transformations within formula

Let's calculate marginal effect for `wt` using `margins` function.

In [20]:
print(coef(model1))

(Intercept)          wt 
  37.285126   -5.344472 


In [18]:
summary(margins(model1))

Unnamed: 0_level_0,factor,AME,SE,z,p,lower,upper
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,wt,-5.344472,0.5591013,-9.55904,1.188543e-21,-6.44029,-4.248653


In [21]:
print(coef(model2))

(Intercept)          wt     I(wt^2) 
  49.930811  -13.380337    1.171087 


In [22]:
summary(margins(model2))

Unnamed: 0_level_0,factor,AME,SE,z,p,lower,upper
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,wt,-5.844978,0.5102247,-11.45569,2.201917e-30,-6.845001,-4.844956


In [25]:
model3 <- lm(mpg ~ factor(cyl)*factor(am), mtcars)
summary(model3)


Call:
lm(formula = mpg ~ factor(cyl) * factor(am), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.6750 -1.1000  0.1125  1.6875  5.8250 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)                22.900      1.751  13.081 6.06e-13 ***
factor(cyl)6               -3.775      2.316  -1.630 0.115155    
factor(cyl)8               -7.850      1.957  -4.011 0.000455 ***
factor(am)1                 5.175      2.053   2.521 0.018176 *  
factor(cyl)6:factor(am)1   -3.733      3.095  -1.206 0.238553    
factor(cyl)8:factor(am)1   -4.825      3.095  -1.559 0.131069    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.032 on 26 degrees of freedom
Multiple R-squared:  0.7877,	Adjusted R-squared:  0.7469 
F-statistic: 19.29 on 5 and 26 DF,  p-value: 5.179e-08


In [26]:
summary(margins(model3))

Unnamed: 0_level_0,factor,AME,SE,z,p,lower,upper
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,am1,2.247396,1.334626,1.683915,0.09219818,-0.3684227,4.863214
2,cyl6,-5.291667,1.608214,-3.2904,0.001000449,-8.4437073,-2.139626
3,cyl8,-9.810156,1.516252,-6.470004,9.800018e-11,-12.7819554,-6.838357
