# Econometrics 1 

## TD 1 - 26/09/2024

TA: Pedro Vergara Merino ([pedro.vergaramerino@ensae.fr](mailto:pedro.vergaramerino@ensae.fr), office 4081) 

We aim at explaining log(wages) with age and education. Henceforth, `lnw` (or `lnwage`) denotes the logarithm of hourly wage, `eduy` is the number of years of education (the count starts at 6 years old), and `age` is the age measured in years.

![Regression_Short](figures/reg_lnw_eduy.png)

![Regression_Long](figures/reg_lnw_eduy_age.png)

![Summary](figures/sum_lnw_eduy_age.png)

#### 1. Given the outcome in the previous tables, what can we learn about the correlation between `age` and `eduy`? Explain.

To deduce something about the correlation between `age`and `eduy` it is useful to look at the **omitted variable bias (OVB)** formula (see Proposition 4 of Chapter 1).

#### 2. Recompute the second regression using the data emp2007.dta. Interpret the coefficient age in this new regression.

We start by loading the required packages and the data.

In [1]:

# Load packages
library('haven')
library('dplyr')
library('margins')


Attachement du package : ‘dplyr’


Les objets suivants sont masqués depuis ‘package:stats’:

    filter, lag


Les objets suivants sont masqués depuis ‘package:base’:

    intersect, setdiff, setequal, union




We can then load data in .dta format using the command `read_dta` from the package `haven`.

In [3]:
#Load data
data <- haven::read_dta("./emp2007.dta")

The command `summary` gives descriptive statistics of the data.
With the command `kable` we can display the first rows of the dataset.

In [4]:
summary(data)

      nsup           salred           hhc            nivet      
 Min.   : 0.00   Min.   :  100   Min.   : 1.00   Min.   :21.00  
 1st Qu.: 0.00   1st Qu.: 1137   1st Qu.:35.00   1st Qu.:32.00  
 Median : 0.00   Median : 1438   Median :36.00   Median :43.00  
 Mean   :19.15   Mean   : 1620   Mean   :36.50   Mean   :42.13  
 3rd Qu.:42.00   3rd Qu.: 1900   3rd Qu.:40.00   3rd Qu.:52.00  
 Max.   :72.00   Max.   :85000   Max.   :99.59   Max.   :72.00  
      age            femme             eduy      
 Min.   :20.00   Min.   :0.0000   Min.   : 5.00  
 1st Qu.:29.00   1st Qu.:0.0000   1st Qu.:10.00  
 Median :37.00   Median :0.0000   Median :12.00  
 Mean   :36.47   Mean   :0.4888   Mean   :12.07  
 3rd Qu.:44.00   3rd Qu.:1.0000   3rd Qu.:14.00  
 Max.   :50.00   Max.   :1.0000   Max.   :17.00  

In [5]:
knitr::kable(head(data), "simple")



 nsup   salred   hhc   nivet   age   femme   eduy
-----  -------  ----  ------  ----  ------  -----
   55     2257    32      21    46       1     17
    0     1400    40      41    40       1     12
    0     2015    35      52    49       0     10
    0     1392    39      61    49       1      9
    0     1486    35      52    25       0     10
    0     1526    35      52    32       1     10

Notice that we have the monthly wage and the weekly hours. So, first we transform the data using the command `mutate` from the package `dplyr`.

The syntax `%>%` allows to call the dataset only once.

In [6]:
data <- data %>% mutate(w=(salred*12)/(hhc*52)) # Create variable w
data <- data %>% mutate(lnw=log(w)) # Apply the logarithm

In [7]:
knitr::kable(head(data), "simple") # See new dataframe



 nsup   salred   hhc   nivet   age   femme   eduy           w        lnw
-----  -------  ----  ------  ----  ------  -----  ----------  ---------
   55     2257    32      21    46       1     17   16.276442   2.789719
    0     1400    40      41    40       1     12    8.076923   2.089011
    0     2015    35      52    49       0     10   13.285714   2.586689
    0     1392    39      61    49       1      9    8.236686   2.108598
    0     1486    35      52    25       0     10    9.797802   2.282158
    0     1526    35      52    32       1     10   10.061538   2.308720

We now run the regression of the logarithm of hourly wages on education level and age. For this, we use the command `lm`.

In [8]:
model1 <- lm(lnw ~ eduy + age, data=data) # Create the model
summary(model1) # Gives the regression coefficients 


Call:
lm(formula = lnw ~ eduy + age, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1422 -0.1906 -0.0021  0.1924  3.8563 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.9515069  0.0142606   66.72   <2e-16 ***
eduy        0.0629493  0.0007651   82.27   <2e-16 ***
age         0.0145086  0.0002423   59.87   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3553 on 31234 degrees of freedom
Multiple R-squared:  0.2159,	Adjusted R-squared:  0.2158 
F-statistic:  4300 on 2 and 31234 DF,  p-value: < 2.2e-16


#### 3. Compute the coefficients of the regression of `lnw` on `eduy`, `age`, and `age`$^2$. What is the marginal effect of `age`? What is the magnitude of such a marginal effect for a person who is 20 years old? And for someone who is 50 years old? Compute the average effect in the sample, first "by hand" and then with the command `margins`.

We start first by the manual computation of marginal effects.

In [9]:
data <- data %>% mutate(age2=age**2) # Create age^2
knitr::kable(head(data), "simple") # See new dataframe



 nsup   salred   hhc   nivet   age   femme   eduy           w        lnw   age2
-----  -------  ----  ------  ----  ------  -----  ----------  ---------  -----
   55     2257    32      21    46       1     17   16.276442   2.789719   2116
    0     1400    40      41    40       1     12    8.076923   2.089011   1600
    0     2015    35      52    49       0     10   13.285714   2.586689   2401
    0     1392    39      61    49       1      9    8.236686   2.108598   2401
    0     1486    35      52    25       0     10    9.797802   2.282158    625
    0     1526    35      52    32       1     10   10.061538   2.308720   1024

We now run the new regression.

In [10]:
model2 <- lm(lnw ~ eduy + age + age2, data=data)
summary(model2)


Call:
lm(formula = lnw ~ eduy + age + age2, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1176 -0.1913 -0.0023  0.1917  3.8862 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.116e-01  3.818e-02   10.78   <2e-16 ***
eduy         6.187e-02  7.656e-04   80.82   <2e-16 ***
age          4.731e-02  2.167e-03   21.84   <2e-16 ***
age2        -4.588e-04  3.011e-05  -15.23   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.354 on 31233 degrees of freedom
Multiple R-squared:  0.2217,	Adjusted R-squared:  0.2216 
F-statistic:  2965 on 3 and 31233 DF,  p-value: < 2.2e-16


The marginal effects of age are estimated by: $\widehat{\beta_2}+ 2\widehat{\beta_3}age$. So we want to compute their value for ages 20, 50, and the average age.

In [11]:
mean_age <- mean(data$age)
effect_20 <- round((coef(model2)["age"] + 2*coef(model2)["age2"]*20),digits=3)
effect_50 <- round(coef(model2)["age"] + 2*coef(model2)["age2"]*50,digits=3)
effect_mean <- round(coef(model2)["age"] + 2*coef(model2)["age2"]*mean_age,digits=3)
mean_age <- round(mean_age,digits=3)

In [12]:
cat("The marginal effect of age on the log-hourly wages at age 20 is", effect_20,"\n")
cat("The marginal effect of age on the log-hourly wages at age 50 is", effect_50,"\n")
cat("The marginal effect of age on the log-hourly wages at the mean age",mean_age,"is", effect_mean)

The marginal effect of age on the log-hourly wages at age 20 is 0.029 
The marginal effect of age on the log-hourly wages at age 50 is 0.001 
The marginal effect of age on the log-hourly wages at the mean age 36.468 is 0.014

Now we compute the marginal effects using the command `margins`. So `R` understands that `age`$^2$ is a function of `age`, we use the syntax `I(age^2)` inside the regression formula instead of `age2`.

In [13]:
model2_margins <- lm(lnw ~ eduy + age + I(age^2), data=data)
summary(model2_margins)


Call:
lm(formula = lnw ~ eduy + age + I(age^2), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1176 -0.1913 -0.0023  0.1917  3.8862 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.116e-01  3.818e-02   10.78   <2e-16 ***
eduy         6.187e-02  7.656e-04   80.82   <2e-16 ***
age          4.731e-02  2.167e-03   21.84   <2e-16 ***
I(age^2)    -4.588e-04  3.011e-05  -15.23   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.354 on 31233 degrees of freedom
Multiple R-squared:  0.2217,	Adjusted R-squared:  0.2216 
F-statistic:  2965 on 3 and 31233 DF,  p-value: < 2.2e-16


In [14]:
margins_20 <- margins(model2_margins, at=list(age=20))
margins_50 <- margins(model2_margins, at=list(age=50))
margins_mean <- margins(model2_margins, atmeans = TRUE)

In [15]:
print(margins_20)
print(margins_50)
print(margins_mean)

Average marginal effects at specified values

lm(formula = lnw ~ eduy + age + I(age^2), data = data)




 at(age)    eduy     age
      20 0.06187 0.02896


Average marginal effects at specified values

lm(formula = lnw ~ eduy + age + I(age^2), data = data)




 at(age)    eduy      age
      50 0.06187 0.001434


Average marginal effects

lm(formula = lnw ~ eduy + age + I(age^2), data = data)




    eduy     age
 0.06187 0.01385


If we do not specify an option, the command `margins` computes the **average marginal effects (AME)**, which can be different than the marginal effects at the means!

In [16]:
margins_AME <- margins(model2_margins)
summary(margins_AME)

Unnamed: 0_level_0,factor,AME,SE,z,p,lower,upper
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,age,0.01384922,0.0002452814,56.46259,0,0.01336848,0.01432997
2,eduy,0.0618749,0.0007655666,80.82237,0,0.06037442,0.06337539


#### 4. Recover the coefficient of education in the previous regression using the Frisch-Waugh Theorem.

We start by regressing `eduy` on the other covariates `age`and `age`$^2$

In [17]:
model3 <- lm(eduy ~ age + age^2, data=data)
summary(model3)


Call:
lm(formula = eduy ~ age + age^2, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2174 -1.8907 -0.4493  2.0394  5.8773 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.613835   0.065454  223.27   <2e-16 ***
age         -0.069823   0.001748  -39.95   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.627 on 31235 degrees of freedom
Multiple R-squared:  0.0486,	Adjusted R-squared:  0.04857 
F-statistic:  1596 on 1 and 31235 DF,  p-value: < 2.2e-16


We then recover the residuals from the regression.

In [18]:
data$u <- residuals(model3)

Finally, we run a regression of `lnw` on the residuals `u`.

In [19]:
model4 <- lm(lnw ~ u , data=data)
summary(model4)


Call:
lm(formula = lnw ~ u, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0357 -0.1978 -0.0074  0.2000  3.7612 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.2402492  0.0020683 1083.13   <2e-16 ***
u           0.0629493  0.0007872   79.96   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3656 on 31235 degrees of freedom
Multiple R-squared:  0.1699,	Adjusted R-squared:  0.1699 
F-statistic:  6394 on 1 and 31235 DF,  p-value: < 2.2e-16


#### 5. Build the variable `pexp = age - eduy - 6`. Explain why it is a proxy of the professional experience. Compute the regression of `lnw` on `eduy`, `age` and `pexp`. Explain the results.

We first create the new variable.

Replace `???` in the code below.

In [None]:
data <- data %>% mutate(pexp = age - eduy - 6)


We then run the new regression.

In [None]:
model5 <- lm(lnw ~ eduy + age + pexp, data = data)
summary(model5)