# <center> ÉCONOMÉTRIE 2 : L3 MIASH, S2</center>
## <center> TRAVAIL 3 </center>
## <center> OMISSION DE VARIABLES EXPLICATIVES PERTINENTES ET ENDOGÉNÉITÉ: CAS D'ÉQUATIONS DE SALAIRE ET APPLICATION SUR CARD(1995)</center>
### <center>Michal Urdanivia (UGA)</center>
#### <center> michal.wong-urdanivia@univ-grenoble-alpes.fr </center>


In [None]:
# Installer environnement:
#!conda env create -f environment.yml

### Application: Card(1995)
Nous allons tout au long de ce notebook illustrer nos propos empiriquement en nous appuyant sur les données du travail de David Card(1995) : "Using Geographic Variation in College Proximity to Estimate the Return to Schooling". Pour une version de 1993 voir ici:

https://davidcard.berkeley.edu/papers/geo_var_schooling.pdf

Nous allons télécharger les données qui sont disponibles(outre le site de David Card lui même) à partir du site de Bruce Hansen. Une description du fichier est ici:

https://www.ssc.wisc.edu/~bhansen/econometrics/Card1995_description.pdf

Il est aussi conseillé de lire la présentation des données dans le travail de David Card.

In [1]:
# Lecture des données
## Importation des bibliothéques employées ci-après.
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
#import seaborn as sns
import statsmodels.api as sm
from linearmodels.iv import IV2SLS
## Données
url = "https://www.ssc.wisc.edu/~bhansen/econometrics/Card1995.dta"
df = pd.read_stata(url)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3613 entries, 0 to 3612
Data columns (total 52 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        3613 non-null   int16  
 1   nearc2    3613 non-null   int8   
 2   nearc4    3613 non-null   int8   
 3   nearc4a   3613 non-null   int8   
 4   nearc4b   3613 non-null   int8   
 5   ed76      3613 non-null   int8   
 6   ed66      3613 non-null   int8   
 7   age76     3613 non-null   int8   
 8   daded     3613 non-null   float32
 9   nodaded   3613 non-null   int8   
 10  momed     3613 non-null   float32
 11  nomomed   3613 non-null   int8   
 12  weight    3613 non-null   int32  
 13  momdad14  3613 non-null   int8   
 14  sinmom14  3613 non-null   int8   
 15  step14    3613 non-null   int8   
 16  reg661    3613 non-null   int8   
 17  reg662    3613 non-null   int8   
 18  reg663    3613 non-null   int8   
 19  reg664    3613 non-null   int8   
 20  reg665    3613 non-null   int8

In [24]:
# Echantillon de travail
## Nous ne retenons que les variables qui sont utilisées dans les estimations 
## de la colonne 2 du tableau 2 dans l'article, et les observations sans valeurs manquantes pour ces variables.
card = df[["lwage76", "ed76", "age76", "south66", "black", "smsa76r", "smsa66r", "nearc4", "daded", "nearc2",
           'smsa66r','reg661', 'reg662', 'reg663', 'reg664', 'reg665', 'reg666', 'reg667', 'reg668', 'reg669']]
card = card.dropna()
## On convertit les valeurs des variables en réels("float numbers")
for col in list(card):
    card[col] = card[col].astype(float)
card.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3010 entries, 0 to 3612
Data columns (total 20 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   lwage76  3010 non-null   float64
 1   ed76     3010 non-null   float64
 2   age76    3010 non-null   float64
 3   south66  3010 non-null   float64
 4   black    3010 non-null   float64
 5   smsa76r  3010 non-null   float64
 6   smsa66r  3010 non-null   float64
 7   nearc4   3010 non-null   float64
 8   daded    3010 non-null   float64
 9   nearc2   3010 non-null   float64
 10  smsa66r  3010 non-null   float64
 11  reg661   3010 non-null   float64
 12  reg662   3010 non-null   float64
 13  reg663   3010 non-null   float64
 14  reg664   3010 non-null   float64
 15  reg665   3010 non-null   float64
 16  reg666   3010 non-null   float64
 17  reg667   3010 non-null   float64
 18  reg668   3010 non-null   float64
 19  reg669   3010 non-null   float64
dtypes: float64(20)
memory usage: 493.8 KB


In [25]:
# Création de certaines variables.
## Mesure de l'expérience('exp'), de l'expérience au carré('expsq'), 
## et de l'expérience au carré divisée par 100('expsq').
card['exp'] = card.age76 - card.ed76 + 6 
card['expsq'] = card['exp']**2
card['expsq100'] = card['expsq']/100
card.head()

Unnamed: 0,lwage76,ed76,age76,south66,black,smsa76r,smsa66r,nearc4,daded,nearc2,...,reg663,reg664,reg665,reg666,reg667,reg668,reg669,exp,expsq,expsq100
0,6.306275,7.0,29.0,0.0,1.0,1.0,1.0,0.0,9.94,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28.0,784.0,7.84
1,6.175867,12.0,27.0,0.0,0.0,1.0,1.0,0.0,8.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21.0,441.0,4.41
2,6.580639,12.0,34.0,0.0,0.0,1.0,1.0,0.0,14.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28.0,784.0,7.84
3,5.521461,11.0,27.0,0.0,0.0,1.0,1.0,1.0,11.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0,484.0,4.84
4,6.591674,12.0,34.0,0.0,0.0,1.0,1.0,1.0,8.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28.0,784.0,7.84


In [26]:
# Régresseurs et variable dépendante.
## On ajoute un régresseur contant.
card = sm.add_constant(card, has_constant="add")

In [20]:
card.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3010 entries, 0 to 3612
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   const     3010 non-null   float64
 1   lwage76   3010 non-null   float64
 2   ed76      3010 non-null   float64
 3   age76     3010 non-null   float64
 4   south66   3010 non-null   float64
 5   black     3010 non-null   float64
 6   smsa76r   3010 non-null   float64
 7   smsa66r   3010 non-null   float64
 8   nearc4    3010 non-null   float64
 9   daded     3010 non-null   float64
 10  nearc2    3010 non-null   float64
 11  exp       3010 non-null   float64
 12  expsq     3010 non-null   float64
 13  expsq100  3010 non-null   float64
dtypes: float64(14)
memory usage: 352.7 KB


In [6]:
card.describe()

Unnamed: 0,const,lwage76,ed76,age76,south66,black,smsa76r,smsa66r,nearc4,daded,exp,expsq,expsq100
count,3010.0,3010.0,3010.0,3010.0,3010.0,3010.0,3010.0,3010.0,3010.0,3010.0,3010.0,3010.0,3010.0
mean,1.0,6.261832,13.263455,28.119601,0.414286,0.233555,0.712957,0.649502,0.68206,9.988904,20.856146,452.126578,4.521266
std,0.0,0.443798,2.676913,3.137004,0.49268,0.423162,0.452457,0.477205,0.465753,3.266501,4.141672,182.513175,1.825132
min,1.0,4.60517,1.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,144.0,1.44
25%,1.0,5.976985,12.0,25.0,0.0,0.0,0.0,0.0,0.0,8.0,18.0,324.0,3.24
50%,1.0,6.286928,13.0,28.0,0.0,0.0,1.0,1.0,1.0,9.94,20.0,400.0,4.0
75%,1.0,6.563503,16.0,31.0,1.0,0.0,1.0,1.0,1.0,12.0,23.0,529.0,5.29
max,1.0,7.784889,18.0,34.0,1.0,1.0,1.0,1.0,1.0,18.0,35.0,1225.0,12.25


L'équation du modèle dont nous allons estimer les paramètres est donc pour cette application:
$$
lwage_i = \beta_0 + \beta_1ed76 + \beta_2exp + \beta_3expsq100 + \beta_4 black + \beta_5 south66 + \beta_6smsa76r + U_i, \ \ i= 1,\ldots, n
$$

et pour laquelle en suppose afin de pouvoir faire une interprétation causale des paramètres que $\mathrm{E}(U_i| X_i) = 0$(où donc $Y_i := lwage_i$, $X_i :=(1, ed76_i, exp, expsq100_i, black_i, south66_i, smsa76r_i)^\top$).


# Estimation par moindres carrés ordinaires.

In [7]:
from linearmodels.iv import IV2SLS

In [8]:
res_ols = IV2SLS(card['lwage76'],  
                 card[['ed76', 'exp', 'expsq100', 'black', 'south66', 'smsa76r', 'const']],
                   None, None).fit(cov_type="robust")
print(res_ols)

                            OLS Estimation Summary                            
Dep. Variable:                lwage76   R-squared:                      0.2836
Estimator:                        OLS   Adj. R-squared:                 0.2822
No. Observations:                3010   F-statistic:                    1252.7
Date:                Tue, Apr 25 2023   P-value (F-stat)                0.0000
Time:                        11:04:02   Distribution:                  chi2(6)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
ed76           0.0743     0.0037     20.325     0.0000      0.0671      0.0814
exp            0.1395     0.0142     9.8419     0.00

In [9]:
# 2MC: 1ère étape
res_first = IV2SLS(card['ed76'],  card[['exp', 'expsq100', 'black', 'south66', 'smsa76r', 'nearc4', 'const']], None, None).fit(
    cov_type="robust"
)
print(res_first)

                            OLS Estimation Summary                            
Dep. Variable:                   ed76   R-squared:                      0.4752
Estimator:                        OLS   Adj. R-squared:                 0.4742
No. Observations:                3010   F-statistic:                    3673.0
Date:                Tue, Apr 25 2023   P-value (F-stat)                0.0000
Time:                        11:07:43   Distribution:                  chi2(6)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exp           -0.4250     0.0720    -5.9025     0.0000     -0.5662     -0.2839
expsq100       0.0677     0.1694     0.3995     0.68

In [10]:
res_second = IV2SLS(card['lwage76'],  
                           card[['exp', 'expsq100', 'black', 'south66', 'smsa76r', 'const']], 
                           card['ed76'], card['nearc4']).fit(
    cov_type="robust"
)
print(res_second)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                lwage76   R-squared:                      0.2022
Estimator:                    IV-2SLS   Adj. R-squared:                 0.2006
No. Observations:                3010   F-statistic:                    732.22
Date:                Tue, Apr 25 2023   P-value (F-stat)                0.0000
Time:                        11:10:09   Distribution:                  chi2(6)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exp            0.1672     0.0277     6.0449     0.0000      0.1130      0.2214
expsq100      -0.2336     0.0352    -6.6357     0.00

In [11]:
from linearmodels.iv import compare

print(compare({"OLS": res_ols, "2SLS": res_second}))

                Model Comparison                
                               OLS          2SLS
------------------------------------------------
Dep. Variable              lwage76       lwage76
Estimator                      OLS       IV-2SLS
No. Observations              3010          3010
Cov. Est.                   robust        robust
R-squared                   0.2836        0.2022
Adj. R-squared              0.2822        0.2006
F-statistic                 1252.7        732.22
P-value (F-stat)            0.0000        0.0000
ed76                        0.0743        0.1394
                          (20.325)      (2.5946)
exp                         0.1395        0.1672
                          (9.8419)      (6.0449)
expsq100                   -0.2291       -0.2336
                         (-7.1910)     (-6.6357)
black                      -0.1896       -0.1284
                         (-10.417)     (-2.4200)
south66                    -0.0975       -0.0706
                    

Un VI supplémentaire pour ed76: daded(niveau d'études du père)

In [13]:
res_second_daded = IV2SLS(card['lwage76'],  
                           card[['exp', 'expsq100', 'black', 'south66', 'smsa76r', 'const']], 
                           card['ed76'], card['daded']).fit(
    cov_type="robust"
)
print(res_second_daded)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                lwage76   R-squared:                      0.2828
Estimator:                    IV-2SLS   Adj. R-squared:                 0.2814
No. Observations:                3010   F-statistic:                    834.00
Date:                Tue, Apr 25 2023   P-value (F-stat)                0.0000
Time:                        11:16:12   Distribution:                  chi2(6)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exp            0.1421     0.0155     9.1709     0.0000      0.1118      0.1725
expsq100      -0.2296     0.0320    -7.1748     0.00

2 VIs pour ed76: nearc4 et daded

In [14]:
res_second_2VIs = IV2SLS(card['lwage76'],  
                           card[['exp', 'expsq100', 'black', 'south66', 'smsa76r', 'const']], 
                           card['ed76'], card[['daded', 'nearc4']]).fit(
    cov_type="robust"
)
print(res_second_2VIs)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                lwage76   R-squared:                      0.2815
Estimator:                    IV-2SLS   Adj. R-squared:                 0.2801
No. Observations:                3010   F-statistic:                    834.60
Date:                Tue, Apr 25 2023   P-value (F-stat)                0.0000
Time:                        11:22:39   Distribution:                  chi2(6)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exp            0.1439     0.0155     9.3060     0.0000      0.1136      0.1742
expsq100      -0.2299     0.0321    -7.1568     0.00

In [15]:
print(compare({"OLS": res_ols, "2SLS": res_second, 
               "2SLS: daded": res_second_daded, "2SLS: 2VIs": res_second_2VIs}))

                              Model Comparison                              
                               OLS          2SLS   2SLS: daded    2SLS: 2VIs
----------------------------------------------------------------------------
Dep. Variable              lwage76       lwage76       lwage76       lwage76
Estimator                      OLS       IV-2SLS       IV-2SLS       IV-2SLS
No. Observations              3010          3010          3010          3010
Cov. Est.                   robust        robust        robust        robust
R-squared                   0.2836        0.2022        0.2828        0.2815
Adj. R-squared              0.2822        0.2006        0.2814        0.2801
F-statistic                 1252.7        732.22        834.00        834.60
P-value (F-stat)            0.0000        0.0000        0.0000        0.0000
ed76                        0.0743        0.1394        0.0805        0.0847
                          (20.325)      (2.5946)      (5.0427)      (5.4982)

In [21]:
res_second_2VIs = IV2SLS(card['lwage76'],  
                           card[['exp', 'expsq100', 'black', 'south66', 'smsa76r', 'const']], 
                           card['ed76'], card[['nearc2', 'nearc4']]).fit(
    cov_type="robust"
)
print(res_second_2VIs)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                lwage76   R-squared:                      0.0885
Estimator:                    IV-2SLS   Adj. R-squared:                 0.0867
No. Observations:                3010   F-statistic:                    643.73
Date:                Tue, Apr 25 2023   P-value (F-stat)                0.0000
Time:                        11:27:50   Distribution:                  chi2(6)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exp            0.1824     0.0287     6.3572     0.0000      0.1261      0.2386
expsq100      -0.2360     0.0382    -6.1796     0.00

MCO avec indicatrices régionales: colonne 2 tableau 2