# <center> UGA </center>
# <center> ÉCONOMÉTRIE 2: L3 MIASH, S2</center>
## <center> DEVOIR 2 : </center>
## <center> PROJECTION ET RÉGRESSION LINÉAIRES</center>
#### <center>Michal Urdanivia (UGA)</center>
#### <center> michal.wong-urdanivia@univ-grenoble-alpes.fr </center>

In [1]:
# Import/appel de la bibliothèque pandas et autres qui seront utilisées 
# Remarques: 
# - les écritures après le "#" sont des commentaires non considérés comme du code à executer.
# - pd, np, etc ci-après sont des abréviations que nous donnons aux bibliothèques correspondates
# (elle sont courantes comme vous pourrez le constater en regardant un peu sur le web)


import pandas as pd   
import numpy as np
import statsmodels.api as sm
from sklearn import linear_model

In [2]:
# Lecture des données.
# On utilise la fonction "read_stata" dans pandas pour lire le fichier au format stata(".dta") disponible 
# sur le site de Bruce Hansen. Vous pouvez aussi le télécharger sur votre poste et ensuite le lire.
# Nous l'appellons cps_df(pour cps data frame)

cps_df = pd.read_stata("https://www.ssc.wisc.edu/~bhansen/econometrics/cps09mar.dta")
cps_df.info()   # Affichage d'informations.


<class 'pandas.core.frame.DataFrame'>
Int64Index: 50742 entries, 0 to 50741
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        50742 non-null  float64
 1   female     50742 non-null  float64
 2   hisp       50742 non-null  float64
 3   education  50742 non-null  float64
 4   earnings   50742 non-null  float64
 5   hours      50742 non-null  float64
 6   week       50742 non-null  float64
 7   union      50742 non-null  float64
 8   uncov      50742 non-null  float64
 9   region     50742 non-null  float64
 10  race       50742 non-null  float64
 11  marital    50742 non-null  float64
dtypes: float64(12)
memory usage: 5.0 MB


In [3]:
# Affichage des premières lignes(5 par défaut) et de statistiques descriptives de base(moyennes, écart-types, etc)
print(cps_df.head())
cps_df.describe()

    age  female  hisp  education  earnings  hours  week  union  uncov  region  \
0  52.0     0.0   0.0       12.0  146000.0   45.0  52.0    0.0    0.0     1.0   
1  38.0     0.0   0.0       18.0   50000.0   45.0  52.0    0.0    0.0     1.0   
2  38.0     0.0   0.0       14.0   32000.0   40.0  51.0    0.0    0.0     1.0   
3  41.0     1.0   0.0       13.0   47000.0   40.0  52.0    0.0    0.0     1.0   
4  42.0     0.0   0.0       13.0  161525.0   50.0  52.0    1.0    0.0     1.0   

   race  marital  
0   1.0      1.0  
1   1.0      1.0  
2   1.0      1.0  
3   1.0      1.0  
4   1.0      1.0  


Unnamed: 0,age,female,hisp,education,earnings,hours,week,union,uncov,region,race,marital
count,50742.0,50742.0,50742.0,50742.0,50742.0,50742.0,50742.0,50742.0,50742.0,50742.0,50742.0,50742.0
mean,42.131725,0.425722,0.148792,13.924619,55091.530685,43.827244,51.879272,0.021521,0.002207,2.635627,1.433507,2.763174
std,11.48762,0.494457,0.355887,2.744447,52222.071166,7.704467,0.598646,0.145113,0.04693,1.060051,1.31743,2.503158
min,15.0,0.0,0.0,0.0,1.0,36.0,48.0,0.0,0.0,1.0,1.0,1.0
25%,33.0,0.0,0.0,12.0,28000.0,40.0,52.0,0.0,0.0,2.0,1.0,1.0
50%,42.0,0.0,0.0,13.0,42000.0,40.0,52.0,0.0,0.0,3.0,1.0,1.0
75%,51.0,1.0,0.0,16.0,65000.0,45.0,52.0,0.0,0.0,4.0,1.0,5.0
max,85.0,1.0,1.0,20.0,561087.0,99.0,52.0,1.0,1.0,4.0,21.0,7.0


In [4]:
# Échantillon

cps_df2 = cps_df[(cps_df.race == 1.0) | (cps_df.race == 2.0)]

# Variables

cps_df2 = cps_df2.assign(exper = cps_df2.age - cps_df2.education - 6) # Expérience
cps_df2 = cps_df2.assign(expersq = cps_df2.exper**2/100) # Expérience au carré
cps_df2 = cps_df2.assign(lwage = np.log(cps_df2.earnings / ( cps_df2.hours * cps_df2.week))) # revenu horaire
cps_df2 = pd.get_dummies(data = cps_df2, columns= ['race']) # indicatrice d'appartenance ethnique
cps_df2 = cps_df2.rename(columns={"race_1.0": "white", "race_2.0": "black"}) # on les renomme 
print(cps_df2.shape)
cps_df2.describe()
#cps_df2[['exper', 'age', 'education', 'expersq', 'lwage', 'earnings', 'week', 'hours']].head()


(46411, 16)


Unnamed: 0,age,female,hisp,education,earnings,hours,week,union,uncov,region,marital,exper,expersq,lwage,white,black
count,46411.0,46411.0,46411.0,46411.0,46411.0,46411.0,46411.0,46411.0,46411.0,46411.0,46411.0,46411.0,46411.0,46411.0,46411.0,46411.0
mean,42.213915,0.423477,0.154468,13.882269,55082.729181,43.879964,51.879554,0.021934,0.002262,2.597789,2.754584,22.331646,6.337939,2.945706,0.889358,0.110642
std,11.468616,0.494115,0.3614,2.713667,52324.915589,7.701222,0.596815,0.146471,0.047511,1.047513,2.497897,11.623014,5.635276,0.673137,0.313692,0.313692
min,15.0,0.0,0.0,0.0,1.0,36.0,48.0,0.0,0.0,1.0,1.0,-4.0,0.0,-7.863267,0.0,0.0
25%,33.0,0.0,0.0,12.0,28000.0,40.0,52.0,0.0,0.0,2.0,1.0,13.0,1.69,2.560096,1.0,0.0
50%,42.0,0.0,0.0,13.0,42000.0,40.0,52.0,0.0,0.0,3.0,1.0,22.0,4.84,2.956512,1.0,0.0
75%,51.0,1.0,0.0,16.0,65000.0,45.0,52.0,0.0,0.0,3.0,5.0,31.0,9.61,3.354542,1.0,0.0
max,85.0,1.0,1.0,20.0,561087.0,99.0,52.0,1.0,1.0,4.0,7.0,75.0,56.25,5.583706,1.0,1.0


## <center>  Théorème de Frisch-Waugh Lovell: application empirique </center>

**Question (a), section 2.1.**

On estime par MCO l'équation linéaire:

$$

\begin{align}
  lwage_i &=  \alpha_0 + b_{1, 0} education_i + b_{2, 0} exper_i + b_{3, 0} exper_i^2 + b_{4, 0} female_i +  b_{5, 0} black + u_i,
\end{align}

$$

où $u_i$ sont les erreurs du modèle, qui sera selon les hypothèses considérées un modèles de régression ou une projection.

In [5]:
# Variables(dépendante et régresseurs)

dep_var = cps_df2['lwage']
reg_var = cps_df2[['education', 'female', 'black', 'exper', 'expersq']]

In [6]:
model = sm.OLS(dep_var, sm.add_constant(reg_var), missing = 'drop')
results = model.fit(cov_type='HC0')
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.280
Model:                            OLS   Adj. R-squared:                  0.280
Method:                 Least Squares   F-statistic:                     2869.
Date:                Fri, 24 Mar 2023   Prob (F-statistic):               0.00
Time:                        14:15:35   Log-Likelihood:                -39873.
No. Observations:               46411   AIC:                         7.976e+04
Df Residuals:                   46405   BIC:                         7.981e+04
Df Model:                           5                                         
Covariance Type:                  HC0                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.0208      0.019     55.132      0.0

Pour avoir des écart-types estimés robustes à l'hétéroscédasticité vous pouvez faire:

**Question (b), section 2.1.**

On applique la procédure suivante:

 1. Estimation de la projection linéaire de $female_i$ sur les autres régresseurs et calcul de résidus estimés.
 2. Estimation de la projection linéaire de $lwage_i$ sur les autres régresseurs sans $female$ et calcul de résidus.
 3. Estimation de la projection des résidus de 2. sur ceux de 1.

In [7]:
W = reg_var.drop(columns = ['female'])
D = reg_var['female']


# Étape 1: 
Dreg = sm.OLS(endog = D, exog = sm.add_constant(W), missing = 'drop').fit(cov_type='HC0')
Dhat = Dreg.predict()
Dres = D - Dhat
# Étape 2:
Yreg = sm.OLS(endog = dep_var, exog = sm.add_constant(W), missing = 'drop').fit(cov_type='HC0')
Yhat = Yreg.predict()
Yres = dep_var - Yhat

# Étape 3: 
Y_partialReg = sm.OLS(endog = Yres, exog = Dres, missing = 'drop').fit(cov_type='HC0')
print(Y_partialReg.summary())

                                 OLS Regression Results                                
Dep. Variable:                  lwage   R-squared (uncentered):                   0.049
Model:                            OLS   Adj. R-squared (uncentered):              0.049
Method:                 Least Squares   F-statistic:                              2470.
Date:                Fri, 24 Mar 2023   Prob (F-statistic):                        0.00
Time:                        14:15:39   Log-Likelihood:                         -39873.
No. Observations:               46411   AIC:                                  7.975e+04
Df Residuals:                   46410   BIC:                                  7.976e+04
Df Model:                           1                                                  
Covariance Type:                  HC0                                                  
                 coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------