In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

$\textbf{Empirical Exercise-Week 3}$  Page 189, Problem 4.27
Let's look at a similar example from your book, related wages to experience in the work force, as well as to gender and racial differences.  It is a larger data set with the same variables.  We are asked to compare regression coefficients for four different groups baased on gender and racial categories.  It is, de facto, an empirical investigation of work-force discrimination.  

In [2]:
Tab1 = pd.read_csv('cps5.csv');
Tab1.head(1)

Unnamed: 0,age,asian,black,divorced,educ,exper,faminc,female,hrswork,insure,...,metro,midwest,nchild,northeast,single,south,union,wage,west,white
0,45,0,0,0,13,26,39200,1,38,1,...,0,0,0,1,0,0,1,14.38,0,1


In [3]:
Aframer = Tab1.iloc[:,2]; exper = Tab1.iloc[:,5]; female = Tab1.iloc[:,7]; edu = Tab1.iloc[:,4];
wage = Tab1.iloc[:,20]; white = Tab1.iloc[:,22];

In [4]:
Aframer1= np.array(Aframer); exper1 = np.array(exper); female1 = np.array(female);
wage1 = np.array(wage); white1 = np.array(white);
wage2 = np.log(wage1);


We extract from the data set subsets of data by gender and race (African American and white).

In [5]:
Data = np.hstack([wage1.reshape(-1,1), np.ones((9799,1)), exper1.reshape(-1,1)])
WhiteMale = np.logical_and(female1==0 , white1==1)
AframerMale = np.logical_and(Aframer1== 1 , female1==0)
WhiteFemale =  np.logical_and(female1==1 , white1== 1)
AframerFemale = np.logical_and(Aframer1== 1 , female1== 1)

In [6]:
Data_WhiteMale = Data[WhiteMale,:];
Data_AframerMale = Data[AframerMale,:];
Data_WhiteFemale = Data[WhiteFemale,:];
Data_AframerFemale = Data[AframerFemale,:];

In [7]:
CV_WhiteMale =  100 * np.std(Data_WhiteMale[:,0],ddof=1)/np.mean(Data_WhiteMale[:,0]);
CV_AframerMale =  100 * np.std(Data_AframerMale[:,0],ddof=1)/np.mean(Data_AframerMale[:,0]);
CV_WhiteFemale =  100 * np.std(Data_WhiteFemale[:,0],ddof=1)/np.mean(Data_WhiteFemale[:,0]);
CV_AframerFemale =  100 * np.std(Data_AframerFemale[:,0],ddof=1)/np.mean(Data_AframerFemale[:,0]);

In [8]:
[CV_WhiteMale, CV_AframerMale, CV_WhiteFemale, CV_AframerFemale]

[61.078005175743385, 56.643433405471995, 79.88263907959818, 68.7235291246936]

We see that white females have greater variation in their wages, then African American females, relative to their male counterparts.

Lets do a regression for each subset, of log(wage) on a constant and experience.

In [9]:
yy_WhiteMale = np.log(Data_WhiteMale[:,0]); xx_WhiteMale = Data_WhiteMale[:,1:];
yy_AframerMale = np.log(Data_AframerMale[:,0]); xx_AframerMale = Data_AframerMale[:,1:];
yy_WhiteFemale = np.log(Data_WhiteFemale[:,0]); xx_WhiteFemale = Data_WhiteFemale[:,1:];
yy_AframerFemale = np.log(Data_AframerFemale[:,0]); 
xx_AframerFemale = Data_AframerFemale[:,1:];


In [10]:
results_WhiteMale = sm.OLS(yy_WhiteMale, xx_WhiteMale).fit()
results_WhiteFemale = sm.OLS(yy_WhiteFemale, xx_WhiteFemale).fit()
results_AframerMale = sm.OLS(yy_AframerMale, xx_AframerMale).fit()
results_AframerFemale = sm.OLS(yy_AframerFemale, xx_AframerFemale).fit()

Let's look at the confidence intervals for the experience coefficient for the four subsets

In [12]:
np.array([results_WhiteMale.conf_int()[1,:], results_WhiteFemale.conf_int()[1,:], results_AframerMale.conf_int()[1,:],results_AframerFemale.conf_int()[1,:]])

array([[0.00539516, 0.00787116],
       [0.00219849, 0.00492729],
       [0.00150336, 0.00898454],
       [0.00062896, 0.00826427]])

You readily see that while male coefficient is always above that of White Females, but not always above that of African Amerian males and female.

In [13]:
[results_WhiteMale.rsquared, results_WhiteFemale.rsquared, results_AframerMale.rsquared,results_AframerFemale.rsquared]

[0.022757519927815273,
 0.0070877934153377176,
 0.018543712882534535,
 0.010999470074775752]

Above we see the overall goodness of fit. The highest  R-squared  measure is for the white males.

In [14]:
print(results_WhiteMale.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.023
Model:                            OLS   Adj. R-squared:                  0.023
Method:                 Least Squares   F-statistic:                     110.3
Date:                Tue, 01 Feb 2022   Prob (F-statistic):           1.57e-25
Time:                        19:36:47   Log-Likelihood:                -3982.6
No. Observations:                4740   AIC:                             7969.
Df Residuals:                    4738   BIC:                             7982.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.9009      0.017    173.446      0.0