# Mastring Metrics
## Chapter 2
### Regression

In [3]:
import pandas as pd

Table 2.1:
<img src="img/T2.1.png" width="50%" height="50%">

In [32]:
# Import the table as dataframe
#  0: Null 
# -1: Reject
# +1: Admit, Not Attended
# +2: Admit, Attend

data = {'Ivy':[0, 0, 0, 2, 1, 0, 0, -1, -1],\
        'Leafy':[-1, -1, -1, 0, 0, 2, 2, 0, 0],\
        'Smart':[2, 2, 1, 0, 0, 0, 0, 0, 0],\
        'AllState':[0, 0, 0, 1, 1, 0, 0, 2, 1],\
        'TallState':[1, 1, 2, 0, 0, 0, 0, 1, 2],\
        'AlteredState':[0, 0, 0, 1, 2, 0, 0, 0, 0],\
        'Earnings':[110000, 100000, 110000, 60000, 30000, 115000, 75000, 90000, 60000]
        }

# Create dataFrame 
df = pd.DataFrame(data, columns =['Ivy', 'Leafy', 'Smart', 'AllState', 'TallState', 'AlteredState', 'Earnings'])
df.index += 1

#Show data
df

Unnamed: 0,Ivy,Leafy,Smart,AllState,TallState,AlteredState,Earnings
1,0,-1,2,0,1,0,110000
2,0,-1,2,0,1,0,100000
3,0,-1,1,0,2,0,110000
4,2,0,0,1,0,1,60000
5,1,0,0,1,0,2,30000
6,0,2,0,0,0,0,115000
7,0,2,0,0,0,0,75000
8,-1,0,0,2,1,0,90000
9,-1,0,0,1,2,0,60000


In [33]:
# Average earnings for private schools
Avg_private_df = df[(df['Leafy'] == 2)  | (df['Ivy'] == 2) | (df['Smart'] == 2)]
Avg_private_df

Unnamed: 0,Ivy,Leafy,Smart,AllState,TallState,AlteredState,Earnings
1,0,-1,2,0,1,0,110000
2,0,-1,2,0,1,0,100000
4,2,0,0,1,0,1,60000
6,0,2,0,0,0,0,115000
7,0,2,0,0,0,0,75000


In [34]:
Avg_private = Avg_private_df['Earnings'].mean()
Avg_private

92000.0

In [35]:
Avg_public_df = df[(df['AllState'] == 2)  | (df['TallState'] == 2) | (df['AlteredState'] == 2)]
Avg_public_df

Unnamed: 0,Ivy,Leafy,Smart,AllState,TallState,AlteredState,Earnings
3,0,-1,1,0,2,0,110000
5,1,0,0,1,0,2,30000
8,-1,0,0,2,1,0,90000
9,-1,0,0,1,2,0,60000


In [36]:
Avg_public = Avg_public_df['Earnings'].mean()
Avg_public

72500.0

The almost $20,000 gap between these two groups suggests a large private school advantage.



#### Groups
The students are organized in four groups, A, B, C, and D. Within each group, students are likely to have similar career ambitious, while they were also judged to be of similar ability by admissions staff at the schools to which they applied. Now we can apply a apple-to-apple and orange-to-orange comparision.

In [39]:
# Group A:
df[df.index<4]

Unnamed: 0,Ivy,Leafy,Smart,AllState,TallState,AlteredState,Earnings
1,0,-1,2,0,1,0,110000
2,0,-1,2,0,1,0,100000
3,0,-1,1,0,2,0,110000


In [41]:
# The mean earnings in group A
df[df.index<4]['Earnings'].mean()

106666.66666666667

Within group A, the private school differential is negative showing a gap of -$5,000.

$$\frac{(110+100)}{2}-100 = -5$$

In [43]:
# Group B:
df[(df.index>3) & (df.index<6)]

Unnamed: 0,Ivy,Leafy,Smart,AllState,TallState,AlteredState,Earnings
4,2,0,0,1,0,1,60000
5,1,0,0,1,0,2,30000


In [44]:
# The mean earnings in group A
df[(df.index>3) & (df.index<6)]['Earnings'].mean()

45000.0

Group B have lower average earning than group A.

The earning differentiation between those who attended private school and those who attended public school in group B is $30,000.

$$60000-30000=30000$$

In [45]:
# Group C:
df[(df.index>5) & (df.index<8)]

Unnamed: 0,Ivy,Leafy,Smart,AllState,TallState,AlteredState,Earnings
6,0,2,0,0,0,0,115000
7,0,2,0,0,0,0,75000


In [46]:
# Group D:
df[(df.index>7) & (df.index<10)]

Unnamed: 0,Ivy,Leafy,Smart,AllState,TallState,AlteredState,Earnings
8,-1,0,0,2,1,0,90000
9,-1,0,0,1,2,0,60000


The two students in group C chose private school and the two students in group D chose public school. Groups C and D are uninformative, because, from the perspective of our effort to estimate a private school treatment effect, each is composed of either all-treated or all-controlled individuals. In other words, their earnings reveal nothing about the value of a private education.

Using group A and B data, the average of -\$5,000 for group A and \$30,000 for group B is $12,500.

$$\frac{30000-5000}{2} = 12500$$

The weighted average is:

$$(\frac{3}{5} * -5000)+(\frac{2}{5} * 30000) = 9000$$

The weighted average generate a statistically more precise summary of the private-public earnings differential.

In [47]:
# Group A & B:
df[df.index<6]

Unnamed: 0,Ivy,Leafy,Smart,AllState,TallState,AlteredState,Earnings
1,0,-1,2,0,1,0,110000
2,0,-1,2,0,1,0,100000
3,0,-1,1,0,2,0,110000
4,2,0,0,1,0,1,60000
5,1,0,0,1,0,2,30000


Suppose $P_i$ indicates students who attendend a private collage or university and $A_i$ shows if the student belongs to group A or not. Both $A_i$ and $P_i$ are summy variables that they equal to 1 to indicate observations in a specific state or condition, and 0 otherwise.

The regression model in this context is an equation linking the treatment variable to the dependent variable while holding control variables fixed by including them in the model.

$$Y_i = \hat{Y}_i + e_i = \alpha + \beta P_i + \gamma A_i +e_i,$$

where $\alpha$ is an intercept. $\beta$ is the causal effect of the treatment and $\gamma$ is the effect of being a group A student. $\hat{Y}_i$ is our regression model.

The distinction between the control variable and the treatment variable is defined by your research question.

Regression analysis assigns values to model parameters ($\alpha$, $\beta$ and $\gamma$) so as to make $\hat{Y}_i$ as colse as possible to $Y_i$. This is accomplished by choosing values that minimize the sum of squared residuals, leading to the moniker *Ordinary Least Squares* (OLS)

In [49]:
newdata = {'P_i':[1, 1, 0, 1, 0],\
           'A_i':[1, 1, 1, 0, 0],\
           'Earnings':[110000, 100000, 110000, 60000, 30000]
        }
df2 = pd.DataFrame(newdata, columns =['P_i', 'A_i', 'Earnings'])
df2.index += 1
df2

Unnamed: 0,P_i,A_i,Earnings
1,1,1,110000
2,1,1,100000
3,0,1,110000
4,1,0,60000
5,0,0,30000


In [53]:
import statsmodels.api as sm # import statsmodels
X = df2[['P_i', 'A_i']] ## X usually means our input variables (or independent variables)
y = df2['Earnings'] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (alpha) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()



0,1,2,3
Dep. Variable:,Earnings,R-squared:,0.921
Model:,OLS,Adj. R-squared:,0.843
Method:,Least Squares,F-statistic:,11.7
Date:,"Sun, 08 Mar 2020",Prob (F-statistic):,0.0787
Time:,22:07:31,Log-Likelihood:,-52.589
No. Observations:,5,AIC:,111.2
Df Residuals:,2,BIC:,110.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4e+04,1.2e+04,3.347,0.079,-1.14e+04,9.14e+04
P_i,1e+04,1.31e+04,0.764,0.525,-4.63e+04,6.63e+04
A_i,6e+04,1.31e+04,4.583,0.044,3665.052,1.16e+05

0,1,2,3
Omnibus:,,Durbin-Watson:,2.25
Prob(Omnibus):,,Jarque-Bera (JB):,0.638
Skew:,0.0,Prob(JB):,0.727
Kurtosis:,1.25,Cond. No.,3.49


As you can see the coefficients for each model's parameter are:

$$\alpha = 40000$$
$$\beta = 10000$$
$$\gamma = 60000$$

The private school coefficient in this case is 10,000, implying a private-public earnings differential of \$10,000. This is indeed a weighted average of our two group-specific effects (recall the group A effect is −5,000 and the group B effect is 30,000). While this is neither the simple unweighted average (12,500) nor the group-size weighted average (9,000), it’s not too far from either of them. In this case, regression assigns a weight of 4/7 to group A and 3/7 to group B. As with these other averages, the regression-weighted average is considerably smaller than the uncontrolled earnings gap between private and public school alumni.