# Advanced IV
IV is pretty handy for linking variables. It can also be used for controlling for attenuation bias.

**Attenu-what?**  
Attenuation Bias is an underestimated of a treatment effect due to incorrect data. An example is that if we have some group of twins and we are trying to use them (since they are a natural experiment controlling for heaps of stuff; parents, even genes, etc) to work out the value of education. 



In [2]:
#meta-params:
num_twins = 100

import pandas as pd
import numpy as np

twins_db = pd.DataFrame()

#Generate data for number of years educating
twins_db['Education_1'] = [round(np.random.normal(loc= 14, scale=2)) for x in range(num_twins)]
twins_db['Education_2'] = [round(np.random.normal(loc= 14, scale=2)) for x in range(num_twins)]

#Function for generating salary from education
def sal_f_edu(edu):
    import numpy as np
    slope = 3
    rand = np.random.normal(0, 15)
    return round(slope * edu + rand)

#Generate data for salary
twins_db['Salary_1'] = twins_db['Education_1'].apply(sal_f_edu)
twins_db['Salary_2'] = twins_db['Education_2'].apply(sal_f_edu)

#Store the diferences in the schooling and salary:
twins_db['diff_ed']   = twins_db['Education_1'] - twins_db['Education_2']
twins_db['diff_sal']  = twins_db['Salary_1'] - twins_db['Salary_2']

#Add an constant and convert whole df to float:
twins_db['const'] = 1
twins_db = twins_db.astype(float)

In [3]:
twins_db.sample(5)

Unnamed: 0,Education_1,Education_2,Salary_1,Salary_2,diff_ed,diff_sal,const
6,10.0,13.0,37.0,67.0,-3.0,-30.0,1.0
24,13.0,18.0,70.0,55.0,-5.0,15.0,1.0
35,13.0,12.0,34.0,8.0,1.0,26.0,1.0
82,15.0,13.0,15.0,42.0,2.0,-27.0,1.0
1,17.0,13.0,73.0,44.0,4.0,29.0,1.0


In [4]:
X = 'diff_ed'
y = 'diff_sal'

#Get the unattenuated result:
import statsmodels.api as sm
lr = sm.OLS(twins_db[y],
            twins_db[[X, 'const']])
result = lr.fit()
normal_coef = result.params[0]

#Get the attenuated results:
coef = []
for _ in range(100):
    
    #Add some Attenuation Bias:
    twins_db['Ed_1_errors'] = twins_db['Education_1'].apply(lambda x : 
                                                            x + round(np.random.normal(0, 0.5))
                                                           )
    twins_db['Ed_2_errors'] = twins_db['Education_2'].apply(lambda x : 
                                                            x + round(np.random.normal(0, 0.5))
                                                           )
    twins_db['diff_ed']   = twins_db['Ed_1_errors'] - twins_db['Ed_2_errors']
    
    #Run a linear regression
    import statsmodels.api as sm
    lr = sm.OLS(twins_db[y],
                twins_db[[X, 'const']])
    result = lr.fit()
    
    coef.append(result.params[0])

In [5]:
print(normal_coef, np.average(coef))

2.70322178398 2.47020682854


The difference between these numbers is "Attenuation Bias".

**How do we solve Attenation Bias?**  
We can use "twin one's report of twin 2's schooling" as an instrument for "twin 2's schooling" (assuming we have/can collect(ed) this information). Assuming that any mistakes made in reporting are just random mistakes, they should mostly cancel out.

### Remember IVs?
Reduced Form (effect of the instrument on the outcome):
$$Y_i = \alpha_0 + \rho Z_i + e_0i$$
First Stage (effect of the instrument on the treatment):
$$D_i = \alpha_1 + \phi Z_i + e_1i$$

2SLS:
$$\hat{D}_i = \alpha_1 + \phi Z_i$$
Finally (effect of treatment on the outcome):
$$Y_i = \alpha_2 + \lambda_{2SLS} \hat{D}_i + \gamma_2 A_i + e_2i$$

**IVs in our example:**  
$Y_i$ = Difference in Salary  
$D_i$ = Difference in education (self reported)  
$Z_i$ = Difference in education (twin reported)

In [6]:
#Get the attenuated results:

import statsmodels.api as sm

lambda_late = []
lambda_2sls = []
for _ in range(300):
    
    #Add some mistakes (self reporting):
    twins_db['Education_1_self'] = twins_db['Education_1'].apply(lambda x : 
                                                            x + round(np.random.normal(0, 0.5))
                                                           )
    twins_db['Education_2_self'] = twins_db['Education_2'].apply(lambda x : 
                                                            x + round(np.random.normal(0, 0.5))
                                                           )
    
    #Add some mistakes (twin reporting):
    twins_db['Education_1_twin'] = twins_db['Education_1'].apply(lambda x : 
                                                            x + round(np.random.normal(0, 0.5))
                                                           )
    twins_db['Education_2_twin'] = twins_db['Education_2'].apply(lambda x : 
                                                            x + round(np.random.normal(0, 0.5))
                                                           )
    #Save this into the dataframe
    twins_db['diff_ed_self']   = twins_db['Education_1_self'] - twins_db['Education_2_self']
    twins_db['diff_ed_twin']   = twins_db['Education_1_twin'] - twins_db['Education_2_twin']
                                               
    
    #Specify the columns of interest:
    X = 'diff_ed_self'
    Z = 'diff_ed_twin'
    y = 'diff_sal'
    
    #Run a linear regression (first stage)
    lr_fs = sm.OLS(twins_db[X],
                twins_db[[Z, 'const']])
    result_fs = lr_fs.fit()
    
    #And another (reduced form)
    lr_rf = sm.OLS(twins_db[y],
                twins_db[[Z, 'const']])
    result_rf = lr_rf.fit()
    
    #save result;
    twins_db['fs_predict'] = result_fs.predict(twins_db[[Z, 'const']])
    
    #And another (2sls)
    lr_2sls = sm.OLS(twins_db[y],
                twins_db[['fs_predict', 'const']])
    result_2sls = lr_2sls.fit()
    
    lambda_late.append(result_rf.params[0] / result_fs.params[0])
    lambda_2sls.append(result_2sls.params[0])

In [7]:
print("Target:", normal_coef, "\nAttenuated:", np.average(coef))
print("LATE:", np.mean(lambda_late))
print("2SLS:", np.mean(lambda_2sls))

Target: 2.70322178398 
Attenuated: 2.47020682854
LATE: 2.6694713072
2SLS: 2.6694713072


That seems to work :)