# Advanced IV
IV is pretty handy for linking variables. It can also be used for controlling for attenuation bias.

**Attenu-what?**  
Attenuation Bias is an underestimated of a treatment effect due to incorrect data. An example is that if we have some group of twins and we are trying to use them (since they are a natural experiment controlling for heaps of stuff; parents, even genes, etc) to work out the value of education. 



In [1]:
#meta-params:
num_twins = 100

import pandas as pd
import numpy as np

twins_db = pd.DataFrame()

#Generate data for number of years educating
twins_db['Education_1'] = [round(np.random.normal(loc= 14, scale=2)) for x in range(num_twins)]
twins_db['Education_2'] = [round(np.random.normal(loc= 14, scale=2)) for x in range(num_twins)]

#Function for generating salary from education
def sal_f_edu(edu):
    import numpy as np
    slope = 3
    rand = np.random.normal(0, 15)
    return round(slope * edu + rand)

#Generate data for salary
twins_db['Salary_1'] = twins_db['Education_1'].apply(sal_f_edu)
twins_db['Salary_2'] = twins_db['Education_2'].apply(sal_f_edu)

#Add an constant and convert whole df to float:
twins_db['const'] = 1
twins_db = twins_db.astype(float)

In [2]:
twins_db.sample(5)

Unnamed: 0,Education_1,Education_2,Salary_1,Salary_2,const
37,11.0,13.0,41.0,21.0,1.0
88,11.0,13.0,33.0,-9.0,1.0
0,14.0,16.0,37.0,29.0,1.0
10,15.0,18.0,40.0,68.0,1.0
30,15.0,15.0,40.0,71.0,1.0


In [7]:
#Set up the data:
X1 = 'Education_1'
X2 = 'Education_2'
y1 = 'Salary_1'
y2 = 'Salary_2'

y = pd.concat([twins_db[y1], twins_db[y2]])
X = pd.DataFrame(columns = ["Education"], data = pd.concat([twins_db[X1], twins_db[X2]]))
X['const'] = 1

In [10]:
#Get the unattenuated result:
#Imports
import statsmodels.api as sm

lr = sm.OLS(y, X)
result = lr.fit()
normal_coef = result.params[0]

#Get the attenuated results:
coef = []
for _ in range(100):
    
    
    
    #Run a linear regression
    lr = sm.OLS(y,
                X)
    result = lr.fit()
    
    coef.append(result.params[0])

In [14]:
print("Target:\t\t", normal_coef, "\nAttenuated:\t", np.average(coef))

Target:		 3.12760991826 
Attenuated:	 1.75869837235


The difference between these numbers is "Attenuation Bias".

**How do we solve Attenation Bias?**  
We can use "twin one's report of twin 2's schooling" as an instrument for "twin 2's schooling" (assuming we have/can collect(ed) this information). Assuming that any mistakes made in reporting are just random mistakes, they should mostly cancel out.

### Remember IVs?
Reduced Form (effect of the instrument on the outcome):
$$Y_i = \alpha_0 + \rho Z_i + e_0i$$
First Stage (effect of the instrument on the treatment):
$$D_i = \alpha_1 + \phi Z_i + e_1i$$

2SLS ():
$$\hat{D}_i = \alpha_1 + \phi Z_i$$
Finally (effect of treatment on the outcome):
$$Y_i = \alpha_2 + \lambda_{2SLS} \hat{D}_i + \gamma_2 A_i + e_2i$$

**IVs in our example:**  
$Y_i$ = Difference in Salary  
$D_i$ = Education (self reported)  
$Z_i$ = Difference in education (self reported - twin reported)

In [17]:
#Set up the data:
X1 = 'Education_1'
X2 = 'Education_2'
y1 = 'Salary_1'
y2 = 'Salary_2'

y = pd.concat([twins_db[y1], twins_db[y2]])
X = pd.DataFrame(columns = ["Education"], data = pd.concat([twins_db[X1], twins_db[X2]]))
X['Education_twin'] = X['Education'].copy()
X['const'] = 1

In [24]:
#Get the attenuated results:
lambda_late = []
lambda_2sls = []
for _ in range(300):
    
    #Add some mistakes (self reporting):
    X['Education'] = X['Education'].apply(lambda x : x + round(np.random.normal(0, 0.3)))
    X['Education_twin'] = X['Education_twin'].apply(lambda x : x + round(np.random.normal(0, 0.3)))
    
    
    #Get the difference:
    X['Education_diff'] = X['Education'] - X['Education_twin']
    X.drop(['Education_twin'], 1)
    
    D_i = 'Education'
    Z_i = 'Education_diff'
                                               
    #Run a linear regression (first stage)
    lr_fs = sm.OLS(X[D_i],
                   X[[Z_i, 'const']])
    result_fs = lr_fs.fit()
    
    #And another (reduced form)
    lr_rf = sm.OLS(y,
                   X[[Z_i, 'const']])
    result_rf = lr_rf.fit()
    
    #save result;
    X['fs_predict'] = result_fs.predict(X[[Z_i, 'const']])
    
    #And another (2sls)
    lr_2sls = sm.OLS(y,
                     X[['fs_predict', 'const']])
    result_2sls = lr_2sls.fit()
    
    lambda_late.append(result_rf.params[0] / result_fs.params[0])
    lambda_2sls.append(result_2sls.params[0])

In [25]:
print("Target:", normal_coef, "\nAttenuated:", np.average(coef))
print("LATE:", np.mean(lambda_late))
print("2SLS:", np.mean(lambda_2sls))

Target: 3.12760991826 
Attenuated: 1.75869837235
LATE: 0.452725708673
2SLS: 0.452725708673
