# PSET 2 - Econometric Theory - Saverio Pietro Capra

In [2]:
# I import the Python modules (or libraries) that I'll need to solve the problems
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm

In [3]:
# I read the csv of the dataset and I assign it to a variable
data = pd.read_csv("tracking.csv")

### Functions

Here below you can see some functions that I've defined in order to make the code below look neater.

$$\hat{ATE} = \frac{1}{n_{1}} \sum^{n}_{i=1} D_{i}y_{i}(1)- \frac{1}{n-n_{1}} \sum^{n}_{i=1} (1-D_{i})y_{i}(0)$$

In [11]:
# I define a function to compute the difference between the mean of the treatment and non-treatment effect

def ATE_estimator(df, col_name:str, variable_name:str):
    """
    df -> dataframe which contains the data on which we want to estimate the ATE
    col -> the column which indicates whether the unit received a treatment or not (1 if treatment, 0 if no treatment)
    variable -> the variable with respect to which you want to compute the ATE, so the variable of interest
    """

    treatment_mean = df[df[col_name]== 1][variable_name].mean()
    non_treatment_mean = df[df[col_name]== 0][variable_name].mean()

    ATE = treatment_mean - non_treatment_mean

    return ATE

## Exercise 1

### Part 1

Estimate the effect of tracking on students' end of first grade test scores

In [4]:
# I group the data by schoolid and remove the column which says whether students were above or below the median, since I don't need it right now
# I also calculate the mean scoreendfirstgrade for each school
grouped_data = data[["tracking", "scoreendfirstgrade", "schoolid"]].groupby(["schoolid"]).mean()

In [51]:
# I estimate the ATE of tracking on student's end of first grade test scores
ATE_estimation = ATE_estimator(grouped_data, "tracking", "scoreendfirstgrade")
ATE_estimation

0.13391256466638107

From this simple estimate it seems that tracking has a positive effect on end of first grade scores

### Part 2

Use a 10% level randomization inference test to assess whether the finding of question 1
is robust or whether it rests on an unappropriate asymptotic approximation. Conclude based
on the result of this randomization test.

In [96]:
print("The total numbe of schools involved in the study is",grouped_data.shape[0])

treatment_schools = round(grouped_data[grouped_data["tracking"]==1].sum()[0])
no_treatment_schools = round(grouped_data[grouped_data["tracking"]==0].count()[0])
print(f"Among these schools {treatment_schools} were assigned the treatment, and {no_treatment_schools} were not assigned the treatment")

The total numbe of schools involved in the study is 108
Among these schools 60 were assigned the treatment, and 48 were not assigned the treatment


In [99]:
alpha = 10

# Number of permutations that I will run to carry out the randomization inference
num_permutations = 1000

# Here I store the outcome of each simulation
permutation_ATEs = []

# This is a for-loop that repeats for 10000 times
for i in range(num_permutations):
    # Creates a permuted version of the DataFrame by randomly shuffling the rows without replacement
    permuted_df = grouped_data.sample(frac=1, replace=False)

    # I assign to this variable the number of units to which I will have to assign the treatment in the MC simulation (half the number of schools)
    # n_treatment = round(permuted_df.shape[0]/2)
    n_treatment = treatment_schools

    # I assign to the first half of the rows in the permuted list the treatment, so the value treatment = 1
    treatment_df = permuted_df.copy().iloc[:n_treatment]
    treatment_df["scoreendfirstgrade"] ==  1

    no_treatment_df = permuted_df.copy().iloc[n_treatment:]
    no_treatment_df["scoreendfirstgrade"] = 0
 
    complete_df = pd.concat([treatment_df, no_treatment_df], axis = 0)

    permutation_ATE = ATE_estimator(complete_df, "tracking", "scoreendfirstgrade")

    # I append to the list permutation_t_stats the t_test value that I got in the current MC trial
    permutation_ATEs.append(permutation_ATE)

# Determine critical value, which is the 10 percentile value of the permutation_t_stats list, which has all the t_stats of the MC simulations
critical_value1 = np.percentile(permutation_ATEs, alpha)
critical_value2 = np.percentile(permutation_ATEs, 100-alpha)

print("Critical values:", (critical_value1, critical_value2))

# Compare observed statistic to critical value
if abs(ATE_estimation) > critical_value1 and abs(ATE_estimation) > critical_value2:
    print("Finding is robust at 10% level randomization inference")
else:
    print("Finding is not robust at 10% level randomization inference")

Critical values: (0.02307167081794145, 0.12241837502955495)
Finding is robust at 10% level randomization inference


### Part 3

One might fear that tracking benefits strong students while harming weaker ones. Assess
whether this is a legitimate concern using the dataset at hand.

In [19]:
# First of all I try to compute a simple difference in means

ATE_bottom_half = ATE_estimator(data[data["bottomhalf"]==1], "tracking", "scoreendfirstgrade")
print("The difference in means between treated and non-treated for students BELOW the median is", ATE_bottom_half)

ATE_upper_half = ATE_estimator(data[data["bottomhalf"]==0], "tracking", "scoreendfirstgrade")
print("The difference in means between treated and non-treated for students ABOVE the median is", ATE_upper_half)

The difference in means between treated and non-treated for students BELOW the median is 0.13009291684380003
The difference in means between treated and non-treated for students ABOVE the median is 0.14925006025040954


The average treatment effect seems to have a positive effect in both cases (students below and above the mean).

In [57]:
grouped_schools_wrtmedian = data.groupby(["schoolid","bottomhalf"]).mean().reset_index()
grouped_schools_wrtmedian = grouped_schools_wrtmedian.set_index("schoolid")

In [60]:
grouped_schools_bottomhalf = grouped_schools_wrtmedian[grouped_schools_wrtmedian["bottomhalf"]==1]
grouped_schools_upperhalf = grouped_schools_wrtmedian[grouped_schools_wrtmedian["bottomhalf"]==0]


In [69]:
model_bottomhalf = sm.OLS(grouped_schools_bottomhalf["scoreendfirstgrade"], sm.add_constant(grouped_schools_bottomhalf["tracking"])).fit()

In [70]:
model_bottomhalf.summary()

0,1,2,3
Dep. Variable:,scoreendfirstgrade,R-squared:,0.031
Model:,OLS,Adj. R-squared:,0.022
Method:,Least Squares,F-statistic:,3.429
Date:,"Sun, 06 Oct 2024",Prob (F-statistic):,0.0668
Time:,13:22:16,Log-Likelihood:,-53.205
No. Observations:,108,AIC:,110.4
Df Residuals:,106,BIC:,115.8
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.4793,0.058,-8.308,0.000,-0.594,-0.365
tracking,0.1433,0.077,1.852,0.067,-0.010,0.297

0,1,2,3
Omnibus:,7.707,Durbin-Watson:,1.574
Prob(Omnibus):,0.021,Jarque-Bera (JB):,7.717
Skew:,0.653,Prob(JB):,0.0211
Kurtosis:,3.097,Cond. No.,2.77


In [73]:
model_upperhalf = sm.OLS(grouped_schools_upperhalf["scoreendfirstgrade"], sm.add_constant(grouped_schools_upperhalf["tracking"])).fit()

In [74]:
model_upperhalf.summary()

0,1,2,3
Dep. Variable:,scoreendfirstgrade,R-squared:,0.023
Model:,OLS,Adj. R-squared:,0.014
Method:,Least Squares,F-statistic:,2.518
Date:,"Sun, 06 Oct 2024",Prob (F-statistic):,0.116
Time:,13:22:47,Log-Likelihood:,-78.779
No. Observations:,108,AIC:,161.6
Df Residuals:,106,BIC:,166.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3063,0.073,4.189,0.000,0.161,0.451
tracking,0.1556,0.098,1.587,0.116,-0.039,0.350

0,1,2,3
Omnibus:,1.585,Durbin-Watson:,1.53
Prob(Omnibus):,0.453,Jarque-Bera (JB):,1.508
Skew:,0.184,Prob(JB):,0.47
Kurtosis:,2.553,Cond. No.,2.77


In both cases (for students originally below or above the median), it seems that tracking has a positive effect

## Exercise 2

### Part 1

### Part 2

In [86]:
obs = 1000
V = np.random.normal(loc=0, scale=1, size=obs)

y_0 = np.random.normal(loc=0, scale=1, size=obs)
y_1 = 0.5*y_0+0.5*V+0.2

Given the way we created `y_0` and `y_1` the treatment effect is heterogeneous across units, since `y_1 - y_0 = 0.5(V-y_0)+0.2`, where `V` and `y_0` are different for each unit.

In [87]:
ATE_1000 = np.mean(y_1-y_0)
print("ATE_1000:", ATE_1000)

ATE_1000: 0.19773442938929975


In [91]:
# Correlation between y_0 and y_1
corr_y0y1 = np.corrcoef(y_1, y_0)
print("The correlation coefficient is", corr_y0y1[0][1])

# Variance between y_0  and y_1
var_y0y1 = np.var(y_1-y_0)
print("The variance of (y_1-y_0) is", var_y0y1)

The correlation coefficient is 0.6926758302401267
The variance of (y_1-y_0) is 0.5096856397322104
