## Exercise 5: Logistic Regression and Causal Inference

In [65]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

### Task 1: Logistic Regression

Again, we revisit the Student Performance dataset. This time however, we do not focus on predicting test performance, but on predicting whether a student has taken the test preparation course.

#### a) Fitting and model analysis

Preprocess the data like in the previous exercise, i.e. transform categorical variables and remove highly correlated predictors. Then, use statsmodels to fit a logistic regression model that aims to predict the completion of a test preparation course model. Which predictors appear significant?

In [75]:
df = pd.read_csv ("StudentsPerformance.csv", delimiter=',')

# print(df['race/ethnicity'].value_counts())

df_gender = pd.get_dummies(df['gender'])
df_race = pd.get_dummies(df['race/ethnicity'])
df_edu = pd.get_dummies(df['parental level of education'])
df_lunch = pd.get_dummies(df['lunch'])
df_course = pd.get_dummies(df['test preparation course'])

col = [df_gender,df_race,df_edu,df_lunch,df_course]
for i in col:
    df = df.join(i)

# df = df.iloc[:, 5:]
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# print(np.tril(arr, k=-1))  # Lower triangle of an array.
# print(np.triu(arr, k=1))  # Upper triangle of an array.

# Find index of feature columns with correlation greater than 0.90
to_drop = [column for column in upper.columns if any(upper[column] > 0.90)]

# Drop features 
df = df.drop(df[to_drop], axis=1)

# print(df.columns)
columns = ['math score', 'reading score', 'female', 'group A', 'group B', 'group C', 'group D', 'group E', "associate's degree", "bachelor's degree", 'high school', "master's degree", 'some college', 'some high school', 'free/reduced']
# columns = ['math score']

X = df[columns]
Y = df['completed']

# initialize model: OLS = ordinary least squares
model = sm.OLS(Y,X)
# fit model: only now te model, i.e. the parameters are computed
results = model.fit()

# print a summary, i.e. an overview on parameters and diagnostics
results.summary()


0,1,2,3
Dep. Variable:,completed,R-squared:,0.092
Model:,OLS,Adj. R-squared:,0.08
Method:,Least Squares,F-statistic:,7.649
Date:,"Thu, 07 Nov 2019",Prob (F-statistic):,1.37e-14
Time:,17:55:35,Log-Likelihood:,-635.71
No. Observations:,1000,AIC:,1299.0
Df Residuals:,986,BIC:,1368.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
math score,-0.0069,0.002,-2.850,0.004,-0.012,-0.002
reading score,0.0156,0.002,6.420,0.000,0.011,0.020
female,-0.1536,0.041,-3.758,0.000,-0.234,-0.073
group A,-0.1203,0.061,-1.979,0.048,-0.240,-0.001
group B,-0.1073,0.054,-1.999,0.046,-0.213,-0.002
group C,-0.1109,0.051,-2.192,0.029,-0.210,-0.012
group D,-0.1731,0.053,-3.244,0.001,-0.278,-0.068
group E,-0.0517,0.063,-0.822,0.411,-0.175,0.072
associate's degree,-0.0941,0.048,-1.947,0.052,-0.189,0.001

0,1,2,3
Omnibus:,9397.675,Durbin-Watson:,1.986
Prob(Omnibus):,0.0,Jarque-Bera (JB):,120.86
Skew:,0.518,Prob(JB):,5.7000000000000004e-27
Kurtosis:,1.648,Cond. No.,1.09e+18


#### b) Diagnostics 1: Accuracy and Confusion Matrix

Write a two function that take as input a vector y of the true classes, and a vector y_hat of the predicted classes. Let the first one return the accuracy of the prediction, i.e. the ratio of correctly predicted samples, and the second one compute the confusion matrix as introduced in class.
Apply your function on your model from a).

#### c) Diagnostics 2: The ROC curve

Write a function that takes as input a vector y of the true classes, and a vector yp of the predicted probabilities resulting from the logistic regression, plots ROC curve of the model, and returns the corresponding AUC score.
Apply your function on your model from a).

### Task 2: Causal Inference

In this task we use a dataset (NSW.csv) which aimed to evaluate the effect of participating in a job training program on the salary. This data was taken from the website of Gelman and Hill's book (http://www.stat.columbia.edu/~gelman/arm/), and originally constructed in two independent studies (see Gelman and Hill, chapter 10, ex. 1).
This data contains some demographic data of its population, the real earnings in 1974 and 1975, and indicator on whether job training, i.e., the treatment, was conducted in 1976/77, and the earnings in 1978, which is our target variable. A brief documentation can be found in "NSW.doc". Make sure that when loading the data, you omit the sample variable which simply indicates a source that a specific obervation originated from.  
Note that there are only very few treated individuals in the dataset.

In [None]:
# you may load and preprocess your data here

#### a) Mean and Regression analysis with one predictor

We first simply consider the treatment as a predictor for the earnings in 1978. 
Investigate the effect of the treatment by (i) computing the difference in means between control and treatment groups and (ii) performing a linear regression with only one predictor.
What do you observe?

#### b) Variable bias

Intuitively, it makes sense that the income in 1975 has strong predictive power in the earnings 3 years later. Recompute your regression model such that it additionally includes that income. 
Further, compute the omitted variable bias between the treatment and the income from 75.

#### c) Adding more predictors

Our data provides a lot more potential predictors. Add some more predoctors to your regression model and observe the sensitivity of the model to new predictors.

#### c) Greedy matching

We now consider ALL columns in the data as predictors. Due to the numerical imbalance in the data, we have many samples that we would not want to include in our analysis with the treatment group.  
Implement a function that is given as input two matrices corresponding to the confounders of controal and treatment group, and returns a matching of their indices based on Mahalanobis distance.
Apply your function to compute a matching on the given data. Note that due to the strong imbalance between the cardinalities of the control group and the treatment group, you do not need to consider a maximum distance threshold in this task.

#### d) Analyzing matched groups

Recompute the means in control group and treated group, and the regression model that includes all columns as predictors. What do you observe?