# Step 0: Import Required Packages 

#### If the cell below doesn't run then do 'pip install rpy2' or 'conda install -c r rpy2' and 'conda install tzlocal' in Anaconda Prompt
#### Change the paths for os.environ below to match your R folder directory and version if you get error messages

In [3]:
import os
import rpy2

try:
    import rpy2.robjects as robjects
except:
    os.environ["R_HOME"] = r"C:\Program Files\R\R-4.0.2"
    os.environ["PATH"]   = r"C:\Program Files\R\R-4.0.2\bin\x64" + ";" + os.environ["PATH"]
    import rpy2.robjects as robjects
    
import rpy2.robjects.packages as rpackages
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
from rpy2.robjects.vectors import StrVector
from rpy2.robjects import FloatVector, Formula

ModuleNotFoundError: No module named 'rpy2'

#### Run pip install tabulate 

In [8]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import mahalanobis
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
import math
from sklearn.metrics import pairwise_distances
from sklearn.model_selection import GridSearchCV
import time

from IPython.display import HTML, display
import tabulate

# Step 1: Read the files

We first read in the low dimensional dataset and high dimensional dataset from the data folder. These have feature columns with a prefix starting with ‘V’, a treatment/control column, and a continuous response Y. 

For this project, we know that the true ATE for the low dimensional data is 2.5, and for the high dimensional data it’s -3.

In [9]:
lowDim_dataset = pd.read_csv('../data/lowDim_dataset.csv')
highDim_dataset = pd.read_csv('../data/highDim_dataset.csv')

lowDim_true_ATE = 2.5
highDim_true_ATE = -3

# Step 2: Calculate Propensity and Linear Propensity Scores

To calculate the propensity scores, we first need to fit a GBM classifier on the features and the binary response A, which indicates if a person is in the control group (0) or the treatment group (1). To get the optimal parameters for the GBM without overfitting, we performed a grid search. The two cells below are commented out due to the long runtime it takes to perform the grid search. 

In [4]:
#low dim grid search (commented out since it takes a few minutes to run)

#X=lowDim_dataset.iloc[:,2:].values
#A=lowDim_dataset['A'].values
#Y=lowDim_dataset['Y'].values

#params = {'learning_rate':[0.01,0.05,0.1,0.5], 'max_depth': [1,2,3,4], 'n_estimators':[50,100,150],
#          'min_samples_leaf':[1,3,5],'min_samples_split':[2,4,6]}
#gscv = GridSearchCV(GradientBoostingClassifier(),params,cv=5).fit(X,A)
#gscv.best_params_

#output: {'learning_rate': 0.01,
# 'max_depth': 2,
# 'min_samples_leaf': 1,
# 'min_samples_split': 2,
# 'n_estimators': 150}

In [5]:
#high dim grid search (commented out since it takes a few minutes to run)

#X=highDim_dataset.iloc[:,2:].values
#A=highDim_dataset['A'].values
#Y=highDim_dataset['Y'].values

#params = {'learning_rate':[0.01,0.05,0.1,0.5], 'max_depth': [1,2,3,4], 'n_estimators':[50,100,150],
#          'min_samples_leaf':[1,3,5],'min_samples_split':[2,4,6]}
#gscv = GridSearchCV(GradientBoostingClassifier(),params,cv=5).fit(X,A)
#gscv.best_params_


#output: {'learning_rate': 0.05,
# 'max_depth': 1,
# 'min_samples_leaf': 5,
# 'min_samples_split': 2,
# 'n_estimators': 100}

Here we define the logit function, which is used to get linear propensity scores from the propensity scores.

In [6]:
def logit(x):
    return math.log(x/(1-x))

Now we will use the parameters we got from the grid search to get the propensity scores and linear propensity scores from the low and high dimensional datasets. We save the scores to csv files in the output folder.

### Low Dimensional Dataset

In [7]:
X=lowDim_dataset.iloc[:,2:].values
A=lowDim_dataset['A'].values
Y=lowDim_dataset['Y'].values

gbm = GradientBoostingClassifier(learning_rate = 0.01, max_depth = 2, min_samples_leaf = 1,
                                min_samples_split = 2, n_estimators = 150).fit(X,A)

low_dim_propensity_scores = [x[1] for x in gbm.predict_proba(X)]
low_dim_linear_propensity_scores = [logit(x) for x in low_dim_propensity_scores]

In [8]:
lowDim_dataset_propensity = lowDim_dataset.copy(deep=True)
lowDim_dataset_propensity['propensity_score'] = low_dim_propensity_scores

In [9]:
lowDim_dataset_linear_propensity = lowDim_dataset.copy(deep=True)
lowDim_dataset_linear_propensity['linear_propensity_score'] = low_dim_linear_propensity_scores

In [10]:
pd.DataFrame({'propensity_scores':low_dim_propensity_scores}).to_csv('../output/low_dim_propensity_scores.csv')
pd.DataFrame({'linear_propensity_scores':low_dim_linear_propensity_scores}).to_csv('../output/low_dim_linear_propensity_scores.csv')

### High Dimensional Dataset

In [11]:
X=highDim_dataset.iloc[:,2:].values
A=highDim_dataset['A'].values
Y=highDim_dataset['Y'].values

gbm = GradientBoostingClassifier(learning_rate = 0.05, max_depth = 1, min_samples_leaf = 5,
                                min_samples_split = 2, n_estimators = 100).fit(X,A)

high_dim_propensity_scores = [x[1] for x in gbm.predict_proba(X)]
high_dim_linear_propensity_scores = [logit(x) for x in high_dim_propensity_scores]

In [12]:
highDim_dataset_propensity = highDim_dataset.copy(deep=True)
highDim_dataset_propensity['propensity_score'] = high_dim_propensity_scores

In [13]:
highDim_dataset_linear_propensity = highDim_dataset.copy(deep=True)
highDim_dataset_linear_propensity['linear_propensity_score'] = high_dim_linear_propensity_scores

In [14]:
pd.DataFrame({'propensity_scores':high_dim_propensity_scores}).to_csv('../output/high_dim_propensity_scores.csv')
pd.DataFrame({'linear_propensity_scores':high_dim_linear_propensity_scores}).to_csv('../output/high_dim_linear_propensity_scores.csv')

# Step 3: Perform Full Matching

Full matching is a way to break the dataset into subsets so that each subset has at least one treatment member and at least one control member. Subsets are created based on how close people are in terms of a specific distance metric, and the subsets do not have to be the same size. Instead, treated individuals who are close to many comparison individuals will be grouped with many comparison individuals, and treated individuals with few similar comparison individuals will be grouped with fewer comparison individuals.

Full matching minimizes the sum of the distances between all pairs of treated and comparison individuals within each matched set, across all matched sets : 




The motivation behind full matching is that people within the same subset are ideally similar enough to each other that their response values can serve as counterfactuals for each other (e.g. if person A received treatment and is in the same subset as person B who was under control, then the response value for B would be close to the response value for A if A had been under control instead of treatment and vice versa). Once we’ve created the subsets, we can estimate the ATE by taking a weighted average of all of the differences between mean treatment response and mean control response within the subsets. 

One problem with full matching is that it sometimes leads to matched sets with widely varying ratios of treated to comparison individuals, which can lead to large variance of the resulting effect estimates.

### Set Up rpy2 (Python Interface to R)

To implement full matching, we use the fullmatch function from the R package optmatch. To use the function, we first set up the Python interface to R which will install the necessary packages from CRAN.

In [15]:
%%capture 
utils = importr('utils')
utils.chooseCRANmirror(ind=1)
packnames = ('optmatch')


In [16]:
names_to_install = [x for x in packnames if not rpackages.isinstalled(x)]
if len(names_to_install) > 0:
    utils.install_packages(StrVector(names_to_install))

In [17]:
%%capture 
utils.chooseCRANmirror(ind=1)
robjects.r(f'install.packages("{"optmatch"}")')

In [18]:
optmatch = rpackages.importr('optmatch')

Here we convert the pandas dataframes to R dataframes which are compatible with the fullmatch function and also keep track of the runtime it takes to do the conversion. 

The try-except block is to take into account of the fact that Windows and Mac do not have the same latest version of the rpy2 package, and the two different versions have different syntax for dataframe conversion.

In [19]:
with localconverter(robjects.default_converter + pandas2ri.converter):
    try:
        lowDim_R_runtime = time.time()
        lowDim_dataset_R = robjects.conversion.py2rpy(lowDim_dataset)
        lowDim_R_runtime = time.time()-lowDim_R_runtime
        
        lowDim_propensity_R_runtime = time.time()
        lowDim_dataset_propensity_R = robjects.conversion.py2rpy(lowDim_dataset_propensity)
        lowDim_propensity_R_runtime = time.time()-lowDim_propensity_R_runtime
        
        lowDim_linear_propensity_R_runtime = time.time()
        lowDim_dataset_linear_propensity_R = robjects.conversion.py2rpy(lowDim_dataset_linear_propensity)
        lowDim_linear_propensity_R_runtime = time.time()-lowDim_linear_propensity_R_runtime
        
    except:
        lowDim_R_runtime = time.time()
        lowDim_dataset_R = pandas2ri.py2ri(lowDim_dataset)
        lowDim_R_runtime = time.time()-lowDim_R_runtime
        
        lowDim_propensity_R_runtime = time.time()
        lowDim_dataset_propensity_R = pandas2ri.py2ri(lowDim_dataset_propensity)
        lowDim_propensity_R_runtime = time.time()-lowDim_propensity_R_runtime
        
        lowDim_linear_propensity_R_runtime = time.time()
        lowDim_dataset_linear_propensity_R = pandas2ri.py2ri(lowDim_dataset_linear_propensity)
        lowDim_linear_propensity_R_runtime = time.time()-lowDim_linear_propensity_R_runtime

In [20]:
with localconverter(robjects.default_converter + pandas2ri.converter):
    try:
        highDim_R_runtime = time.time()
        highDim_dataset_R = robjects.conversion.py2rpy(highDim_dataset)
        highDim_R_runtime = time.time()-highDim_R_runtime
        
        highDim_propensity_R_runtime = time.time()
        highDim_dataset_propensity_R = robjects.conversion.py2rpy(highDim_dataset_propensity)
        highDim_propensity_R_runtime = time.time()-highDim_propensity_R_runtime
        
        highDim_linear_propensity_R_runtime = time.time()
        highDim_dataset_linear_propensity_R = robjects.conversion.py2rpy(highDim_dataset_linear_propensity)
        highDim_linear_propensity_R_runtime = time.time()-highDim_linear_propensity_R_runtime
        
    except:
        highDim_R_runtime = time.time()
        highDim_dataset_R = pandas2ri.py2ri(highDim_dataset)
        highDim_R_runtime = time.time()-highDim_R_runtime
        
        highDim_propensity_R_runtime = time.time()
        highDim_dataset_propensity_R = pandas2ri.py2ri(highDim_dataset_propensity)
        highDim_propensity_R_runtime = time.time()-highDim_propensity_R_runtime
        
        highDim_linear_propensity_R_runtime = time.time()
        highDim_dataset_linear_propensity_R = pandas2ri.py2ri(highDim_dataset_linear_propensity)
        highDim_linear_propensity_R_runtime = time.time()-highDim_linear_propensity_R_runtime

### Method 1: Mahalanobis

The Mahalanobis distance matrix is given by
$$D_{ij} = (X_i-X_j)^T\Sigma^{-1}(X_i-X_j)$$
where $\Sigma$ is the covariance matrix of $X$ in the pooled treatment and full control groups.

Mahalanobis does not require propensity scores and instead uses the features and covariance matrix of the pooled treatment and full control groups to create a distance matrix. Intuitively, the Mahalanobis distance measures the distance of two points relative to the centroid of all of the data points with the axes being determined by the direction of greatest variance in the cloud of points. That is, we let the data itself determine the coordinate system. For uncorrelated variables, the covariance matrix becomes a diagonal matrix, so the Mahalanobis distance between two points is equal to their standardized Euclidean distance in this case. 

#### a. Low Dim Data

Mahalanbois generally works well, both in terms of runtime and having a low error, when there are relatively few covariates because the covariance matrix is easier to invert. In this case, it works well because it takes advantage of the correlations between different features for its distance calculation. 

In [21]:
start = time.time()
full_match_Mahalanobis_factor = optmatch.fullmatch(optmatch.match_on(Formula('A~.-Y'),data=lowDim_dataset_R,method='mahalanobis'),data=lowDim_dataset_R)
lowDim_dataset['assign'] = list(full_match_Mahalanobis_factor)

In [22]:
#compute ATE
ATE_vec = []
weights = []

for i in range(max(list(full_match_Mahalanobis_factor))):
    temp = lowDim_dataset.loc[lowDim_dataset['assign']==i+1]
    
    treatment_Y = temp.loc[temp['A']==1]['Y'].values
    control_Y = temp.loc[temp['A']==0]['Y'].values
    
    ATE_vec.append(np.mean(treatment_Y)-np.mean(control_Y))
    weights.append(len(treatment_Y)+len(control_Y))

lowDim_mahalanobis_est_ATE = np.average(ATE_vec, weights = weights)

end = time.time()
lowDim_mahalanobis_match_runtime = end-start

In [23]:
#runtime is time to convert to R data frame + time to do matching
lowDim_mahalanobis_runtime = "{:,.3f}".format(lowDim_R_runtime+lowDim_mahalanobis_match_runtime)
lowDim_mahalanobis_runtime

'0.461'

In [24]:
lowDim_mahalanobis_error = abs(lowDim_true_ATE-lowDim_mahalanobis_est_ATE)
lowDim_mahalanobis_error ="{:,.3f}".format(lowDim_mahalanobis_error)
print(lowDim_mahalanobis_error)

0.406


#### b. High Dim Data

Mahalanobis has a higher error on the high dimensional dataset because the creation of the distance matrix views all of the interactions between features as equally important, so full matching with Mahalanobis tries to capture more of the multi way interactions, so it does not perform well when there are too many interactions to keep track of for the matching criteria. In terms of runtime, Mahalanobis also takes a significantly long time on high dimensional data due to the complexity of inverting the covariance matrix.

In [6]:
start = time.time()
full_match_Mahalanobis_factor = optmatch.fullmatch(optmatch.match_on(Formula('A~.-Y'),data=highDim_dataset_R,method='mahalanobis'),data=highDim_dataset_R)
highDim_dataset['assign'] = list(full_match_Mahalanobis_factor)

NameError: name 'optmatch' is not defined

In [26]:
#compute ATE
ATE_vec = []
weights = []

for i in range(max(list(full_match_Mahalanobis_factor))):
    temp = highDim_dataset.loc[highDim_dataset['assign']==i+1]
    
    treatment_Y = temp.loc[temp['A']==1]['Y'].values
    control_Y = temp.loc[temp['A']==0]['Y'].values
    
    ATE_vec.append(np.mean(treatment_Y)-np.mean(control_Y))
    weights.append(len(treatment_Y)+len(control_Y))
    
highDim_mahalanobis_est_ATE = np.average(ATE_vec, weights=weights)
    
end = time.time()
highDim_mahalanobis_match_runtime = end-start

In [27]:
highDim_mahalanobis_runtime = "{:,.3f}".format(highDim_R_runtime+highDim_mahalanobis_match_runtime)
highDim_mahalanobis_runtime

'51.508'

In [28]:
highDim_mahalanobis_error = abs(highDim_true_ATE-highDim_mahalanobis_est_ATE)
highDim_mahalanobis_error= "{:,.3f}".format(highDim_mahalanobis_error)
print(highDim_mahalanobis_error)

1.447


### Method 2: Propensity Score

#### a. Low Dim Data

In [29]:
start = time.time()
full_match_propensity_factor = optmatch.fullmatch(optmatch.match_on(Formula('A~propensity_score'),data=lowDim_dataset_propensity_R,method='euclidean'),data=lowDim_dataset_propensity_R)
lowDim_dataset_propensity['assign'] = list(full_match_propensity_factor)

In [30]:
#compute ATE
ATE_vec = []
weights = []

for i in range(max(list(full_match_propensity_factor))):
    temp = lowDim_dataset_propensity.loc[lowDim_dataset_propensity['assign']==i+1]
    
    treatment_Y = temp.loc[temp['A']==1]['Y'].values
    control_Y = temp.loc[temp['A']==0]['Y'].values
    
    ATE_vec.append(np.mean(treatment_Y)-np.mean(control_Y))
    weights.append(len(treatment_Y)+len(control_Y))

lowDim_propensity_est_ATE = np.average(ATE_vec, weights=weights)
    
end = time.time()
lowDim_propensity_match_runtime = end-start
    

In [31]:
lowDim_propensity_runtime = "{:,.3f}".format(lowDim_propensity_R_runtime+lowDim_propensity_match_runtime)
lowDim_propensity_runtime

'0.307'

In [32]:
lowDim_propensity_error = abs(lowDim_true_ATE-lowDim_propensity_est_ATE)
lowDim_propensity_error ="{:,.3f}".format(lowDim_propensity_error)
print(lowDim_propensity_error)

0.888


#### b. High Dim Data

In [33]:
start = time.time()
full_match_propensity_factor = optmatch.fullmatch(optmatch.match_on(Formula('A~propensity_score'),data=highDim_dataset_propensity_R,method='euclidean'),data=highDim_dataset_propensity_R)
highDim_dataset_propensity['assign'] = list(full_match_propensity_factor)

In [34]:
#compute ATE
ATE_vec = []
weights = []

for i in range(max(list(full_match_propensity_factor))):
    temp = highDim_dataset_propensity.loc[highDim_dataset_propensity['assign']==i+1]
    
    treatment_Y = temp.loc[temp['A']==1]['Y'].values
    control_Y = temp.loc[temp['A']==0]['Y'].values
    
    ATE_vec.append(np.mean(treatment_Y)-np.mean(control_Y))
    weights.append(len(treatment_Y)+len(control_Y))

highDim_propensity_est_ATE = np.average(ATE_vec, weights=weights)    

end = time.time()
highDim_propensity_match_runtime = end-start

In [35]:
highDim_propensity_runtime = "{:,.3f}".format(highDim_propensity_R_runtime+highDim_propensity_match_runtime)
highDim_propensity_runtime

'5.227'

In [36]:
highDim_propensity_error = abs(highDim_true_ATE-highDim_propensity_est_ATE)
highDim_propensity_error ="{:,.3f}".format(highDim_propensity_error)
print(highDim_propensity_error)

0.292


### Method 3: Linear Propensity Score

#### a. Low Dim Data

In [37]:
start = time.time()
full_match_linear_propensity_factor = optmatch.fullmatch(optmatch.match_on(Formula('A~linear_propensity_score'),data=lowDim_dataset_linear_propensity_R,method='euclidean'),data=lowDim_dataset_linear_propensity_R)
lowDim_dataset_linear_propensity['assign'] = list(full_match_linear_propensity_factor)

In [38]:
#compute ATE
ATE_vec = []
weights = []

for i in range(max(list(full_match_linear_propensity_factor))):
    temp = lowDim_dataset_linear_propensity.loc[lowDim_dataset_linear_propensity['assign']==i+1]
    
    treatment_Y = temp.loc[temp['A']==1]['Y'].values
    control_Y = temp.loc[temp['A']==0]['Y'].values
    
    ATE_vec.append(np.mean(treatment_Y)-np.mean(control_Y))
    weights.append(len(treatment_Y)+len(control_Y))

lowDim_linear_propensity_est_ATE = np.average(ATE_vec, weights=weights)

end = time.time()
lowDim_linear_propensity_match_runtime = end-start

In [39]:
lowDim_linear_propensity_runtime = "{:,.3f}".format(lowDim_linear_propensity_R_runtime+lowDim_linear_propensity_match_runtime)
lowDim_linear_propensity_runtime

'0.304'

In [40]:
lowDim_linear_propensity_error = abs(lowDim_true_ATE-lowDim_linear_propensity_est_ATE)
lowDim_linear_propensity_error ="{:,.3f}".format(lowDim_linear_propensity_error)
print(lowDim_linear_propensity_error)

0.976


#### b. High Dim Data

In [41]:
start = time.time()
full_match_linear_propensity_factor = optmatch.fullmatch(optmatch.match_on(Formula('A~linear_propensity_score'),data=highDim_dataset_linear_propensity_R,
                                                                           method='euclidean'),data=highDim_dataset_linear_propensity_R)
highDim_dataset_linear_propensity['assign'] = list(full_match_linear_propensity_factor)

In [42]:
#compute ATE
ATE_vec = []
weights = []

for i in range(max(list(full_match_linear_propensity_factor))):
    temp = highDim_dataset_linear_propensity.loc[highDim_dataset_linear_propensity['assign']==i+1]
    
    treatment_Y = temp.loc[temp['A']==1]['Y'].values
    control_Y = temp.loc[temp['A']==0]['Y'].values
    
    ATE_vec.append(np.mean(treatment_Y)-np.mean(control_Y))
    weights.append(len(treatment_Y)+len(control_Y))
    
highDim_linear_propensity_est_ATE=np.average(ATE_vec, weights=weights)

end = time.time()
highDim_linear_propensity_match_runtime = end-start

In [43]:
highDim_linear_propensity_runtime = "{:,.3f}".format(highDim_linear_propensity_R_runtime+highDim_linear_propensity_match_runtime)
highDim_linear_propensity_runtime

'5.231'

In [44]:
highDim_linear_propensity_error = abs(highDim_true_ATE-highDim_linear_propensity_est_ATE)
highDim_linear_propensity_error ="{:,.3f}".format(highDim_linear_propensity_error)
print(highDim_linear_propensity_error)

0.232


## Step 4: Inverse Propensity Weighting Algorithm

When comparing the exposure effects between treatment groups, if we ignored those confounding factors, the effect estimates will be biased. Inverse probability weighting (IPW) based on the marginal structure model is an important method that can be used to estimate the effect of observational data processing and can address a very large number of confounding variables. Applying this weight when conducting statistical tests or regression models reduces or removes the impact of confounders. For inverse probability of treatment weighting (IPTW), we use propensity score as inverse weights in estimates of the ATE. 

The weight $w_i$ is 
$$w_i = \frac{T_i}{\hat{e_i}} + \frac{1 - T_i}{1 - \hat{e_i}} $$
where $\hat{e_i}$ is the estimated propensity score for individual $i$; $T_i$ is the treatment groups: $T_0$ is the controlled group and $T_1$ is after treatment group

In this project, IPW does not work well in both low and high dimensional datasets.The reason might be the limitations of the Inverse Probability Weighted Estimator (IPWE). It can be unstable if estimated propensities are small. If the probability of either treatment assignment is small, then the logistic regression model can become unstable around the tails causing the IPWE to also be less stable. IPW needs to meet some prerequisites when applying, such as no omissions and unobserved confounding factors, non-negativity assumptions, stable unit processing value assumptions, and correct weight estimation models.

The estimate ATE using IPW is 
$$\hat{\Delta}_{IPW} = N^{-1} (\sum_{i \in treated}{w_i Y_i} -\sum_{i\in controlled}{w_i Y_i} )$$

### 1. Reset data & Define Functions

In [10]:
lowDim_dataset = pd.read_csv('../data/lowDim_dataset.csv')
highDim_dataset = pd.read_csv('../data/highDim_dataset.csv')

In [11]:
def ipw_ate(dataset):
    treated = 0
    controlled = 0
    for i in range(dataset.shape[0]):
        if dataset['A'][i] == 1:
            treated += dataset['Y'][i] * dataset['weight'][i]
        else:
            controlled += dataset['Y'][i] * dataset['weight'][i]

    print(treated - controlled)
    ate = (treated - controlled)/dataset.shape[0]
    return ate

### a. Low Dim Data

In [12]:
runtime = time.time()
X=lowDim_dataset.iloc[:,2:].values
A=lowDim_dataset['A'].values
gbm = GradientBoostingClassifier(learning_rate = 0.01, max_depth = 2, min_samples_leaf = 1,
                                min_samples_split = 2, n_estimators = 150).fit(X,A)
low_dim_propensity_scores = [x[1] for x in gbm.predict_proba(X)]
lowDim_dataset_ipw = lowDim_dataset
lowDim_dataset_ipw['score'] = low_dim_propensity_scores
lowDim_dataset_ipw['weight'] = lowDim_dataset_ipw['A']/lowDim_dataset_ipw['score'] + (1 - lowDim_dataset_ipw['A'])/(1 - lowDim_dataset_ipw['score'])
ate_low = ipw_ate(lowDim_dataset_ipw)
runtime_low_ipw = time.time()-runtime
lowDim_ipw_error = abs(ate_low - lowDim_true_ATE)
lowDim_ipw_error = "{:,.3f}".format(lowDim_ipw_error)
print("ATE for low dimension is: ", ate_low)
print("Runtime for low dimension is: ", runtime_low_ipw)
print("ATE error for low dimension is: ", lowDim_ipw_error)

398.383166304905
ATE for low dimension is:  0.8387014027471684
Runtime for low dimension is:  0.28824400901794434
ATE error for low dimension is:  1.661


### b. High Dim Data

In [13]:
runtime = time.time()
X=highDim_dataset.iloc[:,2:].values
A=highDim_dataset['A'].values
Y=highDim_dataset['Y'].values

gbm = GradientBoostingClassifier(learning_rate = 0.05, max_depth = 1, min_samples_leaf = 5,
                                min_samples_split = 2, n_estimators = 100).fit(X,A)
high_dim_propensity_scores = [x[1] for x in gbm.predict_proba(X)]
highDim_dataset_ipw = highDim_dataset
highDim_dataset_ipw['score'] = high_dim_propensity_scores
highDim_dataset_ipw['weight'] = highDim_dataset_ipw['A']/highDim_dataset_ipw['score'] + (1 - highDim_dataset_ipw['A'])/(1 - highDim_dataset_ipw['score'])
ate_high = ipw_ate(highDim_dataset_ipw)
runtime_high_ipw = time.time()-runtime
highDim_ipw_error = abs(ate_high-highDim_true_ATE)
highDim_ipw_error = "{:,.3f}".format(highDim_ipw_error)
print("ATE for high dimension is: ", ate_high)
print("Runtime for high dimension is: ", runtime_high_ipw)
print("ATE error for high dimension is: ", highDim_ipw_error)

-3693.534034205346
ATE for high dimension is:  -1.8467670171026729
Runtime for high dimension is:  0.6256771087646484
ATE error for high dimension is:  1.153


## Step 5: Stratification

In [52]:
# Method to stratify data 
def stratify(df):
    
    Y = df['Y']
    D = df['A']
    scores = df['propensity_scores']
    
    # Create stratum and stratum limits
    Q1 = np.quantile(scores, .20)
    Q2 = np.quantile(scores, .40)
    Q3 = np.quantile(scores, .60)
    Q4 = np.quantile(scores, .80)
    Q5 = np.quantile(scores, 1.0)
    
    quin1 = df[df['propensity_scores']<= Q1]
    quin2 = df[(df['propensity_scores']> Q1) & (df['propensity_scores']<= Q2)]
    quin3 = df[(df['propensity_scores']> Q2) & (df['propensity_scores']<= Q3)]
    quin4 = df[(df['propensity_scores']> Q3) & (df['propensity_scores']<= Q4)]
    quin5 = df[df['propensity_scores']> Q4]

    quintiles = [quin1, quin2, quin3, quin4, quin5]
    Q_ranges = [None, Q1, Q2, Q3, Q4, Q5]

    return [quintiles, Q_ranges]

In [53]:
# Method to calc ATE
def strat_ATE(quintiles, Q_ranges):
    results = []
    N = sum([len(quintiles[0]),len(quintiles[1]),len(quintiles[2]),len(quintiles[3]),len(quintiles[4])])
    
    for i, stratum in enumerate(quintiles): 
        i+=1
        
        Nj = len(stratum)                      # Number of ind in stratum
        N1j = stratum['A'].value_counts()[1]   # Number of treated ind
        N0j = stratum['A'].value_counts()[0]   # Number of control ind
        
        sum1 = 0
        sum2 = 0

        # Summation of treated samples within strata
        sum1 = sum([Y*T for Y,T in zip(stratum['Y'],stratum['A'])])
        # Summation of untreated samples within strata
        sum2 = sum([(1-T)*Y for Y,T in zip(stratum['Y'],stratum['A'])]) 

        results.append(Nj/N * ((sum1/N1j)-(sum2/N0j)))

    return sum(results)

### 1. Reload data

In [54]:
lowDim_dataset = pd.read_csv('../data/lowDim_dataset.csv')
highDim_dataset = pd.read_csv('../data/highDim_dataset.csv')

### a. Low Dim data

In [55]:
lowDim_scores = pd.read_csv('../output/low_dim_propensity_scores.csv') 
lowDim_scores.insert( 1 , "Y" , lowDim_dataset['Y']) 
lowDim_scores.insert( 2 , "A" , lowDim_dataset['A'])

In [56]:
start = time.time()

quintiles , Q_ranges = stratify(lowDim_scores) 

lowDim_stratification_ATE = strat_ATE(quintiles ,Q_ranges)
end = time.time()
lowdim_strat_runtime = end - start
print( "Estimated ATE: " , lowDim_stratification_ATE)

Estimated ATE:  2.463529123502176


In [57]:
lowdim_strat_runtime = "{:,.3f}".format(lowdim_strat_runtime)
#lowDim_stratification_ATE = "{:,.3f}".format(lowDim_stratification_ATE)
#print(lowDim_stratification_ATE)
lowDim_stratification_error = abs(lowDim_stratification_ATE-lowDim_true_ATE)
lowDim_stratification_error = "{:,.3f}".format(lowDim_stratification_error)
print(lowDim_stratification_error)
print(lowdim_strat_runtime)

0.036
0.015


### b. High Dim Data

In [58]:
# Get Calculated Propensity Scores 
highDim_scores = pd.read_csv('../output/high_dim_propensity_scores.csv') 
highDim_scores.insert( 1 , "Y" , highDim_dataset['Y']) 
highDim_scores.insert( 2 , "A" , highDim_dataset['A'])

In [59]:
start = time.time()

quintiles , Q_ranges = stratify(highDim_scores)

highDim_stratification_ATE = strat_ATE(quintiles ,Q_ranges)

end = time.time()

In [60]:
highdim_strat_runtime="{:,.3f}".format(end-start)
#highDim_stratification_ATE = "{:,.3f}".format(highDim_stratification_ATE)
#print(highDim_stratification_ATE)
highDim_stratification_error = abs(highDim_stratification_ATE-highDim_true_ATE)
highDim_stratification_error = "{:,.3f}".format(highDim_stratification_error)
print(highDim_stratification_error)
print(highdim_strat_runtime)

0.010
0.014


## Step 6: Comparison

Finally, we have a table of all of the absolute errors and the runtimes for each method across the low dimensional and high dimensional datasets. Note that this notebook was run on a Mac with 64 GB of RAM and an i9 CPU, so the runtimes may be different if you run this notebook on your system.

We can see that the overall best method on the low dimensional dataset and the high dimensional dataset in terms of runtime and performance is stratification. The reason is because…

In [61]:
table = [["ATE Absolute Error",'Low Dim',lowDim_mahalanobis_error,lowDim_propensity_error,
         lowDim_linear_propensity_error,lowDim_ipw_error,lowDim_stratification_error],
        ["",'High Dim',highDim_mahalanobis_error,highDim_propensity_error,
         highDim_linear_propensity_error,highDim_ipw_error,highDim_stratification_error],
        ["Run Time (sec)",'Low Dim',lowDim_mahalanobis_runtime,lowDim_propensity_runtime,lowDim_linear_propensity_runtime,
         runtime_low_ipw, lowdim_strat_runtime],
        ["",'High Dim',highDim_mahalanobis_runtime,highDim_propensity_runtime,highDim_linear_propensity_runtime,
         runtime_high_ipw, highdim_strat_runtime],
        ["Computer Used",'','PC','PC','PC','PC','PC'],
        ["Stable/Nonstable",'','Stable','Stable','Stable','Stable','Stable']]

display(HTML(tabulate.tabulate(table, headers=["Metric","Dimension", "Full Matching-\nMahalanobis",
                                               "Full Matching-\nPropensity score", "Full Matching-\nLinear Propensity Score",
                                               "Inverse Propensity\nWeighting", 'Stratification'],
                                tablefmt='html')))

Metric,Dimension,Full Matching- Mahalanobis,Full Matching- Propensity score,Full Matching- Linear Propensity Score,Inverse Propensity Weighting,Stratification
Best ATE score,Low Dim,0.406,0.888,0.976,0.705,0.036
,High Dim,1.447,0.292,0.232,0.947,0.010
Run Time (sec),Low Dim,0.461,0.307,0.304,52.604,0.015
,High Dim,51.508,5.227,5.231,372.688,0.014
Computer Used,,PC,PC,PC,PC,PC
Stable/Nonstable,,Stable,Stable,Stable,Stable,Stable


## Reference Papers (alphabetize later)

Stuart E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical science : a review journal of the Institute of Mathematical Statistics, 25(1), 1–21. https://doi.org/10.1214/09-STS313

Chan, D., Ge, R., Gershony, O., Hesterberg, T., & Lambert, D. (2010). Evaluating online ad campaigns in a pipeline: causal models at scale. KDD '10.

Stuart, E. A., & Green, K. M. (2008). Using full matching to estimate causal effects in nonexperimental studies: examining the relationship between adolescent marijuana use and adult outcomes. Developmental psychology, 44(2), 395–406. https://doi.org/10.1037/0012-1649.44.2.395

McCaffrey, D. F., Ridgeway, G., & Morral, A. R. (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological methods, 9(4), 403–425. https://doi.org/10.1037/1082-989X.9.4.403

Raad, H., Cornelius, V., Chan, S., Williamson, E. and Cro, S., 2020. An evaluation of inverse probability weighting using the propensity score for baseline covariate adjustment in smaller population randomised controlled trials with a continuous outcome. BMC Medical Research Methodology, 20(1), pp.1-12. https://link.springer.com/article/10.1186/s12874-020-00947-7