# Augmented Inverse Probability of Treatment Weights
Augmented-IPTW (AIPTW) is a doubly robust estimator. Essentially, AIPTW combines the IPTW estimator and g-formula into a single estimate. Before continuing, I will briefly outline what a doubly-robust estimator is and why you would want to use one. In observational research with high-dimensional data, we (generally) are forced to use parametric models to adjust for many confounders. In this scenario, we assume that our parametric models are correctly specified. Our statistical model, $\mathcal{M}$, must include the distribution that the data came from. 

With other estimators, like IPTW or g-formula, we have one chance to specify $\mathcal{M}$ correctly. Doubly-robust estimators use a model to predict the treatment (like IPTW) and another model to predict the outcome (like g-formula). The estimator then combines the estimates, such that if either is correct, then our estimate will be consistent. Essentially, we get two chances to get the statistical model correct.

A more in-depth description of doubly robust estimators is available in [this pre-print](https://statnav.files.wordpress.com/2017/10/doublerobustness-preprint.pdf)

## AIPTW

AIPTW takes the following form
$$E[Y^a] = \frac{1}{n} \sum_i^n \left(\frac{Y \times I(A=a)}{\widehat{\Pr}(A=a|L)} - \frac{\hat{E}[Y|A=a, L] \times (I(A=a) - \widehat{\Pr}(A=a|L))}{1 - \widehat{\Pr}(A=a|L)}\right)$$
where $\widehat{\Pr}(A=a|L)$ comes from the IPTW model and $\hat{E}[Y|A=a,L]$ comes from the g-formula. If we do some manipulations and assume an infinite sample size, we can end up with
$$\hat{E}^{IPW}[Y^a] \times \frac{\Pr(A=a|L)}{\widehat{\Pr}(A=a|L, \mathcal{M})} - \hat{E}^{STD}[Y^a] \times \frac{\Pr(A=a|L) - \widehat{\Pr}(A=a|L, \mathcal{M})}{\widehat{\Pr}(A=a|L, \mathcal{M})}$$
from this form, we can see that if as long as one estimate is correct then AIPTW will be unbiased

## An example
To motivate our example, we will use a simulated data set included with *zEpid*. In the data set, we have a cohort of HIV-positive individuals. We are interested in the sample average treatment effect of antiretroviral therapy (ART) on all-cause mortality at 45-weeks. Based on substantive background knowledge, we believe that the treated and untreated population are exchangeable based gender, age, CD4 T-cell count, and detectable viral load. 

In this tutorial, we will focus on a complete case analysis. Therefore, we will drop the `cd4_wk45` column and all the missing data in `dead`. This will leave 517 observations with no missing data

In [1]:
import numpy as np
import pandas as pd

from zepid import load_sample_data, spline
from zepid.causal.doublyrobust import AIPTW

df = load_sample_data(False)
df[['age_rs1', 'age_rs2']] = spline(df, 'age0', n_knots=3, term=2, restricted=True)
df[['cd4_rs1', 'cd4_rs2']] = spline(df, 'cd40', n_knots=3, term=2, restricted=True)

dfcc = df.drop(columns=['cd4_wk45']).dropna()
dfcc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 517 entries, 0 to 546
Data columns (total 12 columns):
id         517 non-null int64
male       517 non-null int64
age0       517 non-null int64
cd40       517 non-null int64
dvl0       517 non-null int64
art        517 non-null int64
dead       517 non-null float64
t          517 non-null float64
age_rs1    517 non-null float64
age_rs2    517 non-null float64
cd4_rs1    517 non-null float64
cd4_rs2    517 non-null float64
dtypes: float64(6), int64(6)
memory usage: 52.5 KB


Our data is now ready to conduct a complete case analysis using TMLE. First, we initialize TMLE with our complete-case data (dfcc), the treatment (art), and the outcome (dead)

In [2]:
aipw = AIPTW(dfcc, exposure='art', outcome='dead')

### Treatment Model
First, we will specify our treatment model. We believe the sufficient set for the treatment model is gender (`male`), age (`age0`), CD4 T-cell (`cd40`) and detectable viral load (`dvl0`). To relax the functional for assumptions, we will model age and CD4 using restricted quadratic splines

In [3]:
aipw.exposure_model('male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')


----------------------------------------------------------------
MODEL: art ~ male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0
-----------------------------------------------------------------
                 Generalized Linear Model Regression Results                  
Dep. Variable:                    art   No. Observations:                  517
Model:                            GLM   Df Residuals:                      508
Model Family:                Binomial   Df Model:                            8
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -206.06
Date:                Sun, 31 Mar 2019   Deviance:                       412.12
Time:                        15:31:32   Pearson chi2:                     510.
No. Iterations:                     5   Covariance Type:             nonrobust
                 coef    std err          z      P>|z|      [0.025      0.975]

`AIPTW` uses a logistic regression model to estimate the probabilities of treatment and the corresponding summary of the model fit are printed to the console. 

### Outcome Model
Now, we will estimate the outcome model. We will model the outcomes as ART (`art`), gender (`male`), age (`age0`), CD4 T-cell (`cd40`) and detectable viral load (`dvl0`). Again, we will model age and CD4 using restricted quadratic splines 

In [4]:
aipw.outcome_model('art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')


----------------------------------------------------------------
MODEL: dead ~ art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0
-----------------------------------------------------------------
                 Generalized Linear Model Regression Results                  
Dep. Variable:                   dead   No. Observations:                  517
Model:                            GLM   Df Residuals:                      507
Model Family:                Binomial   Df Model:                            9
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -202.85
Date:                Sun, 31 Mar 2019   Deviance:                       405.71
Time:                        15:31:32   Pearson chi2:                     535.
No. Iterations:                     6   Covariance Type:             nonrobust
                 coef    std err          z      P>|z|      [0.025     

Again, logistic regression is used to predict the outcome data

### Estimation
To estimate the risk difference and risk ratio, we will now call the `fit()` function. After this, `AIPTW` gains the attributes `risk_difference` and `risk_ratio`. Additionally, results can be printed to the console using the `summary()` function

In [5]:
aipw.fit()

print('RD:', aipw.risk_difference)
print('RR:', aipw.risk_ratio)

aipw.summary()

RD: -0.08485106054489719
RR: 0.5319812234515429
           Augment Inverse Probability of Treatment Weights           
Risk Difference:    -0.085
95.0% two-sided CI: (-0.155 , -0.015)
----------------------------------------------------------------------
Risk Ratio:         0.532
95.0% two-sided CI: -


Interpreting the risk difference, we would conclude that had everyone in our cohort been treated with ART, the risk of all-cause mortality would have been 8.5% (95% CL: -0.155, -0.015) points lower than had no one been treated.

Confidence intervals come for influence curves. They are currently only available for the risk difference

### Confidence Intervals
To obtain correct confidence intervals, we can alternatively use a bootstrap procedure. For the risk ratio, you will have to use the bootstrap procedure at this point

In [6]:
rd_results = []
for i in range(1000):
    dfs = dfcc.sample(n=dfcc.shape[0],replace=True)
    s = AIPTW(dfs,exposure='art',outcome='dead')
    s.exposure_model('male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0',
                     print_results=False)
    s.outcome_model('art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0',
                    print_results=False)
    s.fit()
    rd_results.append(s.risk_difference)


print('95% LCL:', np.percentile(rd_results, q=2.5))
print('95% UCL', np.percentile(rd_results, q=97.5))

95% LCL: -0.15155098761494876
95% UCL -0.01620381403340392


Under the counterfactual of everyone receiving treatment with ART, the risk of all-cause mortality was 8.5% points lower (95% CL: -0.152, -0.016) than the counterfactual where no one had been treated. As you can see, the confidence intervals from the bootstrap and the influence curves are approximately the same

# Conclusion
In this tutorial, I introduced the concept of doubly-robust estimators and detailed augmented-IPTW. I demonstrated estimation with `AIPTW` using *zEpid* and how to obtain confidence intervals. Please view other tutorials for information on other functionality within *zEpid*

## References
Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, Davidian M. (2011). Doubly robust estimation of causal effects. *AJE*, 173(7), 761-767.

Lunceford JK, Davidian M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. *SiM*, 23(19), 2937-2960.

Keil AP et al. (2018). Resolving an apparent paradox in doubly robust estimators. *AJE*, 187(4), 891-892.

Robins JM, Rotnitzky A, Zhao LP. (1994). Estimation of regression coefficients when some regressors are not always observed. *JASA*, 89(427), 846-866.