# TMLE - Targeted Maximum Likelihood Estimation

Targeted learning is a method developed by Mark van der Laan that establish theoretically-guaranteed way of applying complex machine learning estimators to causal inference estimation.  

### Motivation
Let's denote $X$ as our confounding features, $Y$ our target variable and $A$ a binary intervention variable whose effect we seek.  
Recall that a regular outcome model (a Standrdization model like S-Learner), simply works by estimating $E[Y|X,A]$. 
However, the way it adjusts for $X$ and $A$ dependends on the core-statitsical estimator used. 
For example, a strictly linear (i.e. no interactions or polynomials) regression model will adjust for $X$ linearly, but that might not describe the response suface properly, especially in high-dimensional complex data.
To account for more complex data, we can use a more complex core estimator.  
However, applying expressive estimator might lead to some bias in estimating the causal effect of the treatment.
In most real-world scenarios, the treatment effect is usually pretty small, and consequently, it might have very little predictive power over the outcome. 
For example, imagine feeding a tree-based estimator a matrix with features join with a treatment assignment column (like an S-learner). 
It is not unreasonable that the tree might simply ignore the treatment variable altogether. 
And so, it will conclude a strictly zero causal effect estimation, since there's no difference in outcome when we plug in $A=1$ and $A=0$.  
This happens because we optimize the prediction $E[Y|X,A]$, which is not the same as optimizing for the causal parameter of interest $E[Y|X,A=1]-E[Y|X,A=0]$.  

### Intutition
A naive statistical estimator will maximize the global likelihood - it will try to estimate correctly _all_ the coefficients of all covariates, only one of which is the treatment assignment.
However, we care for one single parameter - the treatment effect - more than we care for other parameters.
Therefore, we would like to focus our estimator's attention on that parameter of interest, even at the price of neglecting some other parameters.  
Using Dr. Susan Gruber's example, this is just like a picture - where the person smiling is of greater importance and we'd like to focus on them, so we allow the background to become a bit more blurry in exchange.  

While the math behind it is complex, the basic principle is pretty simple: 
In order to focus our estimator on the treatment effect, we will use information from the treatment mechanism to update and re-target the initial outcome model prediction.
This will allow us to use highly data-adaptive preidction models, but still estimate the treatment effect properly.


### Steps
Fitting a TMLE can be summarized into a few simple steps:
1. Fit an outcome model $Q_0(A,X)$, estimating $E[Y|X,A]$ by predicting the outcome $Y$ using the covariates $X$ and the treatment $A$.  
   $Q_0$ can be a highly expressive method, and a common use is a "Super Learner", which is basically a stacking meta-learner using a broad library (pool) of base-estimators.  
   In causallib, this will be done by specifying a `Standardization` model with any kind of core estimator.
1. Fit a propensity model $g(A,X)$, estimating $\Pr[A|X]$ by predicting the treatment assignment $A$ using the covariate $X$.  
   In `causallib`, this will be done by specifying an `IPW` model with any kind of core estimator.  
   Note that `causallib` also allows this set of `X` to be different than the `X` used for the outcome model in step 1.
1. Generate a "clever covariate"* $H(A,X)$ using the propensity scores: $\frac{2A-1}{g(A,X)}$.
   Namely: take the inverse propensity scores of the treated units, and the minus inverse propensity for the controls.
1. Update the initial outcome prediction using treatment information from the "clever covariate": estimate an $\epsilon$ parameter such that:  
   $ Q_*(A,X) = expit(logit(Q_0(A,X))+ \epsilon H(A,X))$  
   Namely, we update the initial $Q_0$ with some contribution of $H(A,X)$ estimated in logit space.  
   in causallib, this is estimated by applying a uni-variable logistic regression, regressing $Y$ on $H(A,X)$ with $Q_0(A,X)$ as offset (i.e. forcing its coefficient to 1).  
   
The intuition behind step (4) is that we basically regressing the "clever covariate" (with its treatment mechanism information) on the residuals of the outcome prediction. 
If the initial prediction is perfect - then $H(A,X)$ is regressed on random noise and $\epsilon$ is therefore $\approx 0$, contributing nothing to the update step.
However, in case there _is_ residual bias in the initial estimator, $\epsilon$ will control the magnitude of correction needed - small residual bias will lead to small update and vice versa. 
This is because "surprising" units, those with small $g(A,X)$, have large $\frac{1}{g(A,X)}$, so small changes in $\epsilon$ will lead to bigger impact on the fitting.  
This is also why we need to be extra careful avoiding overfitting in the initial estimator $Q_0$, because it will falsely minimize the sigal in the residuals that the updating step is needing. 

Note that there are several flavors of the "clever covariates", which causallib implements 4 of, and will be described further down. 

#### Counterfactual prediction
For counterfactual prediction we assign a specific treatment value $a\in A$ and propagate it through the model's component:  
$ Q_*(a,X) = expit(logit(Q_0(a,X))+ \epsilon H(a,X))$    
And then we can calculate any contrast of two intevention values to obtain an effect, like risk difference ($Q_*(1,X)-Q_*(0,X)$) or risk ratio ($\frac{Q_*(1,X)}{Q_*(0,X)}$).


### Doubly robust
TMLE combines an outcome model with a treatment model in a way that makes it doubly robust: we get two chances to get things right.  
As we seen above, We either correctly specify the outcome model and then there's no signal left for correction by the treatment model.
And, conversly, in cases were the initial model is strongly misspecified (think a simple `Y~A` regression), then $\epsilon$ will be large and cover up for it like an IPW model.
Therefore, the targeting step is a second chance to get things right.

### Conclusion
TMLE for causal framework allow us to apply flexible machine learning estimators on high-dimensional complex data and still obtain valid causal effect estimations.


In [1]:
import pandas as pd
import numpy as np

from causallib.estimation import TMLE
from mlxtend.classifier import StackingCVClassifier

from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier, RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures



## DATA
Synthesize data, so we know the true effect estimation

MECHANISM

