#**Heterogeneous treatment effect estimation**

### Predictive Analytics Seminar SoSe 2020

<br>

Authors : Li Zhi (601150) & Malgorzata Paulina Olesiewicz (598939)

Github page :[https://github.com/Olesiewitch/Predictive_Analytics](https://github.com/Olesiewitch/Predictive_Analytics). 

# **Table of contents**
1. [Introduction](#introduction) <br>
  1.1 [ CATE ](#cate) <br>
  1.2 [Uplift](#uplift)

2. [Direct and Indirect Models](#models) <br>
    2.1  [ Direct Models](#direct) <br>
    2.2 [ Indirect Models](#indirect)

3. [Python Packages](#packages) <br>
   3.1.[Comperative Review](#review) <br>
   3.2.[Discussion](#discussion) <br>

4. [Use cases](#implementation) <br>
   4.1 [Data & simulation method](#data) <br>
     4.1.1. [Data for the uplift modelling](#uplift_data) <br>
     4.1.2. [Simulation for the CATE estimation](##cate_data) <br>
   4.2 [Examples of the implementation](#examples) <br>
      4.2.1. [Uplift](#uplift_ex) <br>
      4.2.2. [CATE](#cate_ex) <br>
5. [Conclusion](#paragraph3)
6. [References](#paragraph4)





## **1. Introduction <a name="introduction"></a>**
The causal treatment effect varies among individuals, as people have different characteristics. Heterogeneous treatment effect estimation, a topic that explores the patterns of the individual-level effect difference has been gaining popularity in the past few years. With the increasing availability of all kinds of data, it brings many important applications in various fields, such as business campaigning, drug trials and political science etc.   

In this seminar paper, we first presented a comprehensive review of the current state-of-the-art in heterogeneous effects estimation methods and organized them into two categories, namely, direct and indirect methods. Then we focused on reviewing and comparing the latest Python libraries developed for causal modelling. During application, we used an E-mail marketing dataset as well as a simulated dataset in selected libraries. Finally, the application results are presented and discussed. 


<br>

We use the following **notations** in the next sessions:
* $T$ is the  treatment indicator, 1 for the treatment and 0 for the control group 
* $X$ and $Y$ are the sample data where: <br>
$X_{i} \in \mathbb{R}^{d}$ is d-dimensional covariate or feature vector for individual $i$ <br>
$Y_{i}$ is the outcome variable for the individual $i$ and $Y_{i}(T)$ is the outcome under group assignment $T$ 
* ${\mu}_{1}(x)$ is the estimated conditional outcome for the treatment group and ${\mu}_{0}(x)$ for the control group 
* ${τ_i}$ is the estimated treatment effect for individual $i$ 
* $\ e(X)=P(T=1 \mid X)$ is the estimated propensity score

<br>

####**1.1 Conditional Average Treatment Effect  <a name="cate"></a>**

Conditional Average Treatment Effect, or CATE, is the causal effect of a treatment on an outcome of interest for a sub-sample with a particular set of features. In other words, it measures the treatment effect on a group within the sample population because of heterogeneous effects in the population, meaning that treatment can have a different effect on different groups of people.   

It is particularly helpful in answering questions such as whether the drug works better on the young than the elderly or if women consumers responded more positively to ads than men etc. Therefore, there has been a growing interest in finding such causal effect, in a multitude of domains, such as medical treatment, policymaking, and marketing campaign etc. to allow for better decision making.

Formally, CATE can be calculated as,

$$\tau(x)=E[Y(1)-Y(0) \mid X=x]$$


 <br>
####**1.2 Uplift  <a name="uplift"></a>**
The second metric we will discuss here is the uplift, which predicts the incremental effect of a treatment on the outcome of an individual. It is often used in the context of binary outcomes such as purchase, visit the website or employment status. The uplift modelling is often used as a tool to maximise the return/gain from the treatment and reach the equilibrium with respect to the treatment cost. 

To find the optimal fraction of the population which should be targeted by the treatment, first, the lift for every observation in the sample needs to be calculated. Lift is defined as a change in the probability of a favoured outcome given treatment in comparison to no treatment. 

 <br>

$$Lift = P\ (outcome \ |\ treatment) - P\ (outcome \ | no\ treatment)$$
 
 <br>

The sub-group of the population with the highest lift is often called the *persuadables* and is the most responsive to the treatment. By targeting only this group, one could maximize return and minimise cost, since such treatment would bring the desired result of the change in the outcome. Targeting *sure things* or *lost causes* will bring no change in the outcome and are usually associated with an additional cost of giving treatment. Moreover, one wants to avoid targeting the *sleeping dogs* as this will bring the opposite to the intended treatment effects. Consequently, the objective of the uplift models is to identify the *persuadables*. 

The summary of the different population groups can be found in the figure below. 

<br>
<center>

<img src="https://storage.googleapis.com/wf-blogs-engineering-media/2018/10/e45e2d97-confmatrix_alt-768x768.png"
     alt="Persudables"
      width="250" height="250"/>


Figure 1 : Population subgroups with respect to their treatment status. <br>
<small> Graphic Source: https://tech.wayfair.com/ </small>

 </center>    

    
   

##**2. Direct and Indirect Models  <a name="models"></a>**
An extensive review of the state-of-the-art in heterogeneous treatment effects modelling is provided by several authors such as Devriendt (2018) and Künzel (2019).

At large, models used for the estimation of CATE and uplift has can be divided into two categories: direct and indirect. The list of models reviewed in this article can be found in Table 1. Noted that this is not meant to be exhaustive, as it is still an area of constant development.

The direct approaches are mostly tree-based where CATE or uplift is calculated based on the observations in each leaf, directly from the data. The models in this category differ in how the split is conducted to grow the tree and how the within each tree estimates are combined to calculate the final CATE or uplift.

The indirect models can be divided into metalearners and transform outcome models. In both cases, the indirect model reformulates the uplift or CATE estimation into a different kind of problem(s) and uses existing machine learning algorithms to solve it (them).

Both groups of the models are discussed in more details in the following sections. 

<center>

Table 1 : List of direct & indirect models

</center>


 <br>

|Direct Models | Indirect  Models |
| --- | --- |
| Information Theory Based Trees <br><br> Causal Conditional Inference Trees <br><br> Context treatment selection (CTS) <br><br> Generalized Random Forest | S-Learner <br><br> R-Learner <br><br> X-Learner <br><br> T-Learner <br><br> Transformed  <br> Outcome

###**2.1 Direct  Models  <a name="direct"></a>**
In the direct models' approach, we are using the algorithms to directly optimise for the uplift or CATE. In the case of the tree-based models, we optimise for the metric of interest at every split. Estimates of each tree can be combined using ensemble methods to obtain a more stable prediction. Figure 2 shows a simple conceptual visualisation of a decision tree for the uplift problem. 

<center>

<img src="https://storage.googleapis.com/wf-blogs-engineering-media/2019/10/fe4e20b1-header-image.png"
     alt="Persudables"
      width="525" height="300"/>


Figure 2 : A conceptual example of Uplift Decision Tree.   
<small> Graphic Source: https://tech.wayfair.com/ </small>
 </center>    
 
  <br>

The split is conducted to optimise the pre-defined for the model performance metric e.g. divergence gain (see Table 2). Each node of the tree must contain observation from both (all) groups for the model to be able to calculate the estimated CATE/uplift at each step. 
 
 <br>

Conceptual **pseudo-code** for the direct tree-based models: 

***1.*** Pre-define the performence criterium for the split <br> ***2.*** Grow the tree in a greedy fashion : <br> <br> ***a)*** Optimise each split using the pre-defined criterium <br> ***b)*** Allow for the split only if both of the children nodes contain  observations from treatment and control groups <br> ***c)*** Stop if the stopping criterium has been reached <br> <br> ***3.*** Obtain the estimates for CATE or uplift within each leaf <br> <br> ***4.*** If using an ensable methods combine the estimates form all the trees to obtain the final estimate.*
  
  <br>

The summary of the most commonly used direct models for the heterogeneous treatment effect estimation can be found in Table 2.

<center> Table 2 : Comperative review of direct models <br>

Method | Authors | Method description | Advantages | Disadvantages 
 --- | --- | --- | --- | --- 
Information Theory <br> Based Trees   | Rzepakowski & <br> Jaroszewicz (2012) | The tree is grown by maximizing the distance in the <br> class distributions of the response. The splits are <br> evaluated based  on the **Divergence Gain**.  There are <br> three different distribution  divergence measures, <br>  which can be used in the model :<br> <br> - Kullback-Leible <br> - Squared Euclidean <br> - Chi-squared |- Information Theory Based Trees are still<br> wildly  used in practice <br> <br> - Their methodology was a stepping stone <br>  in the development of many following <br>  methods | - Empty control group makes causal inference <br> not possible <br> <br> - Splitting conditions are independent of population size,<br> which may cause problems when dealing with real world <br>data 
Causal Forest | Athey, et al (2015) | The tree is grown using the exact loss criterion and  <br> nearest neighbor  approximation. Within each leaf a <br>  **propensity score weighted two model**  approach is <br> used to obtain the treatment effect. The final <br> estimate is an average of the all the trees. <br>| - Inference is straight forward due to the <br> honest  tree assumption  and normality of <br> Causal Forest | - Due to nearest neighbor approach, individual treatment <br> effects still unobtainable <br> <br> - Model performs well only in high dimensions
Context Treatment  <br> Selection (CTS)| Zhao, et al (2017)  | Model allow estimation of the **expected response <br> under each treatment**, which leads to optional <br>  treatment selection. The tree is grown with each <br> split directly maximizing the expected response <br> (lift).  Model allows partition of the feature space.| - Allows to model multiple and continuous <br>  response treatments <br> <br> - Shows significant improvement in <br> performance  for symmetric data |
Generalized  <br> Random Forest | Wager & Athey (2018) | Tree is grown with splitting using  **gradient loss <br> criterion** and  pseudo-outcomes. **Honest problem- <br> specific split** improves the accuracy of the model <br> and assures its unbaisness. The estimated <br> treatment effect is derived using the estimating <br> equation weighting scheme.| - Model can better express treatment <br> heterogeneity than Causal Forest <br> <br> - Model can be use for any heterogeneous  <br> parameters prediction estimation  <br> such as Quantile Regression. |  -  Bias confidence intervals <br> <br> - Model performs well only in high dimensions <br> <br> - Prediction unstable around the edge of the <br>parameter space


##**2.2 Indirect  Models  <a name="indirect"></a>**

In comparison to direct models, Indirect models estimate treatment effects indirectly, using one or more models of the observed outcome, or via some transformation on treatment and responses classes.  We will discuss two main classes of indirect approaches, namely, the **Meta-learner family** (Kunzel etl 2019, Nie and Wager 2017) and the **Transformed outcome model** (Athey and Imbens 2015).

###**Metaleaners**
A metalearner is a framework that builds on some base learner, such as Random forests, Bayesian additive regression trees, or neural networks to estimate the CATE.

* **T-learner** is the most standard model, also often called two-model approach in other literatures. It predicts the target outcome with and without treatment, and takes the difference between the predicted outcomes to derive the impact of a treatment. It follows two stages: 

Stage 1: to estimate the average outcomes 
\begin{array}{l}
\mu_{0}(x)=E[Y(0) \mid X=x] \\
\mu_{1}(x)=E[Y(1) \mid X=x]
\end{array}        
Stage 2:
Take the difference in conditonal menans to calculate the CATE estimate as:
$${\tau}(x)={\mu}_{1}(x)-{\mu}_{0}(x)$$

T-learner is very straightforward and easy to implement, and it tends to be more effective when the treatment effect is very complicated - when there are not many common trends between the control and the treatment group. However, the downside is that both two models need to be accurate to make the estimate more reliable. 

* **S-learner** estimates the treatment effect using a single machine learning model, including the treatment indicator as a covariate into the model. We could estimate the average treatment effect on the population through standard regression analysis. The formula of calculating CATE is  

$${\tau}(x)={\mu}(x, T=1)-{\mu}(x, T=0)$$

Intuitively, this single estimator avoided the amplified error problem as presented in T-learner, since there is only one model to consider. Empirically, it is suitable when the treatment effect is simple, so by treating the treatment indicator as other covariates have proved to be more effective. 

As pointed out in Kunzel's paper, however, S-learner estimate tends to be biased to 0. For example, when base algorithms such as Lasso and RFs are used, the treatment indicator could be simply ignored by not choosing or spitting on it. In that case, the CATE estimates would be 0. 

* **X-learner**, is an extension of T-learner. It uses imputed treatment effects and weighted average to estimate CATE. 
Three stages are as follows:    

*Stage 1*: 
Estimate the average outcomes just as T-learner
$$\mu_{0}(x)=E[Y(0) \mid X=x] \\
\mu_{1}(x)=E[Y(1) \mid X=x]$$

*Stage 2*:   Impute the treatment effects,$\quad {T}_{i}^{1}$, and $\quad {T}_{j}^{0}$ for individual $i$ in the treatment group based on $\mu_{0}(x)$, and individual $j$ in the control group based on $\mu_{1}(x)$:

$$\quad{T}_{i}^{1}=Y_{i}(1)-E[Y(0) \mid X=x],\text {and}
\quad{T}_{j}^{0}=E[Y(1) \mid X=x]-Y_{i}(0)$$
then construct a treatment estimate for the treatment and control group separately: $$\tau_{1}(x)=E\left[T^{1} \mid X=x\right], \text { and } \tau_{0}(x)=E\left[T^{0} \mid X=x\right]$$ 

*Stage 3*:  
Combining the treatment effect estimates from both models using a weighted average: 


$$ \quad {\tau}(x)=g(x) {\tau}_{0}(x)+(1-g(x)){\tau}_{1}(x)$$
where $g(x)$ often times is propensity scores. 



X-learner has advantages when there are assumptions on the structural property of the CATE (its sparsity or smoothness, for example). Moreover, in the settings where the treatment and control group vary in size, and when we want to emphasize the conditional mean model estimated on the larger group, X-learner performs more effectively. 

* **R-learner** uses the cross-validation out-of-fold estimates of outcomes and propensity scores. As explained in Nie and Wager(2017)paper, a transformed loss function was constructed, and the CATE estimation problem became minimizing a R-loss function:
$$\underset{\tau}{\arg \min } \frac{1}{n} \sum_{i}\left(\left(Y_{i}-E\left[Y \mid X_{i}\right]\right)-\left(T_{i}-E\left[W=1 \mid X_{i}\right]\right) \tau\left(X_{i}\right)\right)$$  
  
R learner is flexible and easy to use in that, it could use any loss-minimization methods in both steps, and these methods can be fine-tuned by cross-validation.  


###**Transformed Outcome**
<br>
Athey's and Imbens' (2015) Transformed Outcome model is another indirect approach used for the calculation of the uplift. 

The idea is quite simple: the outcome labels are transformed in such a way that taking the expectation of the outcome is equivalent to estimating the uplift. 

Transformation of the outcome is conducted using following formula:
\begin{equation}
    Y_{i}^{*}=Y_{i} \frac{T- e(X)}{ e(X)(1- e(X)}
\end{equation}

where $Y_{i}^{*}$ is the Transformed Outcome.

For  $ e(X)=0.5$ the positive outcome 1 (e.g buy) is then transformed to -2 for those in the control group and to 2 for those in the treatment group. The negative outcome label in both groups remains 0.

The main advantage of this transformation is that by taking an average of the transformed outcomes within the sample is equivalent to estimating the lift. 

This can be easily illustrated by an example with a simple size of 2n, where half of the sample is randomly assigned to the treatmeant (T=1) and the other half to the control (T=0) group. For each observation, the original and transformed outcomes are denoted by $Y_{i}$ and $Y^{*}_{i}$ ,  respectively. Consequently, we can write down the lift equasion as : 

  \begin{align*} \text{Lift} &= E[Y|T=1] - E[Y|T=0] = \frac{1}{n} \sum_{i=1}^n Y_i - \frac{1}{n}\sum_{i=n+1}^{2n} Y_i = \frac{1}{2n} \left[ \sum_{i=1}^{n} 2Y_i - \sum_{i=n+1}^{2n} 2 Y_i \right] = \frac{1}{2n} \sum_{i=1}^{2n} Y^{*}_{i} = E[Y^{*}]. \end{align*}

And further, conditioning on $x$ will allow us to obtain the lift for the relevant subsample. The implementation of this model into a python package ***Pylift*** has been conducted by the company Wayfair. More information regarding the technical specifications of the implementation can be found on their [blog](https://tech.wayfair.com/data-science/2018/10/pylift-a-fast-python-package-for-uplift-modeling/).

#**3.Application with Python Packages <a name="packages"></a>**

##**3.1 Comperative review  <a name="review"></a>**
There are several Python packages  available for causal modeling and we have reviewed the following:
* [CausalML](https://github.com/uber/causalml) 
* [PyLift](https://github.com/wayfair/pylift/)
* [DoWhy](https://github.com/microsoft/dowhy) 
* [EconML](https://www.microsoft.com/en-us/research/project/alice/) 
* [CausalLift](https://github.com/Minyus/causallift) 
* [PyUplift](https://github.com/duketemon/pyuplift) 

To compare relevant packages, the following aspects have been considered: ***functionality, interpretability of the results*** and ***user-friendliness***. 

All the technical aspects of the packages such as available algorithms, flexibility and additional features were considered under the functionality aspect. When reviewing interpretability, we focused on the available by default evaluation methods and results. Finally, user-friendliness was evaluated based on documentation, interface and level of additional expertise required to use the package. A summary of features from the abovementioned packages can be found in Table 4. The discussion of the advantages and disadvantages of each package is presented in section 3.2. 
 
<br/>


<center>Table 3 : Python packages for causal modelling: comparison of the features

  <br/><br/>
  
Feature \ Package | CausalML |PyLift | DoWhy |EconML |CausalLift|PyUplift|
 --- | :-:| :-: | :-: | :-:| :-: | :-:|
 **FUNCTIONALITY** |||||| |
  **Heterogeneous treatment effects  applications:** ||||||
CATE |||✓|✓|✓| 
Uplift|✓|✓|||✓|✓
 **Types of treatment supported:**||||||
Single |✓|✓|✓|✓|✓|✓|
Multiple|✓|||✓||
Binary|✓|✓|✓|✓|✓|✓
Continouse||||✓||
**Implemented Models:**||||||
*Indirect algorithms:* ||||||
S-learner |✓|||✓||
T-learner|✓|||✓||
X-learner|✓|||✓||
R-learner|✓|||✓||
*Direct algorithms:* ||||||
Information Theory Based Trees |✓|||✓||
Orthogonal Random Forest||||✓||
CTS |✓|||||
Transformed Outcome Model||✓||||
*Other algorithms:* ||||||
Propensity-based Stratification|||✓|||
Propensity Score Matching|||✓|||
Inverse Propensity Weighting |||✓|||
Regression |||✓||✓|✓
Two model approach|||||✓|
**Other features:** ||||||
Direct connection to data API||||||✓
**INTERPRETABILITY** |||||
*Evaluation methods:* ||||||
Gini Curve|✓|✓||✓||
Conversion rate|||||✓|
Feature importance|✓|✓||||
Tree visualisation |✓|||✓||
Causal graphs |||✓|||
**USER FRIENDLINESS** ||||||
Documentation is completed |✓|✓|✓|✓||
Tutorial notebooks are avaiable |✓|✓|✓|✓|✓|
Specific version of supporting packages required  |✓|||||
Package is beginner friendly ||✓|✓||✓|✓

##**3.2 Discussion  <a name="discussion"></a>**

As can be seen from Table 3, the functionality and potential applications vary among the packages significantly. 

Two of the most advanced packages, CausalML and EconML were developed by Uber and Microsoft respectively. DoWhy is also a product of Microsoft, however, it focuses more on the older models, which emphasis on IV. It is important to note that the resources available for the research teams at those big companies
by far exceed the market standards. Therefore, comparisons between these well-developed packages and the ones developed by individuals may not be the fairest.

***CausalML*** and ***EconML*** implemented most of the discussed in section 2 models and allow for simultaneous multiple treatment effects analysis. The visualisation of the trees and the policy interpreter in EconML allows for the intuitive presentation of the results. However, both of the packages are very advanced and require additional statistical expertise to navigate through them. Moreover, EconML requires TensorFlow version 1.*, which is no longer available for Python3 and therefore we were not able to run the package using a desktop code editor. It also causes issues in Colab, when used with other packages which require a later version of the TensorFlow.  Finally, the visualisation tools should be extended to the metalearners.

***DoWhy's*** biggest advantage is the implementation of the causal graphs for the treatment models. It allows for a clear interpretation of the causal effect and prevents a false model structure e.g. conditioning on colliders. It supports linear regression models for CATE estimate.  In practice, it is recommended to combine it with **EconML** for more advanced estimate methods, meanwhile making use of its 'Refutation' methods to test the robustness of the estimate.
Depending on the model one wants to use, the package can be a great fit for CATE estimation but is not suitable for uplift modelling. 


***Pylift*** has been developed by Wayfair, an American e-commerce company. Among all the packages we have inspected, this is the only one that implemented the transformed outcomes method proposed by Athey and Imben(2015). It would be an easy and intuitive package to use for someone new to uplift modelling. However, it is rather a 'black box' with limited results available to users. Moreover, only the default learners worked well. For example, during our experiment, switching to any other learners would cause the algorithm to break.  In terms of documentation, it is rather limited, which does not allow users to understand the source of the errors.  

Finally, ***CausalLift*** and ***PyUplift*** are two packages developed by individual researchers. They are rather limited in functionality. CausalLift offers as output a clean table that shows estimated CATE for each observation, which is intuitive for interpretation. PyUplift offers a direct connection to the API with datasets used for uplift modellings such as Hillstrom EmailMarketing and Criteo. Unfortunately, both packages have very limited documentation. Moreover, PyUplift only returns the predicted values for the test set, no sorting or optimal target subgroup has been implemented.  Therefore, we would not recommend using these packages and we decided not to include them as part of the use case examples.

#**4. Use cases  <a name="implementation"></a>**



##**4.1 Data & simulation methods  <a name="data"></a>**  

###**4.1.1. Data for the uplift modelling  <a name= "uplift_data"></a>**
 
For uplift modelling application in **CausalML** and **PyLift** package we used one of the classical causal inference example dataset, Hillstrom Email marketing data from [Kevin Hillstron’s MineThatData](https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html) blog. It contains 64,000 customers who made the last purchase within 12 months. The marketing campaign test was conducted by sending emails to about 1/3 of the consumers featured with Women's merchandise, 1/3 of the consumers featured with Men's merchandise, and the remaining 1/3 consumers received nothing. The campaign results were reflected in three aspects: if the consumers have visited the website in the following two weeks, if they made any purchase and how much money they spent on the purchase.   

The available covariates within the dataset include three outcome variables, **Visit, Conversion and Spend** as explained above, and:
* Recency: indicates the number of months since the last purchase
* History: the actual dollar value spent in the past year
* Men, Women: {0,1} indicates if the customer purchased Men's or Women's merchandise in the past year separately
* Zip-code: indicates if the customers are from Urban, Suburban, or Rural. 
* Newbie: whether it is a new customer in the past 12 months, and 
* Channel: describes through which channel the customers have purchased from in the past year.   


###**4.1.2 Simulation for the CATE estimation <a name="cate_data"></a>**
In the real world data, we never observe both potential outcomes for each observation and consequently, evaluation of the model accuracy can be only done on the simulated data. 

To implement CATE estimation using different metalerners we simulated the data using the generating process (DGP) suggested by Kunzel et al. (2019) in their Simulation 1. The DGP specification is given as: 

 <br>

\begin{equation}
\begin{aligned}
 Y_{i} = Y_i(1) \cdot T + Y_i(0) \cdot (1-T) + \epsilon \\
T \sim Bern(e(x)), \; e(x) = P(T=1|X=x) 
\end{aligned}
\end{equation}
 
 <br>

The potential outcomes without the treatment ($Y_i(0)$) and with treatment ($Y_i(1)$)) are described by the following equations: 
 <br>

\begin{equation}
\begin{aligned}
Y_i(0) &=X_{i}{^T} \beta+5 \mathbb{I}\left(x_{i1}>0.5\right), \quad \text { with } \quad \beta \sim \operatorname{Unif}\left([-5,5]^{20}\right)  \text{ and } X_i  \sim \operatorname{N}(0, \Sigma) \\
Y_i(1) &=Y_i(0)+8 \mathbb{I}\left(x_{i2}>0.1\right)
\end{aligned}
\end{equation}

 <br>

In order to generate data, we followed EconML's [example notebook](https://github.com/microsoft/EconML/blob/master/notebooks/Metalearners%20Examples.ipynb).





##**4.2 Examples of the implementation  <a name="examples"></a>**


**4.2.1 Uplift**  

**Segement distributions**    

First, we looked at how the variable 'Segment' (the treatment assignment) is linked to the target variable and other feature variables. From the distribution plots below, it is clear that consumers in Mens, Womens campaign and No campaign groups are perfectly random chosen. The distribution of features among different groups is almost identical, which ensures that there is no selection bias.   
As explained in the data overview, there are three outcome variables - **Visit, Conversation, Spend**. The actual conversion rate, however, is less than 1%, therefore, for causal effect study purpose, we focus on **'Visit'** as our target outcome variable. Meaning that if the consumer visited the website in the following two weeks after receiving emails, the campaign is working. 

<img src="http://drive.google.com/uc?export=view&id=1wOUeszzY3zgj1BeY4wcY46EcZMxjdhxS
"
alt="Persudables"
      width="1500" height="700"/>
<center>

Figure 3 : Features distribution among Treatment assignment groups
 </center> 



**DoWhy Causal Graph** 

To further investigate the causal relationship between the covariates, treatment and outcome we used DoWhy and produced the causal graph of the model. As shown in Figure 4,  there is a causal relationship between all the covariates and the outcome, and some of them are also impacting the treatment. Therefore, conditioning on those covariates would satisfy the conditional independence assumption. However, according to the graph, there seems to be a causal relationship between unobserved covariates and the treatment. Therefore, even though the covariate distribution seems to be the same in all of the groups, indicating a lack of the selection bias, the causal graph suggests otherwise. The potential selection bias in the sample should be further investigated.

 <br>

<img src="http://drive.google.com/uc?export=view&id=1oK33WaO_zl624BsygSiK0aZwNKk5y5oC
"
alt="Causal Graph"
      width="2500" height="120"/>


<center>

Figure 4: Part of the causal graph for the uplift model.* <br>
<small> *Due to the large number of covariates, the graph is unreadable when added to the notebook. The full graph can be found on the project's Github. </small>
 </center> 

 <br>
 **CausalML and PyLift Qinni Curve** 

CausalML supports multiple treatments, as in Hillstrom dataset, there are two treatment assignment groups, the uplift model predicts uplift for each treatment. The table below is the model predict output, where the first individual was predicted to have a higher uplift effect from receiving a men-campaign than women-campaign treatment.

<center> Table 4 : Uplift model predict

<img src="http://drive.google.com/uc?export=view&id=1_LBFawfTnPpCM4etMzK-7AeWfWJUXp5W
"
alt="Uplift model predict"
      width="250" height="300"/>  

</center>

In the case of multiple treatments, we could either draw a separate uplift curve for each treatment or to combine treated subjects into one group and compare their outcomes per ranked quantile to the control group.  We have applied the second option in **CausalML**. That is, if a customer's estimated uplift result is negative in both treatments, we would assign this person to control group. Otherwise, we would assign him/her to the treatment group with a greater effect. For example, the first individual would be assigned to Men-campaign group since the estimate showed a more effective treatment effect under Men-campaign treatment assignment. 
      Finally, with the newly assigned treatment group, the uplift curve looks like this:

<center>
      
<img src="http://drive.google.com/uc?export=view&id=1iYtwpWWod3vbF5yXCFdN4ce3u7ni9FB7
"
alt="CausalML Upliftcurve"
      width="500" height="400"/>  
  


Figure 5 : CausalML UpliftCurve
 </center> 

The red line as a benchmark, which presents the result from the random
assignment and the blue line is when we assigned individuals to the treatment group where the maximum 'lift' effect can be expected. 

The next two figures have been populated using the **pylift** package. The package does not allow for multiple treatment assignment, therefore we present the results for both campaigns separately. 

After the outcomes are transformed using `TransformedOutcome()` function, we used the default regressor `XGBRegressor` with the 5 fold cross-validation. 
The presented below Quini curves show with the faded line the estimations for each of the folds, and the final averaged estimate with the black line. The 5 fold estimation has been also used to estimate a confidence interval of the prediction which is marked with the purple area. 

As can be seen, the men campaign can almost no uplift gain, whereas the results for the women campaign are slightly better. 

<center>
      
<img src="http://drive.google.com/uc?export=view&id=1wYeTcGNvOeHQHibkmWjYQTo1iQffR0FJ
"
alt="CausalML Upliftcurve"
      width="500" height="400"/>  





Figure 6 : Pylift Uplift gain chart for men campain
 </center> 

<center>
      
<img src="http://drive.google.com/uc?export=view&id=1kKlMDwaufXlPUQJw35uK8Xa58mibiL1e
"
alt="CausalML Upliftcurve"
      width="500" height="400"/>  


Figure 7: Pylift Uplift gain chart for women campaign
 </center> 


 <br>


**NIV Pylift** 

To evaluate which covariates have the biggest impact on the outcome, for each campaign, we have calculated Net Information Value (NIV) using the pylift package. Weighted of evidence (WOE) describes the relationship between a predictive variable and a binary target outcome. NIV measures the strength of that relationship. Generally, if IV is smaller than 0.05 the variable has a very weak relationship with the outcome and will not add predictive power to your model. Detail description of the WOE and NIV implementation can be found in the package [documentation](https://pylift.readthedocs.io/en/latest/eda.html). We present the top 5 variables with the strongest relationship with the outcome for both campaigns.

<img src="http://drive.google.com/uc?export=view&id=1tx-9vZQfX0HOMHxDAvgePgMxjFJu4FFt
"
alt="Woman NIV"
      width="900" height="300"/>  


<center>

Figure 8 : Top 5 NIV values for the Women campaign.
 </center> 

For woman campaign, the fact whether someone has already purchased a product from a Men's (`men`) of Women's (`women`) merchandise in the last year has the highest predictive power of the outcome. For men campaign, the rural residency and the fact of whether someone is a new customer seem to have the biggest impact on the outcome.

<img src="http://drive.google.com/uc?export=view&id=1DknPKYmHc5F2zyXRX22RMm5uYfyryFtk
"
alt="Woman NIV"
      width="900" height="300"/>  


<center>

Figure 9 : Top 5 NIV values for the Men campaign.
 </center> 
 


 <br>

**Tree from CausalML**   

We applied CausalML to show the implementation of the tree-based methods. We used the  `UpliftRandomForestClassifier`, which is based on ensemble methods for uplift modelling by Sołtys (2015) and its base learner, `UpliftTreeClassifier` is based on decision trees for uplift modelling with single and multiple treatments by Rzepakowski and Jaroszewicz (2012).

CausalML offers a nice visualization tool for uplift tree to improve model interpretability. The code is fairly simple to implement:

`uplift_tree = uplift_model.uplift_forest[0]`  
  
`uplift_tree.fill(X=df_test[x_names].values, 
                 treatment=df_test['Treatment'].values, 
                 y=df_test['visit'].values) `

`graph = uplift_tree_plot(uplift_tree.fitted_uplift_tree,x_names)`  

`Image(graph.create_png())`


The plot below is an example that shows the first tree from the trained forest. 


<img src="http://drive.google.com/uc?export=view&id=1sYY0bc5o6_a8mgljXcJy2K9_hmtON4i8
"
alt="Persudables"
      width="1500" height="400"/>
<center>

Figure 10 : Visualization of Uplift tree
 </center> 
 <br>

And a closer look at its branch:
<img src="http://drive.google.com/uc?export=view&id=1Loq-BlsgkY-XshFiiFZtuSVLU_hV_ECJ
"
alt="Persudables"
      width="1500" height="400"/>
<center>

Figure 11 : Branch of Uplift tree
 </center> 
 <br>

The feature indicates the splitting rule of this node to its children
nodes, and the uplift score within each group is also shown in this graph. One thing to note is that the tree plot also includes **validation uplift score**, which can be used as a comparison to the training uplift score, to evaluate if the tree has overfitting issue. 

**Metalearners and EconML** 

To review the EconML implementation of metalearners discussed in section 2.2, we have generated dataset according to the specifications in section 4.1.2. The complete code for the DGP can be found on [Github](https://github.com/Olesiewitch/Predictive_Analytics/blob/master/EconML_CATE.ipynb).

The required code for the application of different metalearners in EconML is very straight forward. For the `T-learner` the conditional estimated treatment effects can be obtained using three lines of code: 

`T_learner = TLearner(models)` \\
`T_learner.fit(Y, T, X)` \\
`T_te = T_learner.effect(X_test)`

where `models` in the gradient boosting regressor constructed using the sklearn package, `Y` is the vector of the potential outcome, `T` vector of treatment dummies and `X` the matrix of covariates. The functions for other metalearners, are implemented similarly (see [Github](https://github.com/Olesiewitch/Predictive_Analytics/blob/master/EconML_CATE.ipynb))

The comparison of the T-learner, S-learner and X-learner estimates can be found in Figure 12. The true treatment effect has annotated with the dashed line. 

<center>
<img src="http://drive.google.com/uc?export=view&id=1RI0AOFTthC_iz3DvzYQ0uGmfrNpcFfiO
"
alt="metalearners"
      width="400" height="300"/>  



Figure 12: Estimates of heterogeneous treatment effect by T, S and X-learner

</center>

 <br>
As it can be observed from the plot, all three meatalerners estimated the heterogeneity of the treatment correctly. The X learner performed best in terms of prediction accuracy. 

Figure 12 as well as models  [bias comparison plot](https://github.com/Olesiewitch/Predictive_Analytics/blob/master/EconML_CATE.ipynb) has been created using matplotlib. Automated comparison visualisation function would be a useful addition to the package. For the complete code of the 'Examples of implementation' see the project's [Github page](https://github.com/Olesiewitch/Predictive_Analytics). 

#**5. Conclusion  <a name="conclusion"></a>**

In conclusion, we have reviewed some of the most popular direct and indirect methods for heterogeneous treatment estimation. A literature review has been complemented by a comparison of the Python libraries, in which the methods have been implemented. It is important to stress that the suitability of each of those methods is highly dependent on the data, meaning that there is no "one fits all" best model for every CATE or uplift modelling task, the same applies to the currently available  Python packages.

The advanced packages such as CausalML and EconML offer the most functions for the heterogeneous treatment effect modelling. However, many of the smaller packages offer implementation of alternative methods and approaches. *R* remains superior among the statistical community, and therefore some of the related functions still await its implementation in Python. As this is an area of vast development in research and Python is becoming the norm among researchers, there is a profound need for the development of future packages.



#**6. Refrences  <a name="refrences"></a>**

[1] Athey, S., & Imbens, G. W. (2015). Machine learning methods for estimating heterogeneous causal effects. stat, 1050(5). 


[2] Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), pp.7353-7360.

[3] Devriendt, F., Moldovan, D., & Verbeke, W. (2018). A literature survey and experimental evaluation of the state-of-the-art in uplift modeling: A stepping stone toward the development of prescriptive analytics. Big Data, 6(1), 13-41.

[4] Gubela, R. M., Bequé, A., Gebert, F., & Lessmann, S. (2019). Conversion uplift in e-commerce: A systematic benchmark of modeling strategies. International Journal of Information Technology & Decision Making, 18(3), 747-791.

[5] Gutierrez, P. and Gérardy, J.Y. (2017), July. Causal inference and uplift modelling: A review of the literature. In International Conference on Predictive Applications and APIs (pp. 1-13).

[6] Knaus, M. C., Lechner, M., & Strittmatter, A. (2018). Machine Learning Estimation of Heterogeneous Causal Effects: Empirical Monte Carlo Evidence. CoRR, (arXiv:1810.13237).

[7] Künzel, Sören R. et al. “Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.” Proceedings of the National Academy of Sciences 116.10 (2019): 4156–4165. Crossref. Web.

[8] Rzepakowski, Piotr & Jaroszewicz, Szymon. (2012). Uplift modeling in direct marketing. Journal of Telecommunications and Information Technology. 2012. 43-50.

[9] Sołtys, M. , Jaroszewicz, S.,  & Rzepakowski,P.  (2015).Ensemble methods for uplift modeling. Data Mining and Knowledge Discovery.Issue 6/2015.

[10] Zhao, Yan & Fang, Xiao & Simchi-Levi, David. (2017). Uplift Modeling with Multiple Treatments and General Response Types. 10.1137/1.9781611974973.66.

