# Part 1: Heterogeneous treatment effects using causal trees and forests

For this part, we will be using experimental data for computing heterogeneous effects through causal trees and forests. For all exercises, the predictors X are all variables that are not the outcome Y or the treatment D.

1.1. Load the data (1 points). This is data for and experiment regarding the National Supported Work Demonstration (NSW) job-training program. You can find the data here, and read a description of the data here. For further details of the experiment and the program, you can use this link

In [None]:
using CSV
using DataFrames  

url = "https://raw.githubusercontent.com/d2cml-ai/CausalAI-Course/main/Labs/Assignment/Assignment_5/data/experimental/experimental_control.csv"
df = CSV.read(download(url), DataFrame)

first(df, 5)

In [17]:
summary_table = DataFrame(
    names = names(df),                
    scitypes = scitype.(eachcol(df)), 
    types = eltype.(eachcol(df))   
)   

Row,names,scitypes,types
Unnamed: 0_level_1,String,DataType,DataType
1,treat,AbstractVector{Count},Int64
2,age,AbstractVector{Count},Int64
3,educ,AbstractVector{Count},Int64
4,black,AbstractVector{Count},Int64
5,hisp,AbstractVector{Count},Int64
6,marr,AbstractVector{Count},Int64
7,nodegree,AbstractVector{Count},Int64
8,re74,AbstractVector{Continuous},Float64
9,re75,AbstractVector{Continuous},Float64
10,re78,AbstractVector{Continuous},Float64


1.2. Find the ATE (1.5 points). With re78 as the outcome variable of interest, find the Average Treatment Effect of participation in the program. Specifically, you should find it by calculating the difference between the means of the treatment group and the control group (the Simple Difference of Means or SDM). What can you say about the program?

In [3]:
# First method
using Statistics 

mean_treat = mean(df[df.treat .== 1, :re78])
mean_control = mean(df[df.treat .== 0, :re78])

# Calcular el ATE
ATE = mean_treat - mean_control

println("El Average Treatment Effect (ATE) es: $ATE")

El Average Treatment Effect (ATE) es: 1794.3423818501024


In [4]:
# Second method
using GLM

model = lm(@formula(re78 ~ treat), df)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

re78 ~ 1 + treat

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  4554.8      408.046  11.16    <1e-24   3752.85     5356.75
treat        1794.34     632.853   2.84    0.0048    550.574    3038.11
───────────────────────────────────────────────────────────────────────

The program had a positive impact on the participants' income, increasing their earnings by an average of 1794.34 monetary units compared to the control group. It can be concluded that the program is effective in improving participants' income.

1.3. Heterogeneous effects with causal trees (3 points). Use causal trees like we saw in class. For Python, you should use the econml package; for R, use the grf package; and for Julia, you will need to create the auxiliary variable 
Y∗ and fit a decision tree regressor. Report the splits the tree finds and interpret them.

Utiliza árboles causales como se explicó en clase. Para Python, deberías usar el paquete econml; para R, utiliza el paquete grf; y para Julia, necesitarás crear la variable auxiliar 𝑌∗Y ∗  y ajustar un árbol de decisión regresor. Reporta las divisiones que los árboles encuentran e interpreta los resultados.

In [8]:
using Pkg
Pkg.add("MLJ")
Pkg.add("MLJModels")
Pkg.add("RDatasets")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `C:\Users\KARL\.julia\environments\v1.11\Project.toml`
[32m[1m  No Changes[22m[39m to `C:\Users\KARL\.julia\environments\v1.11\Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m    Updating[22m[39m `C:\Users\KARL\.julia\environments\v1.11\Project.toml`
  [90m[d491faf4] [39m[92m+ MLJModels v0.17.4[39m
[32m[1m  No Changes[22m[39m to `C:\Users\KARL\.julia\environments\v1.11\Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `C:\Users\KARL\.julia\environments\v1.11\Project.toml`
[32m[1m  No Changes[22m[39m to `C:\Users\KARL\.julia\environments\v1.11\Manifest.toml`


In [12]:
Pkg.add("MLJScikitLearnInterface")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m CondaPkg ──────────────── v0.2.24
[32m[1m   Installed[22m[39m micromamba_jll ────────── v1.5.8+0
[32m[1m   Installed[22m[39m UnsafePointers ────────── v1.0.0
[32m[1m   Installed[22m[39m Pidfile ───────────────── v1.3.0
[32m[1m   Installed[22m[39m StructTypes ───────────── v1.11.0
[32m[1m   Installed[22m[39m MLJScikitLearnInterface ─ v0.7.0
[32m[1m   Installed[22m[39m JSON3 ─────────────────── v1.14.1
[32m[1m   Installed[22m[39m PythonCall ────────────── v0.9.23
[32m[1m   Installed[22m[39m MicroMamba ────────────── v0.1.14
[32m[1m    Updating[22m[39m `C:\Users\KARL\.julia\environments\v1.11\Project.toml`
  [90m[5ae90465] [39m[92m+ MLJScikitLearnInterface v0.7.0[39m
[32m[1m    Updating[22m[39m `C:\Users\KARL\.julia\environments\v1.11\Manifest.toml`
  [90m[992eb4ea] [39m[92m+ CondaPkg v0.2.24[39m
  [90m[0f8b85d8] [39m[92m+ JSON3 v1.14.1[39m
  [90m[5ae904

In [9]:
using MLJ, MLJModels, RDatasets

In [19]:
y, X = unpack(df, ==(:re78), !=(:re78))
coerce!(X, Count => Multiclass);

D, X = unpack(X, ==(:treat), !=(:treat));

In [20]:
LogisticClassifier = @load LogisticClassifier pkg=MLJScikitLearnInterface verbosity=0

log_model = LogisticClassifier()

log_model_machine = machine(log_model, X, D)

fit!(log_model_machine)

│ supports. Suppress this type check by specifying `scitype_check_level=0`.
│ 
│ Run `@doc MLJScikitLearnInterface.LogisticClassifier` to learn more about your model's requirements.
│ 
│ Commonly, but non exclusively, supervised models are constructed using the syntax
│ `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
│ constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
│ sample or class weights.
│ 
│ In general, data in `machine(model, data...)` is expected to satisfy
│ 
│     scitype(data) <: MLJ.fit_data_scitype(model)
│ 
│ In the present case:
│ 
│ scitype(data) = Tuple{Table{Union{AbstractVector{Continuous}, AbstractVector{Multiclass{34}}, AbstractVector{Multiclass{14}}, AbstractVector{Multiclass{2}}}}, AbstractVector{Multiclass{2}}}
│ 
│ fit_data_scitype(model) = Tuple{Table{<:AbstractVector{<:Continuous}}, AbstractVector{<:Finite}}
└ @ MLJBase C:\Users\KARL\.julia\packages\MLJBase\7nGJF\src\machines.jl:237
┌ Info: T

trained Machine; caches model-specific representations of data
  model: LogisticClassifier(penalty = l2, …)
  args: 
    1:	Source @333 ⏎ Table{Union{AbstractVector{Continuous}, AbstractVector{Multiclass{34}}, AbstractVector{Multiclass{14}}, AbstractVector{Multiclass{2}}}}
    2:	Source @194 ⏎ AbstractVector{Multiclass{2}}


In [2]:
using CSV
url = "https://raw.githubusercontent.com/d2cml-ai/CausalAI-Course/refs/heads/main/Labs/PD/PD10/online_discounts.csv"
df = select(CSV.read(download(url), DataFrame), Not(:Column1));

UndefVarError: UndefVarError: `select` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

In [None]:
pscore = pdf.(MLJ.predict(log_model_machine, X),1)
y_star = df.spend ./ (df.discount .* pscore .- (1 .- df.discount) .* (1 .- pscore));

1.4. Heterogeneous effects with causal forests (3 points). Use causal forests like we saw in class. For Python, you should use the econml package; for R, use the grf package; and for Julia, you will need to use the auxiliary variable Y∗ computed in the previous exercise and fit a random forest regressor. Report the importance of the prediction variables.

Usa bosques causales como se explicó en clase. Para Python, deberías usar el paquete econml; para R, utiliza el paquete grf; y para Julia, necesitarás usar la variable auxiliar 𝑌∗Y ∗  calculada en el ejercicio anterior y ajustar un regresor aleatorio de bosques. Reporta la importancia de las variables de predicción.

1.5. Plot heterogeneous effects (1.5 points). Plot how the predicted treatment effect changes depending on a variable of your choice. (You can see the last example in PD11 for clarification of what you should do in this exercise)

Traza cómo cambia el efecto predicho del tratamiento dependiendo de una variable de tu elección. (Puedes ver el último ejemplo en PD11 para una aclaración de lo que debes hacer en este ejercicio).