
# Machine Learning + Heterogeneous Treatment Effects
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/1iJfH3FN55H4XLgzu--1ECZGv-vWVWjel/view?usp=sharing)

1. Causality primer/review
2. Machine learning (ML) prediction primer/review
3. Heterogeneous treatment effects
    - When they matter
    - Conceptual framework
    - Using ML to predict treatment effects:
4. Random Causal Forests

**Causal Assumption for Estimates:** Causal estimates rely on the assumption that potential outcomes are independent of treatment assignment, given covariates:

1. Target (for now!):

$$ ATE = E[Y_i(1) − Y_i(0)] = E [\tau_i]$$

2. Key identifying assumption:
$$(Y_i(0), Y_i (1)) \perp D_i|X_i$$

3. Estimation:
- Multiple linear regression (OLS)
$$Yi = \beta_0 + \tau D_i + \beta_1 \times X_{1i} + · · · + \beta_k \times X_{ki} + \epsilon$$
- Matching
- Propensity score methods
- Machine-assisted:
    - Post-Double Selection Lasso
    - Double/De-biased Machine Learning 

## Predicting heterogeneous treatment effects

What is the effect of job training on the probability of finding a job . . .
- for more-educated vs. less-educated individuals?
- for men vs. women?
- for married vs. single?
- for high-earning vs. low-earning (prior to training)?
- for minorities vs. non-minorities?
- Why does it matter?
- Other examples where heterogeneity in treatment effects matter?

## Traditional heterogeneity analysis: Interacted regression

To estimate the overall average effect:
$$Y_i = \tau D_i + \epsilon_i$, $i \in {1, . . . , n}$$

To explore heterogeneity by sex:

$$Y_i = \tau^{female}D_i + \epsilon_i$, $i : Female_i = 1$$

$$Y_i = \tau^{male}D_i + \epsilon_i$, $i : Female_i = 0$$,

or, equivalently:

$$Y_i = \tau^{male}D_i + \beta Female_i + \gamma D_i \times Female_i + \epsilon_i$$

$$\tau^{female} = \tau^{male} + \gamma$$.

More generally,
$$Y_i = \tau D_i + X′_i \beta + D_iX′_i \gamma + \epsilon_i$$,

$$\tau (x) = \tau + x′\gamma$$

---
$$Y_i = \tau D_i + X′_i \beta + D_iX′_i \gamma + \epsilon_i$$,

- Functional form: treatment effects may not vary linearly with $X_i$
- Curse of dimensionality: when $X_i$ includes many variables, OLS impractical or infeasible
- These are problems ML was born to solve

# Causal Effects via Regression

Let's take up the example from the slides: what is the effect of going to a fancy college on later-life earnings? We'll use data on about 1,000 American men in the NLSY born 1980-1984 who finished college, and look at the effect of going to a private college ($D_i$) on earnings ($Y_i$) in 2015-2019 (when they were about 30-39 years old). We will be estimating an equation like this:

$$
Y_i = \delta D_i + X_i'\beta+\varepsilon_i,
$$

where $X_i$ is a vector of controls, conditional on which we are willing to assume $D_i$ is as good as randomly assigned.

What kinds of variables should we include in $X_i$?


In [None]:
# Import useful packages
import warnings
warnings.filterwarnings('ignore')
import pandas as pd  # for loading and managing datasets
import statsmodels.api as sm  # for running regressions and getting standard errors

In [None]:
# Load NLSY data
nlsy = pd.read_csv(
    "https://github.com/Mixtape-Sessions/Heterogeneous-Effects/raw/main/Labs/data/nlsy97.csv"
)

In [None]:
# Clean data (drop obs with missing values)
nlsy = nlsy.dropna()
nlsy

Let's start with a simple (uncontrolled) regression.


In [None]:
# Simple regression
D = nlsy["privatecollege"]
y = nlsy["annualearnings"]

rhs = sm.add_constant(
    D
)  # you have to add the constant yourself with statsmodels!
model = sm.OLS(y, rhs)
results = model.fit(cov_type="HC3")  # heteroskedasticity-robust
print(results.summary())

How to interpret the coefficient on $privatecollege$? As a causal effect?


Now let's add controls for parent's education and cognitive ability as measured by ASVAB:

In [None]:
# Practice!
# Regression with controls
# Create D, y, x objects

# Add constant

# Obtain the model

# Fit the model and create "results" obejct

print(results.summary())

How did the inclusion of controls change the estimate? Why?


## Bootstrapping Clustered Standard Errors
- Handling within-group dependence
- Resampling with replacement
- Practical details and Stata resource [link](https://cameron.econ.ucdavis.edu/research/Cameron_Miller_JHR_2015_February.pdf)
- [R resource](https://www.r-bloggers.com/2013/01/the-cluster-bootstrap/)

### Start with the first iteration

In [None]:
import numpy as np
import statsmodels.api as sm

In [None]:
# Generate random data
np.random.seed(0) # for reproducibility
clusters = np.random.choice(['A', 'B', 'C', 'D'], size=100) 
X = np.random.rand(100, 1)
y = 2.5 * X.squeeze() + 3 + np.random.randn(100) * 2
X[:5]

In [None]:
row_indices = np.random.choice(X.shape[0], size=4, replace=False)  # Get 4 random row indices
X_sample = X[row_indices]
print(len(X_sample))
print(X_sample)

In [None]:
# When Right Hand Side is Pandas DataFrame object:
row_indices = np.random.choice(rhs.shape[0], size=2, replace=False)  # Get 2 random row indices

X_sample = rhs.loc[row_indices] 
X_sample

In [None]:
# With Replacement Example
import math
import numpy as np
for _ in range(math.factorial(3)):
    print(np.random.choice(['A','B','C'], size=3, replace=True))

In [None]:
# Without Replacement Example
import math
import numpy as np
for _ in range(math.factorial(3)):
    print(np.random.choice(['A','B','C'], size=3, replace=False))

In [None]:
# Now we want to resample clusters
unique_clusters = np.unique(clusters)
unique_clusters

In [None]:
# Sample clusters with replacement
sampled_clusters = np.random.choice(unique_clusters, size=len(unique_clusters), replace=True)
sampled_clusters

In [None]:
# Get the observations corresponding to the sampled clusters
idx = np.isin(clusters, sampled_clusters)
idx

In [None]:
# Get the first sample
X_sample = X[idx]
y_sample = y[idx]

In [None]:
# Fit the model and store the parameters
model = sm.OLS(y_sample, sm.add_constant(X_sample)).fit()
model.params

In [None]:
# Now we want to repeat this across multiple interactions
# Practice!
# Generate a forloop to repeat this process 1,000 times
# Save the parameters in the "params" object

---

## Define function

In [None]:
import numpy as np
import statsmodels.api as sm

def clustered_bootstrap(X, y, clusters, n_iterations):
    unique_clusters = np.unique(clusters)
    params = []

    for _ in range(n_iterations):
        # Sample clusters with replacement
        sampled_clusters = np.random.choice(unique_clusters, size=len(unique_clusters), replace=True)
        
        # Get the observations corresponding to the sampled clusters
        idx = np.isin(clusters, sampled_clusters)
        X_sample = X[idx]
        y_sample = y[idx]

        # Fit the model and store the parameters
        model = sm.OLS(y_sample, sm.add_constant(X_sample)).fit()
        params.append(model.params)

    return np.array(params)

In [None]:
# Generate some sample data
np.random.seed(0)
clusters = np.random.choice(['A', 'B', 'C', 'D'], size=100) 
X = np.random.rand(100, 1)
y = 2.5 * X.squeeze() + 3 + np.random.randn(100) * 2

# Bootstrap
params = clustered_bootstrap(X, y, clusters, 1000)

In [None]:
# Compute standard errors
standard_errors = params.std(axis=0)
print("Clustered Bootstrapped Standard Errors:", standard_errors)

## With real data

In [None]:
# Pretend we have a cluster
nlsy['cluster'] = np.random.choice(['A', 'B', 'C', 'D', 'F'], size=len(nlsy))

In [None]:
# Practice!
# Generate the y and X variable from nlsy data

# rhs = ?

rhs = sm.add_constant(
    pd.concat([D,X],axis=1)
)

# Run the predetermined bootstrap function

# Compute standard errors and print

# Remember we randomly assign the clusters here

---

# Prediction Priemer

Let's use decision trees to predict which participants of the National JTPA Study were likely to find a job. We will use prior earnings, education, sex, race, and marital status as our prediction features.

In [None]:
data = pd.read_csv(
    "https://github.com/Mixtape-Sessions/Heterogeneous-Effects/raw/main/Labs/data/jtpahet.csv"
)
data

Import some utilities:


In [None]:
import requests

url1 = "https://github.com/Mixtape-Sessions/Heterogeneous-Effects/raw/main/Labs/python/plot_2d_separator.py"
r1 = requests.get(url1)

r1.text.split("\n")

In [None]:
# @title
import requests
url1 = 'https://github.com/Mixtape-Sessions/Heterogeneous-Effects/raw/main/Labs/python/plot_2d_separator.py'
url2 = 'https://github.com/Mixtape-Sessions/Heterogeneous-Effects/raw/main/Labs/python/plot_interactive_tree.py'
url3 = 'https://github.com/Mixtape-Sessions/Heterogeneous-Effects/raw/main/Labs/python/plot_helpers.py'
url4 = 'https://github.com/Mixtape-Sessions/Heterogeneous-Effects/raw/main/Labs/python/tools.py'
r1 = requests.get(url1)
r2 = requests.get(url2)
r3 = requests.get(url3)
r4 = requests.get(url4)

# make sure your filename is the same as how you want to import
with open('plot_2d_separator.py', 'w') as f1:
    f1.write(r1.text)

with open('plot_interactive_tree.py', 'w') as f2:
    f2.write(r2.text)

with open('plot_helpers.py', 'w') as f3:
    f3.write(r3.text)

with open('tools.py', 'w') as f4:
    f4.write(r4.text)

# now we can import
import plot_helpers
import tools
import plot_2d_separator
import plot_interactive_tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

plt.style.use("seaborn-whitegrid")
from sklearn.tree import plot_tree

We'll first grow a tree using just two features (education and prior earnings) so we can visualize it easily. Let's visualize the feature space: triangles are individuals who found a job, circles are those who didn't.


In [None]:
data.columns

In [None]:
plot_helpers.discrete_scatter(
    data["educ"], # x axis
    data["priorearn"], # y axis
    data["foundjob"], # shape
)
plt.show()

In [None]:
# Initiate the trees
tree = DecisionTreeRegressor(max_depth=3).fit(
    data[["educ", "priorearn"]].values, data["foundjob"]
)

In [None]:
fig1, ax = plt.subplots(1, 1, figsize=(12, 8))

plot_interactive_tree.plot_tree_partition(
    data[["educ", "priorearn"]].values,
    data["foundjob"],
    tree,
    ax=ax,
)
plot_tree(
    tree,
    feature_names=["education", "Prior earnings"],
    class_names=["No job", "Found job"],
    impurity=False,
    filled=True,
)
plt.show()

Now let's do a random forest:


In [None]:
# Create X with "educ" and "priorearn" columns from data and y with "foundjob"
X = data[["educ", "priorearn"]]
y = data['foundjob']
# Initiate a random forest classifier
forest = RandomForestClassifier(n_estimators=5, random_state=2).fit(
    X.values, y
)

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(20, 10))
for i, (ax, tree) in enumerate(zip(axes.ravel(), forest.estimators_)):
    ax.set_title("Tree {}".format(i))
    plot_interactive_tree.plot_tree_partition(
        X.values,
        y,
        tree,
        ax=ax,
    )

plot_2d_separator.plot_2d_separator(
    forest,
    X.values,
    fill=True,
    ax=axes[-1, -1],
    alpha=0.4,
)
axes[-1, -1].set_title("Random Forest")
plot_helpers.discrete_scatter(
    data["educ"],
    data["priorearn"],
    data["foundjob"],
)
plt.show()

We only used two prediction features (prior earnings and education) for visualization. To get the best predictions, we should use all of our features. And to evaluate the quality of the prediction, we should hold out a test set.


In [None]:
data.columns

In [None]:
# Practice!
# Define X and y with sensible variables from data:

# Hold out a test test:
# Remember `train_test_split` function


In [None]:
# How would you check if the split function is correctly specified?

In [None]:
len(X_train)+len(X_test)==len(X)

Try on your own: grow a forest with 500 trees using the training set, and evaluate the prediction accuracy on the test set. Hint: you can evaluate the prediction accuracy by doing `forest.score(X_test,y_test)`.


In [None]:
# Practice!
# Initiate a random forest classifier with 500 trees

# Evaluate the prediction accuracy



## Key Challenge: Algorithms tailored for predicting outcomes can do poorly when predicting treatment effects
### Factors that strongly predict outcomes may not strongly predict treatment effects
$Y_i$: spending on a Lexus

$D_i$: seeing an online ad for a Lexus

$\ln Y_i=\beta_0+\beta_1 age_i +\beta_2 male_i + \beta_3 D_i+\beta_4 D_i \times male_i +\varepsilon_i$

How do outcomes vary by age? (A lot if $\beta_1$ is big)

How do treatment effects vary by age? (not at all!)

What do treatment effects vary by? (gender!)


So much for predicting _outcomes_. We want to predict causal effects. Back to the whiteboard!


--- 

**Predicting Outcome Vs. Predicting Treatment Effects**

Target: $\hat{y}(x) = E[Y_i|X_i=x]$ 

Criterion: $min E[(Y-\hat{y}(x))^2|X_i = x]$

Training data: $\{Y_i, X_i\}^n_{i=1}$


---

Target: $\hat{\tau}(x) = E[\tau_i|X_i=x]$ 

Criterion: $min E[(\tau_i-\hat{\tau}(x))^2|X_i = x]$

Training data: $\{\tau_i, X_i\}^n_{i=1}$


**What would be the potential issue?**

$$
\begin{align}
E(\tau_i|X_i) &:=  E[Y_i(1)-Y_i(0)|X_i] \\
&= E[Y_i|X_i, D_i=1] -E[Y_i|X_i, D_i=0]
\end{align}
$$


---

- Final criterion (Athey and Imbens, 2016)
$$
min \sum_i (\tau_i - \tau(X_i))^2 \equiv max \sum\tau(X_i)^2
$$

- Be `honest`: use one set of observations to select the tree structure, and another to generate prediction

---

- Target: 

$$
CATE := \tau(x) =E[\tau_i|X_i=x]
$$

- Key identifying assumption:
$$(Y_i(0), Y_i (1)) \perp D_i|X_i$$

- Estimation: Random Causal Forest
    - Grow decision trees on many bootstrapped samples
    - Choose splits using the training set to $max \sum \tau(X_i)^2$
    - Generate predictions in each leaf using the estimation set
    - Average predictions over the trees in the forest
---

Let's simulate some data to show what happens when we try to use algorithm tailored to predicting outcomes for predicting treatment effects.


In [None]:
import pandas as pd
import numpy as np
from sklearn import tree

In [None]:
# define parameters
n = 1000  # sample size
p = 0.5  # probability of seeing the ad
beta0 = 0
beta1 = 0.2  # effect of age
beta2 = (
    -0.025
)  # difference in average spending between males and females who don't see the ad ()
beta3 = 0  # effect of treatment among females
beta4 = 0.05  # differential effect of treatment among males compared to females
sigeps = 0.02  # residual variance of outcome

In [None]:
# generate some fake data
age = np.random.randint(low=18, high=61, size=(n, 1))
male = np.random.randint(low=0, high=2, size=(n, 1))
d = np.random.rand(n, 1) > (1 - p)
epsilon = sigeps * np.random.randn(n, 1)

In [None]:
# True Data Generating Process
lny = beta0 + beta1 * age + beta2 * male + beta3 * d + beta4 * d * male + epsilon

In [None]:
# assemble as dataframe
fakedata = pd.DataFrame(
    np.concatenate((lny, d, age, male), axis=1), columns=["lny", "d", "age", "male"]
)

In [None]:
fakedata.feature_names = ["age", "male"]

- Note: subset data

In [None]:
condition = data.z == 1
subset = data[condition]

In [None]:
condition[:5]

In [None]:
data.z[:5]

In [None]:
sum(condition)

In [None]:
# Practice!
# Find the untreated group

# Find people older than 35 among the untreated group


In [None]:
# Alterantively,
condition = (data.z==0)&(data.age>35)
subset = data[condition]

In [None]:
# Practice! 
# Define x0, x1, y0, y1
# where x0 contains "age" and "male" variables untreated units

# x1 contains treated units

# y0 is the outcome among untreated units "lny"

# y1 is the outcome among treated units "lny"

# You can also combine the two lines

In [None]:
# Alterantively, use .loc function
x0 = fakedata.loc[d == 0, ["age", "male"]]
x1 = fakedata.loc[d == 1, ["age", "male"]]
y0 = fakedata.loc[d == 0, ["lny"]]
y1 = fakedata.loc[d == 1, ["lny"]]

Try on your own: fit two trees (call them `tree0` and `tree1`), each with `max_depth=2` to predict the outcome separately in the untreated ($D_i=0$) and treated ($D_i=1$) samples, using `x0` and `x1`, respectively.


In [None]:
# Practice!
# fit trees
# tree1 among treated

# tree0 among untreated


In [None]:
# display trees
print("Treated tree:")
plot_tree(tree1, filled=True, feature_names=fakedata.feature_names)
plt.show()

In [None]:
print("Untreated tree:")
plot_tree(tree0, filled=True, feature_names=fakedata.feature_names)
plt.show()

Which variable(s) did the trees key in on? Why? Would these trees be useful for predicting treatment effects? Why or why not?

How do we fix the problem?


---

## Random Causal Forest: Simulated Example


[Resource](https://lost-stats.github.io/Machine_Learning/causal_forest.html)

For those who might encounter difficulties with installing `econml` packages... Uncomment the command lines if you would want to run them

## Installing `econml` Attempt 1

In [None]:
# pip install econml

## Installing `econml` Attempt 2

In [None]:
# pip install pyproject

In [None]:
# pip install --upgrade pip setuptools wheel

In [None]:
# pip install shap

## Installing `econml` Attempt 3

In [None]:
# Run anaconda prompt 
# conda install -c conda-forge econml on conda prompt

## Installing `econml` Attempt 4

In [None]:
# pip uninstall scikit-learn

In [None]:
# pip install scikit-learn

---

In [None]:
from econml.dml import CausalForestDML as CausalForest

In [None]:
# NOTE: If you are getting `np.int` error, do the following:
# pip install --force-reinstall numpy==1.23.5 
# There is a fix for the new numpy version, but it's not released yet:
# https://github.com/py-why/EconML/commit/0be16255f10853fc9fe0774cb5649e051dc55dff

# Instantiate the Causal Forest
estimator = CausalForest(n_estimators=500, discrete_treatment=True, criterion="het")

In [None]:
# Can you guess how to call the function to fit the estimator?
y = fakedata["lny"]
d = fakedata["d"]
X = fakedata[["age", "male"]]

# Grow the forest
estimator.fit(
    y,d,X=X  # outcome  # treatment
)  # prediction features

# Predict effects for each observation based on its characteristics:
effects = estimator.effect(fakedata[["age", "male"]])

Let's see how well it did at estimating effects among men and women:


In [None]:
# Practice!
# Subset "effects" object where the observation is male in fakedata
# Generate 'malefx' object


In [None]:
malefx.mean()

In [None]:
# Practice!
# Subset "effects" object where the observation is female in fakedata
# Generate "femalefx" object


In [None]:
femalefx.mean()

How did our causal forest do at getting effects right for men and women? Let's see how it does on the age profile:


In [None]:
# Practice!
# Generate "maleage" and "femaleage" objects
# that contain the age array from male observations

# and contain the age array from female observations


In [None]:
# Alternatively we can use iloc function
maleage = fakedata["age"].iloc[fakedata["male"].values == 1]
femaleage = fakedata["age"].iloc[fakedata["male"].values == 0]

In [None]:
fig = plt.figure()
ax = plt.axes()

ax.scatter(maleage, malefx, label="males")
ax.scatter(femaleage, femalefx, label="females")
ax.legend()

# add title, x label, and y label to the plt object
plt.title("Estimated Treatment effects")
plt.xlabel("age")
plt.ylabel("treatment effect")

A little noisy on the age profile (which should be flat) but does get the difference between men and women!


## Random Causal Forest: Predict the effects of job training
We are ready to apply machine learning to predict causal effects in a real-life setting: how do the effects of job training vary by an individual's characteristics? We will use data from the National Job Training Partnership study, a large-scale randomized evaluation of a publicly subsidized job training program for disadvantaged youth and young adults. Why would we care how the effects of a subsidized job training program vary by a person's characteristics?


We will use the JTPA evaluation dataset, which contains observations on about 14,000 individuals, some of whom were randomized to participate in job training ($z_i = 1$) and others who were not ($z_i = 0$).

To do on your own:

- load the dataset from the url `https://github.com/Mixtape-Sessions/Heterogeneous-Effects/raw/main/Labs/data/jtpahet.csv`
- define the outcome vector (call it `y`) to be the column labeled `foundjob`
- define the randomized assignment indicator (call it `z`) to be the column labeled `z`
- define the feature vector (call it `x`) to be all columns except `foundjob`, `z`, and `enroll`.


In [None]:
# Practice!
# load data and set things up


On your own: run a linear regression of the outcome on the random assignment indicator, `z`. Since this was a randomized experiment, we don't need controls!


In [None]:
# Practice

print(results.summary())

### Set up random forest
So far, so good? Now create a random causal forest object, and fit it with outcome `y`, treatment variable `z`, and feature matrix `x`.


In [None]:
# Practice! 
# Create and fit random causal forest object


### Explore effects
Let's see what kind of heterogeneous effects our random causal forest predicted


In [None]:
# calculate the predicted effects:
insamplefx = rcf.effect(x)
print(rcf.ate_[0])

In [None]:
# Recap practice!
# Use .format function to keep the four digits


In [None]:
# Alternatively, flexibly keeping significant digits
"ATE: {:.3g}".format(rcf.ate_[0])

In [None]:
# plot a histogram of the estimated effects, with average effect overlaid
fig = plt.figure()
ax = plt.axes()
ax.hist(insamplefx, bins=30, density=True)
plt.axvline(rcf.ate_, color="k", linestyle="dashed", linewidth=1)
plt.suptitle("Estimated Treatment effects")
plt.title("ATE: {:.3g}".format(rcf.ate_[0]))
plt.show()

Let's visualize how these effects vary by prior earnings and education by making a heatmap


In [None]:
import itertools

In [None]:
# create a grid of values for education and prior earnings:
educgrid = np.arange(data["educ"].values.min(), data["educ"].values.max() + 1)
earngrid = np.arange(
    data["priorearn"].values.min(), data["priorearn"].values.max(), 5000
)
grid = pd.DataFrame(
    itertools.product(educgrid, earngrid), columns=["educ", "priorearn"]
)

In [None]:
grid

We'll first visualize the effects among married, nonwhite females of average age:


In [None]:
# Adding columns
grid["age"] = data["age"].mean()  # set age to the average
grid["female"] = 1  # set female = 1
grid["nonwhite"] = 1  # set nonwhite = 1
grid["married"] = 1  # set married = 1
grid

To do on your own: calculate the predicted effects for each "observation" in the grid:


In [None]:
# Practice!
# gridfx = # uncomment and fill in on your own!
gridfx = rcf.effect(grid)

### Visualize effects with a heatmap:


In [None]:
from mpl_toolkits.axes_grid1 import make_axes_locatable

fig = plt.figure()
ax = plt.subplot()
main = ax.scatter(
    grid["educ"], grid["priorearn"], c=gridfx, cmap="plasma", marker="s", s=300
)
plt.suptitle("Estimated Treatment effects")
plt.title("Nonwhite married females")
plt.xlabel("years of education")
plt.ylabel("prior earnings")

# create an Axes on the right side of ax. The width of cax will be 5%
# of ax and the padding between cax and ax will be fixed at 0.05 inch.
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
plt.colorbar(main, cax=cax)
plt.show()

### Comparison to Residualization?

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
y = data["foundjob"]
d = data["z"]
X = data.drop(["foundjob", "z", "enroll"], axis=1)

In [None]:
X.columns

In [None]:
# Practice!
# Manual one-fold cross validation

# 1. Subset data
# females and earnings below 10,000
condition = (X.female==1)&(X.priorearn<10000)

X_female = X[condition]
len(X_female)

# 2. Generate a trainig and test set
X_train_female, X_test_female, y_train_female, y_test_female = train_test_split(X_female, y[condition], test_size=0.5, random_state=0)

# 3. How would you split the sets for D?
d_train, d_test = d[X_train_female.index], d[X_test_female.index]

In [None]:
X_train_female[:3]

In [None]:
y_train_female[:3]

In [None]:
# Practice!
# Create a Random Forest Classifier with max_depth=3

# Train the classifier on the training data

# Calculate residuals for y 

# Calculate residuals for d on entire data


In [None]:
# Linear projection of y tilde on d tilde
d_reshaped = d_residual.values.reshape(-1,1) # 1 dimensional array needs reshaping for regression packages
y_reshaped = y_residual.values.reshape(-1,1)
lm=LinearRegression()
lm.fit(d_reshaped,y_reshaped)
lm.coef_

## Stratified sample splitting

In [None]:
# Practice!
# Split training/test set by "nonwhite" status
# Concatenate the subsets into final training/test set