# A Case for using Machine Learning Models to Improve Causal Inference in Economics

In this notebook, I will provide a detailed analysis of a randomized controlled experiment and evaluate its impact. Additionally, I will explore meta-learners, with a primary focus on the T-Learner model.The goal of this analysis is to demonstrate how machine learning models can be used to improve causal inference in economics, particularly in the context of randomized controlled trials (RCTs).

Randomized controlled trials (RCTs) are widely regarded as the gold standard for causal impact evaluation. However, they are primarily limited to estimating average treatment effects (ATE). In cases where scaling up a program involves substantial costs, it is often more valuable to estimate conditional treatment effects (CTE) and conditional average treatment effect (CATE), which help identify the subpopulations that benefit the most from the program.

This notebook enables a two-part analysis. First, I will assess the effectiveness of the program using the randomized controlled experiment. Then, leveraging the rich set of descriptive features available, I will build a predictive model. Then, I will develop a T-learners to perform causal predictive analytics, allowing for a more granular understanding of the treatment effect across different subgroups.

With the results from the RCT and T-Learners, I will apply these insights to a new dataset to evaluate how the program can be effectively scaled up.

#### Import Libraries

In [1]:
import polars as pl
from scipy.stats import ttest_ind
import numpy as np
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

#### Read the experiment dataset

In [2]:
df = pl.read_csv("Datagen/dataset.csv")
df.head(5)

age,is_male,race,ownes_house,employment_status,employment_income,non_employment_income,treatment,expenditure
i64,i64,str,i64,i64,f64,f64,i64,f64
18,1,"""Asian""",1,0,0.0,12000.0,0,8156.3
18,0,"""Black""",0,0,0.0,12000.0,0,9915.55
32,0,"""White""",1,1,99157.9,0.0,0,83903.93
33,1,"""Hispanic""",1,0,0.0,10000.0,1,11803.33
43,0,"""White""",0,1,124127.2,0.0,1,123086.25


This dataset contains information about a legislation aimed at providing accidental and medical coverage to individuals in order to increase aggregate expenditure. The underlying rationale behind the experiment is that individuals with access to medical insurance are more likely to spend money without the concern of saving for unexpected emergencies. To test this hypothesis, the government launched a pilot program through a randomized controlled trial (RCT).

The dataset includes information from this RCT, designed to assess the impact of accidental and medical insurance on employment outcomes. It contains various demographic and background characteristics, such as age, education level, and prior work experience, along with treatment assignment (indicating whether an individual received medical insurance) and the outcome variable (employment status after the program).

20% of the population of this experiment are treated.

## Result of Randomized Controlled Trials (RCTs) in Expenditures 

The experiment is designed with robust randomized controlled trial framework. Now, let’s examine the average treatment effect on total expenditure.

In [3]:
treated = (df
.filter(pl.col('treatment') == 1)
.select('expenditure').to_numpy()
)

controlled = (df
.filter(pl.col('treatment') == 0)
.select('expenditure').to_numpy()
)


#### Function to run  t-test and output t-test values

In [4]:
def p_test(treated, controlled):
    t_stat, p_val = ttest_ind(treated, controlled)
    # Store results in a dictionary
    results = {
        "Mean Expenditure (Treated)": treated.mean(),
        "Mean Expenditure (Controlled)": controlled.mean(),
        "T-statistic": t_stat[0],
        "P-value": p_val[0]
    }

    # Print readable output
    print("T-Test Results:")
    for key, value in results.items():
        print(f"{key}: {value:.4f}")

In [5]:
p_test(treated,controlled)

T-Test Results:
Mean Expenditure (Treated): 54484.0057
Mean Expenditure (Controlled): 53094.3690
T-statistic: 2.5590
P-value: 0.0105


In [6]:
54484.0057 - 53094.3690


1389.6367000000027

The mean expenditure for the treatment group (those who received medical insurance) is 55484, while the mean for the control group (those without insurance) is 53094.36. The difference between the two groups is **1389.6**. With a t-statistic of **2.56** and a p-value of less than **0.051**, this difference is statistically significant. 

These findings suggest that having accidental and medical insurance leads to increased spending. This implies that the security of a financial safety net encourages individuals to spend more, potentially stimulating economic activity and generating a multiplier effect throughout the economy.

### Factoring for different variable

However, the average treatment effect does not account for the heterogeneity in treatment effects across different subpopulations. Do male and female benefit equally from the program? Are there differences in treatment effects based on age, education level, or other demographic factors? 

I will analyze further by sub grouping the demographic variables.

### Male and Female

#### Treatment impact on Male

In [7]:
#Treatment and Control Impact by Gender
treated_male = df.filter(
    (pl.col('treatment') ==1) & (pl.col('is_male') == 1)
).select("expenditure").to_numpy()

controlled_male = df.filter(
    (pl.col('treatment')==0) & (pl.col('is_male')==1)
).select("expenditure").to_numpy()


In [8]:
p_test(treated_male, controlled_male)

T-Test Results:
Mean Expenditure (Treated): 55069.0839
Mean Expenditure (Controlled): 53402.7015
T-statistic: 2.2370
P-value: 0.0253


#### Female

#### Treatment impact on Female

In [9]:
#Treatment and Control Impact by Gender
treated_female = df.filter(
    (pl.col('treatment') ==1) & (pl.col('is_male') == 0)
).select("expenditure").to_numpy()

controlled_female = df.filter(
    (pl.col('treatment')==0) & (pl.col('is_male')==0)
).select("expenditure").to_numpy()

In [10]:
p_test(treated_female, controlled_female)

T-Test Results:
Mean Expenditure (Treated): 53791.6865
Mean Expenditure (Controlled): 52723.9074
T-statistic: 1.3470
P-value: 0.1780


The results show an increase in mean expenditure for both males and females. However, the increase is **statistically significant at the 5% level** only for males. This raises an important question: **should the program prioritize targeting males to maximize its overall impact?**

### Employment Status

#### Employed

In [11]:
treated_employed = df.filter(
    (pl.col('treatment')==1) & (pl.col('employment_status')==1)
).select("expenditure").to_numpy()

controlled_employed = df.filter(
    (pl.col('treatment')==0) & (pl.col('employment_status')==1)
).select("expenditure").to_numpy()

In [12]:
p_test(treated_employed, controlled_employed)

T-Test Results:
Mean Expenditure (Treated): 88414.1740
Mean Expenditure (Controlled): 87370.9478
T-statistic: 1.9788
P-value: 0.0478


#### Unemployed

In [13]:
treated_unemployed = df.filter(
    (pl.col('treatment')==1) & (pl.col('employment_status')==0)
).select("expenditure").to_numpy()

controlled_unemployed = df.filter(
    (pl.col('treatment')==0) & (pl.col('employment_status')==0)
).select("expenditure").to_numpy()

In [14]:
p_test(treated_unemployed, controlled_unemployed)

T-Test Results:
Mean Expenditure (Treated): 6597.3795
Mean Expenditure (Controlled): 6236.2265
T-statistic: 4.7301
P-value: 0.0000


The results indicate that both employed and unemployed individuals increase their expenditure when treated. However, the **proportional increase is greater for the unemployed**, and the **very low p-value** suggests this effect is highly statistically significant. This finding aligns with intuition—**unemployed individuals are likely to benefit more from social security support**. It's also worth noting that, on average, **unemployed individuals spend significantly less than those who are employed**, highlighting the potential for targeted interventions to reduce inequality.

#### Having house or not

In [15]:
treated_house = df.filter(
    (pl.col('treatment')==1) & (pl.col('ownes_house')==1)
).select('expenditure').to_numpy()

controlled_house = df.filter(
    (pl.col('treatment')==0) & (pl.col('ownes_house')==1)
).select('expenditure').to_numpy()

In [16]:
p_test(treated_house, controlled_house)

T-Test Results:
Mean Expenditure (Treated): 54246.9453
Mean Expenditure (Controlled): 53244.6892
T-statistic: 1.0138
P-value: 0.3107


In [17]:
treated_rental = df.filter(
    (pl.col('treatment')==1) & (pl.col('ownes_house')==0)
).select('expenditure').to_numpy()

controlled_rental = df.filter(
    (pl.col('treatment')==0) & (pl.col('ownes_house')==0)
).select('expenditure').to_numpy()

In [18]:
p_test(treated_rental, controlled_rental)

T-Test Results:
Mean Expenditure (Treated): 54586.5793
Mean Expenditure (Controlled): 53028.8203
T-statistic: 2.3971
P-value: 0.0165



When comparing homeowners to renters, the data shows that **renters experience a more significant increase in expenditure** after receiving the treatment. This result isn't surprising—**owning a home is often an indicator of greater financial stability and existing spending power**. In contrast, **renters may benefit more from additional support**, as they might have more immediate needs and fewer financial resources.

### Further Subgroup Analysis


Now, let’s take the subgroup analysis a step further. What happens when we combine characteristics—like looking at treatment effects based on both **gender** and **employment status**? For instance, how does a **treated male who is unemployed** compare to a **treated female who is also unemployed** and **treated male who is employed** vs **treated female who is also unemployed** and in terms of expenditure change? These combinations open the door to a wide range of targeted insights and questions. While the number of subgroups increases quickly, analyzing just this one example illustrates the potential of more granular policy targeting.

#### For employed and gender

In [19]:
#Treated female and employed
treated_female_employed = df.filter(
    (pl.col('treatment') ==1) & (pl.col('is_male') == 0) & (pl.col('employment_status')==1)
).select("expenditure").to_numpy()

controlled_female_employed = df.filter(
    (pl.col('treatment')==0) & (pl.col('is_male')==0) & (pl.col('employment_status')==1)
).select("expenditure").to_numpy()

#Treated male and employed
treated_male_employed = df.filter(
    (pl.col('treatment') ==1) & (pl.col('is_male') == 1) & (pl.col('employment_status')==1)
).select("expenditure").to_numpy()

#Control male and employed
controlled_male_employed = df.filter(
    (pl.col('treatment')==0) & (pl.col('is_male')==0) & (pl.col('employment_status')==1)
).select("expenditure").to_numpy()

In [20]:
p_test(treated_female_employed, controlled_female_employed)

T-Test Results:
Mean Expenditure (Treated): 86569.4846
Mean Expenditure (Controlled): 86407.3314
T-statistic: 0.2111
P-value: 0.8328


In [21]:
p_test(treated_male_employed, controlled_male_employed)

T-Test Results:
Mean Expenditure (Treated): 89997.0778
Mean Expenditure (Controlled): 86407.3314
T-statistic: 4.9218
P-value: 0.0000


Among the employed group, we observe that expenditure rises significantly for males, but shows almost no change for females. Earlier, we saw that while being employed generally increases expenditure, and being female does not, the combination of the two suggests that gender has a stronger influence on expenditure than employment status—at least in this case. As a result, the expenditure increase for employed females is not statistically significant.

This could be due to several factors: for example, the proportion of females who are employed may be smaller, or their average earnings might be lower. If female incomes are relatively modest, the income effect from employment might not be strong enough to significantly boost expenditure within this subgroup.

#### For unemployed and gender

In [22]:
#Treated female and employed
treated_female_unemployed = df.filter(
    (pl.col('treatment') ==1) & (pl.col('is_male') == 0) & (pl.col('employment_status')==0)
).select("expenditure").to_numpy()

controlled_female_unemployed = df.filter(
    (pl.col('treatment')==0) & (pl.col('is_male')==0) & (pl.col('employment_status')==0)
).select("expenditure").to_numpy()

#Treated male and employed
treated_male_unemployed = df.filter(
    (pl.col('treatment') ==1) & (pl.col('is_male') == 1) & (pl.col('employment_status')==0)
).select("expenditure").to_numpy()

#Control male and employed
controlled_male_unemployed = df.filter(
    (pl.col('treatment')==0) & (pl.col('is_male')==0) & (pl.col('employment_status')==0)
).select("expenditure").to_numpy()

In [23]:
p_test(treated_female_unemployed, controlled_female_unemployed)

T-Test Results:
Mean Expenditure (Treated): 6597.8874
Mean Expenditure (Controlled): 6227.8678
T-statistic: 3.2483
P-value: 0.0012


In [24]:
p_test(treated_male_unemployed, controlled_male_unemployed)

T-Test Results:
Mean Expenditure (Treated): 6596.9595
Mean Expenditure (Controlled): 6227.8678
T-statistic: 3.4971
P-value: 0.0005


Among the unemployed, the treatment has a significant impact on expenditure for both males and females. In fact, the increase in expenditure for unemployed females is nearly as large as it is for unemployed males. This is a particularly interesting observation, as it contrasts with the results from the employed group—where the impact was only significant for males.

This raises further questions. There are many possible ways to subgroup the population to better understand who benefits the most from the treatment. However, such analysis involves extensive filtering and coding, which can be time-consuming and complex.

A more efficient approach is to run a linear regression model. This allows us to control for multiple variables simultaneously and assess how each factor influences the outcome. As we continue, I will explore how we can use this model to identify individuals who benefit the most from the treatment—and how such insights can help us design more targeted and effective interventions in the future.

## Linear Model

The RCT results demonstrated that the treatment is effective, and subgroup analysis confirmed its broad impact across different types of people. Now, we can conduct further analysis using linear regression. Linear regression will help us quantify the impact of each variable and determine whether these variables are significant in predicting expenditure.  

Unlike simple mean comparisons between treatment and control groups, a linear regression model allows us to estimate the impact of multiple variables simultaneously. This approach provides a more precise measure of the treatment effect while accounting for potential confounding factors such as age, education level, previous work experience, and employment status. By incorporating all available variables, the model helps us isolate the true effect of medical insurance on expenditure and identify key determinants of spending behavior.  

This deeper analysis will provide a more nuanced understanding of how different factors interact, offering valuable insights for policymakers and stakeholders looking to design effective interventions.

#### Using pyrsm package to port R's statistical output

The scikit-learn (sklearn) package is excellent for machine learning but lacks detailed statistical outputs that can be useful for interpretation and analysis. Instead, I will use the pyrsm library, which brings R's comprehensive statistical output capabilities into the Python environment. 

In [25]:
import pyrsm as rsm

In [26]:
#using rsm
feature_names = df.drop('expenditure').columns
dfp = df.to_pandas() #note that pyrsm requires pandas dataframe, I could not find a way to run it on polars dataframe
rsm.model.regress({"df":dfp},rvar="expenditure",evar=feature_names).summary(rmse=True, ssq= True)

Linear regression (OLS)
Data                 : df
Response variable    : expenditure
Explanatory variables: age, is_male, race, ownes_house, employment_status, employment_income, non_employment_income, treatment
Null hyp.: the effect of x on expenditure is zero
Alt. hyp.: the effect of x on expenditure is not zero

                       coefficient  std.error  t.value p.value     
Intercept                 3850.909    136.758   28.159  < .001  ***
race[Black]                107.962    109.759    0.984   0.325     
race[Hispanic]             -33.703    100.479   -0.335   0.737     
race[Other]                -34.561    146.886   -0.235   0.814     
race[White]                 77.972     93.923    0.830   0.406     
age                          3.768      4.007    0.940   0.347     
is_male                     29.948     53.792    0.557   0.578     
ownes_house                 32.166     58.227    0.552   0.581     
employment_status        -5190.535    147.308  -35.236  < .001  ***
emp

##### Summary of Linear Regression (OLS) Results
The linear regression model evaluates the impact of various factors on expenditure, using a dataset of 50,000 observations. The R-squared value of 0.985indicates that the model explains 97.8% of the variance in expenditure, suggesting a strong fit.

Key Findings:
- Income Matters Most: Both employment and non-employment income are strongly predictive of expenditure. This is economically intuitive.

- Treatment Works: The treatment is statistically and practically significant, with a ~$1,032 increase in expenditure on average. This supports rolling out the program more broadly.

- Unemployment Reduces Spending: Being unemployed strongly reduces expenditure, suggesting vulnerable populations benefit the most.

- Demographics Less Relevant: Race, gender, age, and home ownership are not statistically significant in this model. Their effects on expenditure are relatively minor or inconsistent.

### Only treated people vs Not Treated people

From the regression output, we can observe how the treatment effect significantly impacts expenditure. However, it would be interesting to explore what the outcome would have been if the treatment had never been implemented.

To begin, let's run a regression model that excludes the treatment variable. I'll divide the dataset into two groups: the treated group and the control group, and then run a separate regression for each group. By excluding the treatment variable, we can assess how the other predictors affect expenditure in both the presence and absence of the treatment.

After running both regressions, we can compare the results to see if there are any notable differences in the predictors between the treated and control groups. This will help us understand how the treatment may have altered the relationships between the predictors and expenditure.

In [27]:
df_treated = df.filter(pl.col('treatment')==1).drop('treatment')
df_controled = df.filter(pl.col('treatment')==0).drop('treatment')

#### Linear Regression for treated group only

In [28]:
feature_names_noTreat = df.drop('expenditure','treatment').columns
dft = df_treated.to_pandas()
rsm.model.regress({"df treated":dft},rvar="expenditure",evar=feature_names_noTreat).summary(rmse=True, ssq= True)

Linear regression (OLS)
Data                 : df treated
Response variable    : expenditure
Explanatory variables: age, is_male, race, ownes_house, employment_status, employment_income, non_employment_income
Null hyp.: the effect of x on expenditure is zero
Alt. hyp.: the effect of x on expenditure is not zero

                       coefficient  std.error  t.value p.value     
Intercept                 4162.760    455.904    9.131  < .001  ***
race[Black]                213.500    367.206    0.581   0.561     
race[Hispanic]            -231.865    336.747   -0.689   0.491     
race[Other]               -301.942    484.381   -0.623   0.533     
race[White]                298.268    313.562    0.951   0.342     
age                          6.084     13.904    0.438   0.662     
is_male                    183.949    181.840    1.012   0.312     
ownes_house                249.373    197.138    1.265   0.206     
employment_status        -5313.958    503.319  -10.558  < .001  ***
employ

#### Linear Regression for control group only

In [29]:
dfc = df_controled.to_pandas()
rsm.model.regress({"df controled":dfc},rvar="expenditure",evar=feature_names_noTreat).summary(rmse=True, ssq= True)

Linear regression (OLS)
Data                 : df controled
Response variable    : expenditure
Explanatory variables: age, is_male, race, ownes_house, employment_status, employment_income, non_employment_income
Null hyp.: the effect of x on expenditure is zero
Alt. hyp.: the effect of x on expenditure is not zero

                       coefficient  std.error  t.value p.value     
Intercept                 4041.565    124.726   32.403  < .001  ***
race[Black]                 76.733    100.666    0.762   0.446     
race[Hispanic]               5.172     92.121    0.056   0.955     
race[Other]                 26.784    135.233    0.198   0.843     
race[White]                 17.063     86.185    0.198   0.843     
age                          2.992      3.642    0.822   0.411     
is_male                    -10.469     49.204   -0.213   0.832     
ownes_house                -19.927     53.237   -0.374   0.708     
employment_status        -5161.413    134.406  -38.402  < .001  ***
empl


#### Overall Model Comparison

 Both models explain a very large proportion of variance in expenditure, with very high F-statistics, suggesting the models are statistically significant. However, the treated group has a higher Root Mean Square Error (RMSE) of 9,082.49 compared to the control group's 4,839.51. This indicates that there is more variability in the expenditure outcomes for the treated group compared to the control group.

Variable-Level Comparison

Gender (is_male)
In the treated group, the coefficient for gender is 183.95 with a p-value of 0.312, which is not statistically significant. Similarly, in the control group, the coefficient for gender is -16.97 with a p-value of 0.727, also not significant. This suggests that gender does not significantly influence expenditure in either group.

Home Ownership (ownes_house)
In the treated group, the coefficient for homeownership is 249.37 with a p-value of 0.206, which is not significant. In the control group, the coefficient for homeownership is 36.10 with a p-value of 0.494, also not significant. Thus, owning a house does not significantly affect expenditure in either group.

Employment Status
In both the treated and control groups, employment status has a highly significant negative impact on expenditure. The coefficients for employment status are -5,313.96 (p < 0.001) in the treated group and -5,212.03 (p < 0.001) in the control group, indicating that unemployment significantly reduces expenditure in both groups. The magnitude of the coefficient is almost identical for both groups, suggesting that unemployment remains a crucial factor regardless of treatment.

Employment Income
In both groups, employment income has a significant positive effect on expenditure. The coefficient for employment income in the treated group is 0.799 (p < 0.001), and in the control group, it is 0.699 (p < 0.001). This means that for every additional dollar of employment income, expenditure increases by approximately 70-80 cents. The effect of employment income is slightly stronger for the treated group, indicating that treatment may amplify the effect of income on expenditure.

Non-Employment Income
In both the treated and control groups, non-employment income also has a significant positive impact on expenditure. The coefficients for non-employment income are 0.448 (p < 0.001) in the treated group and 0.446 (p < 0.001) in the control group, which suggests that non-employment income consistently increases expenditure in both groups.

Age
The coefficient for age in both the treated and control groups is not statistically significant. In the treated group, the coefficient is 6.08 (p = 0.662), and in the control group, it is 0.93 (p = 0.794). This suggests that age does not have a significant effect on expenditure in either group.

Race
Race does not significantly influence expenditure in either group, except for a marginally significant result for race[White] in the control group, where the coefficient is 166.09 (p = 0.049). In the treated group, no racial category shows a significant difference in expenditure. Therefore, race has minimal to no impact on expenditure in both groups.

Key Takeaways

1. **Income (especially employment income)** is the most important predictor of spending in both groups.
2. **Unemployment** significantly reduces expenditure in both groups, indicating that employment support is critical for increasing expenditure.
3. The effect of treatment appears to increase income sensitivity, especially for employment income, meaning that treated individuals tend to respond more strongly to income changes.
4. The treated individuals have more variability in spending (higher RMSE), which may be due to the different ways they respond to the intervention.
5. **Demographics like gender, age, and race** do not significantly explain expenditure differences within either group, post-treatment or in control.

Policy Implications

1. Targeting individuals based on **employment status and income** is likely the most effective strategy for increasing expenditure.
2. A **randomized treatment across demographic groups** appears equitable, as there is no strong evidence of a demographic bias in how the treatment influences expenditure.
3. The treatment program has the potential to increase spending, but not equally for everyone—income and employment status are the most critical factors.
4. To refine targeting strategies, further analysis should consider **interaction terms**, such as treatment × employment status, to capture more nuanced effects.

##### Predicting Expenditure with Linear Regression

Although we have gained valuable insights into how each variable impacts the coefficients in our regression models, we can take the analysis a step further. To enhance our understanding, we will now use Scikit-Learn to fit a linear regression model and evaluate its predictive capabilities. By leveraging Scikit-Learn, we can assess the model's performance using various metrics such as R-squared, Mean Squared Error (MSE), and others. This will allow us to better understand how well the model generalizes to new data, which is crucial for making data-driven decisions in policy interventions.

[Note: We are using the full dataset here, which includes both treated and control groups.]

In [30]:
from sklearn.linear_model import LinearRegression,Ridge, Lasso
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


In [31]:
#### One Hot encoding categorical variables
race_array = df['race'].to_numpy()
label_encoder = LabelEncoder()
race_encoded = label_encoder.fit_transform(race_array)
X = df.with_columns(pl.Series('race_encoded', race_encoded))
X = X.drop('race')
y = X['expenditure']

In [32]:
X.head()

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,treatment,expenditure,race_encoded
i64,i64,i64,i64,f64,f64,i64,f64,i32
18,1,1,0,0.0,12000.0,0,8156.3,0
18,0,0,0,0.0,12000.0,0,9915.55,1
32,0,1,1,99157.9,0.0,0,83903.93,4
33,1,1,0,0.0,10000.0,1,11803.33,2
43,0,0,1,124127.2,0.0,1,123086.25,4


In [33]:
lr = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X.drop('expenditure'),y,test_size=0.25,random_state=99)
feature_names = X_train.columns

In [34]:
reg = lr.fit(X_train,y_train)

In [35]:
# Get feature names and coefficients
{'Intercept': reg.intercept_, 
              **{feature_names[i]: reg.coef_[i] for i in range(len(feature_names))}}



{'Intercept': 3750.045833338423,
 'age': 5.223903310191677,
 'is_male': 71.87384429801835,
 'ownes_house': 72.56523430901153,
 'employment_status': -5199.925387378324,
 'employment_income': 0.7909507960945765,
 'non_employment_income': 0.4556498400688992,
 'treatment': 1109.6872684667007,
 'race_encoded': 15.518792544321574}

The values are slightly different since the training set is now different from the full set that I used before

In [36]:
predicted_expenditure = reg.predict(X_test)

In [37]:
result = X_test.with_columns(
    pl.Series(y_test).alias('actual_expenditure'),
    pl.Series(predicted_expenditure).alias('pred_expenditure')
    
)
result

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,treatment,race_encoded,actual_expenditure,pred_expenditure
i64,i64,i64,i64,f64,f64,i64,i32,f64,f64
50,1,0,1,147487.0,10000.0,0,1,125990.96,120110.166713
60,1,0,1,172787.8,0.0,1,4,144926.01,136773.838893
70,1,1,1,181656.4,0.0,1,4,171010.97,143913.269391
18,0,1,1,62453.2,0.0,0,1,49217.38,48129.642991
27,0,1,0,0.0,0.0,0,2,2779.7,3994.694042
…,…,…,…,…,…,…,…,…,…
48,1,0,0,0.0,0.0,0,2,2446.25,4103.704622
30,0,0,0,0.0,10000.0,0,4,10354.48,8525.336504
22,0,0,0,0.0,0.0,1,2,2929.6,5005.69656
27,1,0,0,0.0,0.0,0,1,2500.35,3978.48386


The dataset above shows the predicted expenditure values. Let's check out how much error it makes in the test set and if it is fit for further analysis.

The root mean square error between actual expenditure and predicted expenditure

In [38]:
mse = mean_squared_error(y_test, predicted_expenditure)
rmse = mse ** 0.5
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)

Mean Squared Error: 35906504.080156565
Root Mean Squared Error: 5992.203608035742


An RMSE of 5992.20 means that, on average, the model's predictions are off by 6467.15. But how good or bad is that? Looking back at some of the model's outputs, we realized that the mean differences for particular subgroups were around 5000 to 10,000. Is an RMSE of 6467.15 a concern? Let me calculate the RMSE percentage, which should provide a clearer picture of our errors.

In [39]:
mean_actual = result["actual_expenditure"].mean()
mean_actual

53879.97462000001

In [40]:
rmse_percentage = (rmse / mean_actual) * 100
print(f"RMSE as % of Mean Expenditure: {rmse_percentage:.2f}%")

RMSE as % of Mean Expenditure: 11.12%


The RMSE is 11.12%. In most cases, a small RMSE percentage is considered a sign of good model performance. A score of less than 20% is generally regarded as good. Based on this, we can say that this model is performing well and should be suitable for further analysis.

## Scaling Up to 1 Million People

Now that we have two key insights—the treatment is effective across the board, and variables like employment status, employment income, and non-employment income have a significant impact on expenditure behavior—let's consider the next step.

Suppose the government, based on the success of the results, wants to roll out the program statewide, which has a population of roughly one million people. However, the government only has enough budget to extend the program to another 200,000 people.

We can confidently assert that the treatment increases expenditure, as supported by both the RCT and linear regression analysis. The regression results also highlight that certain factors—such as employment status and income—have strong, statistically significant effects on spending behavior.

But the key question remains: **How should the government scale up the program?** In a completely new population, who should be prioritized for treatment? Should we focus on unemployed individuals to boost their spending capacity? Or should we target the employed, who might have a greater propensity to spend and potentially generate a larger multiplier effect? Should women be prioritized as a form of economic empowerment?

These are complex economic and socio-political questions. However, with tools like **T-learners**, we can approach this challenge through the lens of **data-driven decision-making**, turning it into a prescriptive analytics problem—where we don't just evaluate impact, but optimize for it.


In [41]:
target = pl.read_csv("Datagen/new_target.csv")
target.head(5)

age,is_male,race,ownes_house,employment_status,employment_income,non_employment_income
i64,i64,str,i64,i64,f64,f64
50,0,"""White""",0,0,0.0,9000.0
35,1,"""White""",0,1,108244.6,12000.0
18,1,"""Asian""",1,0,0.0,0.0
18,1,"""Hispanic""",0,0,0.0,0.0
49,1,"""Black""",0,1,137030.7,0.0


### Scenarios 1: Randomly Assigning Treatment to 200,000 People

One method would be to randomly assign the treatment to 200,000 people, given that we observed a consistent impact of the treatment across all subgroups. Let’s explore how this approach would look. 

#### I have developed a simulation program that generates the outcomes for individuals after the treatment is assigned. Think of this as a hypothetical scenario after the treatment following the experiment.

In [42]:
RCT_target = target
np.random.seed(259)
treatment = np.random.choice([1, 0], RCT_target.shape[0], p=[0.2, 0.8])
RCT_target = RCT_target.with_columns(pl.Series('treatment', treatment))
RCT_target.head(5)

age,is_male,race,ownes_house,employment_status,employment_income,non_employment_income,treatment
i64,i64,str,i64,i64,f64,f64,i32
50,0,"""White""",0,0,0.0,9000.0,0
35,1,"""White""",0,1,108244.6,12000.0,1
18,1,"""Asian""",1,0,0.0,0.0,0
18,1,"""Hispanic""",0,0,0.0,0.0,0
49,1,"""Black""",0,1,137030.7,0.0,0


In [43]:
%run data_processor.ipynb  # Executes the notebook
# Then call your function directly
RCT_target = process_dataframe(RCT_target)
RCT_target.head(5)

age,is_male,race,ownes_house,employment_status,employment_income,non_employment_income,treatment,expenditure
i64,i64,str,i64,i64,f64,f64,i32,f64
50,0,"""White""",0,0,0.0,9000.0,0,9986.9
35,1,"""White""",0,1,108244.6,12000.0,1,122066.69
18,1,"""Asian""",1,0,0.0,0.0,0,2159.8
18,1,"""Hispanic""",0,0,0.0,0.0,0,2159.8
49,1,"""Black""",0,1,137030.7,0.0,0,78744.31


We have now obtained the simulated expenditure values. Next, let's calculate the total increase in expenditure among the treated individuals and across the entire economy.

Total Expenditure Increase Among Treated Individuals: This will be the sum of the differences in expenditure for the treated individuals compared to their hypothetical expenditure if they had not received the treatment.

Total Expenditure for the Whole Economy (Including Both Treated and Control Groups): This represents the sum of all expenditures (both treated and control groups) in the entire population after the treatment has been applied.

In [44]:
treatment_group_1 = RCT_target.filter(pl.col('treatment') == 1)
treatment_group_0 = RCT_target.filter(pl.col('treatment') == 0)

# Mean and sum for treatment group 1
mean_treatment_1 = treatment_group_1.select(pl.col('expenditure').mean()).to_numpy()[0][0]
sum_treatment_1 = treatment_group_1.select(pl.col('expenditure').sum()).to_numpy()[0][0]

# Mean and sum for treatment group 0
mean_treatment_0 = treatment_group_0.select(pl.col('expenditure').mean()).to_numpy()[0][0]
sum_treatment_0 = treatment_group_0.select(pl.col('expenditure').sum()).to_numpy()[0][0]

# Mean and sum for total population
mean_total = RCT_target.select(pl.col('expenditure').mean()).to_numpy()[0][0]
sum_total = RCT_target.select(pl.col('expenditure').sum()).to_numpy()[0][0]

# Output the results
print(f"Treatment Group 1 - Mean Expenditure: {mean_treatment_1}, Sum Expenditure: {sum_treatment_1}")
print(f"Treatment Group 0 - Mean Expenditure: {mean_treatment_0}, Sum Expenditure: {sum_treatment_0}")
print(f"Total - Mean Expenditure: {mean_total}, Sum Expenditure: {sum_total}")

Treatment Group 1 - Mean Expenditure: 68943.99619016469, Sum Expenditure: 13806586789.05
Treatment Group 0 - Mean Expenditure: 37351.423360508765, Sum Expenditure: 29871502021.18
Total - Mean Expenditure: 43678.088810229994, Sum Expenditure: 43678088810.229996



Treatment Group 1: The mean expenditure for individuals in the treatment group is approximately $68,943.99,  with a total expenditure of around $13.81 billion.

Treatment Group 0: The mean expenditure for individuals in the control group is approximately $37,351.42,  with a total expenditure of about $29.87 billion.

Overall: The overall mean expenditure is approximately $43,678.08,  with the total expenditure across all groups amounting to about $43.69 billion.

## Scenario 2: Using Predictive Models to Assign Treatment 

Instead of randomly selecting 200,000 people for treatment, I can use linear regression to target individuals who are likely to benefit the most. To achieve this, I will leverage both the results from the RCT and the linear regression model. This approach is known as T-learners, a type of Causal Machine Learning model.  

T-learners work by developing two separate models: one from treated group and one from the control group. These models allow us to make a prediction of two counterfactual scenarios; one where everyone receives treatment and another where no one does. This enables us to estimate the individual treatment effect for each person in the dataset. While these estimates are hypothetical, they provide valuable insight into the potential outcomes of individuals with or without treatment. It’s important to note that our previous models demonstrated strong predictive performance, with a low RMSE and a high R² statistic, suggesting that these estimates should be fairly reliable.

Here is a more step-wise breakdown of how T-learners work:

Train Two Models:

- Model 1: Train a model using the data for the treated group (where treatment = 1).

- Model 2: Train another model using the data for the control group (where treatment = 0).

Make Predictions:

- Prediction 1: Use Model 1 (treated group) to predict the outcome as if everyone received the treatment.

- Prediction 2: Use Model 2 (control group) to predict the outcome as if no one received the treatment.

Calculate the Difference:

By comparing the predicted outcomes for the treated and control groups, we can find the difference in predicted expenditure for each individual.

The individuals who show the highest difference between the treated and control predictions are likely to benefit the most from the treatment. For example, individuals with higher predicted outcomes in the treated group compared to the control group would be the ones who see the largest benefit from the insurance treatment.

In this way, the T-Learner allows us to estimate the individualized treatment effect by predicting how each person would behave in both treatment and control conditions, and then comparing those predictions.

For instance, being a woman, unemployed, and older might result in the most benefit from the treatment. The difference between the two predictions should be the highest for such a person (just an example)

I will be using the whole dataset from out earlier analysis to train the model. The reason is that I want to use all the data available to make the best prediction possible. The T-Learner will be able to learn from both treated and control groups, and then we can apply it to the new dataset.

In [45]:
X_treated = X.filter(pl.col('treatment')==1).drop('expenditure','treatment')
X_control = X.filter(pl.col('treatment')==0).drop('expenditure','treatment')

y_treated = df.filter(pl.col('treatment')==1).select('expenditure').to_numpy()
y_control = df.filter(pl.col('treatment')==0).select('expenditure').to_numpy()


In [46]:
lr= LinearRegression()
# make a linear regression model for treated group only
lr_treated = lr.fit(X_treated.to_numpy(),y_treated) 
lr= LinearRegression()
# make a linear regression model for control group only
lr_control = lr.fit(X_control.to_numpy(),y_control)


Now, using the model, I will make predictions on a new dataset representing a population of 1 million. I will generate predictions for two scenarios: one where everyone receives treatment and another where no one does. The difference between these two predictions will indicate the individuals who are likely to benefit the most from the treatment.

This is my hypothesis: those with the largest predicted difference will experience the greatest impact from the treatment. Based on this, I will select the top 200,000 individuals with the highest predicted benefit to receive the treatment. Finally, I will compare their actual (simulated) outcomes to evaluate whether this targeted approach leads to a better overall result than random assignment.

In [47]:
target_tlearner = target
target_tlearner.head(5)

age,is_male,race,ownes_house,employment_status,employment_income,non_employment_income
i64,i64,str,i64,i64,f64,f64
50,0,"""White""",0,0,0.0,9000.0
35,1,"""White""",0,1,108244.6,12000.0
18,1,"""Asian""",1,0,0.0,0.0
18,1,"""Hispanic""",0,0,0.0,0.0
49,1,"""Black""",0,1,137030.7,0.0


In [None]:
#encode race from target_tlearner
race_encoded = label_encoder.fit_transform(target_tlearner['race'].to_numpy())
target_tlearner = target_tlearner.with_columns(pl.Series('race_encoded', race_encoded))
target_tlearner= target_tlearner.drop('race')

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,race_encoded
i64,i64,i64,i64,f64,f64,i32
50,0,0,0,0.0,9000.0,4
35,1,0,1,108244.6,12000.0,4
18,1,1,0,0.0,0.0,0
18,1,0,0,0.0,0.0,2
49,1,0,1,137030.7,0.0,1


In [49]:
#predict outcome for all individual with the model built on treated group
predicted_expenditure_treated = lr_treated.predict(target_tlearner)
#predict outcome for all individual with the model built on control group
predicted_expenditure_control = lr_control.predict(target_tlearner)



In [50]:
target_tlearner = target_tlearner.with_columns(
    pl.Series(predicted_expenditure_treated.flatten()).alias('predicted_expenditure_treated'),
    pl.Series(predicted_expenditure_control.flatten()).alias('predicted_expenditure_control')
)

In [51]:
target_tlearner.head(10)

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,race_encoded,predicted_expenditure_treated,predicted_expenditure_control
i64,i64,i64,i64,f64,f64,i32,f64,f64
50,0,0,0,0.0,9000.0,4,8700.086814,8284.063161
35,1,0,1,108244.6,12000.0,4,91328.241418,89791.513214
18,1,1,0,0.0,0.0,0,4596.710038,4097.409711
18,1,0,0,0.0,0.0,2,4508.574865,4109.400776
49,1,0,1,137030.7,0.0,1,108811.194898,107109.685601
18,0,1,1,59443.8,9000.0,4,50947.992835,49886.321986
18,0,0,1,65769.2,0.0,4,51730.069015,50817.006997
22,0,0,0,0.0,9000.0,2,8376.286623,8208.839592
18,0,0,1,61014.4,9000.0,3,51874.604561,51148.604399
49,1,0,1,187961.9,12000.0,2,154968.260549,152708.516017


In [52]:
#find the difference between the two predicted outcomes
# and add it to the dataframe
target_tlearner= target_tlearner.with_columns(
    (pl.col('predicted_expenditure_treated') - pl.col('predicted_expenditure_control')).alias('difference')
)
target_tlearner.head(5)

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,race_encoded,predicted_expenditure_treated,predicted_expenditure_control,difference
i64,i64,i64,i64,f64,f64,i32,f64,f64,f64
50,0,0,0,0.0,9000.0,4,8700.086814,8284.063161,416.023653
35,1,0,1,108244.6,12000.0,4,91328.241418,89791.513214,1536.728204
18,1,1,0,0.0,0.0,0,4596.710038,4097.409711,499.300327
18,1,0,0,0.0,0.0,2,4508.574865,4109.400776,399.17409
49,1,0,1,137030.7,0.0,1,108811.194898,107109.685601,1701.509297


Based on the two model regression output, I will now assign the treatment (we can now call it a benefit since we are no longer evaluating the treatment effect) to the top 200,000 individuals with the highest predicted difference. This will hopefully allow us to see if this targeted approach yields better results than random assignment.

In [53]:
target_tlearner = target_tlearner.with_columns(
    pl.lit(0).alias("treatment")  # Create a new column with default value 0
)

# Sort the DataFrame in descending order based on predicted benefit
target_tlearner = target_tlearner.sort("difference", descending=True)

# Assign 1 to the top 200,000 people
target_tlearner = target_tlearner.with_columns(
    pl.when(pl.arange(0, target_tlearner.height) < 200000)
    .then(1)
    .otherwise(0)
    .alias("treatment")
)

target_tlearner = target_tlearner.with_columns(
    pl.Series(target['race']).alias('race')
)
target_tlearner.head(10)

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,race_encoded,predicted_expenditure_treated,predicted_expenditure_control,difference,treatment,race
i64,i64,i64,i64,f64,f64,i32,f64,f64,f64,i32,str
70,1,1,1,251542.0,0.0,4,200952.865824,197447.94329,3504.922534,1,"""White"""
70,1,1,1,251358.8,0.0,4,200806.432388,197303.465647,3502.966742,1,"""White"""
70,1,1,1,249812.0,0.0,4,199570.060999,196083.607465,3486.453534,1,"""Asian"""
70,1,1,1,249606.0,0.0,4,199405.403315,195921.148979,3484.254336,1,"""Hispanic"""
70,1,1,1,250049.0,1000.0,4,200206.753567,196723.569555,3483.184012,1,"""Black"""
70,1,1,1,249918.7,1000.0,4,200102.603585,196620.81062,3481.792965,1,"""White"""
70,1,1,1,249061.8,0.0,4,198970.419277,195491.97467,3478.444608,1,"""White"""
70,1,1,1,249484.8,1000.0,4,199755.783346,196278.622577,3477.160769,1,"""Hispanic"""
70,1,1,1,248630.6,0.0,4,198625.757173,195151.915937,3473.841236,1,"""Other"""
70,1,1,1,248520.2,0.0,4,198537.513443,195064.850807,3472.662636,1,"""Hispanic"""


Now, using the simulation, let see what impact this targeted approach has on the total expenditure. 

In [54]:
%run data_processor.ipynb  # Executes the notebook
# Then call your function directly
target_tlearner = process_dataframe(target_tlearner)
target_tlearner.head(5)

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,race_encoded,predicted_expenditure_treated,predicted_expenditure_control,difference,treatment,race,expenditure
i64,i64,i64,i64,f64,f64,i32,f64,f64,f64,i32,str,f64
70,1,1,1,251542.0,0.0,4,200952.865824,197447.94329,3504.922534,1,"""White""",197364.92
70,1,1,1,251358.8,0.0,4,200806.432388,197303.465647,3502.966742,1,"""White""",197221.37
70,1,1,1,249812.0,0.0,4,199570.060999,196083.607465,3486.453534,1,"""Asian""",196009.32
70,1,1,1,249606.0,0.0,4,199405.403315,195921.148979,3484.254336,1,"""Hispanic""",196067.9
70,1,1,1,250049.0,1000.0,4,200206.753567,196723.569555,3483.184012,1,"""Black""",197025.64


Finally, let us evaluate what the impact of this targeted approach has on the total expenditure.

In [55]:
treatment_group_1 = target_tlearner .filter(pl.col('treatment') == 1)
treatment_group_0 = target_tlearner .filter(pl.col('treatment') == 0)

# Mean and sum for treatment group 1
mean_treatment_1 = treatment_group_1.select(pl.col('expenditure').mean()).to_numpy()[0][0]
sum_treatment_1 = treatment_group_1.select(pl.col('expenditure').sum()).to_numpy()[0][0]

# Mean and sum for treatment group 0
mean_treatment_0 = treatment_group_0.select(pl.col('expenditure').mean()).to_numpy()[0][0]
sum_treatment_0 = treatment_group_0.select(pl.col('expenditure').sum()).to_numpy()[0][0]

# Mean and sum for total population
mean_total = target_tlearner .select(pl.col('expenditure').mean()).to_numpy()[0][0]
sum_total = target_tlearner .select(pl.col('expenditure').sum()).to_numpy()[0][0]

# Output the results
print(f"Treatment Group 1 - Mean Expenditure: {mean_treatment_1}, Sum Expenditure: {sum_treatment_1}")
print(f"Treatment Group 0 - Mean Expenditure: {mean_treatment_0}, Sum Expenditure: {sum_treatment_0}")
print(f"Total - Mean Expenditure: {mean_total}, Sum Expenditure: {sum_total}")

Treatment Group 1 - Mean Expenditure: 119106.8910883, Sum Expenditure: 23821378217.66
Treatment Group 0 - Mean Expenditure: 32528.46748395, Sum Expenditure: 26022773987.16
Total - Mean Expenditure: 49844.15220482, Sum Expenditure: 49844152204.82


The results indicate a significant difference in expenditure between the treated and untreated groups. Individuals who received the treatment had an **average expenditure of $119,106.89**, contributing to a **total expenditure of $23.8 billion**. In contrast, those who did not receive the treatment had a **much lower average expenditure of $32,528.47**, with a **total expenditure of $26.02 billion**. 

Overall, across the entire population, the **mean expenditure was $47,822.22**, leading to a **total economic expenditure of $49.84 billion**. This highlights the substantial impact of the treatment on spending behavior.

### Comparing the two sets of results, we can see a clear difference in expenditure patterns between the **random assignment method** and the **T-learner targeting approach**:


#### Treatment Group (Assigned Treatment)

1. Random Assignment: Mean expenditure = $68,943.99, Total expenditure = $13.81 billion

2. T-Learner Targeting: Mean expenditure = $119,106.89, Total expenditure = $23.8 billion

1.Comparison: The T-learner approach results in a higher mean expenditure (+$50,162.90) and an increase of $9.99 billion in total expenditure, suggesting that targeting individuals most likely to benefit significantly increases the program's economic impact.

#### Control Group (No Treatment)

1. Random Assignment: Mean expenditure = $37,351.42, Total expenditure = $29.87 billion

2. T-Learner Targeting: Mean expenditure = $32,528.47, Total expenditure = $26.02 billion

Comparison: The mean expenditure in the control group is slightly lower (-$4,822.95) when using the T-learner method, and total expenditure decreases by about $3.85 billion. 

You can see the gains in the treatment group are much much more larger than the losses in the control group. While the average expenditure in the control group is lower by around $5000, the average expenditure in the treatment group is higher by around $50,000 (almost 10 times). This suggests that the T-learner method is more effective in targeting individuals who are likely to benefit from the treatment, leading to a more efficient allocation of resources.

#### Overall Economic Impact

1. Random Assignment: Mean expenditure = $43,678.08, Total expenditure = $43.69 billion

2. T-Learner Targeting: Mean expenditure = $47,822.22, Total expenditure = $47.82 billion

Comparison: The T-learner method results in an increase of $4,144.14 in mean expenditure and a higher total expenditure by $4.13 billion, indicating a more efficient allocation of resources.

However, it is worth noting that while the T-learner approach significantly increases expenditure within the treatment group, the overall impact on the entire economy is comparatively smaller. This is primarily because only 20% of the population is treated, meaning the control group carries more weight in the overall expenditure calculation, thus lowering the mean expenditure. Nevertheless, it is still evident how the T-learner method improves both mean expenditures and total expenditures by targeting those who are most likely to benefit.

If we were to scale the program to a larger population, the impact could become much more substantial, as the proportion of treated individuals would increase, leading to a larger increase in total expenditures.

However, this also brings up an important policy consideration: fairness vs. equity. Randomized assignment is often viewed as a fair approach because everyone has an equal chance of receiving the treatment, regardless of their characteristics. On the other hand, the T-learner method focuses on those most likely to benefit, making it a more equitable strategy, as resources are allocated where they can generate the greatest impact.

Ultimately, the choice between these approaches depends on the policy goal: maximizing overall impact (T-learner) or ensuring equal access (random assignment).

#### Conclusion

The T-learner approach outperforms random assignment by targeting individuals who are expected to benefit the most from treatment. This leads to a higher mean expenditure per treated individual and a greater overall economic impact. By prioritizing those with the highest predicted treatment effect, the program can maximize its effectiveness while using the same budget constraints.