# A Case for using Machine Learning Models to Improve Causal Inference in Economics

In this notebook, I will provide a detailed analysis of a randomized controlled experiment and evaluate its impact. Additionally, I will explore meta-learners, with a primary focus on the T-Learner model.

Randomized controlled trials (RCTs) are widely regarded as the gold standard for causal impact evaluation. However, they are primarily limited to estimating average treatment effects (ATE). In cases where scaling up a program involves substantial costs, it is often more valuable to estimate conditional treatment effects (CTE), which help identify the subpopulations that benefit the most from the program.

This dataset enables a two-part analysis. First, I will assess the effectiveness of the program using the randomized controlled experiment. Then, leveraging the rich set of descriptive features available, I will build a predictive model. Specifically, I will implement T-learners to perform causal predictive analytics, allowing for a more granular understanding of the treatment effect across different subgroups.
With the results from the RCT and T-Learners, I will apply these insights to a new dataset to evaluate how the program can be effectively scaled up.

#### Import Libraries

In [1]:
import polars as pl
from scipy.stats import ttest_ind
import numpy as np
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

#### Read the experiment dataset

In [2]:
df = pl.read_csv("dataset.csv")
df.head(5)

age,is_male,race,ownes_house,employment_status,employment_income,non_employment_income,treatment,expenditure
i64,i64,str,i64,i64,f64,f64,i64,f64
32,0,"""Black""",0,1,104995.2,0.0,1,92102.73
69,1,"""Asian""",1,1,190858.9,12000.0,0,139499.88
18,0,"""Hispanic""",0,0,0.0,0.0,0,2332.43
44,1,"""White""",0,1,174413.2,0.0,0,112242.89
23,0,"""White""",0,0,0.0,12000.0,1,13224.75


This dataset contains information about a legislation aimed at providing accidental and medical coverage to individuals in order to increase aggregate expenditure. The underlying rationale behind the experiment is that individuals with access to medical insurance are more likely to spend money without the concern of saving for unexpected emergencies. To test this hypothesis, the government launched a pilot program through a randomized controlled trial (RCT).

The dataset includes information from this RCT, designed to assess the impact of accidental and medical insurance on employment outcomes. It contains various demographic and background characteristics, such as age, education level, and prior work experience, along with treatment assignment (indicating whether an individual received medical insurance) and the outcome variable (employment status after the program).

20% of the population of this experiment are treated.

## Result of Randomized Controlled Trials (RCTs) in Expenditures 

The experiment is designed as a robust randomized controlled trial. Now, let’s examine the average treatment effect on total expenditure.

In [3]:
treated = (df
.filter(pl.col('treatment') == 1)
.select('expenditure').to_numpy()
)

controlled = (df
.filter(pl.col('treatment') == 0)
.select('expenditure').to_numpy()
)


#### Function to run  t-test and output t-test values

In [4]:
def p_test(treated, controlled):
    t_stat, p_val = ttest_ind(treated, controlled)
    # Store results in a dictionary
    results = {
        "Mean Expenditure (Treated)": treated.mean(),
        "Mean Expenditure (Controlled)": controlled.mean(),
        "T-statistic": t_stat[0],
        "P-value": p_val[0]
    }

    # Print readable output
    print("T-Test Results:")
    for key, value in results.items():
        print(f"{key}: {value:.4f}")

In [5]:
p_test(treated,controlled)

T-Test Results:
Mean Expenditure (Treated): 55040.9798
Mean Expenditure (Controlled): 47944.4373
T-statistic: 14.3681
P-value: 0.0000


The mean expenditure for the treatment group (those who received medical insurance) is 55040.97, while the mean for the control group (those without insurance) is 47,944.44. The difference between the two groups is **7106**. With a t-statistic of **14.3681** and a p-value of less than **0.001**, this difference is statistically significant. 

These findings suggest that having accidental and medical insurance leads to increased spending. This implies that the security of a financial safety net encourages individuals to spend more, potentially stimulating economic activity and generating a multiplier effect throughout the economy.

### Factoring for different variable

#### Male

Now, I will examine whether there is variation within subgroups. The treatment effect may be influenced by certain factors, which we can identify through subgroup analysis.

In [6]:
#Treatment and Control Impact by Gender
treated_male = df.filter(
    (pl.col('treatment') ==1) & (pl.col('is_male') == 1)
).select("expenditure").to_numpy()

controlled_male = df.filter(
    (pl.col('treatment')==0) & (pl.col('is_male')==1)
).select("expenditure").to_numpy()


In [7]:
p_test(treated_male, controlled_male)

T-Test Results:
Mean Expenditure (Treated): 55091.0328
Mean Expenditure (Controlled): 48511.5612
T-statistic: 9.7950
P-value: 0.0000


#### Female

In [8]:
#Treatment and Control Impact by Gender
treated_female = df.filter(
    (pl.col('treatment') ==1) & (pl.col('is_male') == 0)
).select("expenditure").to_numpy()

controlled_female = df.filter(
    (pl.col('treatment')==0) & (pl.col('is_male')==0)
).select("expenditure").to_numpy()

In [9]:
p_test(treated_female, controlled_female)

T-Test Results:
Mean Expenditure (Treated): 54980.5565
Mean Expenditure (Controlled): 47254.9465
T-statistic: 10.6050
P-value: 0.0000


As seen above, both males and females experience a significant increase in expenditure when treated. In the control group, men have a higher mean expenditure than women. While the absolute increase in expenditure is similar for both genders, women show a slightly higher proportional increase compared to their respective control group.

#### Employment Status

#### On the employed

In [10]:
treated_employed = df.filter(
    (pl.col('treatment')==1) & (pl.col('employment_status')==1)
).select("expenditure").to_numpy()

controlled_employed = df.filter(
    (pl.col('treatment')==0) & (pl.col('employment_status')==1)
).select("expenditure").to_numpy()

In [11]:
p_test(treated_employed, controlled_employed)

T-Test Results:
Mean Expenditure (Treated): 88545.3040
Mean Expenditure (Controlled): 77207.2134
T-statistic: 23.5961
P-value: 0.0000


#### On the unemployed

In [12]:
treated_unemployed = df.filter(
    (pl.col('treatment')==1) & (pl.col('employment_status')==0)
).select("expenditure").to_numpy()

controlled_unemployed = df.filter(
    (pl.col('treatment')==0) & (pl.col('employment_status')==0)
).select("expenditure").to_numpy()

In [13]:
p_test(treated_unemployed, controlled_unemployed)

T-Test Results:
Mean Expenditure (Treated): 6837.4546
Mean Expenditure (Controlled): 6268.7604
T-statistic: 7.2956
P-value: 0.0000


Another subgroup of interest is employment status. As expected, employed individuals have higher expenditures than unemployed individuals in both the control and treatment groups. Furthermore, even after receiving treatment, the unemployed do not increase their spending as much as the employed (both absolute and proportion amount). If the goal is to support unemployed individuals to encourage higher spending, this policy is somewhat effective, but it is clear that employed individuals are the bigger beneficiaries.

#### Having house or not

In [14]:
treated_house = df.filter(
    (pl.col('treatment')==1) & (pl.col('ownes_house')==1)
).select('expenditure').to_numpy()

controlled_house = df.filter(
    (pl.col('treatment')==0) & (pl.col('ownes_house')==1)
).select('expenditure').to_numpy()

In [15]:
p_test(treated_house, controlled_house)

T-Test Results:
Mean Expenditure (Treated): 54822.1543
Mean Expenditure (Controlled): 47645.9506
T-statistic: 7.9444
P-value: 0.0000


In [16]:
treated_rental = df.filter(
    (pl.col('treatment')==1) & (pl.col('ownes_house')==0)
).select('expenditure').to_numpy()

controlled_rental = df.filter(
    (pl.col('treatment')==0) & (pl.col('ownes_house')==0)
).select('expenditure').to_numpy()

In [17]:
p_test(treated_rental, controlled_rental)

T-Test Results:
Mean Expenditure (Treated): 55135.0317
Mean Expenditure (Controlled): 48073.1068
T-statistic: 11.9713
P-value: 0.0000


Both homeowners and non-homeowners benefit significantly from the treatment. This finding suggests that the treatment effect is broad and has a positive impact across all groups in the experiment

## Linear Model

The RCT results demonstrated that the treatment is effective, and subgroup analysis confirmed its broad impact across different types of people. Now, we can conduct further analysis using linear regression. Linear regression will help us quantify the impact of each variable and determine whether these variables are significant in predicting expenditure.  

Unlike simple mean comparisons between treatment and control groups, a linear regression model allows us to estimate the impact of multiple variables simultaneously. This approach provides a more precise measure of the treatment effect while accounting for potential confounding factors such as age, education level, previous work experience, and employment status. By incorporating all available variables, the model helps us isolate the true effect of medical insurance on expenditure and identify key determinants of spending behavior.  

This deeper analysis will provide a more nuanced understanding of how different factors interact, offering valuable insights for policymakers and stakeholders looking to design effective interventions.

In [18]:
from sklearn.linear_model import LinearRegression,Ridge, Lasso
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


#### Using pyrsm package to port R's statistical output

The scikit-learn (sklearn) package is excellent for machine learning but lacks detailed statistical outputs that can be useful for interpretation and analysis. Instead, I will use the pyrsm library, which brings R's comprehensive statistical output capabilities into the Python environment. 

In [19]:
import pyrsm as rsm

In [20]:
#using rsm
feature_names = df.drop('expenditure').columns
dfp = df.to_pandas() #note that pyrsm requires pandas dataframe, I could not find a way to run it on polars dataframe
rsm.model.regress({"df":dfp},rvar="expenditure",evar=feature_names).summary(rmse=True, ssq= True)

Linear regression (OLS)
Data                 : df
Response variable    : expenditure
Explanatory variables: age, is_male, race, ownes_house, employment_status, employment_income, non_employment_income, treatment
Null hyp.: the effect of x on expenditure is zero
Alt. hyp.: the effect of x on expenditure is not zero

                       coefficient  std.error  t.value p.value     
Intercept                 2658.000    147.086   18.071  < .001  ***
race[Black]                176.270    117.985    1.494   0.135     
race[Hispanic]              83.079    107.857    0.770   0.441     
race[Other]                116.452    159.066    0.732   0.464     
race[White]                150.733    100.811    1.495   0.135     
age                          5.685      4.303    1.321   0.186     
is_male                    -59.410     58.313   -1.019   0.308     
ownes_house                109.335     63.230    1.729   0.084    .
employment_status        -5087.014    159.240  -31.946  < .001  ***
emp

##### Summary of Linear Regression (OLS) Results
The linear regression model evaluates the impact of various factors on expenditure, using a dataset of 50,000 observations. The R-squared value of 0.978 indicates that the model explains 97.8% of the variance in expenditure, suggesting a strong fit.

Key Findings:
- Treatment Effect: The treatment variable has the largest and most significant impact on expenditure (coefficient = 6847.9, p < 0.001), confirming that receiving insurance significantly increases spending.

- Income Impact: Both employment income (coefficient = 0.717, p < 0.001) and non-employment income (coefficient = 0.449, p < 0.001) are strong predictors of expenditure.

- Employment Status: Being employed decreases expenditure (coefficient = -5087, p < 0.001), (which is strange)
- Housing : Owning a house (p = 0.085) and race (p = 0.3) have weaker and less statistically significant effects.

- Gender and Age: Gender (p = 0.307) and age (p = 0.187) are not statistically significant, meaning their effects on expenditure are minimal.

Model Performance:
- The F-statistic (283,844.4, p < 0.001) confirms the model is highly significant overall.

The Root Mean Square Error (RMSE) of 6484.77 suggests that, on average, predictions deviate by this amount from actual expenditures.

#### Let's see how the regression result would have been in absence of any treatment

From the regression output, we can see how the treatment effect significantly impacts expenditure. But wouldn’t it be interesting to explore what the outcome would have been if the treatment had never been implemented?

To begin, let's run a regression model that excludes the treatment. So, I will divide the dataset between treated and control group and run the regression. The output will show how these variables would have affected the expenditure in presence or absense of treatment. After running both the regression, we can see if there is any notable difference in predictor between those in treatment and those in control

### Only treated people vs Not Treated people

From the regression output, we can observe how the treatment effect significantly impacts expenditure. However, it would be interesting to explore what the outcome would have been if the treatment had never been implemented.

To begin, let's run a regression model that excludes the treatment variable. I'll divide the dataset into two groups: the treated group and the control group, and then run a separate regression for each group. By excluding the treatment variable, we can assess how the other predictors affect expenditure in both the presence and absence of the treatment.

After running both regressions, we can compare the results to see if there are any notable differences in the predictors between the treated and control groups. This will help us understand how the treatment may have altered the relationships between the predictors and expenditure.

In [21]:
df_treated = df.filter(pl.col('treatment')==1).drop('treatment')
df_controled = df.filter(pl.col('treatment')==0).drop('treatment')

#### Linear Regression for treated group only

In [22]:
feature_names_noTreat = df.drop('expenditure','treatment').columns
dft = df_treated.to_pandas()
rsm.model.regress({"df treated":dft},rvar="expenditure",evar=feature_names_noTreat).summary(rmse=True, ssq= True)

Linear regression (OLS)
Data                 : df treated
Response variable    : expenditure
Explanatory variables: age, is_male, race, ownes_house, employment_status, employment_income, non_employment_income
Null hyp.: the effect of x on expenditure is zero
Alt. hyp.: the effect of x on expenditure is not zero

                       coefficient  std.error  t.value p.value     
Intercept                 4012.755    475.529    8.439  < .001  ***
race[Black]                360.091    375.278    0.960   0.337     
race[Hispanic]            -166.056    344.996   -0.481    0.63     
race[Other]               -217.746    506.380   -0.430   0.667     
race[White]                -38.207    321.357   -0.119   0.905     
age                         21.105     14.267    1.479   0.139     
is_male                    -98.186    186.981   -0.525     0.6     
ownes_house                413.863    202.886    2.040   0.041    *
employment_status        -4461.077    514.903   -8.664  < .001  ***
employ

#### Linear Regression for control group only

In [23]:
dfc = df_controled.to_pandas()
rsm.model.regress({"df controled":dfc},rvar="expenditure",evar=feature_names_noTreat).summary(rmse=True, ssq= True)

Linear regression (OLS)
Data                 : df controled
Response variable    : expenditure
Explanatory variables: age, is_male, race, ownes_house, employment_status, employment_income, non_employment_income
Null hyp.: the effect of x on expenditure is zero
Alt. hyp.: the effect of x on expenditure is not zero

                       coefficient  std.error  t.value p.value     
Intercept                 4017.127    121.580   33.041  < .001  ***
race[Black]                 75.942     98.595    0.770   0.441     
race[Hispanic]             140.919     90.001    1.566   0.117     
race[Other]                199.731    132.891    1.503   0.133     
race[White]                166.092     84.195    1.973   0.049    *
age                          0.927      3.560    0.260   0.794     
is_male                    -16.965     48.632   -0.349   0.727     
ownes_house                 36.100     52.723    0.685   0.494     
employment_status        -5212.025    132.557  -39.319  < .001  ***
empl

##### Significance of Predictors:

Owns_house: This variable is significant in the treated group (p = 0.04), but not in the control group (p = 0.492). This suggests that owning a house has a stronger effect on expenditure when individuals are receiving social welfare benefits. For those in the treatment group, housing status seems to be a more important factor in determining spending behavior.

Race_encoded: This variable is significant in the control group (p = 0.045), but not in the treated group (p = 0.492). This may suggest that the treatment (i.e., receiving welfare) has a leveling effect on expenditure across different racial groups, reducing racial disparities in spending behavior. In the absence of treatment, race appears to have a more noticeable impact on expenditure.

Employment Status:
Both the treated and control groups show significant negative effects for employment status (p < 0.001). However, the coefficient for the control group is more negative (-5211.47) compared to the treated group (-4476.49). This might suggest that employed individuals in the control group have a larger decrease in expenditure than those in the treated group, which is somewhat counter-intuitive.

A possible explanation for this could be that employment status and employment income are highly correlated. This suggests that one of these variables might be capturing similar information. Therefore, to improve the model and reduce multicollinearity, we should consider dropping one of these variables. For further analysis, I will drop the employment status variable to assess the model more effectively.

Income Variables:
Employment_income and non_employment_income both have significant positive coefficients in both groups, with the coefficients being slightly higher in the treated group. This suggests that income, regardless of its source, plays a crucial role in determining expenditure in both the treated and control groups. The marginally higher coefficients in the treated group indicate that income might have a more pronounced effect on expenditure when individuals have access to social welfare.

##### Running the Regression Estimate Again with Predictive Capabilities
Now, let's take the modeling a step further by using Scikit-Learn to fit the linear regression model and also evaluate its predictive capabilities.

In [24]:
#### One Hot encoding categorical variables
race_array = df['race'].to_numpy()
label_encoder = LabelEncoder()
race_encoded = label_encoder.fit_transform(race_array)
X = df.with_columns(pl.Series('race_encoded', race_encoded))
X = X.drop('race')
y = X['expenditure']

In [25]:
X.head()

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,treatment,expenditure,race_encoded
i64,i64,i64,i64,f64,f64,i64,f64,i32
32,0,0,1,104995.2,0.0,1,92102.73,1
69,1,1,1,190858.9,12000.0,0,139499.88,0
18,0,0,0,0.0,0.0,0,2332.43,2
44,1,0,1,174413.2,0.0,0,112242.89,4
23,0,0,0,0.0,12000.0,1,13224.75,4


In [26]:
lr = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X.drop('expenditure'),y,test_size=0.25,random_state=99)
feature_names = X_train.columns

In [27]:
reg = lr.fit(X_train,y_train)

In [28]:
# Get feature names and coefficients
{'Intercept': reg.intercept_, 
              **{feature_names[i]: reg.coef_[i] for i in range(len(feature_names))}}



{'Intercept': 2720.851986918402,
 'age': 6.240473745453834,
 'is_male': -39.793653190603855,
 'ownes_house': 45.97058389654647,
 'employment_status': -5066.151702257103,
 'employment_income': 0.7164030483992931,
 'non_employment_income': 0.45243828695602145,
 'treatment': 6777.709694468725,
 'race_encoded': 11.843171893217766}

The values are slightly different since the training set is not different from the full set that I used before

In [29]:
predicted_expenditure = reg.predict(X_test)

In [30]:
result = X_test.with_columns(
    pl.Series(y_test).alias('actual_expenditure'),
    pl.Series(predicted_expenditure).alias('pred_expenditure')
    
)
result

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,treatment,race_encoded,actual_expenditure,pred_expenditure
i64,i64,i64,i64,f64,f64,i64,i32,f64,f64
36,0,0,1,110679.4,0.0,1,4,80617.68,83995.499277
22,1,1,0,0.0,10000.0,1,4,9517.9,14213.784592
21,1,0,1,76957.2,0.0,0,4,48806.14,52925.701944
22,1,0,1,81637.1,0.0,0,4,64406.44,56284.637044
55,0,1,0,0.0,5000.0,0,4,6397.49,5419.612749
…,…,…,…,…,…,…,…,…,…
18,0,0,0,0.0,12000.0,1,4,10637.32,15087.52234
18,0,0,0,0.0,0.0,1,4,3201.58,9658.262896
68,1,0,1,250590.3,10000.0,0,4,193134.09,182134.669223
28,0,0,0,0.0,0.0,0,4,2690.25,2942.957939


The dataset above shows the predicted expenditure values. Let's check out how much error it makes in the test set and if it is fit for further analysis.

The root mean square error between actual expenditure and predicted expenditure

In [31]:
mse = mean_squared_error(y_test, predicted_expenditure)
rmse = mse ** 0.5
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)

Mean Squared Error: 41824028.693792306
Root Mean Squared Error: 6467.149966854975


An RMSE of 6467.15 means that, on average, the model's predictions are off by 6467.15. But how good or bad is that? Looking back at some of the model's outputs, we realized that the mean differences for particular subgroups were around 5000 to 10,000. Is an RMSE of 6467.15 a concern? Let me calculate the RMSE percentage, which should provide a clearer picture of our errors.

In [32]:
mean_actual = result["actual_expenditure"].mean()
mean_actual

49758.84115200001

In [33]:
rmse_percentage = (rmse / mean_actual) * 100
print(f"RMSE as % of Mean Expenditure: {rmse_percentage:.2f}%")

RMSE as % of Mean Expenditure: 13.00%


The RMSE is 13%. In most cases, a small RMSE percentage is considered a sign of good model performance. A score of less than 20% is generally regarded as good. Based on this, we can say that our model is performing well.

## Scaling Up to 1 Million People

Now that we have two key insights—the treatment is effective across the board, and variables like employment status, employment income, and non-employment income have a significant impact on expenditure behavior—let's consider the next step.

Suppose the government, based on the success of the results, wants to roll out the program statewide, which has a population of roughly one million people. However, the government only has enough budget to extend the program to another 200,000 people.

We can confidently say that the treatment will increase expenditure, as both the RCT and the linear regression analysis have confirmed this. Additionally, the linear regression showed that certain variables, such as employment status and income, have large coefficients and significantly influence expenditure.

So, how should the government scale up the program?



In [34]:
target = pl.read_csv("new_target.csv")
target.head(5)

age,is_male,race,ownes_house,employment_status,employment_income,non_employment_income
i64,i64,str,i64,i64,f64,f64
18,1,"""Asian""",0,1,59449.6,12000.0
18,1,"""White""",0,0,0.0,1000.0
19,0,"""Black""",0,0,0.0,12000.0
18,0,"""White""",0,0,0.0,12000.0
59,0,"""White""",0,1,216939.1,0.0


### Scenarios 1: Randomly Assigning Treatment to 200,000 People

One method would be to randomly assign the treatment to 200,000 people, given that we observed a consistent impact of the treatment across all subgroups. Let’s explore how this approach would look. I have developed a program that generates the outcomes for individuals after the treatment is assigned. Think of this as a hypothetical scenario in the future, following the experiment.

In [35]:
RCT_target = target
np.random.seed(259)
treatment = np.random.choice([1, 0], RCT_target.shape[0], p=[0.2, 0.8])
RCT_target = RCT_target.with_columns(pl.Series('treatment', treatment))
RCT_target.head(5)

age,is_male,race,ownes_house,employment_status,employment_income,non_employment_income,treatment
i64,i64,str,i64,i64,f64,f64,i32
18,1,"""Asian""",0,1,59449.6,12000.0,0
18,1,"""White""",0,0,0.0,1000.0,1
19,0,"""Black""",0,0,0.0,12000.0,0
18,0,"""White""",0,0,0.0,12000.0,0
59,0,"""White""",0,1,216939.1,0.0,0


In [36]:
%run data_processor.ipynb  # Executes the notebook
# Then call your function directly
RCT_target = process_dataframe(RCT_target)
RCT_target.head(5)

age,is_male,race,ownes_house,employment_status,employment_income,non_employment_income,treatment,expenditure
i64,i64,str,i64,i64,f64,f64,i32,f64
18,1,"""Asian""",0,1,59449.6,12000.0,0,36261.96
18,1,"""White""",0,0,0.0,1000.0,1,3753.78
19,0,"""Black""",0,0,0.0,12000.0,0,12595.93
18,0,"""White""",0,0,0.0,12000.0,0,12595.93
59,0,"""White""",0,1,216939.1,0.0,0,124663.44


We have now obtained the simulated expenditure values. Next, let's calculate the total increase in expenditure among the treated individuals and across the entire economy.

Total Expenditure Increase Among Treated Individuals: This will be the sum of the differences in expenditure for the treated individuals compared to their hypothetical expenditure if they had not received the treatment.

Total Expenditure for the Whole Economy (Including Both Treated and Control Groups): This represents the sum of all expenditures (both treated and control groups) in the entire population after the treatment has been applied.

In [37]:
treatment_group_1 = RCT_target.filter(pl.col('treatment') == 1)
treatment_group_0 = RCT_target.filter(pl.col('treatment') == 0)

# Mean and sum for treatment group 1
mean_treatment_1 = treatment_group_1.select(pl.col('expenditure').mean()).to_numpy()[0][0]
sum_treatment_1 = treatment_group_1.select(pl.col('expenditure').sum()).to_numpy()[0][0]

# Mean and sum for treatment group 0
mean_treatment_0 = treatment_group_0.select(pl.col('expenditure').mean()).to_numpy()[0][0]
sum_treatment_0 = treatment_group_0.select(pl.col('expenditure').sum()).to_numpy()[0][0]

# Mean and sum for total population
mean_total = RCT_target.select(pl.col('expenditure').mean()).to_numpy()[0][0]
sum_total = RCT_target.select(pl.col('expenditure').sum()).to_numpy()[0][0]

# Output the results
print(f"Treatment Group 1 - Mean Expenditure: {mean_treatment_1}, Sum Expenditure: {sum_treatment_1}")
print(f"Treatment Group 0 - Mean Expenditure: {mean_treatment_0}, Sum Expenditure: {sum_treatment_0}")
print(f"Total - Mean Expenditure: {mean_total}, Sum Expenditure: {sum_total}")

Treatment Group 1 - Mean Expenditure: 68959.48440247081, Sum Expenditure: 13809688427.47
Treatment Group 0 - Mean Expenditure: 37273.36951645656, Sum Expenditure: 29809079083.829998
Total - Mean Expenditure: 43618.76751129999, Sum Expenditure: 43618767511.299995



Treatment Group 1: The mean expenditure for individuals in the treatment group is approximately $68,959.48, with a total expenditure of around $13.81 billion.

Treatment Group 0: The mean expenditure for individuals in the control group is approximately $37,273.37, with a total expenditure of about $29.81 billion.

Overall: The overall mean expenditure is approximately $43,618.77, with the total expenditure across all groups amounting to about $43.62 billion.

## Scenario 2: Using Predictive Models to Assign Treatment 

Instead of randomly selecting 200,000 people for treatment, I can use linear regression to target individuals who are likely to benefit the most. To achieve this, I will leverage both the results from the RCT and the linear regression model. This approach is known as T-learners, a type of Causal Machine Learning model.  

T-learners work by developing two separate models: one from treated group and one from the control group. These models allow us to make a prediction of two counterfactual scenarios; one where everyone receives treatment and another where no one does. This enables us to estimate the individual treatment effect for each person in the dataset. While these estimates are hypothetical, they provide valuable insight into the potential outcomes of individuals with or without treatment. It’s important to note that our previous models demonstrated strong predictive performance, with a low RMSE and a high R² statistic, suggesting that these estimates should be fairly reliable.

Here is a more step-wise breakdown of how T-learners work:

Train Two Models:

- Model 1: Train a model using the data for the treated group (where treatment = 1).

- Model 2: Train another model using the data for the control group (where treatment = 0).

Make Predictions:

- Prediction 1: Use Model 1 (treated group) to predict the outcome as if everyone received the treatment.

- Prediction 2: Use Model 2 (control group) to predict the outcome as if no one received the treatment.

Calculate the Difference:

By comparing the predicted outcomes for the treated and control groups, we can find the difference in predicted expenditure for each individual.

The individuals who show the highest difference between the treated and control predictions are likely to benefit the most from the treatment. For example, individuals with higher predicted outcomes in the treated group compared to the control group would be the ones who see the largest benefit from the insurance treatment.

In this way, the T-Learner allows us to estimate the individualized treatment effect by predicting how each person would behave in both treatment and control conditions, and then comparing those predictions.

For instance, being a woman, unemployed, and older might result in the most benefit from the treatment. The difference between the two predictions should be the highest for such a person (just an example)

I will be using the whole dataset from out earlier analysis to train the model. The reason is that I want to use all the data available to make the best prediction possible. The T-Learner will be able to learn from both treated and control groups, and then we can apply it to the new dataset.

In [38]:
X_treated = X.filter(pl.col('treatment')==1).drop('expenditure','treatment')
X_control = X.filter(pl.col('treatment')==0).drop('expenditure','treatment')

y_treated = df.filter(pl.col('treatment')==1).select('expenditure').to_numpy()
y_control = df.filter(pl.col('treatment')==0).select('expenditure').to_numpy()


In [39]:
lr= LinearRegression()
# make a linear regression model for treated group only
lr_treated = lr.fit(X_treated.to_numpy(),y_treated) 
lr= LinearRegression()
# make a linear regression model for control group only
lr_control = lr.fit(X_control.to_numpy(),y_control)


Now, using the model, I will make predictions on a new dataset representing a population of 1 million. I will generate predictions for two scenarios: one where everyone receives treatment and another where no one does. The difference between these two predictions will indicate the individuals who are likely to benefit the most from the treatment.

This is my hypothesis: those with the largest predicted difference will experience the greatest impact from the treatment. Based on this, I will select the top 50,000 individuals with the highest predicted benefit to receive the treatment. Finally, I will compare their actual (simulated) outcomes to evaluate whether this targeted approach leads to a better overall result than random assignment.

In [40]:
target_tlearner = target
target_tlearner.head(5)

age,is_male,race,ownes_house,employment_status,employment_income,non_employment_income
i64,i64,str,i64,i64,f64,f64
18,1,"""Asian""",0,1,59449.6,12000.0
18,1,"""White""",0,0,0.0,1000.0
19,0,"""Black""",0,0,0.0,12000.0
18,0,"""White""",0,0,0.0,12000.0
59,0,"""White""",0,1,216939.1,0.0


In [41]:
#encode race from target_tlearner
race_encoded = label_encoder.fit_transform(target_tlearner['race'].to_numpy())
target_tlearner = target_tlearner.with_columns(pl.Series('race_encoded', race_encoded))
target_tlearner= target_tlearner.drop('race')
target_tlearner.head(5)

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,race_encoded
i64,i64,i64,i64,f64,f64,i32
18,1,0,1,59449.6,12000.0,0
18,1,0,0,0.0,1000.0,4
19,0,0,0,0.0,12000.0,1
18,0,0,0,0.0,12000.0,4
59,0,0,1,216939.1,0.0,4


In [42]:
#predict outcome for all individual with the model built on treated group
predicted_expenditure_treated = lr_treated.predict(target_tlearner)
#predict outcome for all individual with the model built on control group
predicted_expenditure_control = lr_control.predict(target_tlearner)



In [43]:
target_tlearner = target_tlearner.with_columns(
    pl.Series(predicted_expenditure_treated.flatten()).alias('predicted_expenditure_treated'),
    pl.Series(predicted_expenditure_control.flatten()).alias('predicted_expenditure_control')
)

In [44]:
target_tlearner.head(10)

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,race_encoded,predicted_expenditure_treated,predicted_expenditure_control
i64,i64,i64,i64,f64,f64,i32,f64,f64
18,1,0,1,59449.6,12000.0,0,52501.939139,45741.457678
18,1,0,0,0.0,1000.0,4,4696.871974,4642.01674
19,0,0,0,0.0,12000.0,1,10215.128617,9468.894997
18,0,0,0,0.0,12000.0,4,10060.754148,9569.934858
59,0,0,1,216939.1,0.0,4,171616.699071,150609.609621
18,0,1,1,58568.7,0.0,3,46446.636512,39923.485584
60,0,0,0,0.0,0.0,0,5371.830229,4116.57903
70,0,0,0,0.0,0.0,2,5491.537127,4194.145118
50,0,1,1,139415.0,0.0,0,110944.544022,96337.223454
18,1,0,0,0.0,0.0,2,4307.248708,4127.55669


In [45]:
#find the difference between the two predicted outcomes
# and add it to the dataframe
target_tlearner= target_tlearner.with_columns(
    (pl.col('predicted_expenditure_treated') - pl.col('predicted_expenditure_control')).alias('difference')
)
target_tlearner.head(5)

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,race_encoded,predicted_expenditure_treated,predicted_expenditure_control,difference
i64,i64,i64,i64,f64,f64,i32,f64,f64,f64
18,1,0,1,59449.6,12000.0,0,52501.939139,45741.457678,6760.481462
18,1,0,0,0.0,1000.0,4,4696.871974,4642.01674,54.855233
19,0,0,0,0.0,12000.0,1,10215.128617,9468.894997,746.233621
18,0,0,0,0.0,12000.0,4,10060.754148,9569.934858,490.81929
59,0,0,1,216939.1,0.0,4,171616.699071,150609.609621,21007.08945


Based on the two model regression output, I will now assign the treatment (we can now call it a benefit since we are no longer evaluating the treatment effect) to the top 200,000 individuals with the highest predicted difference. This will hopefully allow us to see if this targeted approach yields better results than random assignment.

In [46]:
target_tlearner = target_tlearner.with_columns(
    pl.lit(0).alias("treatment")  # Create a new column with default value 0
)

# Sort the DataFrame in descending order based on predicted benefit
target_tlearner = target_tlearner.sort("difference", descending=True)

# Assign 1 to the top 200,000 people
target_tlearner = target_tlearner.with_columns(
    pl.when(pl.arange(0, target_tlearner.height) < 200000)
    .then(1)
    .otherwise(0)
    .alias("treatment")
)

target_tlearner = target_tlearner.with_columns(
    pl.Series(target['race']).alias('race')
)
target_tlearner.head(10)

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,race_encoded,predicted_expenditure_treated,predicted_expenditure_control,difference,treatment,race
i64,i64,i64,i64,f64,f64,i32,f64,f64,f64,i32,str
70,1,1,1,253594.0,9000.0,1,205484.526377,180165.454961,25319.071416,1,"""Asian"""
70,0,1,1,254046.6,0.0,0,201676.988032,176446.348398,25230.639634,1,"""White"""
70,1,1,1,250818.5,12000.0,1,204733.664584,179565.675193,25167.989391,1,"""Black"""
70,1,1,1,253753.3,9000.0,4,205476.530839,180378.750239,25097.780601,1,"""White"""
70,1,1,1,248760.6,12000.0,0,203156.80402,178093.877408,25062.926612,1,"""White"""
70,1,1,1,251454.7,9000.0,2,203754.530672,178704.782998,25049.747674,1,"""Other"""
70,1,1,1,248541.0,12000.0,0,202983.787334,177940.448847,25043.338487,1,"""Asian"""
70,1,1,1,249611.7,9000.0,0,202391.484978,177349.131115,25042.353863,1,"""Hispanic"""
70,1,1,1,249244.4,12000.0,1,203493.475303,178465.894237,25027.581066,1,"""Asian"""
70,1,1,1,251846.8,12000.0,4,205410.329796,180386.117191,25024.212606,1,"""Hispanic"""


Now, using the simulation, let see what impact this targeted approach has on the total expenditure. 

In [47]:
%run data_processor.ipynb  # Executes the notebook
# Then call your function directly
target_tlearner = process_dataframe(target_tlearner)
target_tlearner.head(5)

age,is_male,ownes_house,employment_status,employment_income,non_employment_income,race_encoded,predicted_expenditure_treated,predicted_expenditure_control,difference,treatment,race,expenditure
i64,i64,i64,i64,f64,f64,i32,f64,f64,f64,i32,str,f64
70,1,1,1,253594.0,9000.0,1,205484.526377,180165.454961,25319.071416,1,"""Asian""",201318.37
70,0,1,1,254046.6,0.0,0,201676.988032,176446.348398,25230.639634,1,"""White""",199927.49
70,1,1,1,250818.5,12000.0,1,204733.664584,179565.675193,25167.989391,1,"""Black""",200495.37
70,1,1,1,253753.3,9000.0,4,205476.530839,180378.750239,25097.780601,1,"""White""",201443.19
70,1,1,1,248760.6,12000.0,0,203156.80402,178093.877408,25062.926612,1,"""White""",198312.82


Finally, let us evaluate what the impact of this targeted approach has on the total expenditure.

In [48]:
treatment_group_1 = target_tlearner .filter(pl.col('treatment') == 1)
treatment_group_0 = target_tlearner .filter(pl.col('treatment') == 0)

# Mean and sum for treatment group 1
mean_treatment_1 = treatment_group_1.select(pl.col('expenditure').mean()).to_numpy()[0][0]
sum_treatment_1 = treatment_group_1.select(pl.col('expenditure').sum()).to_numpy()[0][0]

# Mean and sum for treatment group 0
mean_treatment_0 = treatment_group_0.select(pl.col('expenditure').mean()).to_numpy()[0][0]
sum_treatment_0 = treatment_group_0.select(pl.col('expenditure').sum()).to_numpy()[0][0]

# Mean and sum for total population
mean_total = target_tlearner .select(pl.col('expenditure').mean()).to_numpy()[0][0]
sum_total = target_tlearner .select(pl.col('expenditure').sum()).to_numpy()[0][0]

# Output the results
print(f"Treatment Group 1 - Mean Expenditure: {mean_treatment_1}, Sum Expenditure: {sum_treatment_1}")
print(f"Treatment Group 0 - Mean Expenditure: {mean_treatment_0}, Sum Expenditure: {sum_treatment_0}")
print(f"Total - Mean Expenditure: {mean_total}, Sum Expenditure: {sum_total}")

Treatment Group 1 - Mean Expenditure: 121163.04413224998, Sum Expenditure: 24232608826.449997
Treatment Group 0 - Mean Expenditure: 31919.385659100004, Sum Expenditure: 25535508527.280003
Total - Mean Expenditure: 49768.11735373, Sum Expenditure: 49768117353.729996


The results indicate a significant difference in expenditure between the treated and untreated groups. Individuals who received the treatment had an **average expenditure of $109,695.19**, contributing to a **total expenditure of $21.94 billion**. In contrast, those who did not receive the treatment had a **much lower average expenditure of $32,353.97**, with a **total expenditure of $25.88 billion**. 

Overall, across the entire population, the **mean expenditure was $47,822.22**, leading to a **total economic expenditure of $47.82 billion**. This highlights the substantial impact of the treatment on spending behavior.

### Comparing the two sets of results, we can see a clear difference in expenditure patterns between the **random assignment method** and the **T-learner targeting approach**:

1. **Treatment Group (Assigned Treatment)**
   - **Random Assignment**: Mean expenditure = **$68,959.48**, Total expenditure = **$13.81 billion**  
   - **T-Learner Targeting**: Mean expenditure = **$109,695.19**, Total expenditure = **$21.94 billion**  
   - **Comparison**: The T-learner approach results in a **higher mean expenditure** (+$40,735.71) and an **increase of $8.13 billion in total expenditure**, suggesting that targeting individuals most likely to benefit significantly increases the program's economic impact.

2. **Control Group (No Treatment)**
   - **Random Assignment**: Mean expenditure = **$37,273.37**, Total expenditure = **$29.81 billion**  
   - **T-Learner Targeting**: Mean expenditure = **$32,353.97**, Total expenditure = **$25.88 billion**  
   - **Comparison**: The mean expenditure in the control group is slightly lower (-$4,919.40) when using the T-learner method, and total expenditure decreases by about **$3.93 billion**. Since the T-learner method is designed to target individuals who are expected to benefit the most from treatment, it is not surprising that the control group shows a decrease in expenditure. It is effectively leaving out people who would not increase expenditure even when getting the treatment. Thus, this indicates that the T-learner method effectively identifies individuals who would have spent more if treated, leading to a more efficient allocation of resources.

3. **Overall Economic Impact**
   - **Random Assignment**: Mean expenditure = **$43,618.77**, Total expenditure = **$43.62 billion**  
   - **T-Learner Targeting**: Mean expenditure = **$47,822.22**, Total expenditure = **$47.82 billion**  
   - **Comparison**: The T-learner method results in an **increase of $4,203.45 in mean expenditure** and a **higher total expenditure by $4.2 billion**, indicating a more efficient allocation of resources. 

However, it is worth noting that while the **T-learner approach significantly increases expenditure within the treatment group**, the overall impact on the entire economy is comparatively smaller. The total expenditure only increased by **$4 billion**, which, while meaningful, is not as dramatic as the difference seen within the treated group.  

If we were to **scale the program to a larger population**, the impact might become more substantial. However, this also brings up an important policy consideration: **fairness vs. equity**. **Randomized assignment** is often seen as a **fair** approach since everyone has an equal chance of receiving the treatment. On the other hand, the **T-learner method prioritizes those most likely to benefit**, making it a **more equitable** strategy, as resources are allocated where they can create the most impact.  

Ultimately, the choice between these approaches depends on the policy goal: **maximizing overall impact (T-learner) vs. ensuring equal access (random assignment).**

### **Conclusion**
The **T-learner approach** outperforms **random assignment** by targeting individuals who are expected to benefit the most from treatment. This leads to a **higher mean expenditure per treated individual** and a **greater overall economic impact**. By prioritizing those with the highest predicted treatment effect, the program can maximize its effectiveness while using the same budget constraints.