# Question 1a
a) Use linear regression to estimate the parameters α, τ , and β. You are permitted to
use Python or R.

In [1]:
import statsmodels.api as sm
import pandas as pd
import numpy as np

# Loading and Prepare data
W = np.array([[0], [1], [1], [1], [0], [1], [1], [0], [0], [1],
              [1], [0], [0], [1], [0], [1], [0], [0], [1], [1]])
X = np.array([[19.8], [23.4], [27.7], [24.6], [21.5], [25.1], [22.4], [29.3], [20.8], [20.2],
              [27.3], [24.5], [22.9], [18.4], [24.2], [21.0], [25.9], [23.2], [21.6], [22.8]])
Y = np.array([137, 118, 124, 124, 120, 129, 122, 142, 128, 114,
              132, 130, 130, 112, 132, 117, 134, 132, 121, 128])

# Construct dataset
data = pd.DataFrame({
    "Y": Y,
    "W": W.flatten(),
    "X": X.flatten()
})

# Add constant (α)
X_reg = sm.add_constant(data[["W", "X"]])  # Includes α (intercept)
model_stats = sm.OLS(data["Y"], X_reg).fit()

# Extract coefficients
alpha = model_stats.params["const"]
tau = model_stats.params["W"]
beta = model_stats.params["X"]
p_value = model_stats.pvalues["W"]

print("Intercept (α):", round(alpha, 2))
print("Treatment Effect (τ̂):", round(tau, 2))
print("Spending Effect (β):", round(beta, 2))
print("P-value for τ̂:", round(p_value, 4))

print("\nStatistical Summary:")
print(model_stats.summary())

Intercept (α): 95.97
Treatment Effect (τ̂): -9.11
Spending Effect (β): 1.51
P-value for τ̂: 0.0004

Statistical Summary:
                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.698
Model:                            OLS   Adj. R-squared:                  0.662
Method:                 Least Squares   F-statistic:                     19.61
Date:                Sun, 13 Apr 2025   Prob (F-statistic):           3.84e-05
Time:                        16:23:01   Log-Likelihood:                -57.076
No. Observations:                  20   AIC:                             120.2
Df Residuals:                      17   BIC:                             123.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
----------

# Question 1b
b) Report the estimated ATE (ˆτ ) and its statistical significance.

Estimated ATE is -9.11 and P>|t| is 0.0004, meaning that the ATE is highly statistically significant. The 95% confidence interval for the ATE is [-13.438, -4.773] further confirming significance


# Question 1c
c) Briefly explain under what assumptions ˆτ can be given a causal interpretation.

The following key assumptions should be there:

1. Unconfoundedness where there should be no unmeasured confounders. All variables infleuncing both treatment assignment (W) and the outcome (Y) are included in the model.
2. Every unit has a non-zero probability of receiving either treatment (W =1 or W=0) for all values of X
3. Stable Unit Treatment Value Assumption (SUTVA) 
    where one unit's treatment does not affect another's outcome - Corporation A's participation does not influence Corporation B's engagement score AND
    all the treatment is well-defined and identical for all units
4. Linear regression model should accurately reflect relationship between X and Y

# Question 2
A brief (2–3 sentence) explanation of each component in your setup (e.g., what app.py does, why the Dockerfile is needed, and how containerization improves reproducibility).

Explanation of Components below:
1. **train_model.py**: Prepares the data to fit into a linear regression for estimation of the parameters α, τ , and β. Allows us to understand the ATE and its statistical significance
2. **app.py**: Implements the Flask API with a /predict endpoint that uses the regression model combined with the estimated paramters from **train_model.py** to predict engagement scores based on treatment status and spending
3. **Dockerfile**: Defines the container environment, ensuring consistent dependecies and runtime across different systems, to run **train_model.py** before **app.py**
4. **Containerization benefits**: Ensures reproducibility by packaging the application with all its dependencies, eliminating "works on my machine" problems and making deployment easier across different environments.
