# HW 4 -- Last Dance

# Task 1. Linear regression for causal effect

## Context
You work on a ride-sharing app.
A new feature **Upfront Price Lock** was introduced.
Some users got access to it earlier than others.

Goal: estimate the **causal effect** of the feature on **weekly spend**.

You are given a user-level dataset.
Treatment assignment is not random and depends on user characteristics.

---

## Dataset
Target

- `spend` weekly spend in EUR

Treatment

- `treatment` 1 if user had access to the feature 0 otherwise

User characteristics (pre-treatment)

- `age` user age
- `tenure_w` weeks since signup
- `trip_freq_pre` average trips per week before the feature
- `peak_share_pre` share of peak-hour trips before the feature
- `price_sens` price sensitivity score higher means more price sensitive
- `city_income` city income index proxy
- `is_ios` 1 if iOS user 0 otherwise
- `profile_completeness` profile completeness score

Post-treatment behavior (measured after exposure)

- `feature_usage` how often the new feature is used
- `price_lock_savings` average savings from the feature

Important note  
Some variables can be dangerous to include when you estimate a causal effect  
Place for reading link: [click](https://matheusfacure.github.io/python-causality-handbook/04-Graphical-Causal-Models.html)


### 1 Specify your model (what to include and what to exclude)

Write down your proposed regression model to estimate the causal effect of `treatment` on `spend`.

You must provide two lists

- variables_to_include
- variables_to_exclude

In [3]:
variables_to_include = []
variables_to_exclude = []

variables_to_include, variables_to_exclude


([], [])

### 2. Fit OLS with your formula

Report

coefficient on `treatment`  
standard error  
p-value

Interpret the coefficient as a causal estimate under your assumptions

In [4]:
## your code here

### 3. Residual check for heteroskedasticity

Make a residuals vs fitted plot  
State whether variance looks constant or not

In [5]:
## your code here

### 4. Robust inference

Refit the same model using robust standard errors HC3

Report again

standard error for `treatment`  
p-value for `treatment`

Compare with classic OLS SE and explain the difference if any


In [6]:
## your code here

### 5. Reflection

Answer briefly

- What could still bias your estimate  
- What important variables might be missing  
- What assumption is the most fragile in your approach


In [7]:
## your code here

# Task 2. Difference in Differences

## Context
You work on a food delivery app  
A new feature **Smart Tips UI** was launched only in one region

Outcome  
`order_value` average order value per order

We have repeated cross-sections  
Different orders in pre and post

Goal  
Estimate the causal effect of the feature using DiD  
Assume parallel trends and iid samples inside each cell

$$
\widehat{DiD}
=
(\bar Y^{T}_{post}-\bar Y^{T}_{pre})
-
(\bar Y^{C}_{post}-\bar Y^{C}_{pre})
$$


### 1. Visualize the data

Dataset file  
`hw4_diff_in_diff_data.csv`

Make a simple plot of average `order_value` over time for both groups

Compute the sample means

$$
\bar Y^{C}_{Pre}
\qquad
\bar Y^{C}_{Post}
\qquad
\bar Y^{T}_{Pre}
\qquad
\bar Y^{T}_{Post}
$$

Plot two lines

Control group C from Pre to Post  
Treated group T from Pre to Post  


In [8]:
## your code here

### 2. Compute the DiD estimate

Using the sample means from Task 1 compute

Naive before-after in treated  
$$
\widehat{\Delta}_T=\bar Y^{T}_{Post}-\bar Y^{T}_{Pre}
$$

Naive post difference  
$$
\widehat{\Delta}_{Post}=\bar Y^{T}_{Post}-\bar Y^{C}_{Post}
$$

Difference in Differences  
$$
\widehat{DiD}
=
(\bar Y^{T}_{Post}-\bar Y^{T}_{Pre})
-
(\bar Y^{C}_{Post}-\bar Y^{C}_{Pre})
$$

In [9]:
## your code here

### 3. Asymptotic z test for DiD

Two-sided hypothesis

$$
H_0: DiD = 0
\qquad
H_1: DiD \neq 0
$$

Assume iid inside each cell and independent samples across cells

Compute the plug-in standard error

$$
\widehat{se}(\widehat{DiD})
=
\sqrt{
\frac{(s^{T}_{Post})^2}{n^{T}_{Post}}
+\frac{(s^{T}_{Pre})^2}{n^{T}_{Pre}}
+\frac{(s^{C}_{Post})^2}{n^{C}_{Post}}
+\frac{(s^{C}_{Pre})^2}{n^{C}_{Pre}}
}
$$

Compute the statistic

$$
z=\frac{\widehat{DiD}}{\widehat{se}(\widehat{DiD})}
$$

Compute a two-sided p-value using the standard normal distribution

Report

- $\widehat{DiD}$  
- $\widehat{se}(\widehat{DiD})$  
- $z$  
- p-value  

In [10]:
## your code here

### 4. DiD via linear regression

Define indicators

$$
D=
\begin{cases}
1 & G=T\\
0 & G=C
\end{cases}
\qquad
Post=
\begin{cases}
1 & S=Post\\
0 & S=Pre
\end{cases}
$$

Fit the regression

$$
Y=\beta_0+\beta_1D+\beta_2Post+\tau(D\cdot Post)+\varepsilon
$$

Report

- $\hat\tau$  
- standard error for $\hat\tau$  
- p-value for $\hat\tau$  

In [11]:
## your code here

# Task 3. Synthetic Control

## Context
You work on a marketplace.
Region **U01** launched a new policy at time **T0**.
Other regions did not.

Outcome `y` is observed weekly for many regions.

Dataset file  
`hw5_synth_control_data.csv`

Columns  
`unit` region id  
`t` time index  
`treated` 1 for U01 0 otherwise  
`post` 1 if t >= T0  
`y` outcome  

Goal  
Estimate the treatment effect for U01 after T0 using a synthetic control.

### 1. Visual inspection

Dataset file  
`hw4_synth_control_data.csv`

1 Load the dataset  
2 Plot `y` over time for the treated unit `U01` and several donor units  
3 Mark the intervention time `T0` with a vertical line


In [12]:
## your code here

### 2. Build synthetic control weights

Use only the pre-treatment period

$$
t < T0
$$

Define objects from the pre period

Treated vector

$$
y^{T}_{Pre} =
\begin{bmatrix}
y_{U01,1}\\
\vdots\\
y_{U01,T0-1}
\end{bmatrix}
$$

Donor matrix (each column is one donor unit)

$$
Y^{D}_{Pre} =
\begin{bmatrix}
y_{U02,1} & \cdots & y_{U18,1}\\
\vdots & \ddots & \vdots\\
y_{U02,T0-1} & \cdots & y_{U18,T0-1}
\end{bmatrix}
$$

Find weights \(w\) by solving

$$
\min_{w}\ \left\lVert y^{T}_{Pre} - Y^{D}_{Pre} w \right\rVert^2
$$

Subject to

$$
w_j \ge 0
\qquad
\sum_j w_j = 1
$$

Report the top 5 donor weights.


In [13]:
## your code here

### 3. Construct synthetic control series

Using the weights from section 2 build the synthetic series for all time periods

$$
\hat y^{Synth}_{t}=\sum_{j \in Donors} w_j \cdot y_{j,t}
$$

Plot on the same chart

1. Actual outcome for U01  
2. Synthetic control outcome  
3. Add a vertical line at \(T0\)


In [14]:
## your code here

### 4. Estimate and report the effect

Using your synthetic control series compute the pointwise effect:

$
\widehat{\tau}_t = y_{U01,t} - \hat y^{Synth}_t
$

1. Plot $\widehat{\tau}_t$ over time  
2. Add a horizontal line at 0  
3. Add a vertical line at \(T0\)  


Report two numbers
- Average post-period effect
- Effect at the final period

In [15]:
## your code here

## ðŸŽ‰ Final note

That was the last homework of the course.

**Congrats on finishing the course.**

Statistics is a skill that compounds  
keep practicing and it will pay back many times.

Thank you
