# Causal Inference Training Exercises
## Fixed Effects

These exercises are designed to introduce the concept of a fixed effects regression using simulated data.

We are working with the UK team to assess the impact of a recent advertising campaign. The team spent significant money across zones in London on digital advertising to boost sales, to maximise impact they decided to spend more in higher density zones. In each zone the advertising spending varied randomly from day to day between a high, medium and low amount. 

We are given data for a four week period covering 06/01/2020-02/02/2020, with details on the advertising spend and the number of sessions.

In [1]:
import pandas as pd
import numpy as np
from linearmodels import PanelOLS
import statsmodels.api as sm
%matplotlib inline

df = pd.read_csv('london_ad_spend.csv')
df['date'] = pd.to_datetime(df['date'])
df.head()

Unnamed: 0,zone_code,date,num_sessions,ad_spend
0,BAL,2020-01-06,1886.0,450.0
1,BAL,2020-01-07,1689.0,300.0
2,BAL,2020-01-08,1918.0,450.0
3,BAL,2020-01-09,1968.0,150.0
4,BAL,2020-01-10,3477.0,150.0


## Part 1
When using any complex regression techniques it is useful to have a baseline of a standard regression for comparison. Let's do some preliminary analysis to understand the data before we start using fixed effects.

1. Perform a linear regression of `num_sessions` on `ad_spend` with a constant. What is the `ad_spend` coefficient estimate? 
1. What are possible omitted variables that prevent this coefficient estimate from having a causal interpretation?
1. If we were planning to use a controlled regression approach here - what would be good controls? 
1. Would we expect a fixed effects regression to have a lower or higher `ad_spend` coefficient estimate than the estimate we have obtained here? Why? (Hint: is the omitted variable bias likely to be negative or positive?)

Note: We are conducting this regression for illustrative purposes - to perform correct inference the standard errors would have to be adjusted to account for the non i.i.d nature of the data

In [4]:
df['const'] = 1

reg = sm.OLS(endog = df['num_sessions'], exog = df[['ad_spend', 'const']])
results = reg.fit()
results.summary()

0,1,2,3
Dep. Variable:,num_sessions,R-squared:,0.269
Model:,OLS,Adj. R-squared:,0.268
Method:,Least Squares,F-statistic:,1027.0
Date:,"Fri, 07 May 2021",Prob (F-statistic):,2.74e-192
Time:,11:27:48,Log-Likelihood:,-25746.0
No. Observations:,2800,AIC:,51500.0
Df Residuals:,2798,BIC:,51510.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
ad_spend,9.1058,0.284,32.053,0.000,8.549,9.663
const,179.9987,84.053,2.141,0.032,15.186,344.811

0,1,2,3
Omnibus:,1773.25,Durbin-Watson:,0.463
Prob(Omnibus):,0.0,Jarque-Bera (JB):,19064.403
Skew:,2.918,Prob(JB):,0.0
Kurtosis:,14.374,Cond. No.,552.0


## Part 2
For illustrative purposes - let's use the "dummy variable" set-up of a fixed effects regression, where we add a dummy as an indicator for each zone. 

1. Add dummies for each zone in the dataset. Hint: `pd.get_dummies`
1. Run an OLS regression on `num_sessions` against `ad_spend` and the zone dummies - what is new coefficent estimate for `ad_spend`? (Beware multicollinearity!) 
1. How would you apply clustered standard errors to this regression? How much does the `p_value` change on the `ad_spend` coefficient? 



In [12]:
zone_dummies = pd.get_dummies(df['zone_code'], prefix='zone', drop_first=True)
zone_dummies.head()


df_dummies = pd.concat([df, zone_dummies], axis=1)
df_dummies.head()


exog_cols = ['ad_spend'] + list(zone_dummies.columns)

y = df_dummies['num_sessions']
X = sm.add_constant(df_dummies[exog_cols])

model_1 = sm.OLS(y, X).fit(cov_type='cluster', cov_kwds={'groups': df['zone_code']})
# Without clustered standard errors
# model_1 = sm.OLS(y, X).fit()
model_1.summary()




0,1,2,3
Dep. Variable:,num_sessions,R-squared:,0.936
Model:,OLS,Adj. R-squared:,0.934
Method:,Least Squares,F-statistic:,15390.0
Date:,"Fri, 07 May 2021",Prob (F-statistic):,1.9100000000000003e-110
Time:,17:37:18,Log-Likelihood:,-22331.0
No. Observations:,2800,AIC:,44860.0
Df Residuals:,2699,BIC:,45460.0
Df Model:,100,,
Covariance Type:,cluster,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,2685.5942,35.037,76.651,0.000,2616.924,2754.265
ad_spend,0.4699,0.136,3.449,0.001,0.203,0.737
zone_BAR,-1945.6489,9.246,-210.436,0.000,-1963.770,-1927.527
zone_BARK,-1806.1833,7.299,-247.445,0.000,-1820.490,-1791.877
zone_BEC,-1951.0398,6.326,-308.412,0.000,-1963.439,-1938.641
zone_BEL,-64.0319,5.839,-10.965,0.000,-75.477,-52.587
zone_BEX,-1804.1486,8.759,-205.972,0.000,-1821.316,-1786.981
zone_BLW,-2426.7456,20.681,-117.339,0.000,-2467.280,-2386.211
zone_BRK,-684.8379,2.190,-312.740,0.000,-689.130,-680.546

0,1,2,3
Omnibus:,426.569,Durbin-Watson:,1.023
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2024.886
Skew:,0.65,Prob(JB):,0.0
Kurtosis:,6.958,Cond. No.,29700.0


## Part 3
The dummy variable approach is not feasible for larger numbers of entities - we have to use the mean deviation implementation from the `linearmodels` package. Let's prepare our data for the fixed effects regression. We do this using the `PanelOLS` function. First we have to index our data by the relevant panel components - entity and time period

In [2]:
df = df.set_index(['zone_code','date']).sort_index()
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,num_sessions,ad_spend
zone_code,date,Unnamed: 2_level_1,Unnamed: 3_level_1
BAL,2020-01-06,1886.0,450.0
BAL,2020-01-07,1689.0,300.0
BAL,2020-01-08,1918.0,450.0
BAL,2020-01-09,1968.0,150.0
BAL,2020-01-10,3477.0,150.0


Now we have our data ready to use, we can use the `PanelOLS` function to perform the regression. We need to use clustered standard errors here to ensure correct inference. [This link](https://bashtage.github.io/linearmodels/devel/panel/examples/examples.html) provides details on the relevant function, and on applying clustered standard errors. 

1. What are the coefficients from the panel data regression? How does this differ from the normal OLS regression?
1. Describe in words the relationship we have measured here
1. What are the limitations of fixed effects regressions we should be concerned about in this setting? 
1. If we had other data available, what further analysis could we perform to alleviate these concerns?
1. Comment on the R^2 of the regression compared to the dummy variable approach above. Should we be concerned?
1. What are other approaches we could use instead of fixed effects regressions here?

In [3]:
exog = sm.add_constant(df['ad_spend'])
y = df['num_sessions']

mod = PanelOLS(y
               , exog
               , entity_effects=True
               , time_effects=False)

model_1 = mod.fit(cov_type='clustered', cluster_entity=True)

model_1


0,1,2,3
Dep. Variable:,num_sessions,R-squared:,0.0054
Estimator:,PanelOLS,R-squared (Between):,0.0285
No. Observations:,2800,R-squared (Within):,0.0054
Date:,"Wed, May 19 2021",R-squared (Overall):,0.0270
Time:,15:21:03,Log-likelihood,-2.233e+04
Cov. Estimator:,Clustered,,
,,F-statistic:,14.683
Entities:,100,P-value,0.0001
Avg Obs:,28.000,Distribution:,"F(1,2699)"
Min Obs:,28.000,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,2337.1,33.265,70.257,0.0000,2271.9,2402.4
ad_spend,0.4699,0.1332,3.5285,0.0004,0.2088,0.7310


## Part 4 [Optional Extension]
**(Not covered in the lectures but will be covered in the workbook solutions if interested)**

Your manager is keen to have an understanding of the proportional relationships - what is the percentage increase in sessions caused by a 1% increase in ad spending?

1. How can you modify the panel data regression to obtain this estimate?
1. Perform the regression with these modifications, what is your estimate for this relationship?

**Interpret Regression Coefficient Estimates - {level-level, log-level, level-log & log-log regression}** http://www.cazaar.com/ta/econ113/interpreting-beta