# Causal inference practice 

Tutorial ref: https://medium.com/analytics-vidhya/identify-causality-by-fixed-effects-model-585554bd9735  

Question of interests: does market value drive investment of companies?  

Data: the famous Grunfeld dataset. In the form of Panel Data (longitudinal or cross-sectional time-series data)  

Approach: we use PanelOLS to fit 3 models with the data 
    - model1 considers fixed effects of firm and year.
    - model2 considers fixed effects of firm only. 
    - model3 does not specify fixed effects.
We discuss the statistical inference we can draw from each model. 

<img src='img/stephen-dawson-qwtCeJ5cLYs-unsplash.jpg' width=400/>

In [1]:
from statsmodels.datasets import grunfeld
data = grunfeld.load_pandas().data
data = data.set_index(['firm','year'])
print(data.head())

                       invest   value  capital
firm           year                           
General Motors 1935.0   317.6  3078.5      2.8
               1936.0   391.8  4661.7     52.6
               1937.0   410.6  5387.1    156.9
               1938.0   257.7  2792.2    209.2
               1939.0   330.8  4313.2    203.4


In [9]:
firms = data.index.get_level_values('firm').unique().tolist()
years = data.index.get_level_values('year').unique().tolist()

In [10]:
print('There are {} firms for {} years'.format(len(firms),len(years)))

There are 11 firms for 20 years


In the PanelOLS formula below, 'EntityEffects' control for time-invariant variables such as the firms.  
'TimeEffects' control for omitted variables (not measured) that vary overtime for every firm, such as macroeconomic conditions.

In [18]:
# Fit a fixed-effects model, treating firm and year as fixed effects.
from linearmodels.panel import PanelOLS
import statsmodels.api as sm


# with both entityeffects and timeeffects
model1 = PanelOLS.from_formula("invest ~ value + capital + EntityEffects + TimeEffects", data = data)
print(model1.fit())

                          PanelOLS Estimation Summary                           
Dep. Variable:                 invest   R-squared:                        0.7253
Estimator:                   PanelOLS   R-squared (Between):              0.7637
No. Observations:                 220   R-squared (Within):               0.7566
Date:                Sun, Sep 27 2020   R-squared (Overall):              0.7625
Time:                        16:44:10   Log-likelihood                   -1153.0
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      248.15
Entities:                          11   P-value                           0.0000
Avg Obs:                       20.000   Distribution:                   F(2,188)
Min Obs:                       20.000                                           
Max Obs:                       20.000   F-statistic (robust):             248.15
                            

**Summary**  
Both 'value' and 'capital' are significantly positively correlated with 'invest'.
The 'Poolability Test' confirms if the fixed-effects indeed exist. In this case,  
F-test for Poolability = 18.476 and p-value = 0.0, so we can reject the hypothesis and conclude that there are fixed-effects.
Thus we can conclude that value and capital both contribute to increase in investment (i.e.,causal relationship).

What if we remove TimeEffects? Let's fit model 2, fe2:

In [19]:
# with only entity effects
model2 = PanelOLS.from_formula('invest ~ value + capital + EntityEffects', data=data)
print(model2.fit())

                          PanelOLS Estimation Summary                           
Dep. Variable:                 invest   R-squared:                        0.7667
Estimator:                   PanelOLS   R-squared (Between):              0.8223
No. Observations:                 220   R-squared (Within):               0.7667
Date:                Sun, Sep 27 2020   R-squared (Overall):              0.8132
Time:                        16:44:14   Log-likelihood                   -1167.4
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      340.08
Entities:                          11   P-value                           0.0000
Avg Obs:                       20.000   Distribution:                   F(2,207)
Min Obs:                       20.000                                           
Max Obs:                       20.000   F-statistic (robust):             340.08
                            

**Summary**  
The R-squared increased from 0.7253 in model1 to 0.7667 in model2.
This indicate that model2 is a better fit than model1. Likely because adding time variants in model1 take into those 'omitted', or 'unmeasured' variables that are also affecting the dependent variable.   
So although model2 has a higher R-squared, we would not be very confident to draw a causal relationship between value, capital and investment. 

Model2 also passed the Poolability test (49.207, p=0.0)

Lastly, what if we remove EntityEffects altogether ? (i.e.,remove all fixed effects)
We will fit a model3

In [20]:
# model without specifying fixed effects
model3 = PanelOLS.from_formula('invest ~ value + capital',data=data)
print(model3.fit())

                          PanelOLS Estimation Summary                           
Dep. Variable:                 invest   R-squared:                        0.8577
Estimator:                   PanelOLS   R-squared (Between):              0.8914
No. Observations:                 220   R-squared (Within):               0.6868
Date:                Sun, Sep 27 2020   R-squared (Overall):              0.8577
Time:                        16:49:28   Log-likelihood                   -1311.4
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      656.92
Entities:                          11   P-value                           0.0000
Avg Obs:                       20.000   Distribution:                   F(2,218)
Min Obs:                       20.000                                           
Max Obs:                       20.000   F-statistic (robust):             656.92
                            

**Summary**  
Model3 yielded the highests R-squared (0.8577) among the three models.  
We can say that model3 explained 86% of the variance in the data, and that value and capital positively **correlate** with investment.  
However, because model3 did not address the **endogenous issue** (i.e., endogeneity, there may exist confounding factors that can explain the relationships between x and y), we could NOT draw any causal relationships from the results.