# Difference in Differences
To get a basic understanding of difference in differences,  we will work through [this video](https://www.youtube.com/watch?v=J7q2H8aB8bQ), but implement the examples in Python

## Example - Did free lunches improve student test scores in Sao Paolo?

In Brazil, 5th  graders take a standardised maths test at the end of the year.

In 2009, Sao Paolo (Brazil) introduced free lunches.  Did this have an impact on test scores?

In [23]:
import pandas as pd

pd.DataFrame([[20, 90]], columns=['2008', '2010'], index=['Sao Paolo'])

Unnamed: 0,2008,2010
Sao Paolo,20,90


The difference in test scores will be at least partially due to the program.

Suppose the World Cup was during the week of the exam in 2008, but not in 2010. This might also influence the difference in the test scores between the two years. There will likely also be a trend between these values.

The difference between these scores is $D_{1} = D_{free\ lunch} + D_{trend}$

If we also had test scores in Rio...

In [24]:

pd.DataFrame([[20, 90], [30, 70]], columns=['2008', '2010'], index=['Sao Paolo', 'Rio'])

Unnamed: 0,2008,2010
Sao Paolo,20,90
Rio,30,70


If we're willing to assume that the difference across time in Rio is reflective of what woild have happened across Sao Paolo, then we can get our difference of differences estimate.

$$Difference\ in\ Differences = D_{SP} - D_{Rio}$$

This can also be calculated as:
$$Difference\ in\ Differences = D_{2010} - D_{2008}$$

If any of the assumptions that have to be made sound fishy to you, you should be worried about the validitiy of the estimate.

# When can you use diff-in-diff?

* You want to evaluate a program or treatment.
* You have a control and treatment group.
* You have observations for both of them before and after.

If the treatment is random you don't need a difference-in-differences to get unbiased estimates of the effect, you can simply look at differences between the treatment and the control groups.

If you're sure nothing else changed between the measures of your outcomes before and after implementation, you could do a simple before / after difference to get the effect.

If the treatment was assigned to different groups based entriely on observable characteristics, you culd use multiple regression and control for these characteristics to get an estimate for the program effect. Unfortunately, you often don't know how the program was assigned or what other differences might exist between the groups.

In [27]:
data = [['Miguel', 40, 0, 0],
        ['Julia', 80, 1, 1],
        ['Davi', 20, 0, 0],
        ['Sophia', 100, 1, 1],
        ['Gabriel', 30, 0, 0],
        ['Isabella', 0, 1, 0],
        ['Davi', 20, 0, 0],
        ['Arthur', 60, 0, 1],
        ['Manuela', 40, 1, 0],
        ['Lucas', 60, 0, 1],
        ['Giovanna', 90, 0, 1]]

df = pd.DataFrame(data, columns=['name', 'score', 'D(Treatment)', 'D(Post)'])

In [26]:
df

Unnamed: 0,name,score,treatment,post
0,Miguel,40,0,0
1,Julia,80,1,1
2,Davi,20,0,0
3,Sophia,100,1,1
4,Gabriel,30,0,0
5,Isabella,0,1,0
6,Davi,20,0,0
7,Arthur,60,0,1
8,Manuela,40,1,0
9,Lucas,60,0,1


# DD with a regression

$$y = \beta_{0}\ +\ \beta_{1}D^{Post}\ +\ \beta_{2}D^{Tr}\ +\ \beta_{3}D^{Post}D^{Tr}\ +\ \epsilon$$