<h1> Introductory Econometrics in Python

### Table of Content ###

1. Welcome and Introduction
2. The Nature of Ecomometrics and Economic Data
3. Simple Regression with Cross-sectional Data
4. Multiple Regression with cross-sectional Data, including Inference and Hypothesis testing
5. Binary Dependent Variables
6. Regression Analysis with Panel Data
7. ***Estimation of Treatment Effects: Difference-in-differences Analysis***

# Differences-in-Differences Estimation #

1. Introductory Examples
2. Experiments and Quasi-Experiments
3. Estimation of Treatment Effects by DiD
4. Applied Case Studies

## Experiments or (randomized controlled trials)

**Experimental setups** are common in e.g., psychology, medicine

- Randomly selected patients: some receive the drug (**treatment group**) while others receive a harmless ineffective alternative (**placebo group**)
    - E.g., Experimental trials for new drugs before being approved by FDA
    - **Treatment effect**: run OLS model and include a dummy equal to one for the treated objects and zero otherwise
    
Such experimental design are rare in economics. Some exceptions: 
- Effectiveness of mosquito nets on Malaria cases 
- Effect of deworming children on schooling. Which incentives work best (lentils!) 
- randomizing class size at schools to study performance 
- giving people in poor areas access to electricity, 
- equipping people with mobile phones

TED Talk by Esther Duflo
https://www.youtube.com/watch?v=0zvrGiPkVcs&t=300s

**Quasi- (or natural) experiments**

More common in economics: 
- randomness introduced by variations in individual circumstances (it appear as if the treatment is randomly assigned).
- E.g., discrepancies in legal institutions, location, timing of policy or program implementation, natural randomness (birth dates, rainfall, etc.)

We consider two types of quasi-experiments:
1. We have "as-if" random variation that only partially determines the treatment.
     - The causal effect is estimated by **instrumental variables regression**,
     - the as-if random source of variation provides the instrumental variable.
1. The treatment of individuals (or other entities) is viewed as randomly determined
    - Causal effect estimated by OLS using the treatment, $x_i$, as a regressor.
    - The differences-in-differences estimator belongs to this category


-  Differences-in-Differences estimation has become the single most popular research design in the quantitative social sciences
-  Applied in situations where a consequential treatment was given to some people or units but denied to others “haphazardly.”
- This is sometimes called a “natural experiment” because it is based on naturally occurring variation in some treatment variable that affects only some units over time.
- All good difference-in-differences designs are based on some kind of natural experiment. 
- The story of how John Snow convinced the world that cholera was transmitted by water, not air, using an ingenious natural experiment was maybe the first application (Snow 1855).

## Introductory Examples ##

**The cholera epidemics in London of the 19th century**

- Cholera killed tens of thousands of people in London 
- Doctors could not help the victims as the prevailing theory was that cholera is transmitted by Miasma,  which said diseases were spread by microscopic poisonous particles that infected people by floating through the air.
- Everything that was tried to stop Cholera (quarantining, cleaning the air) was not effective
- The physician John Snow then developed the theory that Cholera may be transmitted through microorganisms in the water supply
    - Symptoms where vomiting and Diarrhea
    - With each evacuation the organism passed the body and flowed into London's water supply 
- To test this idea Snow needed to find a situation where uncontaminated water had been distributed to a large number of people as if by random chance, and then calculate the difference between those those who did and did not drink contaminated water. 


- Two companies served different areas of London back then.
- One company, the Lambeth water company had recently moved its intake pipes upstream higher up the Thames where - if Cholera was actually transmitted through water - the water should not be infected with Cholera anymore 
- There was a path through London districts determining which of the two companies suplies a household with water

Experimental setup:
- Cross-sectional variation: 
    - districts supplied by two water companies (the Southwark and Vauxhall and the Lambeth Company)
- Time series variation:
    - 1849: both companies obtained their wather from the dirty Thamse in central London
    - 1852: the Lambeth Company moved its water plant upriver to an area that was free of sewage


Test: Compare changes in death rates from cholera in both districts before and after the intervention 

<br>  

<img src="figs/StatsLondon.gif" width="300"/> 
    
<p>

Death rates in Lambeth districts fell sharply in comparison with changes of death rates in Southwark and Vauxhall districts

***Introductory example II: James Lind and the Magical Sour Cabbage***

- Scurvy (caused by Vitamine C deficiency) killed two million sailors between 1500 and 1800
    - Many theories about what could help: malt, blood-letting, turf into the mouth, magic elixirs

- James Lind's research hypothesis: Sauerkraut prevents scurvy on long sailing trips (storable and Vitamine C loaded)

- Cross Sectional Variation: 
    - Captain James Cook left with Sauerkraut aboard his ship to South Pacific in 1768
    - 3 other captains left without Sauerkraut on the same journey 

- Time Series Variation: The number of deaths before and after the trip for each ship

- Result: Only James Cook returned without a single death

**Quasi- (or natural) experiments**

Typical Research Question: What are the effects of political interventions?
 - Example: How does a certain regulation affect prices in country X?

Typical Problem: Comparing before and after effects of the political intervention often not indicative 
  - Price changes might also be attributed to demand or supply developments

Solution: Compare the before-after changes to before-after changes of a comparison or control group


**Graphical representation of Difference-in-Differences**
<br>  
<img src="figs/DiD_graph2.png" width="400"/>     
<p>
    
 - Something exogenous has happened in unit $x$ between the "before" and "after period
 - Let's call this event $T$
 - However, nothing has happened in unit $y$ in the same period of time
 - $T$ has an affect on an outcome variable $p$, e.g. price
 - The difference of the price changes - the one of $x$ and change of $y$ - can help us to identify the effect of $T$ on $p$ in a causal way
 - This is the baseline idea of a DiD framework

We can translate the concept of this idea into a Difference-in-Differences regression framework

- $Y_i$ denotes an outcome over a population of individuals $i=1,...,N$
- Two groups indexed by treatment status $treatment=0,1$ 
    - where 0 indicates individuals who did not receive the treatment, i.e. the **control group**, 
    - 1 indicates individuals who received the treatment, i.e. the **treatment group**.
- We observe individuals in two time periods, $post=0,1$ 
    - where 0 indicates a time period before the treatment group receives the treatment, i.e. **pre-treatment period**, 
    - and 1 indicates a time period after the treatment group receives the treatment, i.e. **post-treatment period**.

The outcome $Y_i$ is explained by

$y_i=\alpha + \beta post_i +\gamma treatment_i + \delta (post_i \times treatment_i) + e_i$

where 
<br>  

<img src="figs/DiD_table2.png" width="500"/> 
    
<p>
Purpose is to find a "good" estimate of $\delta$, $\hat{\delta}$, given the data that we have available

***How many workers would lose their jobs if we increase the minimum wage?***

The Bureau of Labor Statistics (BLS) report on the effects of the 1913 Oregon minimum wage law 
- Minimum wage of 9 Dollar for experienced woman 
- For inexperienced women, and for girls aged 16–18, the minimum was set at $6 per week

Cross sectional variation
- Payroll, working hours, and sales from 33 retail stores in Portland (Oregon) and Salem (Massachusetts)
- 1546 women and 868 men

Time series information
- March and April of 1913: about five months before minimum wage introduction, 
- March and April of 1914: about five months after the minimum wage introduction. 

<br>  

<img src="figs/Kennan.png" width="600"/> 
    
<p>

Main Limitations: 
- recession: total sales in these Portland stores fell by 8.6 percent over this period.
- As noted by the authors, the jobs held by men were less vulnerable to this decline than women, 
    - The difference-in-differences estimator overstates employment effects (more negative than in reality)
    - The authors argue that that there would have been no employment effect without the recession (p. 12)

**Example Card and Krueger (1994, AER)**

"*Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania*"

**Time Series Variation**: On April 1, 1992, New Jersey's minimum wage rose from \\$4.25 to \\$5.05

**Cross Sectional Variation**: 410 fast-food restaurants in New Jersey (treatment group) and Eastern Pennsylvania (control group) before and after

Empirical Implementation:
- $Y_i$: the emplyoment of a fast-food restaurant outcome, 
- $treatment_i$: whether or not a restaurant is in New Jersey, 
- $post_i$: whether the observation is from before or after the minimum wage hike

***Application*** 

Consider the following sample averages for New Jersey (Treated) and Pennsylvania (Control) before and after the minimum wage implementation in April


|Time     | New Jersey | Pennsylvania | 
| :---    | ---        | ---          | 
|February | 20.44      | 23.33        | 
|November | 21.03      | 21.17        |


What is the effect of the minimum wages on employment, measured as FTEs (full-time equivalents)?
-  1FTE is one fully employed person


***Solution*** 

|Time     | New Jersey | Pennsylvania | Difference | 
| :---     | ---        | ---          | ---        | 
|February | 20.44      | 23.33        | −2.89      | 
|November | 21.03      | 21.17        | −0.14      |
|Change   |  0.59      | −2.16        | 2.75       |

- Are these results as expected? 
- What may be probles that cause this finding other than it being a causal effect? 

**Unbiased Estimator**

- To get an unbiased estimator the estimate has to be correct _on average_, i.e. $E[\hat{\delta}]=\delta$
- This is the case if the following assumptions hold:
    
    1. The model is correctly specified
    1. The error term is on average zero: $E[\varepsilon_i]=0$
    1. The error term is uncorrelated with the other variables in the equation:
		
		$cov(\varepsilon_i,T_i)=0$
		
		$cov(\varepsilon_i,t_i)=0$
		
		$cov(\varepsilon_i,T_i\times t_i)=0$  
        
        
- The last assumption is known as the **parallel trend assumption** or **common trend assumption** and is most critical.

**Unbiased Estimator**


- Under these assumptions we can use equation $Y_i=\alpha+\beta T_i +\gamma t_i + \delta(T_i \times t_i)+e_i$ to determine that expected values of the outcomes are given by

    - $E[Y_0^T]=\alpha+\beta$
    - $E[Y_1^T]=\alpha+\beta+\gamma+\delta$
    - $E[Y_0^C]=\alpha$
    - $E[Y_1^T]=\alpha+\gamma$

**Simple Pre vs Post Estimator (Before-After)**

- Why not just compare "before" and "after" treatment outcomes in the treatment group alone?  


\begin{equation*}
\hat{\delta}_1=\bar{Y}_1^T-\bar{Y}_0^T
\end{equation*}


- This would be a regression of the form $Y_i=\alpha_1+\delta_1t_i+\varepsilon_i$ on the sample from the treatment group only.

- We then have  

\begin{equation*}
E[\hat{\delta}_1]=E[\bar{Y}_1^T]-E[\bar{Y}_0^T]=[\alpha+\beta+\gamma+\delta]-[\alpha+\beta]=\gamma+\delta
\end{equation*}

- The estimator is biased if $\gamma\neq 0$ 


**Simple Treatment vs Control Estimator**

- Why not compare average differences in outcome $Y_i$ post-treatment, between treatment and control groups, ignoring pre-treatment outcomes

\begin{equation*}
\hat{\delta}_2=\hat{Y}_1^T-\hat{Y}_1^C
\end{equation*}


- Regression of the form  

\begin{equation*}
Y_i=\alpha_2+\beta_2T_i+\varepsilon_i
\end{equation*}  


- We then have 

\begin{equation*}
E[\hat{\delta}_2]=E[\hat{Y}_1^T]-E[\hat{Y}_1^C]=[\alpha+\beta+\gamma+\delta]-[\alpha+\gamma]=\beta+\delta
\end{equation*}

- The estimator is biased as long as $\beta\neq 0$, i.e. there exists a permanent average difference in outcome $Y_i$ between treatment and control group (no randomized treatment as in controlled experiments)

**Difference in Difference Estimator (DiD)**

- DiD (or "Double difference" estimator) is defined as the difference in the average outcome in the treatment group - the difference in the average outcome in the control group before and after the treatment
- Regression of the form 

\begin{equation*}
Y_i=\alpha+\beta T_i +\gamma t_i + \delta(T_i \times t_i)+e_i
\end{equation*}

- The estimator is unbiased

\begin{equation*}
\hat{\delta}_{DiD}=\bar{Y}_1^T-\bar{Y}_0^T-(\bar{Y}_1^C-\bar{Y}_0^C)=\alpha+\beta+\gamma+\delta-(\alpha-\beta)-(\alpha+\gamma-\gamma)=(\gamma+\delta)-\gamma=\delta
\end{equation*}


- BUT: only unbiased if treatment and control group have a _common (parallel) trend_, i.e.

\begin{equation*}
\gamma^T=\gamma^C
\end{equation*}


!!!

**Treatment Effect in DiD**


<br>  


<center><img src="figs/graph_did.png" width="500"/> 
    

<p>

- Assume we have a treated and a non-treated group of individuals, e.g. firms in two countries but one country introduced a new law on something and the other country didn't
- The countries already differ in the outcome levels prior to the treatment by 
- The true model is 

\begin{equation*}
Y_i=0.5+1\times T-0.5\times t+2\times T \times t+\varepsilon
\end{equation*}  

- How do we have to interpret the coefficients?
- Let's estimate such a model to see the differences between before-after, treatment-control and differences-in-differences estimators 

In [1]:
## generate some random variables with the above characteristics
import pandas as pd
import numpy as np
from numpy.random import seed
from numpy.random import rand
from numpy.random import randn
from scipy import stats
import statsmodels.formula.api as smf    # for the ols and robust ols model
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col

## generate control and treatment group pre- and post treatment

seed(1)
y = np.array(randn(1000)+1)
#print(y[:10])
stats.describe(y)

## generate treatment dummy
## which is 0 for the first 500 observations
## and 1 for the second 500 observations

treat=np.ones(1000)
#print(treat[0:10])

treat[:500]=0
#print(treat[490:510])

## generate two periods (before and after) for treatment and control group
post=np.ones(1000)
post[250:750]=0
#print(post[240:260])
#print(post[740:760])

## generate a general time trend which is -0.5 in the post period
y = np.where(post==1, y-0.5, y)

## generate a general level difference between treatment and control group of 1
y = np.where(treat==1, y+1, y)

#generate the treatment effect
y=np.where((treat==1) & (post==1),y+2,y)

df = pd.DataFrame({'y' : y, 'treat': treat, 'post':post})
#df.head()


           before-after control-treatment   DiD   
               (1)             (2)          (2)   
--------------------------------------------------
Intercept  2.01***      0.58***           1.03*** 
           (0.06)       (0.06)            (0.06)  
post       1.54***                        -0.45***
           (0.09)                         (0.09)  
treat                   2.97***           0.97*** 
                        (0.08)            (0.09)  
treat:post                                1.99*** 
                                          (0.12)  
N          500          500               1000    
R2         0.38         0.71              0.57    
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01


**Common Trend assumption**

- Most common problem with DiD estimates is the failure of the common trend assumption
- If common trend assumption does not hold then we get biased estimates (even the sign can go in the wrong direction)
- Suppose that $cov(\varepsilon_i,T_i\times t_i)=E(\varepsilon_i(T_i\times t_i))=\Delta$ so that $Y$ follows a different trend for the treatment and the control group
- The control group has a time trend of $\gamma^C=\gamma$ and the treatment group has a trend of $\gamma^T=\gamma+\Delta$
- In this case the DiD estimator will be biased as 

\begin{equation*}
E[\hat{\delta}_{DD}]=(\gamma^T+\delta)-\gamma^C=\gamma+\Delta+\delta-\gamma=\delta+\Delta
\end{equation*}

- Think about the New Jersey minimum wage example: How could the common trend assumption be violated?


**Failure of the Common Trend Assumption**

<br>  


<center><img src="figs/fail_common_trend.png" width="500"/> 
    

<p>


**The Common Trend Assumption**

Do you think this assumption is testable?



**Testing the common trend assumption**


- The validity of the common trend assumption cannot be tested formally as we do not observe the `real' counterfactual
- However, some techniques are frequently applied to add confidence on its validity, but for this we need to observe a sufficiently large pre-treatment period
    - Graphical illustration of pre-treatment trends
        - Problem: using a graphical check we cannot account for factors we are able to conrol for in our regression
    - Adding pre-treatment dummies to the baseline model and see whether they are significant (e.g. $t_{-1}\cdot T$)
        - By this we can test whether there was already a difference in the slope of y prior to the treatment
        - Problem: Hard to distinguish between a violation of the common trend assumption and a potential anticipation effect
    - Run placebo estimations with randomized fake treatments for the treatment group in the pre-treatment period when we have a heterogenous treatment (treatment occurs at different points in time for different entities)
        

In [5]:
## generate some random variables with the above characteristics
import pandas as pd
import numpy as np
from numpy.random import seed
from numpy.random import rand
from numpy.random import randn
from scipy import stats
import statsmodels.formula.api as smf    # for the ols and robust ols model
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col

## generate control and treatment group pre- and post treatment

seed(1)
y = np.array(randn(1000)+1)
#print(y[:10])
stats.describe(y)

## generate treatment dummy
## which is 0 for the first 500 observations
## and 1 for the second 500 observations

treat=np.ones(1000)
#print(treat[0:10])

treat[:500]=0
#print(treat[490:510])

## generate two periods (before and after) for treatment and control group
post=np.ones(1000)
post[250:750]=0
#print(post[240:260])
#print(post[740:760])

## generate a general time trend which is -0.5 in the post period
y = np.where(post==1, y-0.5, y)

## generate a general level difference between treatment and control group of 1
y = np.where(treat==1, y+1, y)

#generate the treatment effect
y=np.where((treat==1) & (post==1),y+2,y)

df = pd.DataFrame({'y' : y, 'treat': treat, 'post':post})
df.head()

## now generate y for the case when the common trend assumption doesn't hold
## in this example we do this by adding an increase in the post period for the non-treated
y2=np.where((treat==0) & (post==1),y+4,y)


In [6]:
## DiD model where common trend holds (as was the case in the previous example)
reg_common = smf.ols(formula='y ~ treat*post', data=df)
results_common = reg_common.fit()


## DiD model where common trend doesn't holds (as was the case in the previous example)
reg_nocommon = smf.ols(formula='y2 ~ treat*post', data=df)
results_nocommon = reg_nocommon.fit()


output = summary_col([results_common,results_nocommon],stars=True,float_format='%0.2f',
                 model_names=['common trend\n(1)','no common trend\n(2)'],
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                 'R2':lambda x: "{:.2f}".format(x.rsquared)})
print(output)



           common trend no common trend
               (1)            (2)      
---------------------------------------
Intercept  1.03***      1.03***        
           (0.06)       (0.06)         
treat      0.97***      0.97***        
           (0.09)       (0.09)         
post       -0.45***     3.55***        
           (0.09)       (0.09)         
treat:post 1.99***      -2.01***       
           (0.12)       (0.12)         
N          1000         1000           
R2         0.57         0.66           
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01


**Real data application**

**Effect of a Garbage Incinerator’s Location on Housing Prices**

- We are interested in whether and how much the construction of a new garbage incinerator affected
the value of nearby houses.
- We analyze this using the data set KIELMc.
- We first estimate a model for 1981 (when the construction began).
- In 1981, the houses close to the construction site were cheaper by an average of \$30,688.27.


In [8]:
import wooldridge as woo
import pandas as pd
import statsmodels.formula.api as smf

kielmc = woo.dataWoo('kielmc')

#  regressions for 1981:
y81 = (kielmc['year'] == 1981)
reg81 = smf.ols(formula='rprice ~ nearinc', data=kielmc, subset=y81)
results81 = reg81.fit()



# print regression tables:
table_81 = pd.DataFrame({'b': round(results81.params, 4),
                         'se': round(results81.bse, 4),
                         't': round(results81.tvalues, 4),
                         'pval': round(results81.pvalues, 4)})
print(f'table_81: \n{table_81}\n')

table_81: 
                     b         se        t  pval
Intercept  101307.5136  3093.0267  32.7535   0.0
nearinc    -30688.2738  5827.7088  -5.2659   0.0



- But this was not only due to the new incinerator since even in 1978, nearby houses were cheaper by an average of \$18, 824.37.
- The difference of these differences $\delta$ = \\$30,688.27 — \\$18,824.37 = \\$11,863.90 is the DiD estimator and is arguably a better indicator of the actual effect.


In [16]:
import wooldridge as woo
import pandas as pd
import statsmodels.formula.api as smf

kielmc = woo.dataWoo('kielmc')

# separate regressions for 1978 and 1981:
y78 = (kielmc['year'] == 1978)
reg78 = smf.ols(formula='rprice ~ nearinc', data=kielmc, subset=y78)
results78 = reg78.fit()

y81 = (kielmc['year'] == 1981)
reg81 = smf.ols(formula='rprice ~ nearinc', data=kielmc, subset=y81)
results81 = reg81.fit()

# print regression tables:
table_78 = pd.DataFrame({'b78': round(results78.params, 4),
                         'se78': round(results78.bse, 4),
                         't78': round(results78.tvalues, 4),
                         'pval78': round(results78.pvalues, 4)})
print(f'table_78: \n{table_78}\n')

table_81 = pd.DataFrame({'b81': round(results81.params, 4),
                         'se81': round(results81.bse, 4),
                         't81': round(results81.tvalues, 4),
                         'pval81': round(results81.pvalues, 4)})
print(f'table_81: \n{table_81}\n')

diff_78_81=results81.params-results78.params

print(f'diff_78_81: \n{diff_78_81}\n')

#kielmc.describe()

table_78: 
                  b78      se78      t78  pval78
Intercept  82517.2276  2653.790  31.0941  0.0000
nearinc   -18824.3705  4744.594  -3.9675  0.0001

table_81: 
                   b81       se81      t81  pval81
Intercept  101307.5136  3093.0267  32.7535     0.0
nearinc    -30688.2738  5827.7088  -5.2659     0.0

diff_78_81: 
Intercept    18790.285953
nearinc     -11863.903252
dtype: float64



- The DiD estimator can be obtained more conveniently using a joint regression model with the interaction
term as described above.
- The estimator $\delta$ = \\$11,863.90 can be directly seen as the coefficient of the interaction term.
- For a one-sided test, the p value is 0.5 $\cdot$ 0.113 = 0.056, so there is some statistical evidence of a negative impact.

In [13]:
import wooldridge as woo
import pandas as pd
import statsmodels.formula.api as smf

kielmc = woo.dataWoo('kielmc')

# joint regression including an interaction term:
reg_joint = smf.ols(formula='rprice ~ nearinc * C(year)', data=kielmc)
results_joint = reg_joint.fit()

table_joint = pd.DataFrame({'b': round(results_joint.params, 4),
                            'se': round(results_joint.bse, 4),
                            't': round(results_joint.tvalues, 4),
                            'pval': round(results_joint.pvalues, 4)})
print(f'table_joint: \n{table_joint}\n')

table_joint: 
                                  b         se        t    pval
Intercept                82517.2276  2726.9101  30.2603  0.0000
C(year)[T.1981]          18790.2860  4050.0650   4.6395  0.0000
nearinc                 -18824.3705  4875.3221  -3.8612  0.0001
nearinc:C(year)[T.1981] -11863.9033  7456.6462  -1.5911  0.1126



In [19]:
import wooldridge as woo
import pandas as pd
import statsmodels.formula.api as smf

kielmc = woo.dataWoo('kielmc')

# separate regressions for 1978 and 1981:
treat = (kielmc['nearinc'] == 1)
regtreat = smf.ols(formula='rprice ~ C(year)', data=kielmc, subset=treat)
resultstreat = regtreat.fit()

control = (kielmc['nearinc'] == 0)
regcontrol = smf.ols(formula='rprice ~ C(year)', data=kielmc, subset=control)
resultscontrol = regcontrol.fit()

# joint regression including an interaction term:
reg_joint = smf.ols(formula='rprice ~ nearinc * C(year)', data=kielmc)
results_joint = reg_joint.fit()

# print regression tables:
table_treat = pd.DataFrame({'b': round(resultstreat.params, 4),
                         'se': round(resultstreat.bse, 4),
                         't': round(resultstreat.tvalues, 4),
                         'pval': round(resultstreat.pvalues, 4)})
print(f'table_treat: \n{table_treat}\n')

table_control = pd.DataFrame({'b': round(resultscontrol.params, 4),
                         'se': round(resultscontrol.bse, 4),
                         't': round(resultscontrol.tvalues, 4),
                         'pval': round(resultscontrol.pvalues, 4)})
print(f'table_control: \n{table_control}\n')

diff_treat_control=resultstreat.params-resultscontrol.params

print(f'diff_treat_control: \n{diff_treat_control}\n')

table_joint = pd.DataFrame({'b': round(results_joint.params, 4),
                            'se': round(results_joint.bse, 4),
                            't': round(results_joint.tvalues, 4),
                            'pval': round(results_joint.pvalues, 4)})
print(f'table_joint: \n{table_joint}\n')

table_treat: 
                          b         se        t    pval
Intercept        63692.8571  5296.3219  12.0259  0.0000
C(year)[T.1981]   6926.3827  8205.0266   0.8442  0.4007

table_control: 
                          b         se        t  pval
Intercept        82517.2276  2277.5287  36.2310   0.0
C(year)[T.1981]  18790.2860  3382.6342   5.5549   0.0

diff_treat_control: 
Intercept         -18824.370499
C(year)[T.1981]   -11863.903252
dtype: float64

table_joint: 
                                  b         se        t    pval
Intercept                82517.2276  2726.9101  30.2603  0.0000
C(year)[T.1981]          18790.2860  4050.0650   4.6395  0.0000
nearinc                 -18824.3705  4875.3221  -3.8612  0.0001
nearinc:C(year)[T.1981] -11863.9033  7456.6462  -1.5911  0.1126



- The DiD can be improved:
<br>

    - A logarithmic specification is more plausible since it implies a constant percentage effect on the house values.
    - We can also add additional regressors to control for incidental changes in the composition of the houses traded.

In [20]:
import wooldridge as woo
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

kielmc = woo.dataWoo('kielmc')

# difference in difference (DiD):
reg_did = smf.ols(formula='np.log(rprice) ~ nearinc*C(year)', data=kielmc)
results_did = reg_did.fit()

# print regression table:
table_did = pd.DataFrame({'b': round(results_did.params, 4),
                          'se': round(results_did.bse, 4),
                          't': round(results_did.tvalues, 4),
                          'pval': round(results_did.pvalues, 4)})
print(f'table_did: \n{table_did}\n')

# DiD with control variables:
reg_didC = smf.ols(formula='np.log(rprice) ~ nearinc*C(year) + age +'
                           'I(age**2) + np.log(intst) + np.log(land) +'
                           'np.log(area) + rooms + baths',
                   data=kielmc)
results_didC = reg_didC.fit()

# print regression table:
table_didC = pd.DataFrame({'b': round(results_didC.params, 4),
                           'se': round(results_didC.bse, 4),
                           't': round(results_didC.tvalues, 4),
                           'pval': round(results_didC.pvalues, 4)})
print(f'table_didC: \n{table_didC}\n')

table_did: 
                               b      se         t    pval
Intercept                11.2854  0.0305  369.8386  0.0000
C(year)[T.1981]           0.1931  0.0453    4.2606  0.0000
nearinc                  -0.3399  0.0546   -6.2308  0.0000
nearinc:C(year)[T.1981]  -0.0626  0.0834   -0.7508  0.4533

table_didC: 
                              b      se        t    pval
Intercept                7.6517  0.4159  18.3986  0.0000
C(year)[T.1981]          0.1621  0.0285   5.6868  0.0000
nearinc                  0.0322  0.0475   0.6789  0.4977
nearinc:C(year)[T.1981] -0.1315  0.0520  -2.5305  0.0119
age                     -0.0084  0.0014  -5.9236  0.0000
I(age ** 2)              0.0000  0.0000   4.3415  0.0000
np.log(intst)           -0.0614  0.0315  -1.9500  0.0521
np.log(land)             0.0998  0.0245   4.0766  0.0001
np.log(area)             0.3508  0.0515   6.8129  0.0000
rooms                    0.0473  0.0173   2.7317  0.0067
baths                    0.0943  0.0277   3.4003  0.

**MEDICAID AND MORTALITY: NEW EVIDENCE FROM LINKED SURVEY AND ADMINISTRATIVE DATA**
(Miller et al., 2021)

- Low-income individuals in the U.S. experience dramatically worse health than those with high incomes.
- Low-income group also experiences higher risks of dying from diabetes (by 787%), cardiovascular disease (552%), and respiratory disease (813%) in a given year relative to those in higher income families
- Research has also shown that men at the bottom of the income distribution live on average nearly 15 years less, and women over 10 years less, than those at the top of the income distribution


**MEDICAID AND MORTALITY: NEW EVIDENCE FROM LINKED SURVEY AND ADMINISTRATIVE DATA**
(Miller et al., 2021)

- What is the research question?
- What is the "identification strategy"? (i.e., what is the research design aiming to provide a causal answer to the research question?)
- Why does it work?
- How is the common trend assumption justified?
    
https://www.nber.org/system/files/working_papers/w26081/w26081.pdf

**Example: The Anti-Competitive Effect of Minority Share Acquisitions: Evidence from the Introduction of National Leniency Programs**

https://www.tau.ac.il/~spiegel/papers/MS-20200528.pdf

**Unbundling, Regulation, and Pricing: Evidence from Electricity Distribution**

https://ftp.zew.de/pub/zew-docs/dp/dp18050.pdf

***Case Study 1: The Effect of Mergers on Prices***

Relevance: 
- Increasing retail concentration in Europe (CR5>70\%) 
- Prices may rise due to increase in concentration or decrease due to efficiency gains (e.g., cost synergies or bargaining)

Research question
- What are the effects of a merger between two German retailers (a discounter and a supermarket) on grocery prices?

Methodology
- Exploit the fact that retailers price at the local market level
- DiD estimator to compare markets with pre-merger overlap of acquirer and target to a control group of unaffected markets

Results
- Prices increase, on average, after the merger by 0.5%
- immediate 1-7\% price increase for supermarkets
- 1.5% Efficiency gains of discounters 2 years after merger


The Market and the Merger
- We consider the merger of R1 and R2 with pre-merger market shares of 25\% and 5\%
- Outsiders O1-O3 split the remainder rather equally (20\%, 15\%, 15\%)

German retailers adopt a local pricing strategy and national bargaining
- Local pricing: R1 owns 11,400 stores operated by roughly 4,500 independent merchants with high degree of freedom
- National purchasing: Derived from former (regional) buying groups built to join forces in purchasing activities

We observe three local market types in Germany 

|Type | Firms          | Concentration | Efficiency | 
| :-: | :-:            | :-:           | :-:        | 
|A    | R1 *and* R2, 0 | X             | X          | 
|B    | R1 *or* R2, 0  | -             | X          |
|C    |  O             | -             | -          |

Previous studies identify the net price effects as the sum of market power and efficiency gains, which corresponds to
- comparing markets of type A to markets of type C: Joined effect of market power and efficiency gains
- comparing markets of Type A to markets of type B: Efficiency gains


Market Power Effect

<br>  

<center><img src="figs/Design3.png" width="400"/> 
    
<p>


Efficiency Gains + Market power

<br>  

<center><img src="figs/Design4.png" width="400"/> 
    
<p>


***Case Study 2:***
The effect of Lockdowns on Corona incidences

https://www.dropbox.com/s/3c5w01tjpm2lt4c/curfew_20210619_2.pdf?dl=0