In [1]:
import pandas as pd

This notebook prepares datasets for use in class.

Resources:

* http://economics.mit.edu/faculty/angrist/data1/mhe

* http://fmwww.bc.edu/ec-p/data/wooldridge/datasets.list.html

* https://www.cambridge.org/gb/academic/subjects/sociology/sociology-general-interest/counterfactuals-and-causal-inference-methods-and-principles-social-research-2nd-edition?format=PB

In [2]:
def from_dta_to_csv(textbook, dataset):
    """This function transfers the dataset from *.dta to a csv file."""
    substring = textbook + '/' + dataset
    df = pd.read_stata('sources/' + substring + '.dta')
    df.to_csv('processed/' + substring + '.csv', index=False) 
    return df

# lowbrth.dta

This is a cross-sectional dataset on low birthweight.

In [3]:
df = from_dta_to_csv('wooldrige', 'lowbrth')
df.head()

Unnamed: 0,year,lowbrth,infmort,afdcprt,popul,pcinc,physic,afdcprc,d90,lpcinc,...,clbedspc,povrate,cpovrate,afdcpsq,cafdcpsq,physicpc,lphypc,clphypc,lpopul,clpopul
0,1987,8.0,12.2,132,4084,12039,151,3.232125,0,9.395906,...,,21.299999,,10.446634,,0.036974,-3.297552,,8.314832,
1,1990,8.4,10.8,132,4041,14899,158,3.266518,1,9.60905,...,-0.018767,19.200001,-2.099998,10.67014,0.223506,0.039099,-3.241652,0.0559,8.304248,-0.010584
2,1987,4.8,10.4,19,524,18461,138,3.625954,0,9.823416,...,,12.0,,13.147544,,0.263359,-1.334238,,6.261492,
3,1990,4.8,10.5,24,550,20867,146,4.363636,1,9.945924,...,-0.048427,11.4,-0.6,19.041323,5.893779,0.265455,-1.326312,0.007926,6.309918,0.048427
4,1987,6.4,9.5,91,3400,14322,191,2.676471,0,9.569552,...,,12.8,,7.163495,,0.056176,-2.879257,,8.131531,


## Lee (2008), regression discontinutiy design

* https://rdrr.io/cran/rddtools/man/house.html

I transfered the `rda` file manually to `csv`.

In [4]:
df = pd.read_csv('sources/msc/house.csv', index_col=0)

# We want more interpretable columnn names.
df.rename(columns={'x': 'vote_last', 'y': 'vote_next'}, inplace=True)
df.to_csv('processed/msc/house.csv', index=False) 

## Krueger (1999), STAR experiment, clustering on group level

In [8]:
# There was a lot of pre-processig required using the replication material
# from the MHE website.
df = from_dta_to_csv('angrist_pischke', 'webstar')
df.head()

Unnamed: 0,schidkn,pscore,classid,cs,female,nwhite,n
0,22,0.0,6317.0,15.0,0.0,1.0,1.0
1,33,0.079981,9700.0,26.0,1.0,1.0,1.0
2,56,0.030598,16647.0,21.0,0.0,0.0,1.0
3,33,1.073276,9700.0,26.0,0.0,1.0,1.0
4,45,0.357227,13400.0,24.0,0.0,1.0,1.0


## nswre74.dta

https://users.nber.org/~rdehejia/nswdata.html


In [6]:
df = from_dta_to_csv('angrist_pischke', 'nswre74')

## Morgan & Winship

These are the datasets for the matching illustration in Chapter 5.

In [7]:
for num in range(1, 11):
    fname = 'mw_cath{}'.format(num) 
    df = from_dta_to_csv('morgan_winship', fname)