## Clean IPUMS CPS data

This is one of the Jupyter notebooks I used in my preparation of *[Probably Overthinking It: How to Use Data to Answer Questions, Avoid Statistical Traps, and Make Better Decisions](https://greenteapress.com/wp/probably-overthinking-it)*, University of Chicago Press, 2023.

Before you read these notebooks, please keep in mind:

* There is some explanatory text in the notebooks, but some of the examples will not make sense until you have read the corresponding chapter in the book.

* While preparing these notebooks, I made some changes to improve the readability of the code. There might be small differences between what appears in the book and what you get when you run the code.

[Click here to run this notebook on Colab](https://colab.research.google.com/github/AllenDowney/ProbablyOverthinkingIt/blob/book/notebooks/clean_cps.ipynb).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

This is the notebook I used to download and clean IPUMS CPS data for the example in Chapter 10 about real wages and education.

In order to create and download an data extract, you'll need an [IPUMS API key](https://developer.ipums.org/docs/v2/get-started/).

In [None]:
api_key = "PUT YOUR API KEY HERE"

Installing `ipumspy` has become a bit of a challenge because it requires an older version of pandas. So you might want to run this in an environment that separate from the one you use to run the other notebooks.

In [None]:
# !pip install ipumspy

# !conda install --experimental-solver=libmamba -c conda-forge ipumspy

In [None]:
from ipumspy import IpumsApiClient, CpsExtract

In [None]:
ipums = IpumsApiClient(api_key)

```
Type 	Variable 	Label
H 	YEAR 	Survey year
H 	SERIAL 	Household serial number
H 	MONTH 	Month
H 	HWTFINL 	Household weight, Basic Monthly
H 	CPSID 	CPSID, household record
P 	PERNUM 	Person number in sample unit
P 	WTFINL 	Final Basic Weight
P 	CPSIDP 	CPSID, person record
P 	AGE 	Age
P 	SEX 	Sex
P 	RACE 	Race
P 	FREVER 	Number of live births ever had
P 	FREXPECT 	Expect to have additional children
P 	FRSUPPWT 	Fertility Supplement Weight
```



```
cps1976_06s

cps1989_06s
Invalid sample name: cps1993_06s
Invalid sample name: cps1996_06s
Invalid sample name: cps1997_06s
Invalid sample name: cps1999_06s
Invalid sample name: cps2005_06s
Invalid sample name: cps2007_06s
Invalid sample name: cps2009_06s
```

* Discovering sample names is non-obvious

year, month, 'b' or 's'

which month for which supplement


* Might be nice to have a synchronous version of download_extract

In [None]:
years = range(1970, 2022)

In [None]:
missing = [1978, 1989, 1993, 1996, 1997, 1999, 2005, 2007, 2009]
suffixes = ["b" if year in missing else "s" for year in years]

In [None]:
samples = [f"cps{year}_03s" for year, suffix in zip(years, suffixes)]
# samples

In [None]:
variables = [
    "cpi99",
    "AGE",
    "SEX",
    "RACE",
    "EMPSTAT",
    "LABFORCE",
    "EDUC",
    "FTOTVAL",
    "INCTOT",
    "INCWAGE",
]

In [None]:
extract = CpsExtract(
    samples, variables, data_format="stata", description="Extract for Simpson paradox"
)
ipums.submit_extract(extract)

In [None]:
extract_status = ipums.extract_status(extract)
extract_status

In [None]:
ipums.download_extract(6, collection="cps")

In [None]:
recent_extracts = ipums.retrieve_previous_extracts("cps")

# display extract IDs and descriptions
for ex in recent_extracts:
    print(f"{ex['number']}: {ex['description']}")

In [None]:
from ipumspy import readers

# read ddi and data
ddi_codebook = readers.read_ipums_ddi("cps_00006.xml")

In [None]:
info = ddi_codebook.get_variable_info("asecwt")
print(info.description)

In [None]:
# if you get the FWF format, which seems to be the default, you can use `readers` to read it

# ipums_df = readers.read_microdata(ddi_codebook, 'cps_00002.dat.gz')
# ipums_df.shape

In [None]:
ipums_df = pd.read_stata("cps_00006.dta.gz", convert_categoricals=False)
ipums_df.head()

In [None]:
ipums_df["year"].value_counts()

In [None]:
ipums_df.groupby("year")["cpi99"].mean().plot()

In [None]:
# 2 means yes, in the labor force
ipums_df["labforce"].value_counts()

In [None]:
educ = ipums_df["educ"]
educ.value_counts().sort_index()

In [None]:
negative = ipums_df["asecwt"] < 0
negative.sum()

In [None]:
zero = ipums_df["asecwt"] == 0
zero.sum()

In [None]:
ipums_df.loc[negative, "asecwt"] = 0

In [None]:
def resample_rows_weighted(df, column="finalwgt"):
    """Resamples a DataFrame using probabilities proportional to given column.
    df: DataFrame
    column: string column name to use as weights
    returns: DataFrame
    """
    sample = df.sample(frac=1, replace=True, weights=df[column])
    return sample

In [None]:
def resample_by_year(df, column="finalwgt"):
    """Resample rows within each year.
    df: DataFrame
    column: string column name to use as weights
    returns DataFrame
    """
    grouped = df.groupby("year")
    samples = [resample_rows_weighted(group, column) for _, group in grouped]
    sample = pd.concat(samples, ignore_index=True)
    return sample

In [None]:
np.random.seed(17)
sample = resample_by_year(ipums_df, "asecwt")

## Recode educ into degree

In [None]:
educ = sample["educ"]
educ.replace([1, 999], np.nan, inplace=True)
educ.isna().sum()

In [None]:
nohs = (educ >= 2) & (educ <= 72)
nohs.sum()

In [None]:
hs = (educ >= 71) & (educ <= 73)
hs.sum()

In [None]:
assc = (educ >= 91) & (educ <= 92)
assc.sum()

In [None]:
bach = educ == 111
bach.sum()

In [None]:
adv = (educ >= 123) & (educ <= 125)
adv.sum()

In [None]:
some_college = educ == 81
some_college.sum()

In [None]:
degree = pd.Series("", index=educ.index, dtype=str, name="degree")
degree[nohs] = "nohs"
degree[hs] = "hs"
degree[some_college] = "college"
degree[assc] = "assc"
degree[bach] = "bach"
degree[adv] = "adv"
degree.value_counts()

## Adjust wage income to real wage

In [None]:
wage = sample["incwage"]
wage.describe()

In [None]:
real_wage = wage * sample["cpi99"]
real_wage.describe()

In [None]:
year = sample["year"]
df = pd.DataFrame(dict(year=year, degree=degree, real_wage=real_wage))
df.shape

In [None]:
valid = (degree != "") & (ipums_df["labforce"] == 2)
valid.sum()

In [None]:
selected = df.loc[valid]
selected.shape

In [None]:
selected.to_hdf("ipums_cps.hdf", "ipums_cps")

In [None]:
start = 1995
recent_df = selected[selected["year"] >= start]
recent_df.shape

In [None]:
recent_df["degree"].value_counts()

In [None]:
overall = recent_df.groupby("year")["real_wage"].mean()

In [None]:
from scipy.stats import linregress

res = linregress(overall.index, overall)
res.slope

In [None]:
table = pd.pivot_table(
    recent_df, index="year", columns="degree", values="real_wage", aggfunc="mean"
)

In [None]:
for name, column in table.iteritems():
    res = linregress(column.index, column)
    print(name, res.slope)

In [None]:
table.plot()
overall.plot(ls=":", color="gray")

In [None]:
xtab = pd.crosstab(year, degree, normalize="index")
xtab.drop("", axis=1, inplace=True)
xtab

In [None]:
recent = xtab.index >= start

In [None]:
xtab.loc[recent].plot()