This is one of the Jupyter notebooks I used in my preparation of *Probably Overthinking It: How to Use Data to Answer Questions, Avoid Statistical Traps, and Make Better Decisions*.

The book is scheduled to be published by University of Chicago Press in 2023.
If you would like to get infrequent email announcements about the book, please
[sign up for my mailing list](http://eepurl.com/h0nfbX).


[Click here to run this notebook on Colab](https://colab.research.google.com/github/AllenDowney/ProbablyOverthinkingIt/blob/book/notebooks/clean_nchs.ipynb).

In [2]:
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

## Get the 1991 Data

In [4]:
download("https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/cohortlinkedus/LinkCO91.zip")

Downloaded LinkCO91.zip


In [5]:
from zipfile import ZipFile

zf = ZipFile("LinkCO91.zip")
filenames = zf.namelist()
print(filenames)

['LinkCO91USnum.dat', 'LinkCO91USden.dat']


In [7]:
colspecs = [(0, 1), (117, 118), (78, 82), (164, 186), (212, 215)]
columns = ["matchs", "tobacco", "birthweight", "congenit", "aged"]

fp = zf.open('LinkCO91USden.dat')
%time vs1991 = pd.read_fwf(fp, colspecs=colspecs, nrows=None, compress='infer')
vs1991.columns = columns

CPU times: user 20 s, sys: 522 ms, total: 20.6 s
Wall time: 20.6 s


In [8]:
vs1991.shape

(4115493, 5)

In [4]:
!rm nchs.hdf

In [5]:
vs1991.to_hdf("nchs.hdf", "vs1991", complevel=6)
!ls -lh nchs.hdf

-rw-rw-r-- 1 downey downey 18M Feb 23 20:22 nchs.hdf


## Get the 2019 data

Which covers the 2018 birth cohort.

In [10]:
download("https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/period-cohort-linked/2019PE2018CO.zip")

Downloaded 2019PE2018CO.zip


In [11]:
zf = ZipFile("2019PE2018CO.zip")
filenames = zf.namelist()
print(filenames)

['VS18LINK.Public.USDENPUB_add_Seq_for2018Cohort_r2021_08_25', 'VS18LINK.Public.USNUMPUB_R2020_04_03', 'VS19LINK.Public.USDENPUB_R2021_08_30', 'VS19LINK.Public.USNUMPUB_R2021_08_30']


In [12]:
colspecs = [(268, 269), (511, 515), (371, 375)]
columns = ["tobacco", "birthweight", "yod"]

fp = zf.open('VS18LINK.Public.USDENPUB_add_Seq_for2018Cohort_r2021_08_25')
vs2018 = pd.read_fwf(fp, colspecs=colspecs, nrows=None)
vs2018.columns = columns

In [7]:
vs2018.to_hdf("nchs.hdf", "vs2018", complevel=6)
!ls -lh nchs.hdf

-rw-rw-r-- 1 downey downey 33M Feb 23 20:23 nchs.hdf
