# Clean NSFG Data

This is one of the Jupyter notebooks I used in my preparation of *Probably Overthinking It: How to Use Data to Answer Questions, Avoid Statistical Traps, and Make Better Decisions*.

The book is scheduled to be published by University of Chicago Press in 2023.
If you would like to get infrequent email announcements about the book, please
[sign up for my mailing list](http://eepurl.com/h0nfbX).

Before you read these notebooks, please keep in mind:

* There is some explanatory text in the notebooks, but some of the examples will not make sense until you have read the corresponding chapter in the book.

* While preparing these notebooks, I made some changes to improve the readability of the code. There might be small differences between what appears in the book and what you get when you run the code.

[Click here to run this notebook on Colab](https://colab.research.google.com/github/AllenDowney/ProbablyOverthinkingIt/blob/book/notebooks/nsfg_clean.ipynb).

In [1]:
# Install empiricaldist if we don't already have it

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

In [2]:
# Install statadict if we don't already have it

try:
    import statadict
except ImportError:
    !pip install statadict
    import statadict

In [3]:
# download utils.py

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
        
download("https://github.com/AllenDowney/ProbablyOverthinkingIt/raw/book/notebooks/utils.py")

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from utils import decorate

# Set the random seed so we get the same results every time
np.random.seed(17)

## Loading the Data

Before you download data from the NSFG, please read the [Data User's Agreement](https://www.cdc.gov/nchs/data_access/ftp_dua.htm?url_redirect=ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NSFG) and follow the link that says "I Accept These Terms".

Then come back here and run the following cell to download the data.



In [5]:
download('https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NSFG/stata/2015_2017_FemPregSetup.dct')
download('https://ftp.cdc.gov/pub/health_statistics/nchs/datasets/NSFG/2015_2017_FemPregData.dat')

In [6]:
from statadict import parse_stata_dict

stata_dict = parse_stata_dict("2015_2017_FemPregSetup.dct")

In [7]:
nsfg = pd.read_fwf("2015_2017_FemPregData.dat", 
                   names=stata_dict.names, 
                   colspecs=stata_dict.colspecs)
assert nsfg.shape == (9553, 248)

In [8]:
nsfg.head()

Unnamed: 0,CASEID,PREGORDR,HOWPREG_N,HOWPREG_P,MOSCURRP,NOWPRGDK,PREGEND1,PREGEND2,HOWENDDK,NBRNALIV,...,SECU,SEST,CMINTVW,CMLSTYR,CMJAN3YR,CMJAN4YR,CMJAN5YR,QUARTER,PHASE,INTVWYEAR
0,70627,1,,,,,6.0,,,1.0,...,3,322,1394,1382,1357,1345,1333,18,1,2016
1,70627,2,,,,,1.0,,,,...,3,322,1394,1382,1357,1345,1333,18,1,2016
2,70627,3,,,,,6.0,,,1.0,...,3,322,1394,1382,1357,1345,1333,18,1,2016
3,70628,1,,,,,6.0,,,1.0,...,2,366,1409,1397,1369,1357,1345,23,1,2017
4,70628,2,,,,,6.0,,,1.0,...,2,366,1409,1397,1369,1357,1345,23,1,2017


Convert the column names to lowercase to be consistent with the documentation.

In [9]:
nsfg.columns = nsfg.columns.str.lower()

Select the columns we need.

In [10]:
variables = [
    "caseid",
    "outcome",
    "pregend1",
    "birthwgt_lb1",
    "birthwgt_oz1",
    "babysex1",
    "prglngth",
    "nbrnaliv",
    "agecon",
    "agepreg",
    "birthord",
    "hpagelb",
    "wgt2015_2017",
]

nsfg = nsfg[variables]
nsfg.shape

(9553, 13)

Write the data to an HDF file.

In [11]:
nsfg.to_hdf("nsfg2015.hdf5", "nsfg")

## Resample the data

The NSFG uses stratified sampling, so different respondents represent different numbers of people in the general population.
One way to compensate for stratified sampling is to resample the data using the sampling weights.

In [12]:
def resample_rows_weighted(df, column="finalwgt"):
    """Resamples a DataFrame using probabilities proportional to given column.

    df: DataFrame
    column: string column name to use as weights

    returns: DataFrame
    """
    weights = df[column].copy()
    weights /= sum(weights)
    indices = np.random.choice(df.index, len(df), replace=True, p=weights)
    sample = df.loc[indices]
    return sample

In [13]:
np.random.seed(18)

sample = resample_rows_weighted(nsfg, "wgt2015_2017")
sample.shape

(9553, 13)

In [14]:
sample.to_hdf("nsfg_sample.hdf5", "nsfg")

## Loading the resampled data

In [15]:
%time nsfg = pd.read_hdf('nsfg_sample.hdf5', 'nsfg')

CPU times: user 8.18 ms, sys: 6 µs, total: 8.18 ms
Wall time: 7.22 ms


In [16]:
type(nsfg)

pandas.core.frame.DataFrame

In [17]:
assert nsfg.shape == (9553, 13)

In [18]:
nsfg.head()

Unnamed: 0,caseid,outcome,pregend1,birthwgt_lb1,birthwgt_oz1,babysex1,prglngth,nbrnaliv,agecon,agepreg,birthord,hpagelb,wgt2015_2017
6211,77112,2,3.0,,,,6,,22,22.0,,,10615.059866
4791,75616,1,6.0,6.0,5.0,1.0,36,1.0,22,23.0,1.0,3.0,6180.620518
8461,79511,2,3.0,,,,9,,26,26.0,,,16758.073622
1591,72308,1,6.0,7.0,10.0,1.0,38,1.0,22,23.0,2.0,4.0,30773.407658
8199,79222,1,5.0,5.0,0.0,2.0,34,1.0,18,19.0,1.0,2.0,29677.015066


In [19]:
nsfg.columns

Index(['caseid', 'outcome', 'pregend1', 'birthwgt_lb1', 'birthwgt_oz1',
       'babysex1', 'prglngth', 'nbrnaliv', 'agecon', 'agepreg', 'birthord',
       'hpagelb', 'wgt2015_2017'],
      dtype='object')

In [20]:
for column in nsfg.columns:
    print(column)

caseid
outcome
pregend1
birthwgt_lb1
birthwgt_oz1
babysex1
prglngth
nbrnaliv
agecon
agepreg
birthord
hpagelb
wgt2015_2017


Probably Overthinking It

Copyright 2022 Allen Downey 

The code in this notebook and `utils.py` is under the [MIT license](https://mit-license.org/).