# Examples and Exercises from Think Stats, 2nd Edition

http://thinkstats2.com

Copyright 2016 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT


In [1]:
from __future__ import print_function, division

import nsfg

## Examples from Chapter 1

Read NSFG data into a Pandas DataFrame.

In [2]:
preg = nsfg.ReadFemPreg()
preg.head()

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,cmotpreg,prgoutcome,cmprgend,flgdkmo1,cmprgbeg,ageatend,hpageend,gestasun_m,gestasun_w,wksgest,Unnamed: 21
0,1,1,,,,,6,,1,,,1,1093,,1084,,,9,0,39,...
1,1,2,,,,,6,,1,,,1,1166,,1157,,,9,0,39,...
2,2,1,,,,,5,,3,5.0,,1,1156,,1147,,,0,39,39,...
3,2,2,,,,,6,,1,,,1,1198,,1189,,,0,39,39,...
4,2,3,,,,,6,,1,,,1,1204,,1195,,,0,39,39,...


Print the column names.

In [4]:
preg.columns

Index([u'caseid', u'pregordr', u'howpreg_n', u'howpreg_p', u'moscurrp', u'nowprgdk', u'pregend1', u'pregend2', u'nbrnaliv', u'multbrth', u'cmotpreg', u'prgoutcome', u'cmprgend', u'flgdkmo1', u'cmprgbeg', u'ageatend', u'hpageend', u'gestasun_m', u'gestasun_w', u'wksgest', u'mosgest', u'dk1gest', u'dk2gest', u'dk3gest', u'bpa_bdscheck1', u'bpa_bdscheck2', u'bpa_bdscheck3', u'babysex', u'birthwgt_lb', u'birthwgt_oz', u'lobthwgt', u'babysex2', u'birthwgt_lb2', u'birthwgt_oz2', u'lobthwgt2', u'babysex3', u'birthwgt_lb3', u'birthwgt_oz3', u'lobthwgt3', u'cmbabdob', u'kidage', u'hpagelb', u'birthplc', u'paybirth1', u'paybirth2', u'paybirth3', u'knewpreg', u'trimestr', u'ltrimest', u'priorsmk', u'postsmks', u'npostsmk', u'getprena', u'bgnprena', u'pnctrim', u'lpnctri', u'workpreg', u'workborn', u'didwork', u'matweeks', u'weeksdk', u'matleave', u'matchfound', u'livehere', u'alivenow', u'cmkidied', u'cmkidlft', u'lastage', u'wherenow', u'legagree', u'parenend', u'anynurse', u'fedsolid', u'frstea

Select a single column name.

In [5]:
preg.columns[1]

u'pregordr'

Select a column and check what type it is.

In [6]:
pregordr = preg['pregordr']
type(pregordr)

pandas.core.series.Series

Print a column.

In [7]:
pregordr

0     1
1     2
2     1
3     2
4     3
5     1
6     2
7     3
8     1
9     2
10    1
11    1
12    2
13    3
14    1
...
13578    1
13579    2
13580    1
13581    2
13582    3
13583    1
13584    2
13585    1
13586    2
13587    3
13588    1
13589    2
13590    3
13591    4
13592    5
Name: pregordr, Length: 13593, dtype: int64

Select a single element from a column.

In [7]:
pregordr[0]

1

Select a slice from a column.

In [8]:
pregordr[2:5]

2    1
3    2
4    3
Name: pregordr, dtype: int64

Select a column using dot notation.

In [8]:
pregordr = preg.pregordr

Count the number of times each value occurs.

In [10]:
preg.outcome.value_counts().sort_index()

1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: outcome, dtype: int64

Check the values of another variable.

In [11]:
preg.birthwgt_lb.value_counts().sort_index()

0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

Make a dictionary that maps from each respondent's `caseid` to a list of indices into the pregnancy `DataFrame`.  Use it to select the pregnancy outcomes for a single respondent.

In [12]:
caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values

array([4, 4, 4, 4, 4, 4, 1])

## Exercises

Select the `birthord` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611933)

In [15]:
counts = preg.birthord.value_counts().sort_index()
total = counts.sum()+preg.birthord.isnull().sum()
print(counts)
print('Total:', total)

1     4413
2     2874
3     1234
4      421
5      126
6       50
7       20
8        7
9        2
10       1
dtype: int64
Total: 13593


We can also use `isnull` to count the number of nans.

In [14]:
preg.birthord.isnull().sum()

4445

Select the `prglngth` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611931)

In [28]:
preg_length = preg.prglngth.value_counts().sort_index()

early_total = sum(preg_length[0:14])
mid_total = sum(preg_length[14:27])
late_total = sum(preg_length[27:])
print('0-13 weeks:', early_total)
print('14-26 weeks:', mid_total)
print('27-50 weeks:', late_total)
print('Total:', early_total+mid_total+late_total) # making sure I get the proper total from the results I display

0-13 weeks: 3522
14-26 weeks: 793
27-50 weeks: 9278
Total: 13593


To compute the mean of a column, you can invoke the `mean` method on a Series.  For example, here is the mean birthweight in pounds:

In [16]:
preg.totalwgt_lb.mean()

7.265628457623368

Create a new column named <tt>totalwgt_kg</tt> that contains birth weight in kilograms.  Compute its mean.  Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

In [33]:
preg['totalwgt_kg'] = preg.totalwgt_lb * 0.453592
preg.totalwgt_kg.mean()

3.295630943350299

`nsfg.py` also provides `ReadFemResp`, which reads the female respondents file and returns a `DataFrame`:

In [34]:
resp = nsfg.ReadFemResp()

`DataFrame` provides a method `head` that displays the first five rows:

In [36]:
resp.head()

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,marstat,fmarstat,fmarit,evrmarry,hisp,hispgrp,numrace,roscnt,hplocale,manrel,Unnamed: 21
0,2298,1,5,5,1,5,27,27,902,27,2,6.0,5,0,1,1.0,1,5,1.0,2.0,...
1,5012,1,5,1,5,5,42,42,718,42,1,,1,1,5,,1,2,1.0,1.0,...
2,11586,1,5,1,5,5,43,43,708,43,4,,3,1,5,,1,1,,,...
3,6794,5,5,4,1,5,15,15,1042,15,6,,5,0,1,2.0,1,4,,,...
4,616,1,5,4,1,5,20,20,991,20,6,,5,0,1,1.0,1,4,,,...


Select the `age_r` column from `resp` and print the value counts.  How old are the youngest and oldest respondents?

In [40]:
print(resp.age_r.value_counts().sort_index())
print('Youngest Respondent:', min(resp.age_r))
print('Oldests Respondent:', max(resp.age_r))

15    217
16    223
17    234
18    235
19    241
20    258
21    267
22    287
23    282
24    269
25    267
26    260
27    255
28    252
29    262
30    292
31    278
32    273
33    257
34    255
35    262
36    266
37    271
38    256
39    215
40    256
41    250
42    215
43    253
44    235
dtype: int64
Youngest Respondent: 15
Oldests Respondent: 44


We can use the `caseid` to match up rows from `resp` and `preg`.  For example, we can select the row from `resp` for `caseid` 2298 like this:

In [21]:
resp[resp.caseid==2298]

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667


And we can get the corresponding rows from `preg` like this:

In [22]:
preg[preg.caseid==2298]

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
2610,2298,1,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,6.875
2611,2298,2,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,5.5
2612,2298,3,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,4.1875
2613,2298,4,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,6.875


How old is the respondent with `caseid` 1?

In [48]:
case1 = resp[resp.caseid==1]
case1
print(case1.age_r)

Respondent 1: 1069    44
Name: age_r, dtype: int64


What are the pregnancy lengths for the respondent with `caseid` 2298?

In [54]:
case2298 = preg[preg.caseid==2298]
print("Weeks pregnant, number of pregnancies")
case2298.prglngth.value_counts().sort_index()

Weeks pregnant, number of pregnancies


30    1
36    1
40    2
dtype: int64

What was the birthweight of the first baby born to the respondent with `caseid` 5515?

In [88]:
#print(preg[preg.caseid==5515])
for i in range(len(preg.caseid)):
    if preg.caseid[i]==5515:
        print(preg.caseid[i])
if 5515 in preg.caseid:
    print('Found respondent')
    
# I am not sure how both of these are true, but it seems that there is not a case 5515.
# However, 5515 does appear in preg.caseid. I assume this is due to a place where there
# is more to it than just 5515, like 15515.

# looking at the solutions, I saw that this is looking for the pregnancy with the index
# 5515, which apparently corresponds to 5012.

# Here are two methods of accessing the same thing.
print(preg[preg.caseid==5012].birthwgt_lb) 
print(preg.birthwgt_lb[5515], 'pounds')

Found respondent
5515    6
Name: birthwgt_lb, dtype: float64
6.0 pounds
