# Behavioral Risk Factor Surveillance System (BRFSS) 2014

## Topics & Techniques Covered:

* Extracting text data from a PDF file
* Eliminating whitespace from text strings
* Using a dictionary to replace numerical data with text-based categories.

## Imports

The `io` (input/output) library handles "file objects" which are representations of files in text, bytes, or raw format. This is a bit of an abstract concept, but essentially it lets you treat streamed data from a website or other source as if it was a file being read from a hard drive. We will be using this library to read in PDF files.

In [1]:
import requests
from io import StringIO, BytesIO

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The [pdfminer](https://pdfminersix.readthedocs.io/en/latest/) package lets you extract data from PDF documents. It doesn't work perfectly all the time and usually takes some fiddling, but it is a potential tool to *reproducibly* convert tables in PDF documents to tabular data in Python. How usable it is will depend largely on how well-formatted the PDF is.

In [3]:
!pip install pdfminer.six



In [4]:
from pdfminer.high_level import extract_text, extract_pages
from pdfminer.layout import LTTextLineHorizontal, LTTextBoxHorizontal

## Behavioral Risk Factor Surveillance System (BRFSS) 2014 Survey Codebook

The Behavioral Risk Factor Surveillance System is a United States public health survey conducted by the Center for Disease Control to assess behavioral health risks in the United States. The data from the CDC website contains a large amount of data, but it's not easily readable because all the fields are coded to numbers rather than containing the actual categories themselves.

The categories are kept in a codebook, which serves as a dictionary so users can translate the survey data. However, the fact that the codebook is a PDF document instead of being in a tabular data format like a spreadsheet makes it difficult to read these codes programmatically. 

This is why tools like `pdfminer.six` are useful; they let you make tables out of data that isn't formatted in a table to begin with.

[FIPS (Federal Information Processing Standard) codes](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code) are identifiers that have been used by the Census Bureau and other institutions as unique identifiers for U.S. states and territories.

Here, they are used in the BRFSS dataset as state identifiers, but without a link between the FIPS code and the state name/postal abbreviation, it's harder to match the data at a glance.

### Codebook Download Links:

Available here: https://www.cdc.gov/brfss/annual_data/annual_2014.html

PLEASE NOTE: The CDC website has been unreliable over the past several weeks; the codebook was unavailable for a few days during that period. It may or may not be available when you access this link.

We have uploaded the codebook to our GitHub page, and, as the CDC website may continue to be unreliable, we are downloading the BRFSS2014 dataset via the Open Science Foundation's repository: https://osf.io/n7wm8.


# `requests`

We will be using the `requests` module to perform a "get" HTML request to the BRFSS resources.

For a more extensive tutorial on the `requests` module and on web-scraping, please see the archived "Practical Python" workshop materials on the Library's "[Introduction to Python](https://libguides.libraries.claremont.edu/intro-to-python)" Research Guide.

First, we use the `requests` module to make an HTML "Get" request to pull the PDF data.

In [5]:
pdf_response = requests.get("https://raw.githubusercontent.com/ClaremontCollegesLibrary/PersnicketyPython/refs/heads/main/brfss_2014_codebook.pdf")

In [6]:
pdf_content = pdf_response.content

Here are the first two thousand characters of the PDF file, returned as a Python [Bytes object](https://docs.python.org/3/library/stdtypes.html#bytes-objects)

Bytes objects display similarly to Python strings (they are formatted like a string, with a "b" at the start before the quotes) but they are fundamentally different.

In [7]:
pdf_content[0:2000]

b'%PDF-1.6\r%\xe2\xe3\xcf\xd3\r\n86468 0 obj\r<</Linearized 1/L 2918480/O 86470/E 30240/N 126/T 2913838/H [ 493 740]>>\rendobj\r      \r\n86475 0 obj\r<</DecodeParms<</Columns 5/Predictor 12>>/Filter/FlateDecode/ID[<0E16E69224B5F14C903517BEA588862D><E8E3BF2B2FDC8A4F8090137FCB34818C>]/Index[86468 17 86791 1]/Info 86467 0 R/Length 65/Prev 2913839/Root 86469 0 R/Size 86792/Type/XRef/W[1 3 1]>>stream\r\nh\xdebbd`\x10``b``9\x08"\x19\xaf\x82\xc9`\x10\xc9<\x01D\xb2\x9d\x07\x91\xca9 \xf2\xdc\x06\x06&F\xc6@\xb0,P\x15Q\xe4\xff\xff\x1b\xd4~2\xfdg\xf8/\xcc\x00\x10`\x00\x03\xbc\x0c%\r\nendstream\rendobj\rstartxref\r\n0\r\n%%EOF\r\n        \r\n86484 0 obj\r<</C 1237/Filter/FlateDecode/I 1259/Length 632/O 1199/S 1153/V 1215>>stream\r\nh\xdeb```\x02\xa2C\x0c,\x0c\x0c\xfc\xd7\x19\xf8\x19\x10\x80\x9f\x81\x19(\xca\xc2\xc0a\xd0\xc0@\t8\xe8\x96\xcc\xb2\xd8\xe2\xce\x86\x9dG\xc3\xdbs{g\xca\x8bX\xbd\xc9\xd6L\x9c\x98\xe6:\xdf#S(@\xbbB\xfbN\xc9\x8e\xd7\x17W\x1ef\xed6\xe5\x89\x90\x9d\x16\xa4\xd0\xech\x93\x14\xd3

The `BytesIO` class from the `io` library allows us to read the Bytes object into a chunk of memory so that it behaves like a file. In this case, the "%PDF-1.6" header at the start of the Bytes object indicates that the file is a PDF, and `BytesIO` lets us treat it as if it was a PDF file on the drive for purposes of the `extract_text()` function from `pdfminer`.

In [8]:
pdf = BytesIO(pdf_content)

# pdfminer

## `extract_text()`

At its simplest, pdfminer converts PDF files to plain text.

In [9]:
text = extract_text(pdf)
print(text[0:2000])

Behavioral Risk Factor Surveillance System 

2014 Codebook Report 

Land-Line and Cell-Phone data 

August 12, 2015 

 
 
 
    
  
  
   
 
 
BEHAVIORAL RISK FACTOR SURVEILLANCE SYSTEM 
CODEBOOK REPORT, 2014 
Land-Line and Cell-Phone data 

State FIPS Code 

Section:  0.1        Record Identification 

Column:  1-2 

Prologue:   

Description:  State FIPS Code 

Value 

Value Label 

1 

2 

4 

5 

6 

8 

9 

10 

11 

12 

13 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

Alabama 

Alaska 

Arizona 

Arkansas 

California 

Colorado 

Connecticut 

Delaware 

District of Columbia 

Florida 

Georgia 

Hawaii 

Idaho 

Illinois 

Indiana 

Iowa 

Kansas 

Kentucky 

Louisiana 

Maine 

Maryland 

Massachusetts 

Michigan 

Minnesota 

Mississippi 

Missouri 

Montana 

Nebraska 

Nevada 

New Hampshire 

New Jersey 

New Mexico 

New York 

North Carolina 

North Dakota 

Type:  Num 

SAS Variable Name:  _ST

## `extract_pages()`

The extract_pages() function segments the text data based on which page it's on... that data may be further segmented by the individual elements in the layout of each page.

In [10]:
pages = [page for page in extract_pages(pdf)]

In [11]:
for page_layout in pages[0:2]:
    for element in page_layout:
        print(element)

<LTTextBoxHorizontal(0) 73.224,688.852,545.572,712.852 'Behavioral Risk Factor Surveillance System \n'>
<LTTextBoxHorizontal(1) 181.940,614.572,436.822,638.572 '2014 Codebook Report \n'>
<LTTextBoxHorizontal(2) 139.940,540.292,478.942,564.292 'Land-Line and Cell-Phone data \n'>
<LTTextBoxHorizontal(3) 233.210,468.242,384.481,488.282 'August 12, 2015 \n'>
<LTRect 30.360,665.020,581.740,739.444>
<LTRect 32.400,685.060,579.700,719.404>
<LTRect 27.720,739.440,30.360,742.200>
<LTRect 27.720,739.560,30.360,742.200>
<LTRect 30.360,739.560,581.740,742.200>
<LTRect 30.360,739.440,581.740,739.560>
<LTRect 581.740,739.440,584.380,742.200>
<LTRect 581.740,739.560,584.380,742.200>
<LTRect 27.720,665.020,30.360,739.444>
<LTRect 581.740,665.020,584.380,739.444>
<LTRect 30.360,590.740,581.740,665.020>
<LTRect 32.400,610.780,579.700,645.100>
<LTRect 27.720,590.740,30.360,665.020>
<LTRect 581.740,590.740,584.380,665.020>
<LTRect 30.360,516.550,581.740,590.734>
<LTRect 32.400,536.500,579.700,570.820>
<LT

## Identifying Elements and Extracting Text

In [12]:
table_text = []


for page_layout in extract_pages(pdf):
    for element in page_layout:
        if isinstance(element, LTTextLineHorizontal) or isinstance(element, LTTextBoxHorizontal):
            table_text.append(element.get_text())

Once we've identified the length of the tables on each page, we can locate the starting points in the list `table_text` and take segments of that list to use as columns in a DataFrame object.



In [13]:
n=0
for line in table_text[0:200]:
    print(n, line)
    n+=1

0 Behavioral Risk Factor Surveillance System 

1 2014 Codebook Report 

2 Land-Line and Cell-Phone data 

3 August 12, 2015 

4  

5  

6  

7     

8   

9   

10    

11  

12  

13 BEHAVIORAL RISK FACTOR SURVEILLANCE SYSTEM 
CODEBOOK REPORT, 2014 
Land-Line and Cell-Phone data 

14 State FIPS Code 

15 Section:  0.1        Record Identification 

16 Column:  1-2 

17 Prologue:   

18 Description:  State FIPS Code 

19 Value 

20 Value Label 

21 1 

22 2 

23 4 

24 5 

25 6 

26 8 

27 9 

28 10 

29 11 

30 12 

31 13 

32 15 

33 16 

34 17 

35 18 

36 19 

37 20 

38 21 

39 22 

40 23 

41 24 

42 25 

43 26 

44 27 

45 28 

46 29 

47 30 

48 31 

49 32 

50 33 

51 34 

52 35 

53 36 

54 37 

55 38 

56 Alabama 

57 Alaska 

58 Arizona 

59 Arkansas 

60 California 

61 Colorado 

62 Connecticut 

63 Delaware 

64 District of Columbia 

65 Florida 

66 Georgia 

67 Hawaii 

68 Idaho 

69 Illinois 

70 Indiana 

71 Iowa 

72 Kansas 

73 Kentucky 

74 Louisiana 

75 Maine 



The column headers that we're interested in are in cells 19, 20, 93, 94, and 95, and the segments of the data we want to extract start on cells 21, 56, 96, 131, and 166 for the first chunk, respectively. 


In [14]:
n=200
for line in table_text[200:400]:
    print(n, line)
    n+=1

200 0.23 

201 2 of 126 

202       August 12, 2015 

203  

204  

205  

206 BEHAVIORAL RISK FACTOR SURVEILLANCE SYSTEM 
CODEBOOK REPORT, 2014 
Land-Line and Cell-Phone data 

207 State FIPS Code 

208 Section:  0.1        Record Identification 

209 Column:  1-2 

210 Prologue:   

211 Description:  State FIPS Code 

212 Value 

213 Value Label 

214 39 

215 40 

216 41 

217 42 

218 44 

219 45 

220 46 

221 47 

222 48 

223 49 

224 50 

225 51 

226 53 

227 54 

228 55 

229 56 

230 66 

231 72 

232 Ohio 

233 Oklahoma 

234 Oregon 

235 Pennsylvania 

236 Rhode Island 

237 South Carolina 

238 South Dakota 

239 Tennessee 

240 Texas 

241 Utah 

242 Vermont 

243 Virginia 

244 Washington 

245 West Virginia 

246 Wisconsin 

247 Wyoming 

248 Guam 

249 Puerto Rico 

250 Type:  Num 

251 SAS Variable Name:  _STATE 

252 Frequency 

253 Percentage 

254 Weighted 
Percentage 

255 10,933 

256 8,448 

257 5,227 

258 11,000 

259 6,450 

260 11,027 

261 7,401 

262 5,14

In the second chunk, the columns start in cells 214, 232, 255, 273, and 291, and the length of each one is 18 entries.

The column headers will need to be cleaned as well. Fortunately, the pattern is consistent. Every element's text has trailing whitespace and a newline character (`\n`), so we can use the string method `.replace()` to pare down each string.

In [15]:
table_text[19]

'Value \n'

In [16]:
table_text[19].replace(' \n','')

'Value'

In [17]:
table_text[21]

'1 \n'

In [18]:
table_text[21].replace(' \n','')

'1'

Here we can use multiple list comprehensions to create columns for a DataFrame:

In [19]:
table_length = 35
table_length2 = 18

state_fips = pd.DataFrame()

state_fips[table_text[19].replace(' \n','')] = [
    value.replace(' \n','') for value in table_text[21:21+table_length] + table_text[214:214+table_length2]
        ]

state_fips[table_text[20].replace(' \n','')] = [
    value.replace(' \n','') for value in table_text[56:56+table_length] + table_text[232:232+table_length2]
        ]

state_fips[table_text[93].replace(' \n','')] = [
    value.replace(' \n','') for value in table_text[96:96+table_length]+ table_text[255:255+table_length2]
        ]

state_fips[table_text[94].replace(' \n','')] = [
    value.replace(' \n','') for value in table_text[131:131+table_length] + table_text[273:273+table_length2]
        ]

state_fips[table_text[95].replace(' \n','')] = [
    value.replace(' \n','') for value in table_text[166:166+table_length] + table_text[291:291+table_length2]
        ]

In [20]:
state_fips

Unnamed: 0,Value,Value Label,Frequency,Percentage,WeightedPercentage
0,1,Alabama,8652,1.86,1.5
1,2,Alaska,4388,0.94,0.22
2,4,Arizona,14867,3.2,2.05
3,5,Arkansas,5258,1.13,0.91
4,6,California,8832,1.9,11.89
5,8,Colorado,13399,2.88,1.66
6,9,Connecticut,7950,1.71,1.14
7,10,Delaware,4300,0.93,0.29
8,11,District of Columbia,4074,0.88,0.22
9,12,Florida,9821,2.11,6.37


In [21]:
zipped_fips = zip(state_fips['Value'].values, state_fips['Value Label'].values)

fips_dict = {int(value):label for value, label in zipped_fips}

In [22]:
fips_dict

{1: 'Alabama',
 2: 'Alaska',
 4: 'Arizona',
 5: 'Arkansas',
 6: 'California',
 8: 'Colorado',
 9: 'Connecticut',
 10: 'Delaware',
 11: 'District of Columbia',
 12: 'Florida',
 13: 'Georgia',
 15: 'Hawaii',
 16: 'Idaho',
 17: 'Illinois',
 18: 'Indiana',
 19: 'Iowa',
 20: 'Kansas',
 21: 'Kentucky',
 22: 'Louisiana',
 23: 'Maine',
 24: 'Maryland',
 25: 'Massachusetts',
 26: 'Michigan',
 27: 'Minnesota',
 28: 'Mississippi',
 29: 'Missouri',
 30: 'Montana',
 31: 'Nebraska',
 32: 'Nevada',
 33: 'New Hampshire',
 34: 'New Jersey',
 35: 'New Mexico',
 36: 'New York',
 37: 'North Carolina',
 38: 'North Dakota',
 39: 'Ohio',
 40: 'Oklahoma',
 41: 'Oregon',
 42: 'Pennsylvania',
 44: 'Rhode Island',
 45: 'South Carolina',
 46: 'South Dakota',
 47: 'Tennessee',
 48: 'Texas',
 49: 'Utah',
 50: 'Vermont',
 51: 'Virginia',
 53: 'Washington',
 54: 'West Virginia',
 55: 'Wisconsin',
 56: 'Wyoming',
 66: 'Guam',
 72: 'Puerto Rico'}

# Read In BRFSS 2014 data

Next, we can read in the survey data. A copy of it is hosted by the [Open Science Foundation](https://osf.io/).

In [23]:
osf = requests.get('https://osf.io/download/n7wm8/')

In [24]:
osf.content[0:1000]

b'_state,fmonth,idate,imonth,iday,iyear,dispcode,seqno,_psu,ctelenum,pvtresd1,colghous,stateres,ladult,numadult,nummen,numwomen,genhlth,physhlth,menthlth,poorhlth,hlthpln1,persdoc2,medcost,checkup1,exerany2,sleptim1,cvdinfr4,cvdcrhd4,cvdstrk3,asthma3,asthnow,chcscncr,chcocncr,chccopd1,havarth3,addepev2,chckidny,diabete3,diabage2,lastden3,rmvteth3,veteran3,marital,children,educa,employ1,income2,weight2,height3,numhhol2,numphon2,cpdemo1,internet,renthom1,sex,pregnant,qlactlm2,useequip,blind,decide,diffwalk,diffdres,diffalon,smoke100,smokday2,stopsmk2,lastsmk2,usenow3,alcday5,avedrnk2,drnk3ge5,maxdrnks,flushot6,flshtmy2,pneuvac3,shingle2,fall12mn,fallinj2,seatbelt,drnkdri2,hadmam,howlong,profexam,lengexam,hadpap2,lastpap2,hadhyst2,pcpsaad2,pcpsadi1,pcpsare1,psatest1,psatime,pcpsars1,bldstool,lstblds3,hadsigm3,hadsgco1,lastsig3,hivtst6,hivtstd3,whrtst10,pdiabtst,prediab1,insulin,bldsugar,feetchk2,doctdiab,chkhemo3,feetchk,eyeexam,diabeye,diabedu,painact2,qlmentl2,qlstres2,qlhlth2,medicare,

This is a .csv file, encoded using the 'latin-1' encoding. The bytes must be decoded using the correct encoding in order for the data to be accessible.

In [25]:
osf.content.decode('latin-1')[0:3000]

'_state,fmonth,idate,imonth,iday,iyear,dispcode,seqno,_psu,ctelenum,pvtresd1,colghous,stateres,ladult,numadult,nummen,numwomen,genhlth,physhlth,menthlth,poorhlth,hlthpln1,persdoc2,medcost,checkup1,exerany2,sleptim1,cvdinfr4,cvdcrhd4,cvdstrk3,asthma3,asthnow,chcscncr,chcocncr,chccopd1,havarth3,addepev2,chckidny,diabete3,diabage2,lastden3,rmvteth3,veteran3,marital,children,educa,employ1,income2,weight2,height3,numhhol2,numphon2,cpdemo1,internet,renthom1,sex,pregnant,qlactlm2,useequip,blind,decide,diffwalk,diffdres,diffalon,smoke100,smokday2,stopsmk2,lastsmk2,usenow3,alcday5,avedrnk2,drnk3ge5,maxdrnks,flushot6,flshtmy2,pneuvac3,shingle2,fall12mn,fallinj2,seatbelt,drnkdri2,hadmam,howlong,profexam,lengexam,hadpap2,lastpap2,hadhyst2,pcpsaad2,pcpsadi1,pcpsare1,psatest1,psatime,pcpsars1,bldstool,lstblds3,hadsigm3,hadsgco1,lastsig3,hivtst6,hivtstd3,whrtst10,pdiabtst,prediab1,insulin,bldsugar,feetchk2,doctdiab,chkhemo3,feetchk,eyeexam,diabeye,diabedu,painact2,qlmentl2,qlstres2,qlhlth2,medicare,h

To read in the Bytes object as a csv file, we need to use a mechanism called a context manager. This is essentially a way of opening and closing a file all in one sequence, so that system resources aren't left occupied and may be freed up for other processes. In Python, context managers typically take the form of a "with... as" statement.

In [26]:
# Use a context manager to read in the bytes as a .csv into pandas:

with BytesIO(osf.content) as osf_data:
    print(type(osf_data))
    df_osf = pd.read_csv(osf_data, encoding='latin-1', low_memory=False)

<class '_io.BytesIO'>


In [27]:
df_osf

Unnamed: 0,_state,fmonth,idate,imonth,iday,iyear,dispcode,seqno,_psu,ctelenum,...,_fobtfs,_crcrec,_aidtst3,_impeduc,_impmrtl,_imphome,rcsbrac1,rcsrace1,rchisla1,rcsbirth
0,1,1,1172014,1,17,2014,1100,2014000001,2014000001,1.0,...,2.0,1.0,2.0,5,1,1,,,,
1,1,1,1072014,1,7,2014,1100,2014000002,2014000002,1.0,...,2.0,2.0,2.0,4,1,1,,,,
2,1,1,1092014,1,9,2014,1100,2014000003,2014000003,1.0,...,2.0,2.0,2.0,6,1,1,,,,
3,1,1,1072014,1,7,2014,1100,2014000004,2014000004,1.0,...,2.0,1.0,2.0,6,3,1,,,,
4,1,1,1162014,1,16,2014,1100,2014000005,2014000005,1.0,...,2.0,1.0,2.0,5,1,1,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
464659,72,11,12112014,12,11,2014,1100,2014005984,2014005984,1.0,...,,,2.0,4,1,1,,,,
464660,72,11,12102014,12,10,2014,1100,2014005985,2014005985,1.0,...,,,2.0,6,2,1,,,,
464661,72,11,12152014,12,15,2014,1100,2014005986,2014005986,1.0,...,2.0,2.0,1.0,5,1,2,,,,
464662,72,11,12132014,12,13,2014,1100,2014005987,2014005987,1.0,...,,,1.0,4,1,1,,,,


We can now use our dictionary to replace the FIPS codes in the `_state` column.

# Replace Codes with States Using Dictionary

In [28]:
df_osf['_state'] = df_osf['_state'].apply(lambda x: fips_dict[x])

In [29]:
df_osf.head()

Unnamed: 0,_state,fmonth,idate,imonth,iday,iyear,dispcode,seqno,_psu,ctelenum,...,_fobtfs,_crcrec,_aidtst3,_impeduc,_impmrtl,_imphome,rcsbrac1,rcsrace1,rchisla1,rcsbirth
0,Alabama,1,1172014,1,17,2014,1100,2014000001,2014000001,1.0,...,2.0,1.0,2.0,5,1,1,,,,
1,Alabama,1,1072014,1,7,2014,1100,2014000002,2014000002,1.0,...,2.0,2.0,2.0,4,1,1,,,,
2,Alabama,1,1092014,1,9,2014,1100,2014000003,2014000003,1.0,...,2.0,2.0,2.0,6,1,1,,,,
3,Alabama,1,1072014,1,7,2014,1100,2014000004,2014000004,1.0,...,2.0,1.0,2.0,6,3,1,,,,
4,Alabama,1,1162014,1,16,2014,1100,2014000005,2014000005,1.0,...,2.0,1.0,2.0,5,1,1,,,,


It's best to double-check the results; if we see any numbers in the `_state` column, we'll know something didn't work right.

In [30]:
df_osf['_state'].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming', 'Guam', 'Puerto Rico'],
      dtype=object)

In [31]:
df_osf.columns

Index(['_state', 'fmonth', 'idate', 'imonth', 'iday', 'iyear', 'dispcode',
       'seqno', '_psu', 'ctelenum',
       ...
       '_fobtfs', '_crcrec', '_aidtst3', '_impeduc', '_impmrtl', '_imphome',
       'rcsbrac1', 'rcsrace1', 'rchisla1', 'rcsbirth'],
      dtype='object', length=279)

In [32]:
df_osf.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 464664 entries, 0 to 464663
Data columns (total 279 columns):
 #    Column    Dtype  
---   ------    -----  
 0    _state    object 
 1    fmonth    int64  
 2    idate     int64  
 3    imonth    int64  
 4    iday      int64  
 5    iyear     int64  
 6    dispcode  int64  
 7    seqno     int64  
 8    _psu      int64  
 9    ctelenum  float64
 10   pvtresd1  float64
 11   colghous  float64
 12   stateres  float64
 13   ladult    float64
 14   numadult  float64
 15   nummen    float64
 16   numwomen  float64
 17   genhlth   float64
 18   physhlth  float64
 19   menthlth  float64
 20   poorhlth  float64
 21   hlthpln1  int64  
 22   persdoc2  float64
 23   medcost   float64
 24   checkup1  float64
 25   exerany2  float64
 26   sleptim1  int64  
 27   cvdinfr4  int64  
 28   cvdcrhd4  int64  
 29   cvdstrk3  int64  
 30   asthma3   int64  
 31   asthnow   float64
 32   chcscncr  float64
 33   chcocncr  float64
 34   chccopd1  float64


As you can see, the other 225 columns each contain a different variable; in order to access these, we could construct tables using `pdfminer` the same way we did for the FIPS codes.

# End of Module 4