# Behavioral Risk Factor Surveillance System (BRFSS) 2014

## Topics & Techniques Covered:

* Extracting text data from a PDF file
* Eliminating whitespace from text strings
* Using a dictionary to replace numerical data with text-based categories.

## Imports

The `io` (input/output) library handles "file objects" which are representations of files in text, bytes, or raw format. This is a bit of an abstract concept, but essentially it lets you treat streamed data from a website or other source as if it was a file being read from a hard drive. We will be using this library to read in PDF files.

In [None]:
import requests
from io import StringIO, BytesIO

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The [pdfminer](https://pdfminersix.readthedocs.io/en/latest/) package lets you extract data from PDF documents. It doesn't work perfectly all the time and usually takes some fiddling, but it is a potential tool to *reproducibly* convert tables in PDF documents to tabular data in Python. How usable it is will depend largely on how well-formatted the PDF is.

In [None]:
!pip install pdfminer.six

In [None]:
from pdfminer.high_level import extract_text, extract_pages
from pdfminer.layout import LTTextLineHorizontal, LTTextBoxHorizontal

## Behavioral Risk Factor Surveillance System (BRFSS) 2014 Survey Codebook

The Behavioral Risk Factor Surveillance System is a United States public health survey conducted by the Center for Disease Control to assess behavioral health risks in the United States. The data from the CDC website contains a large amount of data, but it's not easily readable because all the fields are coded to numbers rather than containing the actual categories themselves.

The categories are kept in a codebook, which serves as a dictionary so users can translate the survey data. However, the fact that the codebook is a PDF document instead of being in a tabular data format like a spreadsheet makes it difficult to read these codes programmatically. 

This is why tools like `pdfminer.six` are useful; they let you make tables out of data that isn't formatted in a table to begin with.

[FIPS (Federal Information Processing Standard) codes](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code) are identifiers that have been used by the Census Bureau and other institutions as unique identifiers for U.S. states and territories.

Here, they are used in the BRFSS dataset as state identifiers, but without a link between the FIPS code and the state name/postal abbreviation, it's harder to match the data at a glance.

### Codebook Download Links:

Available here: https://www.cdc.gov/brfss/annual_data/annual_2014.html

PLEASE NOTE: The CDC website has been unreliable over the past several weeks; the codebook was unavailable for a few days during that period. It may or may not be available when you access this link.

We have uploaded the codebook to our GitHub page, and, as the CDC website may continue to be unreliable, we are downloading the BRFSS2014 dataset via the Open Science Foundation's repository: https://osf.io/n7wm8.


# `requests`

We will be using the `requests` module to perform a "get" HTML request to the BRFSS resources.

For a more extensive tutorial on the `requests` module and on web-scraping, please see the archived "Practical Python" workshop materials on the Library's "[Introduction to Python](https://libguides.libraries.claremont.edu/intro-to-python)" Research Guide.

First, we use the `requests` module to make an HTML "Get" request to pull the PDF data.

In [None]:
pdf_response = requests.get("https://raw.githubusercontent.com/ClaremontCollegesLibrary/PersnicketyPython/refs/heads/main/brfss_2014_codebook.pdf")

In [None]:
pdf_content = pdf_response.content

Here are the first two thousand characters of the PDF file, returned as a Python [Bytes object](https://docs.python.org/3/library/stdtypes.html#bytes-objects)

Bytes objects display similarly to Python strings (they are formatted like a string, with a "b" at the start before the quotes) but they are fundamentally different.

In [None]:
pdf_content[0:2000]

The `BytesIO` class from the `io` library allows us to read the Bytes object into a chunk of memory so that it behaves like a file. In this case, the "%PDF-1.6" header at the start of the Bytes object indicates that the file is a PDF, and `BytesIO` lets us treat it as if it was a PDF file on the drive for purposes of the `extract_text()` function from `pdfminer`.

In [None]:
pdf = BytesIO(pdf_content)

# pdfminer

## `extract_text()`

At its simplest, pdfminer converts PDF files to plain text.

In [None]:
text = extract_text(pdf)
print(text[0:2000])

## `extract_pages()`

The extract_pages() function segments the text data based on which page it's on... that data may be further segmented by the individual elements in the layout of each page.

In [None]:
pages = [page for page in extract_pages(pdf)]

In [None]:
for page_layout in pages[0:2]:
    for element in page_layout:
        print(element)

## Identifying Elements and Extracting Text

In [None]:
table_text = []


for page_layout in extract_pages(pdf):
    for element in page_layout:
        if isinstance(element, LTTextLineHorizontal) or isinstance(element, LTTextBoxHorizontal):
            table_text.append(element.get_text())

Once we've identified the length of the tables on each page, we can locate the starting points in the list `table_text` and take segments of that list to use as columns in a DataFrame object.



In [None]:
n=0
for line in table_text[0:200]:
    print(n, line)
    n+=1

The column headers that we're interested in are in cells 19, 20, 93, 94, and 95, and the segments of the data we want to extract start on cells 21, 56, 96, 131, and 166 for the first chunk, respectively. 


In [None]:
n=200
for line in table_text[200:400]:
    print(n, line)
    n+=1

In the second chunk, the columns start in cells 214, 232, 255, 273, and 291, and the length of each one is 18 entries.

The column headers will need to be cleaned as well. Fortunately, the pattern is consistent. Every element's text has trailing whitespace and a newline character (`\n`), so we can use the string method `.replace()` to pare down each string.

In [None]:
table_text[19]

In [None]:
table_text[19].replace(' \n','')

In [None]:
table_text[21]

In [None]:
table_text[21].replace(' \n','')

Here we can use multiple list comprehensions to create columns for a DataFrame:

In [None]:
table_length = 35
table_length2 = 18

state_fips = pd.DataFrame()

state_fips[table_text[19].replace(' \n','')] = [
    value.replace(' \n','') for value in table_text[21:21+table_length] + table_text[214:214+table_length2]
        ]

state_fips[table_text[20].replace(' \n','')] = [
    value.replace(' \n','') for value in table_text[56:56+table_length] + table_text[232:232+table_length2]
        ]

state_fips[table_text[93].replace(' \n','')] = [
    value.replace(' \n','') for value in table_text[96:96+table_length]+ table_text[255:255+table_length2]
        ]

state_fips[table_text[94].replace(' \n','')] = [
    value.replace(' \n','') for value in table_text[131:131+table_length] + table_text[273:273+table_length2]
        ]

state_fips[table_text[95].replace(' \n','')] = [
    value.replace(' \n','') for value in table_text[166:166+table_length] + table_text[291:291+table_length2]
        ]

In [None]:
state_fips

In [None]:
zipped_fips = zip(state_fips['Value'].values, state_fips['Value Label'].values)

fips_dict = {int(value):label for value, label in zipped_fips}

In [None]:
fips_dict

# Read In BRFSS 2014 data

Next, we can read in the survey data. A copy of it is hosted by the [Open Science Foundation](https://osf.io/).

In [None]:
osf = requests.get('https://osf.io/download/n7wm8/')

In [None]:
osf.content[0:1000]

This is a .csv file, encoded using the 'latin-1' encoding. The bytes must be decoded using the correct encoding in order for the data to be accessible.

In [None]:
osf.content.decode('latin-1')[0:3000]

To read in the Bytes object as a csv file, we need to use a mechanism called a context manager. This is essentially a way of opening and closing a file all in one sequence, so that system resources aren't left occupied and may be freed up for other processes. In Python, context managers typically take the form of a "with... as" statement.

In [None]:
# Use a context manager to read in the bytes as a .csv into pandas:

with BytesIO(osf.content) as osf_data:
    print(type(osf_data))
    df_osf = pd.read_csv(osf_data, encoding='latin-1', low_memory=False)

In [None]:
df_osf

We can now use our dictionary to replace the FIPS codes in the `_state` column.

# Replace Codes with States Using Dictionary

In [None]:
df_osf['_state'] = df_osf['_state'].apply(lambda x: fips_dict[x])

In [None]:
df_osf.head()

It's best to double-check the results; if we see any numbers in the `_state` column, we'll know something didn't work right.

In [None]:
df_osf['_state'].unique()

In [None]:
df_osf.columns

In [None]:
df_osf.info(verbose=True)

As you can see, the other 225 columns each contain a different variable; in order to access these, we could construct tables using `pdfminer` the same way we did for the FIPS codes.

# End of Module 4