# Table of Contents

- [Why This Data Was Collected](#Why_This_Data_Was_Collected)
- [Data Import / Housekeeping](#Housekeeping)
- [Data Understanding](#Data_Understanding)
- [References](#References)

Data Mining - MSDS 7331 - Thurs 6:30, Summer 2016
Team 3 (AKA Team Super Awesome):  Sal Melendez, Rahn Lieberman, Thomas Rogers


# Lab 1 - Visualization and Data Preprocessing

## Why This Data Was Collected

Our team has selected the 2014 Behavioral Risk Factor Surveillance System data (BRFSS), from the Center for Disease Control and prevention (CDC), to attempt to understand the relationship between quality of health and a number of behavioral, demographic and environmental factors. 

The purpose of the BRFSS project is to survey a large population of Americans on a wide range of topics to inform policy, research and healthcare delivery. The same or similar questions are asked each year and the resulting dataset gives not only a broad, comprehensive view of health quality in the United States, but it also provides a longitudinal view on how quality of care (among other factors) is changing over time.

There are 279 variables in the dataset and over 460,000 surveys completed. The sheer breadth and complexity of this data, with missing, weighted and calculated variables requires a clear and distinct question of interest and some sense of what variables might help answer the question. We have chosen to focus on one particular question in the survey as our response variable and will attempt to better understand the impact reported behaviors have on responses to that question. 

Our response variable becomes the answer to the following question on quality of health:
•	"Would you say that in general your health is: (1) excellent, (2) very good, (3) good, (4) fair, (5) poor."  (section 1.1, column 80)

We will limit the 279 variables to focus on those related to behavioral survey questions. The corresponding variables from the questions related to behavior number XXX, so our dataset is roughly 450,000 rows by XXX columns. 

If we start now with a model-building methodology in mind and identify logistic regression as an approach, we need to transform these ordinal responses into a binary response. We’ll do so by combining the “excellent”, “very good” and “good” responses as measures of “good or better” health and the “fair” and “poor” measures as “fair and poor”.

To measure success, we will know we’ve mined useful knowledge from this dataset if we’re able to translate the data into actionable insight. If the information we discover has the potential to inform policy or change individual behavior to increase health quality, we’ve been helpful. 

We plan to measuring the effectiveness of a good prediction algorithm by it’s ability to classify whether someone will report good or better health or fair and poor health at a high degree of sensitivity and specificity. 

[Source data, reference 1](#References)


# Housekeeping
Loading the data, imports, and such.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("data/LLCP2014XPT.txt", sep="\t")
df.info()
df.head()

# Data Understanding

The data consists of 279 fields. The majority of the data is answers to the survey questions on either a Likert scale (1 t o 5+), yes/no (1 = yes, 2 = not), true/false (1=true, 2=false).

For our question of interest, we are interested in the GENLTH variable.  The answers scale is:

Key | 1 | 2 | 3 | 4 | 5 | 7 | 9 | Blank |
----- | --- | --- | --- | --- | --- | --- | --- | --- |
Value | Excellent | Very Good | Good | Fair | Poor | Don't Know | Refused | Not asked or missing


To get a feel for the data, here are some of the other variables, with their meaning.

 The STATE field is a numerically coded list of all states and territories participating in the survey.

Key | Value
--- | ---
1 | Alaska
2 | Alabama
3 | Arizona
... | ...
66 | Guam
72 | Puerto Rico


The number of adults in the household (measure seperately for male and female):

Key | Anwer to "number of adult [men]/[women] in household
--- | ---
0 | 0
1 | 1
2 | 2
3 | 3
4 | 4
5 | 5
6 - 99 | 6 or more

A detailed listing of all keys and values is found in the codebook. [Codebook, Reference #2](#References)

In [None]:
### References

In [None]:
(1) The main page for the data: http://www.cdc.gov/brfss/annual_data/annual_2014.html
(2) Codebook of lookup values: http://www.cdc.gov/brfss/annual_data/2014/pdf/codebook14_llcp.pdf
        