Data Mining - MSDS 7331 - Thurs 6:30, Summer 2016
Team 3 (AKA Team Super Awesome):  Sal Melendez, Rahn Lieberman, Thomas Rogers
<hr>

# Lab 1 - Visualization and Data Preprocessing

## Table of Contents

- [Why This Data Was Collected](#Why-This-Data-Was-Collected)
- [Data Import / Housekeeping](#Housekeeping)
- [Data Understanding](#Data-Understanding)
- [Data Quality](#Data-Quality)
- [Simple Statistics](#Simple-Statistics)
- [Visualization and Attributes](#Visualization-And-Attributes)
- [Interesting Features](#Interesting-Features)
- [Exceptional Work](#Exceptional-Work)
- [References](#References)

<hr>

<hr>
## Why This Data Was Collected

Our team has selected the 2014 Behavioral Risk Factor Surveillance System data (BRFSS), from the Center for Disease Control and prevention (CDC), to attempt to understand the relationship between quality of health and a number of behavioral, demographic and environmental factors. 

The purpose of the BRFSS project is to survey a large population of Americans on a wide range of topics to inform policy, research and healthcare delivery. The same or similar questions are asked each year and the resulting dataset gives not only a broad, comprehensive view of health quality in the United States, but it also provides a longitudinal view on how quality of care (among other factors) is changing over time.

There are 279 variables in the dataset and over 460,000 surveys completed. The sheer breadth and complexity of this data, with missing, weighted and calculated variables requires a clear and distinct question of interest and some sense of what variables might help answer the question. We have chosen to focus on one particular question in the survey as our response variable and will attempt to better understand the impact reported behaviors have on responses to that question. 

Our response variable becomes the answer to the following question on quality of health:
•	"Would you say that in general your health is: (1) excellent, (2) very good, (3) good, (4) fair, (5) poor."  (section 1.1, column 80)

We will limit the 279 variables to focus on those related to behavioral survey questions. The corresponding variables from the questions related to behavior number XXX, so our dataset is roughly 450,000 rows by XXX columns. 

If we start now with a model-building methodology in mind and identify logistic regression as an approach, we need to transform these ordinal responses into a binary response. We’ll do so by combining the “excellent”, “very good” and “good” responses as measures of “good or better” health and the “fair” and “poor” measures as “fair and poor”.

To measure success, we will know we’ve mined useful knowledge from this dataset if we’re able to translate the data into actionable insight. If the information we discover has the potential to inform policy or change individual behavior to increase health quality, we’ve been helpful. 

We plan to measuring the effectiveness of a good prediction algorithm by it’s ability to classify whether someone will report good or better health or fair and poor health at a high degree of sensitivity and specificity. 

[Source data, reference 1](#References)


 [toc](#Table-Of-Contents)
<hr>
# Housekeeping
Loading the data, imports, and such.

In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv("data/LLCP2014XPT.txt", sep="\t")
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 464664 entries, 0 to 464663
Columns: 279 entries, _STATE to RCSBIRTH
dtypes: float64(226), int64(52), object(1)
memory usage: 989.1+ MB


Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENUM,...,_FOBTFS,_CRCREC,_AIDTST3,_IMPEDUC,_IMPMRTL,_IMPHOME,RCSBRAC1,RCSRACE1,RCHISLA1,RCSBIRTH
0,1,1,1172014,1,17,2014,1100,2014000001,2014000001,1.0,...,2.0,1.0,2.0,5,1,1,,,,
1,1,1,1072014,1,7,2014,1100,2014000002,2014000002,1.0,...,2.0,2.0,2.0,4,1,1,,,,
2,1,1,1092014,1,9,2014,1100,2014000003,2014000003,1.0,...,2.0,2.0,2.0,6,1,1,,,,
3,1,1,1072014,1,7,2014,1100,2014000004,2014000004,1.0,...,2.0,1.0,2.0,6,3,1,,,,
4,1,1,1162014,1,16,2014,1100,2014000005,2014000005,1.0,...,2.0,1.0,2.0,5,1,1,,,,


[toc](#Table-Of-Contents)
<hr>
## Data Understanding
The data consists of 279 fields. The majority of the data is answers to the survey questions on either a Likert scale (1 t o 5+), yes/no (1 = yes, 2 = not), true/false (1=true, 2=false).



For our question of interest, we are interested in the GENLTH variable.  The answers scale is:

Key | 1 | 2 | 3 | 4 | 5 | 7 | 9 | Blank |
----- | --- | --- | --- | --- | --- | --- | --- | --- |
Value | Excellent | Very Good | Good | Fair | Poor | Don't Know | Refused | Not asked or missing


To get a feel for the data, here are some of the other variables, with their meaning.

 The STATE field is a numerically coded list of all states and territories participating in the survey.

Key | Value
--- | ---
1 | Alaska
2 | Alabama
3 | Arizona
... | ...
66 | Guam
72 | Puerto Rico


The number of adults in the household (measure seperately for male and female):

Key | Anwer to "number of adult [men]/[women] in household
--- | ---
0 | 0
1 | 1
2 | 2
3 | 3
4 | 4
5 | 5
6 - 99 | 6 or more

As mentioned in the first section, to make the data somewhat manageable, we choose one response variable and 30 explanatory variables. The table below shows whether the variable was a response or explanatory variable, which section it belongs in, the column number, the variable name, a description of the variable and the variable data type.

<table class="table table-bordered table-hover table-condensed">
<tbody><tr><td>RV or EV</td>
<td>Section</td>
<td>Column</td>
<td>Variable Name</td>
<td>Description</td>
</tr>
<tr><td>RV</td>
<td>1.1</td>
<td>80</td>
<td>GENHLTH</td>
<td>Would you say that in general your health is:</td>
</tr>
<tr><td>EV</td>
<td>3.1</td>
<td>87</td>
<td>HLTHPLN1</td>
<td>Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?</td>
</tr>
<tr><td>EV</td>
<td>3.2</td>
<td>88</td>
<td>PERSDOC2</td>
<td>Do you have one person you think of as your personal doctor or health care provider? (If &quot;No&quot; ask &quot;Is there more than one or is there no person who you think of as your personal doctor or health care provider?&quot;.)</td>
</tr>
<tr><td>EV</td>
<td>3.3</td>
<td>89</td>
<td>MEDCOST</td>
<td>Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?</td>
</tr>
<tr><td>EV</td>
<td>3.4</td>
<td>90</td>
<td>CHECKUP1</td>
<td>About how long has it been since you last visited a doctor for a routine checkup? [A routine checkup is a general physical exam, not an exam for a specific injury, illness, or condition.]</td>
</tr>
<tr><td>EV</td>
<td>4.1</td>
<td>91</td>
<td>EXERANY2</td>
<td>During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?</td>
</tr>
<tr><td>EV</td>
<td>5.1</td>
<td>92</td>
<td>SLEPTIM1</td>
<td>On average, how many hours of sleep do you get in a 24-hour period?</td>
</tr>
<tr><td>EV</td>
<td>6.1</td>
<td>94</td>
<td>CVDINFR4</td>
<td>(Ever told) you had a heart attack, also called a myocardial infarction?<br/></td>
</tr>
<tr><td>EV</td>
<td>6.2</td>
<td>95</td>
<td>CVDCRHD4</td>
<td>(Ever told) you had angina or coronary heart disease?<br/></td>
</tr>
<tr><td>EV</td>
<td>6.3</td>
<td>96</td>
<td>CVDSTRK3</td>
<td>(Ever told) you had a stroke.<br/></td>
</tr>
<tr><td>EV</td>
<td>6.4</td>
<td>97</td>
<td>ASTHMA3</td>
<td>(Ever told) you had asthma?<br/></td>
</tr>
<tr><td>EV</td>
<td>6.5</td>
<td>98</td>
<td>ASTHNOW</td>
<td>Do you still have asthma?</td>
</tr>
<tr><td>EV</td>
<td>6.6</td>
<td>99</td>
<td>CHCSCNCR</td>
<td>(Ever told) you had skin cancer?</td>
</tr>
<tr><td>EV</td>
<td>6.7</td>
<td>100</td>
<td>CHCOCNCR</td>
<td>(Ever told) you had any other types of cancer?</td>
</tr>
<tr><td>EV</td>
<td>6.8</td>
<td>101</td>
<td>CHCCOPD1</td>
<td>(Ever told) you have Chronic Obstructive Pulmonary Disease or COPD, emphysema or chronic bronchitis?</td>
</tr>
<tr><td>EV</td>
<td>6.9</td>
<td>102</td>
<td>HAVARTH3</td>
<td>on: (Ever told) you have some form of arthritis, rheumatoid arthritis, gout, lupus, or fibromyalgia? (Arthritis diagnoses include: rheumatism, polymyalgia rheumatica; osteoarthritis (not osteporosis); tendonitis, bursitis, bunion, tennis elbow; carpal tunnel syndrome, tarsal tunnel syndrome; joint infection, etc.)</td>
</tr>
<tr><td>EV</td>
<td>6.10</td>
<td>103</td>
<td>ADDEPEV2</td>
<td>(Ever told) you that you have a depressive disorder, including depression, major depression, dysthymia, or minor depression?</td>
</tr>
<tr><td>EV</td>
<td>6.11</td>
<td>104</td>
<td>CHCKIDNY</td>
<td>(Ever told) you have kidney disease? Do NOT include kidney stones, bladder infection or incontinence.(Incontinence is not being able to control urine flow.)</td>
</tr>
<tr><td>EV</td>
<td>7.1</td>
<td>108</td>
<td>LASTDEN3</td>
<td>How long has it been since you last visited a dentist or a dental clinic for any reason? Include visits to dental specialists, such as orthodontists.</td>
</tr>
<tr><td>EV</td>
<td>7.2</td>
<td>109</td>
<td>RMVTETH3</td>
<td>How many of your permanent teeth have been removed because of tooth decay or gum disease? Include teeth lost to infection, but do not include teeth lost for other reasons, such as injury or orthodontics. (If wisdom teeth are removed because of tooth decay or gum disease, they should be included in the count for lost teeth)</td>
</tr>
<tr><td>EV</td>
<td>8.24</td>
<td>181</td>
<td>USEEQUIP</td>
<td>Do you now have any health problem that requires you to use special equipment, such as a cane, a wheelchair, a special bed, or a special telephone? (Include occasional use or use in certain circumstances.)</td>
</tr>
<tr><td>EV</td>
<td>8.25</td>
<td>182</td>
<td>BLIND</td>
<td>Are you blind or do you have serious difficulty seeing, even when wearing glasses?</td>
</tr>
<tr><td>EV</td>
<td>8.26</td>
<td>183</td>
<td>DECIDE</td>
<td>Because of a physical, mental, or emotional condition, do you have serious difficulty concentrating, remembering, or making decisions?</td>
</tr>
<tr><td>EV</td>
<td>8.27</td>
<td>184</td>
<td>DIFFWALK</td>
<td>Do you have serious difficulty walking or climbing stairs?</td>
</tr>
<tr><td>EV</td>
<td>8.28</td>
<td>185</td>
<td>DIFFDRES</td>
<td>Do you have difficulty dressing or bathing?</td>
</tr>
<tr><td>EV</td>
<td>8.29</td>
<td>186</td>
<td>DIFFALON</td>
<td>Because of a physical, mental, or emotional condition, do you have difficulty doing errands alone such as visiting a doctor’s office or shopping?</td>
</tr>
<tr><td>EV</td>
<td>9.1</td>
<td>187</td>
<td>SMOKE100</td>
<td>Have you smoked at least 100 cigarettes in your entire life?</td>
</tr>
<tr><td>EV</td>
<td>9.2</td>
<td>188</td>
<td>SMOKDAY2</td>
<td>Do you now smoke cigarettes every day, some days, or not at all?</td>
</tr>
<tr><td>EV</td>
<td>9.3</td>
<td>189</td>
<td>STOPSMK2</td>
<td>During the past 12 months, have you stopped smoking for one day or longer because you were trying to quit smoking?</td>
</tr>
<tr><td>EV</td>
<td>9.4</td>
<td>190</td>
<td>LASTSMK2</td>
<td>How long has it been since you last smoked a cigarette, even one or two puffs?</td>
</tr>
<tr><td>EV</td>
<td>9.5</td>
<td>192</td>
<td>USENOW3</td>
<td>Do you currently use chewing tobacco, snuff, or snus every day, some days, or not at all? (Snus (Swedish for snuff) is a moist smokeless tobacco, usually sold in small pouches that are placed under the lip against the gum.)[Snus (rhymes with ´goose´)]</td>
</tr>
</tbody></table>

A detailed listing of all keys and values is found in the codebook. [Codebook, Reference #2](#References)

[toc](#Table-Of-Contents)
<hr>
## Data Quality

[toc](#Table-Of-Contents)
<hr>
## Simple Statistics

[toc](#Table-Of-Contents)
<hr>
## Visualization and Attributes

[toc](#Table-Of-Contents)
<hr>
## Interesting Features
we should find something....

[toc](#Table-Of-Contents)
<hr>
## Exceptional Work
1. This notebook is pretty big. As simple as it sounds, we found adding a table of contents and links to the different sections to be a valuable tool.
2. 

[toc](#Table-Of-Contents)
<hr>
## References

(1) The main page for the data: http://www.cdc.gov/brfss/annual_data/annual_2014.html
(2) Codebook of lookup values: http://www.cdc.gov/brfss/annual_data/2014/pdf/codebook14_llcp.pdf
        