
Data Mining - MSDS 7331 - Thurs 6:30, Summer 2016

Team 3 (AKA Team Super Awesome):  Sal Melendez, Rahn Lieberman, Thomas Rogers

Github page:
https://github.com/RahnL/DataScience-SMU/tree/master/DataMining

Note: Code borrowed heavily from Eric Larson's github pages for this class.
https://github.com/eclarson/DataMiningNotebooks/blob/master/04.%20Logits%20and%20SVM.ipynb

Code also borrowed from other projects we're working on using the same dataset.

https://github.com/RahnL/DataScience-SMU/blob/master/DataMining/DataMining-MiniLab1-Lieberman-Melendez-Rogers.ipynb
https://github.com/rlshuhart/MSDS6210-Immersion_Project/blob/master/Study/Closing%20the%20Gap%20Study%20Revisited.ipynb


## Data Preparation


Our team has selected the 2014 Behavioral Risk Factor Surveillance System data (BRFSS), from the Center for Disease Control and prevention (CDC), to attempt to understand the relationship between quality of health and a number of behavioral, demographic and environmental factors. 

The purpose of the BRFSS project is to survey a large population of Americans on a wide range of topics to inform policy, research and healthcare delivery. The same or similar questions are asked each year and the resulting dataset gives not only a broad, comprehensive view of health quality in the United States, but it also provides a longitudinal view on how quality of care (among other factors) is changing over time.

There are 279 variables in the dataset and over 460,000 surveys completed. The sheer breadth and complexity of this data, with missing, weighted and calculated variables requires a clear and distinct question of interest and some sense of what variables might help answer the question. We have chosen to focus on one particular question in the survey as our response variable and will attempt to better understand the impact reported behaviors have on responses to that question. 

Our response variable becomes the answer to the following question on quality of health: "Would you say that in general your health is: (1) excellent, (2) very good, (3) good, (4) fair, (5) poor?" (section 1.1, column 80)

We will limit the 279 variables to focus on those related to behavioral survey questions. The corresponding variables from the questions related to behavior number 30, so our dataset is roughly 450,000 rows by 30 columns. 

In [46]:
import pandas as pd
import numpy as np
from pandas import DataFrame
import matplotlib.pyplot as plt
import seaborn as sns

# plot graphs in the notebook

%matplotlib inline

In [50]:
df = pd.read_csv("data/LLCP2014XPT.txt", sep="\t", encoding = "ISO-8859-1")
df.head()

print("Starting length is %.f " % len(df))

# Age 18 to 64 - Excludes 65 or older, refused, or missing
df = df[df['_AGE65YR'] == 1].drop('_AGE65YR', axis=1)

# Exclude blank, 'Don't know', 'Not Sure', or 'Refused'
df = df[((df['GENHLTH'].notnull()) & (~df['GENHLTH'].isin([7,9])))] 

# Reduce Ethnicity to White, Black, or Hispanic (ex. Asian 2%, American Indian/Alaskan Native 1.55%, other 2.8%)
df = df[df['_IMPRACE'].isin([1,2,5])]
# Has Health plan --Excludes 'Don't know', 'Not Sure', or 'Refused'. drops .6%
df = df[df['HLTHPLN1'].isin([1,2])]

# Translate GENHLTH to binary classification of
# Combining the “excellent”, “very good” and “good” responses as measures of “good or better” (1) health 
# and the “fair” and “poor” measures as “fair and poor” (0).
df.loc[(df['GENHLTH'] < 4), 'health'] = 1
df.loc[(df['GENHLTH'] >= 4), 'health'] = 0

# Extract survey year from sequence. IYEAR sometimes went into the next year. 
# This is one way to put designate the year of the data publication
# Also, if we add  other years to the data, this seperates it.
df['Rec_Year'] = df['SEQNO'].astype(str).str[:4].astype(int)

df.info()
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Starting length is 464664 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 275270 entries, 2 to 464663
Columns: 280 entries, _STATE to Rec_Year
dtypes: float64(227), int32(1), int64(51), object(1)
memory usage: 589.1+ MB


Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENUM,...,_AIDTST3,_IMPEDUC,_IMPMRTL,_IMPHOME,RCSBRAC1,RCSRACE1,RCHISLA1,RCSBIRTH,health,Rec_Year
2,1,1,1092014,1,9,2014,1100,2014000003,2014000003,1.0,...,2.0,6,1,1,,,,,1.0,2014
6,1,1,1062014,1,6,2014,1100,2014000007,2014000007,1.0,...,1.0,6,1,1,,,,,1.0,2014
7,1,1,1112014,1,11,2014,1100,2014000008,2014000008,1.0,...,1.0,4,2,1,,,,,1.0,2014
8,1,1,1022014,1,2,2014,1100,2014000009,2014000009,1.0,...,2.0,3,1,1,,,,,1.0,2014
9,1,1,1082014,1,8,2014,1100,2014000010,2014000010,1.0,...,2.0,5,1,1,,,,,1.0,2014


## Data Reduction and Pre-processing

Because we're interested in the relationship between behaviors, demographics and other factors, and the impact they have on general health quality, we'll reduce the data frame down to those variables we think will have the biggest impact, including:

#### Behaviors:
- Whether someone smokes or not (represented by _SMOKER3)
- Physical activity (represented by PHYSHLTH)

#### Demographics:
- Age (represented by _AGE_G)
- Education level (represented by EDUCA)
- Income level (represented by _INCOMG)
- Race (represented by _IMPRACE, an imputed value based on the initial data ste)

#### Other Factors:
- The cost of health care (represented by MEDCOST)
- Health coverage (represented by HLTHPLN1)

In [51]:
df_reduced = df[['health','_SMOKER3','PHYSHLTH','_AGE_G','EDUCA','_INCOMG','MEDCOST','HLTHPLN1','_IMPRACE']]

# Cleanup
df_reduced.replace(7,np.nan, inplace=True)  #replace the "refused" answer choice
df_reduced.replace(9, np.nan, inplace=True) #replace the 'Don't Know' choice
df_reduced = df_reduced.dropna() # this drops those that were the refused/don't know.

df_reduced.info()
df_reduced.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


<class 'pandas.core.frame.DataFrame'>
Int64Index: 231507 entries, 2 to 464663
Data columns (total 9 columns):
health      231507 non-null float64
_SMOKER3    231507 non-null float64
PHYSHLTH    231507 non-null float64
_AGE_G      231507 non-null int64
EDUCA       231507 non-null float64
_INCOMG     231507 non-null float64
MEDCOST     231507 non-null float64
HLTHPLN1    231507 non-null int64
_IMPRACE    231507 non-null int64
dtypes: float64(6), int64(3)
memory usage: 17.7 MB


Unnamed: 0,health,_SMOKER3,PHYSHLTH,_AGE_G,EDUCA,_INCOMG,MEDCOST,HLTHPLN1,_IMPRACE
2,1.0,3.0,88.0,4,6.0,5.0,2.0,1,1
6,1.0,3.0,2.0,5,6.0,5.0,2.0,1,1
7,1.0,4.0,3.0,5,4.0,1.0,2.0,1,1
9,1.0,2.0,1.0,3,5.0,5.0,2.0,1,1
12,1.0,4.0,88.0,3,6.0,5.0,2.0,1,1


In [52]:
## Trying with one-hot encoding...failing with one-hot encoding

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit([df])  
OneHotEncoder(categorical_features='health', dtype='float64', handle_unknown='error', n_values='auto', sparse=True)

ValueError: cannot copy sequence with size 275270 to array axis with dimension 280

In [53]:
from sklearn.cross_validation import ShuffleSplit

#... setup x, y
if 'PHYSHLTH' in df_reduced:
    y = df_reduced['health'].values # get the labels we want
    del df_reduced['health'] # get rid of the class label

X = df_reduced.values # use everything else to predict!

# do the cross validation
num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n=num_instances,
                         n_iter=num_cv_iterations,
                         test_size  = 0.2)
print (cv_object)

ShuffleSplit(231507, n_iter=3, test_size=0.2, random_state=None)


In [54]:
# run logistic regression and vary some parameters
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
import datetime

# *** RESUBMISSION COMMENTARY ***
# We are going to try each of the different penalties to see how it does.

penalties = ('l1', 'l2')
for p in penalties:
    print ("\n\nNow testing penalty:") , p
    
    lr_clf = LogisticRegression(penalty=p, C=1.0, class_weight=None) # get object

    iter_num = 0
    accuracy = 0

    # the indices are the rows used for training and testing in each iteration
    for train_indices, test_indices in cv_object: 
        X_train = X[train_indices]
        y_train = y[train_indices]

        X_test = X[test_indices]
        y_test = y[test_indices]

        # train the reusable logisitc regression model on the training data
        lr_clf.fit(X_train,y_train)  # train object
        y_hat = lr_clf.predict(X_test) # get test set predictions

        # now let's get the accuracy and confusion matrix for this iterations of training/testing
        acc = mt.accuracy_score(y_test,y_hat)
        conf = mt.confusion_matrix(y_test,y_hat)
        print ("====Iteration",iter_num," ====")
        print ("accuracy", acc)
        print ("confusion matrix\n",conf)
        iter_num+=1
        accuracy = accuracy + acc

    print ('\nAverage accuracy: ', accuracy/iter_num)



Now testing penalty:
====Iteration 0  ====
accuracy 0.86804025744
confusion matrix
 [[ 2269  4763]
 [ 1347 37923]]
====Iteration 1  ====
accuracy 0.867111571854
confusion matrix
 [[ 2357  4783]
 [ 1370 37792]]
====Iteration 2  ====
accuracy 0.866874001123
confusion matrix
 [[ 2322  4870]
 [ 1294 37816]]

Average accuracy:  0.867341943473


Now testing penalty:
====Iteration 0  ====
accuracy 0.864843851238
confusion matrix
 [[ 2254  4856]
 [ 1402 37790]]
====Iteration 1  ====
accuracy 0.864195931061
confusion matrix
 [[ 2314  4853]
 [ 1435 37700]]
====Iteration 2  ====
accuracy 0.865513368753
confusion matrix
 [[ 2316  4894]
 [ 1333 37759]]

Average accuracy:  0.864851050351


In [55]:
classweight = (None, 'balanced')
for c in classweight:
    print ('\n\nNow testing weight:', c)
    
    lr_clf = LogisticRegression(penalty='l2', C=1.0, class_weight=c) # get object

    iter_num = 0
    accuracy = 0

    # the indices are the rows used for training and testing in each iteration
    for train_indices, test_indices in cv_object: 
        X_train = X[train_indices]
        y_train = y[train_indices]

        X_test = X[test_indices]
        y_test = y[test_indices]

        # train the reusable logisitc regression model on the training data
        lr_clf.fit(X_train,y_train)  # train object
        y_hat = lr_clf.predict(X_test) # get test set predictions

        # now let's get the accuracy and confusion matrix for this iterations of training/testing
        acc = mt.accuracy_score(y_test,y_hat)
        conf = mt.confusion_matrix(y_test,y_hat)
        print ("====Iteration",iter_num," ====")
        print ("accuracy", acc)
        # print "confusion matrix\n",conf  #not showing to save space.
        iter_num+=1
        accuracy = accuracy + acc

    print ('\nAverage accuracy: ', accuracy/iter_num)



Now testing weight: None
====Iteration 0  ====
accuracy 0.867759492031
====Iteration 1  ====
accuracy 0.867262753229
====Iteration 2  ====
accuracy 0.865707744806

Average accuracy:  0.866909996688


Now testing weight: balanced
====Iteration 0  ====
accuracy 0.764373029243
====Iteration 1  ====
accuracy 0.768238952961
====Iteration 2  ====
accuracy 0.766273595093

Average accuracy:  0.766295192432


### Couldn't Get Code to Work Past Here

In [41]:
# import k-fold cross validation from scikit learn
from sklearn.cross_validation import KFold

if 'health' in df:
    y = df['health'].values # get the labels we want
    del df['health'] # get ride of the class label
    X = df.values # use everything else to predict!
    
    
KFoldCrossObject = KFold(len(y), n_folds=10)

In [42]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
from sklearn.preprocessing import StandardScaler

for train_indices, test_indices in KFoldCrossObject: 
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]

# scale attributes by the training set
scale = StandardScaler()
scale.fit(X_train) # find scalings for each column that make this zero mean and unit std


X_train_scaled = scale.transform(X_train) # apply to training
X_test_scaled = scale.transform(X_test) # apply those means and std to the test set (without snooping at the test set values)

# train the model just as before
logReg = LogisticRegression(penalty='l2', C=0.05, n_jobs=-1) 
logReg.fit(X_train_scaled,y_train)  # train object

y_hat = logReg.predict(X_test_scaled) # get test set precitions

acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print('accuracy:', acc )
print(conf )

# sort these attributes and spit them out
zip_vars = zip(logReg.coef_.T,brfss.columns) # combine attributes
zip_vars.sort(key = lambda t: np.abs(t[0])) # sort them by the magnitude of the weight
for coef, name in zip_vars:
    print(name, 'has weight of', coef[0]) # now print them out

ValueError: could not convert string to float: 'Too expensive'

In [47]:
plt.figure(figsize=[15,4])

weights = pd.Series(logReg.coef_[0],index=brfss.columns)
weights.plot(kind='bar')
plt.show()

NameError: name 'logReg' is not defined

<matplotlib.figure.Figure at 0x5d43b588>