# GA Data Science (DAT16) - Lab 11
## Pair Programming

### Heart Disease Dataset
ref: [https://archive.ics.uci.edu/ml/datasets/Heart+Disease](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)

#### Features

    Dataset has 76 total attributes - 14 attributes are used:
    1. #3 (age)
    2. #4 (sex)
    3. #9 (cp)
    4. #10 (trestbps)
    5. #12 (chol)
    6. #16 (fbs)
    7. #19 (restecg)
    8. #32 (thalach)
    9. #38 (exang)
    10. #40 (oldpeak)
    11. #41 (slope)
    12. #44 (ca)
    13. #51 (thal)
    14. #58 (num) (the predicted attribute - 0 is healthy and 1,2,3,4 indicate heart disease) 

### Class Exercise: Implement Random Forest

#### Import the dataset into a pandas dataframe:

Note: You'll have to manually add column labels

In [34]:
import pandas as pd
columns= ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak'\
            ,'slope', 'ca', 'thal', 'num']
df= pd.read_csv('heart_disease.csv', names=columns, index_col=None)
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0


#### Prepare and validate the data:

Investigate the data and check for missing values - we've used .info() before:

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null float64
sex         303 non-null float64
cp          303 non-null float64
trestbps    303 non-null float64
chol        303 non-null float64
fbs         303 non-null float64
restecg     303 non-null float64
thalach     303 non-null float64
exang       303 non-null float64
oldpeak     303 non-null float64
slope       303 non-null float64
ca          303 non-null object
thal        303 non-null object
num         303 non-null int64
dtypes: float64(11), int64(1), object(2)
memory usage: 35.5+ KB


#### Clean the data to ensure it can be used in a random forest algorithm

In [36]:
df.ca.value_counts()

0.0    176
1.0     65
2.0     38
3.0     20
?        4
dtype: int64

In [37]:
df.thal.value_counts()

3.0    166
7.0    117
6.0     18
?        2
dtype: int64

In [38]:
df2=df[df['ca']!='?']
df2=df2[df2['thal']!='?']
df2.ca.value_counts()
df2.thal.value_counts()

3.0    164
7.0    115
6.0     18
dtype: int64

#### Select Features and convert Target to Boolean Class for Heart Disease (i.e., values 1, 2, 3 and 4 all indicate heart disease)

In [39]:
# Create target column, map values so we create a dummy 
df2['target'] = df2['num'].map({0:0, 1:1, 2:1, 3:1, 4:1})
target = df2.target
target.head()

0    0
1    1
2    1
3    0
4    0
Name: target, dtype: int64

In [40]:
df2.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0,0


#### Build the model and score with cross-validation

In [43]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn import cross_validation




In [46]:
rfc = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=3)

score = cross_validation.cross_val_score(rfc, features, target, cv=5)

print score.mean()



0.811073446328


In [53]:
rfc.fit(features, target)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=3, max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

#### How important are the various features?

In [54]:
print rfc.feature_importances_

[ 0.09546497  0.01068881  0.1462995   0.04645298  0.03888874  0.00712661
  0.01831161  0.13413866  0.04689816  0.12004335  0.01483112  0.13279744
  0.18805807]


#### Plot Feature importances

In [57]:
imp= zip(columns,rfc.feature_importances_)
imp

[('age', 0.095464966051443459),
 ('sex', 0.010688805231364677),
 ('cp', 0.14629950205628345),
 ('trestbps', 0.04645298293701601),
 ('chol', 0.038888735069127246),
 ('fbs', 0.0071266115209164399),
 ('restecg', 0.018311613498569849),
 ('thalach', 0.13413865829354749),
 ('exang', 0.046898155751482984),
 ('oldpeak', 0.12004334587827101),
 ('slope', 0.014831117278118028),
 ('ca', 0.13279743893309554),
 ('thal', 0.18805806750076384)]

In [60]:
from bokeh.charts import Bar, show
p = Bar(rfc.feature_importances_, cat=list(columns))
show (p)

ValueError: operands could not be broadcast together with shapes (14,) (13,) 

#### Bonus: Repeat the classification with Support Vector Machine