## The Data: Pima Indian Classification

### Problem statement 
Using the UCI PIMA Indian Diabetes dataset to predict a person has diabetes or not using the medical attributes provided. (Target is column 8)

### Assumptions

This is enough data to split and reliably predict if the patient has diabetes, the dataset has only 786 data points
Just these attributes are enough to diagnose the ailment
Similar Problems 
This is very much like some common 2 class classification problems like classifying mail into spam and ham based on the contents of the email. Obviously the attributes there would be strings and not numbers like this dataset, therefore the way in which we process at least some of the features will be different.

In [47]:
import pandas as pd
%pylab inline

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


In [48]:
df = pd.read_csv("diabetes.csv")

In [49]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [50]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [51]:
X=df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']]
y=df['Outcome']

In [52]:
len(df)

768

## Train Test

In [53]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1) 

## Decision Tree

In [54]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

grid=GridSearchCV(DecisionTreeClassifier(),
                  param_grid={'max_depth':range(1,20),'min_samples_leaf':range(20,100),'min_samples_split':[20]},
                  cv=6,
                  scoring='accuracy',
                  n_jobs=-1,
                  verbose=True)


In [55]:
grid.fit(X,y)

Fitting 6 folds for each of 1520 candidates, totalling 9120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 203 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-1)]: Done 2903 tasks      | elapsed:    8.9s
[Parallel(n_jobs=-1)]: Done 7403 tasks      | elapsed:   19.6s
[Parallel(n_jobs=-1)]: Done 9120 out of 9120 | elapsed:   23.1s finished


GridSearchCV(cv=6, error_score='raise-deprecating',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'max_depth': range(1, 20), 'min_samples_leaf': range(20, 100), 'min_samples_split': [20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=True)

In [56]:
tree_final = grid.best_estimator_
tree_final

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=20, min_samples_split=20,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [57]:
grid.best_params_

{'max_depth': 5, 'min_samples_leaf': 20, 'min_samples_split': 20}

In [58]:
grid.best_score_

0.7682291666666666

## KNeighbors

In [59]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

grid=GridSearchCV(KNeighborsClassifier(),
                  param_grid={'n_neighbors':range(1,100),'weights':['distance','uniform']},
                  cv=5,
                  scoring='accuracy',
                  n_jobs=-1,
                  verbose=True)

In [60]:
grid.fit(X,y)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 198 candidates, totalling 990 fits


[Parallel(n_jobs=-1)]: Done 416 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 990 out of 990 | elapsed:    9.6s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'n_neighbors': range(1, 100), 'weights': ['distance', 'uniform']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=True)

In [61]:
knn_final = grid.best_estimator_
knn_final

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=43, p=2,
           weights='distance')

In [62]:
grid.best_params_

{'n_neighbors': 43, 'weights': 'distance'}

In [63]:
grid.best_score_

0.7591145833333334

## SVM

In [65]:
from sklearn.svm import SVC

grid=GridSearchCV(SVC(),
                  param_grid={'kernel':['poly'],'C':range(1,10001,500),'degree':range(2,5)},
                  cv=5,
                  scoring='accuracy',
                  n_jobs=-1,
                  verbose=True)

grid.fit(X,y)

Fitting 5 folds for each of 60 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


KeyboardInterrupt: 

In [None]:
svm_final = grid.best_estimator_
svm_final

In [None]:
grid.best_params_

In [None]:
grid.best_score_

## Ensemble

### Voting Classifier

In [66]:
from sklearn.ensemble import VotingClassifier

clf = VotingClassifier(estimators=[('dectree',tree_final),('knn',knn_final)])

In [67]:
cross_val_score(clf,X,y,cv=5,scoring='accuracy').mean()

0.7487394957983193

### Bagging Classifier

In [68]:
from sklearn.ensemble import BaggingClassifier

clf = BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=4),n_estimators=100,oob_score=True)

In [69]:
clf.fit(X,y)

BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=4, p=2,
           weights='uniform'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=100, n_jobs=None, oob_score=True,
         random_state=None, verbose=0, warm_start=False)

In [70]:
from sklearn.model_selection import cross_val_score
cross_val_score(clf,X,y,scoring='accuracy').mean()



0.7148052249434146

In [71]:
clf.oob_score_

0.7083333333333334

## Random Forest

In [72]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(max_depth=4,min_samples_leaf=20,n_estimators=500,n_jobs=-1)

In [73]:
cross_val_score(clf,X,y,cv =5,scoring='accuracy').mean()

0.7604872251931075