# Test Your Skills

*Author: Francesco Mosconi*

*Copyright &copy; 2017 CATALIT LLC*

## Exercise

The [Pima Indians dataset](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes) is a very famous dataset distributed by UCI and originally collected from the National Institute of Diabetes and Digestive and Kidney Diseases. It contains data from clinical exams for women age 21 and above of Pima indian origins. The objective is to predict based on diagnostic measurements whether a patient has diabetes.

It has the following features:

- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)

The last colum is the outcome, and it is a binary variable.

In this exercise we will explore it through the following steps:

1. Load the `../../../data/diabetes.csv` dataset, use pandas to explore the range of each feature
- Inspect the features using `sns.pairplot`, can you guess which features are more predictive?
- Look at the scale of each feature and rescale all of them using the Standard Scaler from Scikit Learn.
- Prepare your final `X` and `y` variables to be used by a ML model. Make sure you define your target variable well.
- Split your data in a train/test with a test size of 20% and a `random_state = 22`
- define a sequential model with at least one inner layer. You will have to make choices for the following things:
    - what is the size of the input?
    - how many nodes will you use in each layer?
    - what is the size of the output?
    - what activation functions will you use in the inner layers?
    - what activation function will you use at output?
    - what loss function will you use?
    - what optimizer will you use?
- fit your model on the training set, using a validation_split of 0.1
- test your trained model on the test data from the train/test split
- check the accuracy score, the confusion matrix and the classification report
- compare your Neural Network model with another model from scikit-learn

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('../../../data/diabetes.csv')
df.head()

In [None]:
import seaborn as sns

In [None]:
sns.pairplot(df, hue='Outcome')

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
from keras.utils import to_categorical

In [None]:
sc = StandardScaler()
X = sc.fit_transform(df.drop('Outcome', axis=1))
y = df['Outcome'].values
y_cat = to_categorical(y)

In [None]:
X.shape

In [None]:
y_cat.shape

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y_cat,
                                                    random_state=22,
                                                    test_size=0.2)

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

In [None]:
model = Sequential()
model.add(Dense(32, input_shape=(8,), activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(Adam(lr=0.05),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
model.fit(X_train, y_train, epochs=20, verbose=2, validation_split=0.1)

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_test_class = np.argmax(y_test, axis=1)
y_pred_class = np.argmax(y_pred, axis=1)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
pd.Series(y_test_class).value_counts() / len(y_test_class)

In [None]:
accuracy_score(y_test_class, y_pred_class)

In [None]:
print(classification_report(y_test_class, y_pred_class))

In [None]:
confusion_matrix(y_test_class, y_pred_class)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

for mod in [RandomForestClassifier(), SVC(), GaussianNB()]:
    mod.fit(X_train, y_train[:, 1])
    y_pred = mod.predict(X_test)
    print("="*80)
    print(mod)
    print("-"*80)
    print("Accuracy score: {:0.3}".format(accuracy_score(y_test_class,
                                                         y_pred)))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test_class, y_pred))
    print()