# Machine Learning Training
Letting specific data teach a machine learning algorithm to create specific prediction model.

- New data => better preditions
- Verify training performance with new data.

## Training Overview

1. Split Data
	- Prepared Data
		- Training (70%)
		- Testing (30%)
2. Train Model
	- Train Model
		1. Algorithm
		2. Model
3. Evaluate Model


## Dependencies

- *Scikit-learn* library designed to work with:
	1. NumPy
	2. SciPy
	3. Pandas


## Toolset for training and evaluation tasks

- Data splitting
- Pre-processing
- Feature selection
- Model tuning

Common interface across algorithms

	

In [1]:
import pandas 
import matplotlib.pyplot
import numpy

# do ploting inline instead of in a separate window
%matplotlib inline

# 
df = pandas.read_csv("./data/pima-data.csv")
df.shape

(768, 10)

## Splitting the data
70/30 - train/test

In [28]:
from sklearn.model_selection import train_test_split

feature_col_names = ['num_preg', 'glucose_conc', 'diastolic_bp', 'thickness', 'insulin','bmi', 'diab_pred', 'age']
predicted_class_names = ['diabetes']

X = df[feature_col_names].values # predictor feature columns (8 X m)
Y = df[predicted_class_names].values # predicted class (1=True, 0=False) column (1 X m)
split_test_size = 0.3



X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=split_test_size, random_state=42)
# test_size = 0.3 is 30% 42 is the answer to everything

We check to ensure we hav the desired 70% train, 30% test split of the data

In [3]:
from __future__ import division
print "training set: \t {0:0.2f}%".format(len(X_train)/len(df.index) * 100)
print "testing set: \t {0:0.2f}%".format(len(X_test)/len(df.index) * 100)

training set: 	 69.92%
testing set: 	 30.08%


Verify the splited data prediction values correlation

In [6]:
print "Original True  : {0} ({1:0.2f}%)".format(len(df.loc[df['diabetes'] == 1]), (len(df.loc[df['diabetes'] == 1])/len(df.index)) * 100.0)
print "Original False : {0} ({1:0.2f}%)".format(len(df.loc[df['diabetes'] == 0]), (len(df.loc[df['diabetes'] == 0])/len(df.index)) * 100.0)
print ""
print "Training True  : {0} ({1:0.2f}%)".format(len(Y_train[Y_train[:] == 1]), (len(Y_train[Y_train[:] == 1])/len(Y_train) * 100.0))
print "Training False : {0} ({1:0.2f}%)".format(len(Y_train[Y_train[:] == 0]), (len(Y_train[Y_train[:] == 0])/len(Y_train) * 100.0))
print ""
print "Test True      : {0} ({1:0.2f}%)".format(len(Y_test[Y_test[:] == 1]), (len(Y_test[Y_test[:] == 1])/len(Y_test) * 100.0))
print "Test False     : {0} ({1:0.2f}%)".format(len(Y_test[Y_test[:] == 0]), (len(Y_test[Y_test[:] == 0])/len(Y_test) * 100.0))


Original True  : 268 (34.90%)
Original False : 500 (65.10%)

Training True  : 188 (35.01%)
Training False : 349 (64.99%)

Test True      : 80 (34.63%)
Test False     : 151 (65.37%)


### Post-split Data Preparation


#### Hidding Missing Values



In [7]:
print "# rows in dataframe {0}".format(len(df))
print "# rows missing glucose_conc: {0}".format(len(df.loc[df['glucose_conc'] == 0]))
print "# rows missing diastolic_bp: {0}".format(len(df.loc[df['diastolic_bp'] == 0]))
print "# rows missing thickness: {0}".format(len(df.loc[df['thickness'] == 0]))
print "# rows missing insulin: {0}".format(len(df.loc[df['insulin'] == 0]))
print "# rows missing bmi: {0}".format(len(df.loc[df['bmi'] == 0]))
print "# rows missing diab_pred: {0}".format(len(df.loc[df['diab_pred'] == 0]))
print "# rows missing age: {0}".format(len(df.loc[df['age'] == 0]))

# rows in dataframe 768
# rows missing glucose_conc: 5
# rows missing diastolic_bp: 35
# rows missing thickness: 227
# rows missing insulin: 374
# rows missing bmi: 11
# rows missing diab_pred: 0
# rows missing age: 0


Impute with the mean

In [21]:
from sklearn.impute import SimpleImputer as Imputer

# Impute with mean all 0 readings
fill_0 = Imputer(missing_values=0, strategy="mean")
X_train = fill_0.fit(X_train)
X_test = fill_0.fit_transform(X_test)


## Training Initial Algorithm - Naive Bayes

In [29]:
from sklearn.naive_bayes import GaussianNB

# create Gaussian Naive Bayes model object and train it with the data
nb_model = GaussianNB()
nb_model.fit(X_train, Y_train.ravel())

GaussianNB(priors=None, var_smoothing=1e-09)