# Summary and Illustration of Machine Learning Algorithm

This document illustrates succintly various machine learning algorithms. The Table of Contents below indicates the scope of this document.

This is a living document. Therefore, it will undergo changes to improve its readability and coorect any typos/errors discovered.

To avoid redundancy and to keep the size of notebook cells small, the cells are self-contained. For example, data is read in one cell and is used in many different cells that follow. If we do not execute the cell where the data is loaded into variables, and execute the cells that follow will result in errors. It is best to execute cells in a sequence unless you can figure out the data dependenciies and execute the relevant cells.

# Table of Contents

1. [Pima Indians Diabetes Dataset](#pima_indians_diabetes_dataset)
2. [Data Summaries](#data_summaries)
3. [Data Visualization with Pandas](#data_visualization_with_pandas)
4. [Data Cleaning](#data_cleaning)
5. [Resampling Methods](#resampling_methods)
6. [Classification Algorithms](#classification_algorithms)
   1. [Classification/Decision Trees](#classification_decision_trees)
   2. [Naive Bayes](#naive_bayes)
   3. [K-Nearest Neighbors](#k_nearest_neighbors)
   4. [Linear Discriminant Analysis](#linear_discriminant_analysis)
   5. [Quadratic Discriminant Analysis](#quadratic_discriminant_analysis)
   6. [Logistic Regression](#logistic_regression)
   7. [Perceptron](#perceptron)
   8. [Support Vector Machines](#support_vector_machines)
   9. [Comparing Classification Algorithms ](#comparing_classification_algorithms)
   10. [Metrics for Evaluating Classification Algorithms](#metrics-for-evaluating-classification-algorithms)
7. [Regression Algorithms](#regression_algorithms)
   1. [Regression Trees](#regression_trees)
   1. [K-Nearest Neighbors](#k_nearest_neighbors)
   1. [LassoLars](#lassoLars)
   1. [Linear](#linear)
   1. [Ridge](#ridge)
   1. [Lasso](#lasso)
   1. [Support Vector Machines](#support_vector_machines)
   1. [Least Angle](#least_angle)
   1. [ElasticNet](#elasticNet)
   1. [Metrics for Evaluating Regression Algorithms](#metrics_for_evaluating_regression_algorithms)
8. [Creating Machine Learning Algorithm Pipelines](#algorithm_pipelines)
9. [Improving Model Performance](#improving_model_performance)
10. [Ensemble Models for Regression](#ensemble_models_for_regression)
11. [Ensemble Models for Classification](#ensemble_models_for_classification)
12. [Saving and Retrieving Models](#saving_and_retrieving_models)

# Pima Indians Diabetes Dataset <a name="pima_indians_diabetes_dataset"></a>

We use this dataset [diabetes.csv](diabetes.csv) to illustarte some machine learning algorithms.

**Predictor variables**:

    - Pregnancies: Number of times pregnant
    
    - Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
    
    - BloodPressure: Diastolic blood pressure (mm Hg)
    
    - SkinThickness: Triceps skin fold thickness (mm)
    
    - Insulin: 2-Hour serum insulin (mu U/ml)
    
    - BMI: Body mass index (weight in kg/(height in m)^2)
    
    - DiabetesPedigreeFunction: Diabetes pedigree function
    
    - Age: Age in years

**Target variable**: 

    - Outcome: Class variable (0 or 1)

In [None]:
# pandas for data manipulation
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy
import scipy
import seaborn
from pandas.tools.plotting import scatter_matrix

# Data Summaries <a name="data_summaries"></a>

In [None]:
# read Pima Indians diabetes data
myNames = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

data = pd.read_csv('diabetes.csv', skiprows=1, names=myNames)

In [None]:
# data information
info = data.info()

print(info)

In [None]:
# explore first 20 data instances
print(data.head(20))

In [None]:
# data types of predictor and traget variables
types = data.dtypes

# print data types
print(types)

In [None]:
# data dimensions
shape = data.shape

print(shape)

In [None]:
# class counts
print(data.groupby('Outcome').size())

In [None]:
pd.set_option('display.width', 100)
pd.set_option('precision', 3)

# statistical summary of variables
print(data.describe())

In [None]:
# compute pairwise Spearman correlations
cor = data.corr(method='spearman')

# print correlation coefficents
print(cor)

In [None]:
# compute skew for each attribute
print(data.skew())

# Data Visualization with Pandas <a name="data_visualization_with_pandas"></a>


In [None]:
# box and whisker plots
data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
plt.show()

In [None]:
# correlation matrix plot
correlations = data.corr()

# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
plt.show()

In [None]:
# correlation matrix plot
correlations = data.corr()

# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(myNames)
ax.set_yticklabels(myNames)
plt.show()

In [None]:
# correlation matrix plot

matplotlib.style.use('ggplot')
correlations = data.corr()
seaborn.heatmap(correlations)
plt.show()

In [None]:
# univariate density plots
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
plt.show()

In [None]:
# univariate histograms
data.hist()
plt.show()

In [None]:
# scatterplot matrix
pd.plotting.scatter_matrix(data)
plt.show()

# Data Cleaning <a name="data_cleaning"></a>

In [None]:
# data type conversion

# before conversion
print(data.dtypes)

# convert datafrmae to float
data = data.astype(float)

# print two blank lines
print('\n' * 2)

# after conversion
print(data.dtypes)

In [None]:
# breast-cancer-wisconsin.csv - a breast cancer dataset
myNames = ['Code', 'Clump-Thickness', 'Cell-Size', 'Cell-Shape', 'Adhesion', 'Single-Cell-Size', 'Bare-Nuclei', 'Chromatin', 'Nucleoli', 'Mitoses', 'Class']
data2 = pd.read_csv('breast-cancer-wisconsin.csv', names=myNames)

# dataset size
print(data2.shape)

In [None]:
# replace missing data (coded as ?) with NaN
data2[['Bare-Nuclei']] = data2[['Bare-Nuclei']].replace('?', numpy.NaN)

# drop rows that have missing data
data2.dropna(axis=0, how='any', inplace=True)

# dataset size
print(data2.shape)

In [None]:
# delete a column

# size of the dataset before deltion of a column
print(data2.shape)

# delete coumn named Outcome
data2.drop('Adhesion', axis=1, inplace=True)

# size of the dataset after deltion of a column
print(data2.shape)

print(data2.head(20))

In [None]:
# feature selection with extra trees classifier

from sklearn.ensemble import ExtraTreesClassifier

print(data.head(4))
print('\n')

# convert dataframe to an array
array = data.values

# predictor variables - first eight columns (0 through 7)
X = array[:,0:8]

print(X)
print('\n')

# target variable
Y = array[:,8]

print(Y)
print('\n')

# feature extraction using extra trees classifier
model = ExtraTreesClassifier()
model.fit(X, Y)

# important features
print(model.feature_importances_)

In [None]:
# identify features with low variance

from sklearn.feature_selection import VarianceThreshold

# feature selection
threshold = 0.8 * (1 - 0.8)

test = VarianceThreshold(threshold)
fit = test.fit(X)

print(fit.variances_)
print('\n')

features = fit.transform(X)
print(features)

In [None]:
# feature selection with Principal Component Analysis (PCA)

from sklearn.decomposition import PCA

# feature extraction
pca = PCA(n_components=3)
pcaModel = pca.fit(X)

# summarize PCA components
# print("Explained variance: %s") % pcaModel.explained_variance_ratio_
print('\n')

print(pcaModel.components_)

In [None]:
# recursive feature extraction

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# recursive feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)

fit = rfe.fit(X, Y)

print("Number of features: %d") % fit.n_features_
print('\n')

print("Selected features: %s") % fit.support_
print('\n')

print("Feature ranking: %s") % fit.ranking_


In [None]:
# feature extraction with univariate statistical tests (Chi-squared for classification)

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)

# summarize scores
numpy.set_printoptions(precision=3)
print(fit.scores_)
print('\n')

features = fit.transform(X)

# summarize selected features
print(features[0:5,:])


In [None]:
# binarization

from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0.0).fit(X)

binaryX = binarizer.transform(X)

# summarize transformed data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])

In [None]:
# Box-Cox transform

from scipy.stats import boxcox

X_boxcox = boxcox(1+X[:,2])[0]
print(X_boxcox)

### Sonar Dataset

This dataset has 208 instances and 60 variables.Based on the sonar signal characteristics, the goal is to predict whether the object is a rock (M) or mine (M).

In [None]:
# convert a string class label to an integer

from sklearn.preprocessing import LabelEncoder

# sonar dataset
data3 = pd.read_csv('sonar.csv', header=None)

# data3 dimensions
print(data3.shape)
print('\n')

# explore first 5  data3 instances
print(data3.head(5))
print('\n')

# explore last 5  data3 instances
print(data3.tail(5))
print('\n')

# dataframe to array
array = data3.values

# all rows, but first 60 columns
y = array[:, 60]

encoder = LabelEncoder()
encoder.fit(y)
print(encoder.classes_)
print('\n')

encoded_y = encoder.transform(y)
print(encoded_y)

In [None]:
# impute missing values with mean attribute values

from sklearn import preprocessing
from sklearn.preprocessing import Imputer

X[X == 0] = numpy.nan

print(X)
print('\n')

# replace missing values (NaN) with mean values of the variable
imputer = Imputer(missing_values='NaN', strategy='mean')

# perform imputation
imputedX = imputer.fit_transform(X)
print(imputedX)

In [None]:
# normalize data (length of 1)

from sklearn.preprocessing import Normalizer

# restore original data - read Pima Indians diabetes data
myNames = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv('diabetes.csv', skiprows=1, names=myNames)

# datafrmae to array
array = data.values

# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]

scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])

In [None]:
# rescale data (between 0 and 1) - you may choose other values for 0 and 1

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

In [None]:
# standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

# Resampling Methods <a name="resampling_methods"></a>

- Train/Test Split

- Shuffle Split

- Cross Validation (CV)

- Leave One Out Cross Validation (LOOCV)

In [None]:
# evaluate a model by splitting data into training and test sets
# Train/Test Split

import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

# divide the dataset into training (67%) and test (33%) sets
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=0.33, random_state=7)
model = LogisticRegression()

# build the model using the training data set
model.fit(X_train, Y_train)

# evaluate the model using test dataset
result = model.score(X_test, Y_test)
print("Accuracy: %.3f" % (result*100.0))


In [None]:
# shuffle split

import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

instances = len(X)
# 67% of data used for building the model and 33% for testing
kfold = cross_validation.ShuffleSplit(n=instances, n_iter=10, test_size=0.33, random_state=7)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f)" % (results.mean()*100.0, results.std()*100.0))
# print("Accuracy: %.3f (.3f)" % (results.mean()*100.0, results.std()*100.0))

In [None]:
# Cross Validation (CV)

import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)

print("Accuracy: %.3f%% (%.3f)" % (results.mean()*100.0, results.std()*100.0))
# print("Accuracy: %.3f" % (results*100.0))


In [None]:
# Leave One Out Cross Validation (LOOCV)

import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

instances = len(X)

loocv = cross_validation.LeaveOneOut(n=instances)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=loocv)

print("Accuracy: %.3f%% (%.3f)" % (results.mean()*100.0, results.std()*100.0))

# Classification Algorithms <a name="classification_algorithms"></a>

- Classification/Decision Trees

- Naive Bayes

- K-Nearest Neighbors

- Linear Discriminant Analysis

- Quadratic Discriminant Analysis

- Logistic Regression

- Perceptron

- Support Vector Machines



### Data Preparation

We will use the same dataset (ima Indians diabetes) for all clasifcation algorithms.

In [None]:
from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier

# read Pima Indians diabetes data
# myNames = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
myNames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv('diabetes.csv', skiprows=1, names=myNames)

# convert dataframe to an array
array = data.values

# predictor variables - first eight columns (0 through 7)
X = array[:,0:8]

# target variable
Y = array[:,8]

### Classification/Decision Trees <a name="classification_decision_trees"></a>


In [None]:
instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)

model = DecisionTreeClassifier()

results = cross_validation.cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

### Naive Bayes <a name="naive_bayes"></a>

In [None]:
from sklearn import model_selection
from sklearn.naive_bayes import GaussianNB

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)

model = GaussianNB()

results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

### K-Nearest Neighbors <a name="k_nearest_neighbors"></a>

In [None]:
from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier

instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)

model = KNeighborsClassifier()

results = cross_validation.cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

### Linear Discriminant Analysis <a name="linear_discriminant_analysis"></a>

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = LinearDiscriminantAnalysis()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

### Quadratic Discriminant Analysis <a name="quadratic_discriminant_analysis"></a>

In [None]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)

model = QuadraticDiscriminantAnalysis()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

### Logistic Regression <a name="logistic_regression"></a>

In [None]:
from sklearn.linear_model import LogisticRegression

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)

model = LogisticRegression()

results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

### Perceptron <a name="perceptron"></a>

In [None]:
from sklearn.linear_model import Perceptron

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)

model = Perceptron()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

### Support Vector Machines <a name="support_vector_machines"></a>

In [None]:
from sklearn.svm import SVC

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)

model = SVC()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

### Comparing Classification Algorithms <a name="comparing_classification_algorithms"></a>

We will compare the performance of the following algorithms using Pima Indians Diabetes dataset:

- Logistic Regression

- K-Nearest Neighbors

- Linear Discriminant Analysis

- Decision Tree Classifier

- Naive Bayes

- Support Vector Machines



In [None]:
# list of classification models
models = []
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('QDA', QuadraticDiscriminantAnalysis()))
models.append(('LR', LogisticRegression()))
# models.append(('PR', Perceptron()))
models.append(('SVM', SVC()))

# we will evaluate each model in turn
results = []
names = []
instances = len(X)

# for each model, compute mean accuracy and standard deviation
for name, model in models:
   kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
   cv_results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='accuracy')
   results.append(cv_results)
   names.append(name)
   msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
   print(msg)
    
# visualize accuracy results using boxplots
figure = plt.figure()
figure.suptitle('Comparison of Accuracy of Classification Algorithms')
ax = figure.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()


### Metrics for Evaluating Classification Algorithms <a name="metrics-for-evaluating-classification-algorithms"></a>

- Accuracy

- Confusion Matrix

- Area Under the Curve (AUC)

- F1

- LogLoss

- Classification Report


In [None]:
# accuracy
from sklearn.linear_model import LogisticRegression

instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='accuracy')
print("Accuracy: %.3f %.3f" % (results.mean(), results.std()))

In [None]:
# confusion matrix
from sklearn.metrics import confusion_matrix

# divide data into training (67%) and test sets (33%)
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=0.33, random_state=7)
model = LogisticRegression()
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
matrix = confusion_matrix(Y_test, predicted)
print(matrix)

In [None]:
# Area Under the Curve (AUC)

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='roc_auc')
print("AUC: %.3f %.3f" % (results.mean(), results.std()))

In [None]:
# F1

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='f1')
print("F1: %.3f %.3f" % (results.mean(), results.std()))


In [None]:
# LogLoss

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='neg_log_loss')
print("LogLoss: %.3f %.3f" % (results.mean(), results.std()))

In [None]:
# classification report
from sklearn.metrics import classification_report

# divide data into training (67%) and test sets (33%)
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=0.33, random_state=7)
model = LogisticRegression()
model.fit(X_train, Y_train)
predicted = model.predict(X_test)

report = classification_report(Y_test, predicted)
print(report)


# Regression Algorithms <a name="regression_algorithms"></a>

- Regression Trees

- K-Nearest Neighbors

- LassoLars

- Linear

- Ridge

- Lasso

- Support Vector Machines

- Least Angle

- ElasticNet

## Boston Housing Dataset <a name="boston_housing_dataset"></a>

We will use [Boston Housing dataset](housing.dat) for illustrating some machine learning algorithms. It is not a CSV file. Columns are seperated by whitespace.

Predictor variables are:

- **crim**: per capita crime rate by town.

- **zn**: proportion of residential land zoned for lots over 25,000 sq.ft.

- **indus:**:  proportion of non-retail business acres per town.

- **chas**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

- **nox**: nitrogen oxides concentration (parts per 10 million).

- **rm**: average number of rooms per dwelling.

- **age**: proportion of owner-occupied units built prior to 1940.

- **dis**: weighted mean of distances to five Boston employment centres.

- **rad**: index of accessibility to radial highways.

- **tax**: full-value property-tax rate per $10,000.

- **ptratio**: pupil-teacher ratio by town.

- **black**: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

- **lstat**: lower status of the population (percent).



Target variable is: 

- **medv**: median value of owner-occupied homes in $1000s.

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()

print(boston.keys())

In [None]:
print(boston.data.shape)

In [None]:
print(boston.feature_names)

In [None]:
print(boston.DESCR)

In [None]:
# data types of predictor variables
print(data.dtypes)

In [None]:
# convert the data into a dataframe
bData = pd.DataFrame(boston.data)

# print first few lines
print(bData.head())

In [None]:
bData = bData.astype(float)

In [None]:
# data types of variables
print(bData.dtypes)

In [None]:
# associate names to columns
bData.columns = boston.feature_names
print(bData.head())

In [None]:
# target attribute, price, is available in target attribute
print(boston.target.shape)

In [None]:
# add target attribute to the dataframe as price column
bData['PRICE'] = boston.target
print(bData.head())

In [None]:
# summary statistics
print(bData.describe())

In [None]:
# data types of predictor and target variables
print(bData.dtypes)

In [None]:
# convert dataframe to an array
array = data.values

# predictor variables - first eight columns (0 through 7)
# X = array[:,0:13]
X = bData.drop('PRICE', axis=1)

# target variable
# Y = array[:,13]
Y = bData['PRICE']

print(X)

## Regression Trees <a name="regression_trees"></a>

In [None]:
from sklearn.tree import DecisionTreeRegressor

instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = DecisionTreeRegressor()

# results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='mean_squared_error')
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')

print(results.mean())

### K-Nearest Neighbors <a name="k_nearest_neighbors"></a>

In [None]:
from sklearn.neighbors import KNeighborsRegressor

instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = KNeighborsRegressor()

# results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='mean_squared_error')
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')

print(results.mean())

### Linear <a name="linear"></a>

In [None]:
from sklearn.linear_model import LinearRegression

instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = LinearRegression()

# results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='mean_squared_error')
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')

print(results.mean())

### LassoLars <a name="lassolars"></a>

Lars with LASSO modification.

In [None]:
from sklearn.linear_model import LassoLars

instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = LassoLars()

# results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='mean_squared_error')
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')

print(results.mean())

### Ridge <a name="ridge"></a>

Suitable for analyzing multiple regression data that suffer from multicollinearity. One predictor variable can be linearly predicted from the others with a great degree of accuracy.

When multicollinearity occurs, least squares estimates are unbiased, but their variances are large. 

In [None]:
from sklearn.linear_model import Ridge

instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = Ridge()

# results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='mean_squared_error')
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')

print(results.mean())

### Lasso <a name="lasso"></a>

Performs both variable selection and regularization to improve the prediction accuracy.

In [None]:
from sklearn.linear_model import Lasso

instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = Lasso()

# results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='mean_squared_error')
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')

print(results.mean())

### ElasticNet <a name="elasticnet"></a>
ElasticNet is a regularization regression that combines the properties of both Ridge Regression and LASSO regression.

In [None]:
from sklearn.linear_model import ElasticNet

instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = ElasticNet()

# results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='mean_squared_error')
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')

print(results.mean())

### Support Vector Machines <a name="support_vector_machines"></a>

In [None]:
from sklearn.svm import SVR

instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = SVR()

# results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='mean_squared_error')
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')

print(results.mean())

### Least Angle <a name="least_angle"></a>

Suitable for developing linear regression models for high-dimensional data.

In [None]:
from sklearn.linear_model import Lars

instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = Lars()

# results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='mean_squared_error')
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')

print(results.mean())

### Metrics for Evaluating Regression Algorithms <a name="metrics_for_evaluating_regression_algorithms"></a>

- $R^2$

- Negative Mean Absolute Error (MAE)

- Mean Squared Error (MSE)



In [None]:
# R^2
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = LinearRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='r2')
print("R^2: %.3f %.3f" % (results.mean(), results.std()))

In [None]:
# Mean Absolute Error (MAE)

from sklearn import cross_validation
from sklearn.linear_model import LinearRegression

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = LinearRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_absolute_error')
print("MAE: %.3f %.3f" % (results.mean(), results.std()))

In [None]:
# Mean Squared Error (MSE)

from sklearn import cross_validation
from sklearn.linear_model import LinearRegression

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = LinearRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')
print("MSE: %.3f %.3f" % (results.mean(), results.std()))

## Creating Machine Learning Algorithm Pipelines <a name="algorithm_pipelines"></a>

**Feature union**: select best features first (for example, using PCA) and then build models.


In [None]:
import pandas

from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

# create feature union
features = []
features.append(('pca', PCA(n_components=3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)

# create a pipeline
estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', LogisticRegression()))

model = Pipeline(estimators)

# evaluate pipeline
instances = len(X)
kfold = KFold(n=instances, n_folds=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

Create a pipeline to standardize the data first and then creates a model.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# create pipeline
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('lda', LinearDiscriminantAnalysis()))
model = Pipeline(estimators)

# evaluate pipeline
instances = len(X)
kfold = KFold(n=instances, n_folds=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# Improving Model Performance <a name="improving_model_performance"></a>

- Grid Search

- Randomized Search


In [None]:
# grid search

import pandas
import numpy
from sklearn import cross_validation
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

alphas = numpy.array([1,0.1,0.01,0.001,0.0001,0])

grid = GridSearchCV(estimator=Ridge(), param_grid=dict(alpha=alphas))
grid.fit(X, Y)
print(grid.best_score_)
print(grid.best_estimator_.alpha)

In [None]:
# Randomized Search

import pandas
import numpy
from scipy.stats import uniform
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV

param_grid = {'alpha': uniform()}
rsearch = RandomizedSearchCV(estimator=Ridge(), param_distributions=param_grid, n_iter=100, random_state=7)
rsearch.fit(X, Y)
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)


# Ensemble Models for Regression <a name="ensemble_models_for_regression"></a>

- Gradient Boosting
    
- Adaboost
    
- Random Forest
    
- Extra Trees

In [None]:
# Gradient Boosting regression

import pandas
from sklearn import cross_validation
from sklearn.ensemble import GradientBoostingRegressor

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = GradientBoostingRegressor()

results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='mean_squared_error')
print(results.mean())


In [None]:
# Adaboost regression

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = AdaBoostRegressor()

results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='mean_squared_error')
print(results.mean())


In [None]:
# Random Forest regression

import pandas
from sklearn import cross_validation
from sklearn.ensemble import RandomForestRegressor

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = RandomForestRegressor()

results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='mean_squared_error')
print(results.mean())


In [None]:
# Extra Trees regression
import pandas
from sklearn import cross_validation
from sklearn.ensemble import ExtraTreesRegressor

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = ExtraTreesRegressor()

results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring='mean_squared_error')
print(results.mean())


# Ensemble Models for Classification <a name="ensemble_models_for_classification"></a>

- Bagging

- Gradient Boosting
    
- Adaboost
    
- Random Forest
    
- Extra Trees

- Voting Ensemble

In [None]:
# Bagging classifier

import pandas
from sklearn import cross_validation
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

instances = len(X)

kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=7)
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())


In [None]:
# Gradient Boosting classifier

import pandas
from sklearn import cross_validation
from sklearn.ensemble import GradientBoostingClassifier

instances = len(X)
num_trees = 100
kfold = cross_validation.KFold(n=instances, n_folds=num_folds, random_state=7)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=7)
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

In [None]:
# Adaboost classifier

import pandas
from sklearn import cross_validation
from sklearn.ensemble import AdaBoostClassifier

instances = len(X)
num_trees = 30
kfold = cross_validation.KFold(n=instances, n_folds=num_folds, random_state=7)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=7)
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())


In [None]:
# Random Forest classifier

import pandas
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier

instances = len(X)
num_trees = 100
kfold = cross_validation.KFold(n=instances, n_folds=num_folds, random_state=seed)
model = RandomForestClassifier(n_estimators=num_trees, max_features=3)
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

In [None]:
# Extra Trees classifier

import pandas
from sklearn import cross_validation
from sklearn.ensemble import ExtraTreesClassifier

instances = len(X)
num_trees = 100
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=7)
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())


In [None]:
# Voting Ensemble classifier

import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

instances = len(X)
kfold = cross_validation.KFold(n=instances, n_folds=10, random_state=7)

# create the submodels
estimators = []

# losgistic regression classifier
model1 = LogisticRegression()
estimators.append(('logistic', model1))

# Decision tree classifier
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))

# support vector machine classifier
model3 = SVC()
estimators.append(('svm', model3))

# create the ensemble model
ensemble = VotingClassifier(estimators)

# evalauate the performance of the ensemble
results = cross_validation.cross_val_score(ensemble, X, Y, cv=kfold)

print(results.mean())


# Saving and Retrieving Models <a name="saving_and_retrieving_models"></a>

- joblib

- Pickle



In [None]:
# save and retrieve model using joblib

import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib

X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=0.33, random_state=7)

# build model using training data
model = LogisticRegression()
model.fit(X_train, Y_train)

# evaluate model performance using the test data
result = model.score(X_test, Y_test)
print(result)

# save the model to disk
filename = 'final_lr__model.sav'
joblib.dump(model, filename)


# sometime in future, retrieve the model
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test)
print(result)


In [None]:
# save and retrieve model using Pickle

import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
import pickle


X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=0.33, random_state=7)

# build model using training data
model = LogisticRegression()
model.fit(X_train, Y_train)

# evaluate model performance using the test data
result = model.score(X_test, Y_test)
print(result)

# save the model to disk
filename = 'final_lr__model2.sav'
pickle.dump(model, open(filename, 'wb'))


# sometime in future, retrieve the model
loaded_model_2 = joblib.load(filename)
result = loaded_model_2.score(X_test, Y_test)
print(result)