# Spot-Check Classification Algorithms

><small><i>from the book 
"Machine Learning Mastery With Python: Understand Your Data, Create Accurate Models and Work Projects End-To-End"
by Jason Brownlee, Migrated to Jupyter with additions by Mitch Sanders 2017</i></small>




Spot-checking is a way of discovering which algorithms perform well on your machine learning
problem. You cannot know which algorithms are best suited to your problem beforehand. You
must trial a number of methods and focus attention on those that prove themselves the most
promising. In this chapter you will discover six machine learning algorithms that you can use
when spot-checking your classification problem in Python with scikit-learn. After completing
this lesson you will know:

1. How to spot-check machine learning algorithms on a classification problem.
2. How to spot-check two linear classification algorithms.
3. How to spot-check four nonlinear classification algorithms.

Let’s get started.

## Algorithm Spot-Checking

You cannot know which algorithm will work best on your dataset beforehand. You must use
trial and error to discover a shortlist of algorithms that do well on your problem that you can
then double down on and tune further. I call this process spot-checking.
The question is not: What algorithm should I use on my dataset? Instead it is: What
algorithms should I spot-check on my dataset? You can guess at what algorithms might do
well on your dataset, and this can be a good starting point. I recommend trying a mixture of
algorithms and see what is good at picking out the structure in your data. Below are some
suggestions when spot-checking algorithms on your dataset:

- Try a mixture of algorithm representations (e.g. instances and trees).
- Try a mixture of learning algorithms (e.g. different algorithms for learning the same type
of representation).
- Try a mixture of modeling types (e.g. linear and nonlinear functions or parametric and
nonparametric).

Let’s get specific. In the next section, we will look at algorithms that you can use to
spot-check on your next classification machine learning project in Python.

## Algorithms Overview
We are going to take a look at six classification algorithms that you can spot-check on your
dataset. Starting with two linear machine learning algorithms:
- Logistic Regression.
- Linear Discriminant Analysis.

Then looking at four nonlinear machine learning algorithms:

- k-Nearest Neighbors.
- Naive Bayes.
- Classification and Regression Trees.
- Support Vector Machines.

Each recipe is demonstrated on the Pima Indians onset of Diabetes dataset. A test harness
using 10-fold cross-validation is used to demonstrate how to spot-check each machine learning
algorithm and mean accuracy measures are used to indicate algorithm performance. The recipes
assume that you know about each machine learning algorithm and how to use them. We will
not go into the API or parameterization of each algorithm.


## Linear Machine Learning Algorithms
This section demonstrates minimal recipes for how to use two linear machine learning algorithms:
logistic regression and linear discriminant analysis.

### Logistic Regression
Logistic regression assumes a Gaussian distribution for the numeric input variables and can
model binary classification problems. You can construct a logistic regression model using the
LogisticRegression class.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html



In [2]:
# Logistic Regression Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
filename = '../pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# Running the example prints the mean estimated accuracy.

0.76951469583


## Linear Discriminant Analysis
Linear Discriminant Analysis or LDA is a statistical technique for binary and multiclass
classification. It too assumes a Gaussian distribution for the numerical input variables. You can
construct an LDA model using the LinearDiscriminantAnalysis class.

http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html



In [4]:
# LDA Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
filename = '../pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = LinearDiscriminantAnalysis()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# Running the example prints the mean estimated accuracy.

0.773462064252


## Nonlinear Machine Learning Algorithms

This section demonstrates minimal recipes for how to use 4 nonlinear machine learning algorithms. 

### k-Nearest Neighbors
The k-Nearest Neighbors algorithm (or KNN) uses a distance metric to find the k most similar
instances in the training data for a new instance and takes the mean outcome of the neighbors
as the prediction. You can construct a KNN model using the KNeighborsClassifier class.

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html


In [6]:
# KNN Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
filename = '../pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = KNeighborsClassifier()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# Running the example prints the mean estimated accuracy

0.726555023923


### Naive Bayes
Naive Bayes calculates the probability of each class and the conditional probability of each class
given each input value. These probabilities are estimated for new data and multiplied together,
assuming that they are all independent (a simple or naive assumption). When working with
real-valued data, a Gaussian distribution is assumed to easily estimate the probabilities for
input variables using the Gaussian Probability Density Function. You can construct a Naive
Bayes model using the GaussianNB class.

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html


In [7]:
# Gaussian Naive Bayes Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
filename = '../pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
kfold = KFold(n_splits=10, random_state=7)
model = GaussianNB()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# Running the example prints the mean estimated accuracy.

0.75517771702


### Classification and Regression Trees

Classification and Regression Trees (CART or just decision trees) construct a binary tree from
the training data. Split points are chosen greedily by evaluating each attribute and each value
of each attribute in the training data in order to minimize a cost function (like the Gini index).
You can construct a CART model using the DecisionTreeClassifier class.

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html


In [8]:
# CART Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
filename = '../pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
kfold = KFold(n_splits=10, random_state=7)
model = DecisionTreeClassifier()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# Running the example prints the mean estimated accuracy.

0.683526999316


### Support Vector Machines

Support Vector Machines (or SVM) seek a line that best separates two classes. Those data
instances that are closest to the line that best separates the classes are called support vectors
and influence where the line is placed. SVM has been extended to support multiple classes.
Of particular importance is the use of different kernel functions via the kernel parameter. A powerful Radial Basis Function is used by default. You can construct an SVM model using the
SVC class.

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html


In [9]:
# SVM Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
filename = '../pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
kfold = KFold(n_splits=10, random_state=7)
model = SVC()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# Running the example prints the mean estimated accuracy.

0.651025290499


## Summary
In this chapter you discovered 6 machine learning algorithms that you can use to spot-check
on your classification problem in Python using scikit-learn. Specifically, you learned how to
spot-check two linear machine learning algorithms: Logistic Regression and Linear Discriminant
Analysis. You also learned how to spot-check four nonlinear algorithms: k-Nearest Neighbors,
Naive Bayes, Classification and Regression Trees and Support Vector Machines.

### Next
In the next lesson you will discover how you can use spot-checking on regression machine learning
problems and practice with seven different regression algorithms.


<hr>

### About the Boston House Price dataset:
Maintained at UCI machine Learning Repository
https://archive.ics.uci.edu/ml/datasets/housing

Included in scikit-learn datasets module
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html


#### Attribute Information:

1. CRIM: per capita crime rate by town 
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft. 
3. INDUS: proportion of non-retail business acres per town 
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 
5. NOX: nitric oxides concentration (parts per 10 million) 
6. RM: average number of rooms per dwelling 
7. AGE: proportion of owner-occupied units built prior to 1940 
8. DIS: weighted distances to five Boston employment centres 
9. RAD: index of accessibility to radial highways 
10. TAX: full-value property-tax rate per 10,000 US Dollars
11. PTRATIO: pupil-teacher ratio by town 
12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 
13. LSTAT: % lower status of the population 
14. MEDV: Median value of owner-occupied homes in 1000's US Dollars


In [10]:
# demo using Boston Housing data in SciKit-learn
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.data.shape)

(506, 13)


### About the Pima Indian Dataset 

#### Attribute Information:

1. Number of times pregnant 
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
3. Diastolic blood pressure (mm Hg) 
4. Triceps skin fold thickness (mm) 
5. 2-Hour serum insulin (mu U/ml) 
6. Body mass index (weight in kg/(height in m)^2) 
7. Diabetes pedigree function 
8. Age (years) 
9. Class variable (0 or 1) 