# Chapter 11
# Spot-Check Classification Algorithms

Spot-checking is a way of discovering which algorithms perform well on your machine learning problem. You cannot know which algorithms are best suited to your problem beforehand. You must trial a number of methods and focus attention on those that prove themselves the most promising. In this chapter you will discover six machine learning algorithms that you can use when spot-checking your **classification problem** in Python with scikit-learn. After completing this lesson you will know:
1. How to spot-check machine learning algorithms on a classification problem.
2. How to spot-check two linear classification algorithms.
3. How to spot-check four nonlinear classification algorithms.

Let's get started.

## 11.1 Algorithm Spot-Checking 
`****verificación al azar****`

You cannot know which algorithm will work best on your dataset beforehand. <a>You must use trial and error to discover a shortlist of algorithms that do well on your problem that you can then double down on and tune further</a>. I call this **>process spot-checking**>.

The question is not: **What algorithm should I `use` on my dataset**? Instead it is: **What algorithms should I `spot-check` on my dataset**? You can guess at what algorithms might do well on your dataset, and this can be a good starting point. I recommend trying a mixture of algorithms and see what is good at picking out the structure in your data. Below are some suggestions when spot-checking algorithms on your dataset:
***
- Try a mixture of `algorithm representations` (e.g. instances and trees).
- Try a mixture of `learning algorithms` (e.g. different algorithms for learning the same type of representation).
- Try a mixture of `modeling types` (e.g. linear and nonlinear functions or parametric and nonparametric).
***
Let's get specific. In the next section, we will look at algorithms that you can use to spot-check on your next classification machine learning project in Python.

## 11.2 Algorithms Overview

We are going to take a look at six **classification algorithms** that you can spot-check on your dataset. Starting with two `linear machine learning` algorithms:
- Logistic Regression.
- Linear Discriminant Analysis.

Then looking at four `nonlinear machine learning` algorithms:
- k-Nearest Neighbors.
- Naive Bayes.
- Classification and Regression Trees.
- Support Vector Machines.

Each recipe is demonstrated on the dataset. A test harness using 10-fold cross validation is used to demonstrate how to spot-check each machine learning algorithm and mean accuracy measures are used to indicate algorithm performance. The recipes assume that you know about each machine learning algorithm and how to use them. We will not go into the API or parameterization of each algorithm.

## 11.3 Linear Machine Learning Algorithms

This section demonstrates minimal recipes for how to use two linear machine learning algorithms: `logistic regression` and `linear discriminant` analysis.

### 11.3.1 Logistic Regression

Logistic regression <a>assumes a Gaussian distribution for the numeric input variables and can model binary classification problems</a>. You can construct a logistic regression model using the LogisticRegression class.

In [3]:
# Logistic Regression Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# load data
filename = '.\data\pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = LogisticRegression(solver='liblinear')

results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7695146958304853


Running the example prints the mean estimated accuracy.

### 11.3.2 Linear Discriminant Analysis

Linear Discriminant Analysis or `LDA` is a <a>statistical technique for **binary and multiclass** classification. It too assumes a Gaussian distribution for the numerical input variables</a>. You can construct an LDA model using the LinearDiscriminantAnalysis class.

In [4]:
# LDA Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# load data
filename = '.\data\pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = LinearDiscriminantAnalysis()

results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.773462064251538


Running the example prints the mean estimated accuracy.

## 11.4 Nonlinear Machine Learning Algorithms

This section demonstrates minimal recipes for how to use 4 nonlinear machine learning algorithms.

### 11.4.1 k-Nearest Neighbors
The k-Nearest Neighbors algorithm (or `KNN`) uses a <a>distance metric to find the k most similar instances in the training data for a new instance and takes the mean outcome of the neighbors as the prediction</a>. You can construct a KNN model using the KNeighborsClassifier class.

In [5]:
# KNN Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

# load data
filename = '.\data\pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = KNeighborsClassifier()

results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7265550239234451


Running the example prints the mean estimated accuracy.

### 11.4.2 Naive Bayes

Naive Bayes calculates the <a>probability of each class and the conditional probability of each class given each input value. These probabilities are estimated for new data and multiplied together, assuming that they are all independent (a simple or naive assumption)</a>. When working with <a>real-valued data, a Gaussian distribution is assumed to easily estimate the probabilities for input variables using the Gaussian Probability Density Function</a>. You can construct a Naive Bayes model using the GaussianNB class.

In [6]:
# Gaussian Naive Bayes Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB

# load data
filename = '.\data\pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

kfold = KFold(n_splits=10, random_state=7)
model = GaussianNB()

results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7551777170198223


Running the example prints the mean estimated accuracy.

### 11.4.3 Classification and Regression Trees

Classification and Regression Trees (`CART` or just `decision trees`) construct a binary tree from the training data. <a>Split points are chosen greedily by evaluating each attribute and each value of each attribute in the training data in order to minimize a cost function (like the Gini index)</a>. 
You can construct a CART model using the DecisionTreeClassifier class.

In [7]:
# CART Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

# load data
filename = '.\data\pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
kfold = KFold(n_splits=10, random_state=7)
model = DecisionTreeClassifier()

results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7004955570745044


Running the example prints the mean estimated accuracy.

### 11.4.4 Support Vector Machines

Support Vector Machines (or `SVM`) <a>seek a line that best separates two classes. Those data instances that are closest to the line that best separates the classes are called support vectors and infuence where the line is placed</a>. SVM has been extended to support multiple classes. Of particular importance is the use of different kernel functions via the kernel parameter. A powerful Radial Basis Function is used by default. You can construct an SVM model using the SVC class.

In [14]:
# SVM Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

# load data
filename = '.\data\pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

kfold = KFold(n_splits=10, random_state=7)

# create and configure model

#To maintain the old behavior, you can specify the argument as follows:
#model = SVC(gamma='auto')  # 0.6510252904989747

#To support the new behavior (recommended), you can specify the argument as follows:
model = SVC(gamma='scale')  # 0.7604237867395763  

results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7604237867395763


Running the example prints the mean estimated accuracy.