## Chapter 11 Spot-Check Classification Algorithms

You cannot know which algorithm will work best on your dataset beforehand. You must use trial and error to discover a shortlist of algorithms that do well on your problem that you can then double down on and tune further. I call this process spot-checking.



In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, cross_val_score

#### 1. Linear Machine Learning Algorithms

##### (1) Logistic Regression

Logistic regression assumes a Gaussian distribution for the numeric input variables and can model *binary classification* problems.

In [3]:
# logistic regression classification
from sklearn.linear_model import LogisticRegression

filename = 'data/pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(filename, names=names)
array = df.values
X = array[:,:-1]
Y = array[:,-1]
num_folds = 10
kfold = KFold(n_splits=num_folds, shuffle=True, random_state=7)
model = LogisticRegression(max_iter=200)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7721633629528366


##### (2) Linear Discriminant Analysis

Linear Discriminant Analysis or **LDA** is a statistical technique for *binary and multiclass classification*. It too assumes a Gaussian distribution for the numerical input variables.

In [4]:
# LDA classification
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

num_folds = 10
kfold = KFold(n_splits=num_folds, shuffle=True, random_state=7)
model = LinearDiscriminantAnalysis()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7669685577580315


#### 2. Nonlinear Machine Learning Algorithms

##### (1) k-Nearest Neighbors

The k-Nearest Neighbors algorithm or **KNN** uses a distance metric to find the k most similar. instances in the training data for a new instance and takes the mean outcome of the neighbors as the prediction.

In [5]:
# KNN classification
from sklearn.neighbors import KNeighborsClassifier

num_folds = 10
kfold = KFold(n_splits=num_folds, shuffle=True, random_state=7)
model = KNeighborsClassifier()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7109876965140123


##### (2) Naive Bayes

Naive Bayes calculates the probability of each class and the conditional probability of each class given each input value. These probabilities are estimated for new data and multiplied together, assuming that they are all independent (a simple or naive assumption). When working with real-valued data, a Gaussian distribution is assumed to easily estimate the probabilities for
input variables using the Gaussian Probability Density Function.

In [6]:
# Gaussian Naive Bayes classification
from sklearn.naive_bayes import GaussianNB

num_folds = 10
kfold = KFold(n_splits=num_folds, shuffle=True, random_state=7)
model = GaussianNB()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7591421736158578


(3) Classification and Regression Trees

Classification and Regression Trees (**CART or just decision trees**) construct a binary tree from the training data. Split points are chosen greedily by evaluating each attribute and each value of each attribute in the training data in order to minimize a cost function (like the **Gini index**).

In [7]:
# CART classification
from sklearn.tree import DecisionTreeClassifier

num_folds = 10
kfold = KFold(n_splits=num_folds, shuffle=True, random_state=7)
model = DecisionTreeClassifier()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.6992993848257005


##### (4) Support Vector Machines

Support Vector Machines or **SVM** seek a line that best separates two classes. Those data instances that are closest to the line that best separates the classes are called support vectors and influence where the line is placed. SVM has been extended to support multiple classes.
Of particular importance is the use of diﬀerent kernel functions via the kernel parameter.
A powerful **Radial Basis Function** is used by default.

In [9]:
# SVM classification
from sklearn.svm import SVC

num_folds = 10
kfold = KFold(n_splits=num_folds, shuffle=True, random_state=7)
model = SVC()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.760457963089542
