Spot-checking is a way of discovering which algorithms perform well on your machine learning problem.
### 11.1 Algorithm Spot-Checking
You must use trial and error to discover a short-list of algorithms that do well on your problem that you can then double down on and tune further. I call this process spot-checking.

You must use trial and error to discover a short-list of algorithms that do well on your problem that you can then double down on and tune further. I call this process spot-checking.

- Try a mixture of algorithm representations (e.g. instances and trees).
- Try a mixture of learning algorithms (e.g. di↵erent algorithms for learning the same type of representation).
- Try a mixture of modeling types (e.g. linear and nonlinear functions or parametric and nonparametric).

### 11.3 Linear Machine Learning Algorithms
#### 11.3.1 Logistic Regression
Logistic regression assumes a Gaussian distribution for the numeric input variables and can model binary classification problems. 

In [1]:
# Logistic Regression Classification
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

url = "http://ftp.ics.uci.edu/pub/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names=["preg", "plas", "pres", "skin", "test", "mass", "pedi", "age", "class"] 
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())



0.7695146958304853


#### 11.3.2 Linear Discriminant Analysis
Linear Discriminant Analysis or LDA is a statistical technique for binary and multiclass classification. It too assumes a Gaussian distribution for the numerical input variables.

In [1]:
# LDA Classification
import pandas
from sklearn import cross_validation
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
url = "http://ftp.ics.uci.edu/pub/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names=["preg", "plas", "pres", "skin", "test", "mass", "pedi", "age", "class"] 
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed) 
model = LinearDiscriminantAnalysis()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold) 
print(results.mean())



0.773462064251538


### 11.4 Nonlinear Machine Learning Algorithms
#### 11.4.1 K-Nearest Neighbors
K-Nearest Neighbors (or KNN) uses a distance metric to find the K most similar instances in the training data for a new instance and takes the mean outcome of the neighbors as the prediction.

In [2]:
# KNN Classification
import pandas
from sklearn import cross_validation
from sklearn.neighbors import KNeighborsClassifier

url = "http://ftp.ics.uci.edu/pub/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names=["preg", "plas", "pres", "skin", "test", "mass", "pedi", "age", "class"] 
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

num_folds = 10
num_instances = len(X)
random_state = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds,
    random_state=random_state)
model = KNeighborsClassifier()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7265550239234451


### 11.4.2 Naive Bayes
Naive Bayes calculates the probability of each class and the conditional probability of each class given each input value. 

These probabilities are estimated for new data and multiplied together, assuming that they are all independent (a simple or naive assumption). 

When working with real-valued data, a Gaussian distribution is assumed to easily estimate the probabilities for input variables using the Gaussian Probability Density Function. 

In [4]:
# Gaussian Naive Bayes Classification
import pandas
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
url = "http://ftp.ics.uci.edu/pub/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names=["preg", "plas", "pres", "skin", "test", "mass", "pedi", "age", "class"] 
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = GaussianNB()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7551777170198223
