# Algorithm Comparison

## Things to note:
1. Consistent comparison: ensure all algorithms are evaluated on the *same data*, in the *same way*!
2. Use resampling methods, like *cross validation*.
3. Use multiple ways for looking at the estimated accuracy (visualise various metrics).

## Classification

We are going to user the Pima Indians dataset (from Lecture 6):

https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data

In [7]:
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

filename = "../Lecture_6-AppliedMachineLearning/data/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# df stands for "Data Frame"
df = pd.read_csv(filename, names=names)

## Create the 10 folds.
array = df.values
X = array[:,0:8]
Y = array[:,8]

### We are going to compare a set of algorithms:
1. Logistic Regression
2. Linear Discriminant Analysis
3. k-Nearest Neighbors
4. Decision Trees
5. Naive Bayes
6. Support Vector Machines

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Create a list, with one item per algorithm. Each item has a name, and a classifier object.
models = []
models.append(('LR',  LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('kNN', KNeighborsClassifier()))
models.append(('DT',  DecisionTreeClassifier()))
models.append(('NB',  GaussianNB()))
models.append(('SVM', SVC()))

In [9]:
# We are going to evaluate all classifiers, and store results in two lists:
results = []
names   = []
# The scoring function to use
scoring = 'accuracy'

In [13]:
for name, model in models:
  kfold = KFold(n_splits=10, random_state=7)
  cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
  results.append(cv_results)
  names.append(name)
  print("%03s: %f (+/- %f)" % (name, cv_results.mean(), cv_results.std()))

 LR: 0.769515 (+/- 0.048411)
LDA: 0.773462 (+/- 0.051592)
kNN: 0.726555 (+/- 0.061821)
 DT: 0.696548 (+/- 0.060179)
 NB: 0.755178 (+/- 0.042766)
SVM: 0.651025 (+/- 0.072141)
