# Compare Machine Learning Algorithms
It is important to compare the performance of multiple different machine learning algorithms consistently.

1. How to formulate an experiment to directly compare machine learning algorithms.
2. A reusable template for evaluating the performance of multiple algorithms on one dataset.
3. How to report and visualize the results when comparing algorithm performance.

## Choose The Best Machine Learning Model
When you work on a machine learning project, you often end up with multiple good models to choose from. Each model will have different performance characteristics.

Using resampling methods like cross validation, you can get an estimate for how accurate each model may be on unseen data. You need to be able to use these estimates to choose one or two best models from the suite of models that you have created.

When you have a new dataset, it is a good idea to visualize the data using different techniques in order to look at the data from different perspectives.

The same idea applies to model selection. You should use a number of different ways of looking at the estimated accuracy of your machine learning algorithms in order to choose the one or two algorithm to finalize.

A way to do this is to use visualization methods to show the average accuracy, variance and other properties of the distribution of model accuracies.

In [1]:
# Pima Indians Diabetes Dataset
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

In [2]:
#Loading dataset
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv('pima-indians-diabetes.data',names=names)

# separate array into input and output components
X = df.drop('class',axis='columns')
Y = df['class']

In [3]:
# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

In [8]:
# evaluate each model in turn
result = []
namess = []

In [9]:
for name, model in models:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring='accuracy')
    result.append(cv_results.mean())
    namess.append(name)

In [10]:
namess

['LR', 'LDA', 'KNN', 'CART', 'NB', 'SVM']

In [11]:
result

[0.76951469583048526,
 0.77346206425153796,
 0.72655502392344506,
 0.69002050580997953,
 0.75517771701982228,
 0.65102529049897473]

In [12]:
resdict = pd.DataFrame.from_dict(dict(zip(namess,result)),orient='index')
resdict.sort_values(0,ascending=False)

Unnamed: 0,0
LDA,0.773462
LR,0.769515
NB,0.755178
KNN,0.726555
CART,0.690021
SVM,0.651025
