# Algorithm Selection

This notebook applies a set of algorithms on datasets, aiming to select the one with the best performance.

We are going to user the Pima Indians dataset (from Lecture 6):

https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data

In [2]:
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

filename = "../datasets/pima_indians_diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# df stands for "Data Frame"
df = pd.read_csv(filename, names=names)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
preg     768 non-null int64
plas     768 non-null int64
pres     768 non-null int64
skin     768 non-null int64
test     768 non-null int64
mass     768 non-null float64
pedi     768 non-null float64
age      768 non-null int64
class    768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [3]:
array = df.values
X = array[:,0:8]
Y = array[:,8]
kfold = KFold(n_splits=10, random_state=7)

## Logistic Regression

In [7]:
from sklearn.linear_model import LogisticRegression
results = cross_val_score(LogisticRegression(), X, Y, cv=kfold)
print(results.mean()) ;# prints the mean estimated accuracy

0.7695146958304853




## Linear Discriminant Analysis

In [8]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
results = cross_val_score(LinearDiscriminantAnalysis(), X, Y, cv=kfold)
print(results.mean())

0.773462064251538


## k-Nearest Neighbors

In [9]:
from sklearn.neighbors import KNeighborsClassifier
results = cross_val_score(KNeighborsClassifier(), X, Y, cv=kfold)
print(results.mean())

0.7265550239234451


## Naive Bayes

In [10]:
from sklearn.naive_bayes import GaussianNB
results = cross_val_score(GaussianNB(), X, Y, cv=kfold)
print(results.mean())

0.7551777170198223


## Decision Trees

In [11]:
from sklearn.tree import DecisionTreeClassifier
results = cross_val_score(DecisionTreeClassifier(), X, Y, cv=kfold)
print(results.mean())

0.6925495557074505


## Support Vector Machines

In [12]:
from sklearn.svm import SVC
results = cross_val_score(SVC(), X, Y, cv=kfold)
print(results.mean())



0.6510252904989747




# What about Regression?

We are going to use the "Boston House Price" dataset:

http://lib.stat.cmu.edu/datasets/boston

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

https://www.kaggle.com/vikrishnan/boston-house-prices

There are 14 attributes in each case of the dataset. They are:

 * CRIM - per capita crime rate by town
 * ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
 * INDUS - proportion of non-retail business acres per town.
 * CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
 * NOX - nitric oxides concentration (parts per 10 million)
 * RM - average number of rooms per dwelling
 * AGE - proportion of owner-occupied units built prior to 1940
 * DIS - weighted distances to five Boston employment centres
 * RAD - index of accessibility to radial highways
 * TAX - full-value property-tax rate per \$10,000
 * PTRATIO - pupil-teacher ratio by town
 * B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
 * LSTAT - % lower status of the population
 * MEDV - Median value of owner-occupied homes in $1000's

In [13]:
from sklearn.datasets import load_boston
import numpy as np
import pandas as pd
boston = load_boston()

df = pd.DataFrame(data= np.c_[boston['data'], boston['target']],
                     columns= boston['feature_names'].tolist() + ['target'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
target     506 non-null float64
dtypes: float64(14)
memory usage: 55.5 KB


In [14]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
array = df.values
X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)

## Linear Regression

In [15]:
from sklearn.linear_model import LinearRegression
results = cross_val_score(LinearRegression(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error

-34.705255944524815


## Ridge Regression

In [16]:
from sklearn.linear_model import Ridge
results = cross_val_score(Ridge(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error

-34.07824620925929


## Lasso Regression

In [17]:
from sklearn.linear_model import Lasso
results = cross_val_score(Lasso(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error

-34.46408458830233


## ElasticNet Regression
Combines the properties of Lasso and Ridge regression.

In [18]:
from sklearn.linear_model import ElasticNet
results = cross_val_score(ElasticNet(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error

-31.164573714249762


## k-Nearest Neighbors

In [19]:
from sklearn.neighbors import KNeighborsRegressor
results = cross_val_score(KNeighborsRegressor(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error

-107.28683898039215


## Classification and Regression Trees

In [20]:
from sklearn.tree import DecisionTreeRegressor
results = cross_val_score(DecisionTreeRegressor(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error

-38.71028470588235


## Support Vector Machines

In [21]:
from sklearn.svm import SVR
results = cross_val_score(SVR(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error



-91.04782433324428


