# Algorithm Selection

This notebook applies a set of algorithms on datasets, aiming to select the one with the best performance.

We are going to user the Pima Indians dataset (from Lecture 6):

https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data

In [None]:
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

filename = "../Lecture_6-AppliedMachineLearning/data/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# df stands for "Data Frame"
df = pd.read_csv(filename, names=names)
df.info()

In [None]:
array = df.values
X = array[:,0:8]
Y = array[:,8]
kfold = KFold(n_splits=10, random_state=7)

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
results = cross_val_score(LogisticRegression(), X, Y, cv=kfold)
print(results.mean()) ;# prints the mean estimated accuracy

## Linear Discriminant Analysis

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
results = cross_val_score(LinearDiscriminantAnalysis(), X, Y, cv=kfold)
print(results.mean())

## k-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier
results = cross_val_score(KNeighborsClassifier(), X, Y, cv=kfold)
print(results.mean())

## Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
results = cross_val_score(GaussianNB(), X, Y, cv=kfold)
print(results.mean())

## Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier
results = cross_val_score(DecisionTreeClassifier(), X, Y, cv=kfold)
print(results.mean())

## Support Vector Machines

In [None]:
from sklearn.svm import SVC
results = cross_val_score(SVC(), X, Y, cv=kfold)
print(results.mean())

# What about Regression?

We are going to use the "Boston House Price" dataset:

http://lib.stat.cmu.edu/datasets/boston

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

https://www.kaggle.com/vikrishnan/boston-house-prices

There are 14 attributes in each case of the dataset. They are:

 * CRIM - per capita crime rate by town
 * ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
 * INDUS - proportion of non-retail business acres per town.
 * CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
 * NOX - nitric oxides concentration (parts per 10 million)
 * RM - average number of rooms per dwelling
 * AGE - proportion of owner-occupied units built prior to 1940
 * DIS - weighted distances to five Boston employment centres
 * RAD - index of accessibility to radial highways
 * TAX - full-value property-tax rate per \$10,000
 * PTRATIO - pupil-teacher ratio by town
 * B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
 * LSTAT - % lower status of the population
 * MEDV - Median value of owner-occupied homes in $1000's

In [None]:
from sklearn.datasets import load_boston
import numpy as np
import pandas as pd
boston = load_boston()

df = pd.DataFrame(data= np.c_[boston['data'], boston['target']],
                     columns= boston['feature_names'].tolist() + ['target'])
df.info()

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
array = df.values
X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)

## Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
results = cross_val_score(LinearRegression(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error

## Ridge Regression

In [None]:
from sklearn.linear_model import Ridge
results = cross_val_score(Ridge(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error

## Lasso Regression

In [None]:
from sklearn.linear_model import Lasso
results = cross_val_score(Lasso(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error

## ElasticNet Regression
Combines the properties of Lasso and Ridge regression.

In [None]:
from sklearn.linear_model import ElasticNet
results = cross_val_score(ElasticNet(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error

## k-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsRegressor
results = cross_val_score(KNeighborsRegressor(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error

## Classification and Regression Trees

In [None]:
from sklearn.tree import DecisionTreeRegressor
results = cross_val_score(DecisionTreeRegressor(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error

## Support Vector Machines

In [None]:
from sklearn.svm import SVR
results = cross_val_score(SVR(), X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(results.mean()) ;# prints the (minus) mean squared error