Spot-checking is a way of discovering which algorithms perform well on your machine learning
problem. You cannot know which algorithms are best suited to your problem beforehand. You
must trial a number of methods and focus attention on those that prove themselves the most
promising. Six machine learning algorithms that you can use
when spot-checking your regression problem in Python with scikit-learn.

Take Aways:
* How to spot-check machine learning algorithms on a regression problem.
* How to spot-check four linear regression algorithms.
* How to spot-check three nonlinear regression algorithms.

### Algorithms Overview

In this lesson we are going to take a look at seven regression algorithms that you can spot-check
on your dataset. Starting with four linear machine learning algorithms:

* Linear Regression.
* Ridge Regression.
* LASSO Linear Regression.
* Elastic Net Regression.

Then looking at three nonlinear machine learning algorithms:

* k-Nearest Neighbors.
* Classification and Regression Trees.
* Support Vector Machines.

Each recipe is demonstrated on the `Boston House Price dataset`. This is a `regression problem` where all attributes are `numeric`. A test harness with `10-fold cross-validation` is used
to demonstrate how to `spot-check` each machine learning algorithm and `mean squared error`
measures are used to indicate `algorithm performance`.

Note that `mean squared error` values are inverted (negative). This is a quirk of the `cross_val_score()` function used 
that requires all algorithm metrics to be sorted in ascending order (larger value is better). The recipes assume
that you know about each machine learning algorithm and how to use them.

In [6]:
import pandas as pd
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_csv('housing.csv', names = names, delim_whitespace=True)

In [7]:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


#### Linear Machine Learning Algorithms

This section provides examples of how to use four different linear machine learning algorithms
for regression in Python with scikit-learn.

#### Linear Regression

`Linear regression` assumes that the input variables have a `Gaussian distribution`. It is also
assumed that input variables are relevant to the output variable and that they are not `highly correlated` with each other (a problem called `collinearity`). You can construct a `linear regression
model` using the `LinearRegression class`.

In [9]:
# Linear Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values

X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)
model = LinearRegression()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("estimate of mean squared error:",  results.mean())

estimate of mean squared error: -34.70525594452499


#### Ridge Regression

`Ridge regression` is an extension of `linear regression` where the `loss function` is modified to
minimize the `complexity of the model` measured as the sum squared value of the coefficient
values (also called the L2-norm). You can construct a `ridge regression model` by using the `Ridge
class`.

In [10]:
# Ridge Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge

filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values

X = array[:,0:13]
Y = array[:,13]
num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = Ridge()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("estimate of mean squared error:", results.mean())

estimate of mean squared error: -34.07824620925929


#### LASSO Regression

The `Least Absolute Shrinkage and Selection Operator` (or LASSO for short) is a modification
of linear regression, like ridge regression, where the `loss function` is modified to minimize the
complexity of the model measured as the sum absolute value of the `coefficient values` (also called
the `L1-norm`). You can construct a LASSO model by using the `Lasso class`.

In [11]:
# Lasso Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso

filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values

X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)
model = Lasso()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("estimate of mean squared error:", results.mean())

estimate of mean squared error: -34.46408458830232


#### ElasticNet Regression

`ElasticNet` is a form of `regularization regression` that combines the properties of both `Ridge Regression` and `LASSO regression`. It seeks to minimize the complexity of the `regression model` (`magnitude and number of regression coefficients`) by penalizing the model using both the `L2-norm` (`sum squared coefficient values`) and the `L1-norm` (`sum absolute coefficient values`). You can construct an ElasticNet model using the ElasticNet class4.

In [12]:
# ElasticNet Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import ElasticNet

filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values

X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)
model = ElasticNet()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("estimate of mean squared error:", results.mean())

estimate of mean squared error: -31.164573714249762


### Nonlinear Machine Learning Algorithms

This section provides examples of how to use three different nonlinear machine learning algorithms
for regression in Python with scikit-learn.

#### K-Nearest Neighbors

The `k-Nearest Neighbors algorithm` (or KNN) locates the `k` most `similar instances` in the training dataset for a new data instance. From the `k neighbors`, a `mean` or `median` output
variable is taken as the prediction. Of note is the distance metric used (the `metric argument`). The `Minkowski distance` is used by default, which is a generalization of both the `Euclidean distance` (used when all inputs have the same scale) and `Manhattan distance` (for when the scales of the input variables differ). You can construct a `KNN model` for regression using the
`KNeighborsRegressor class`.

In [13]:
# KNN Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor

filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values

X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)
model = KNeighborsRegressor()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("estimate of mean squared error:", results.mean())

estimate of mean squared error: -107.28683898039215


#### Classiffication and Regression Trees

`Decision trees` or the `Classiffication and Regression Trees` (`CART` as they are known) use the training data to select the best points to split the data in order to minimize a cost metric. The default
`cost metric` for regression decision trees is the `mean squared error`, specified in the criterion parameter. You can create a `CART model` for regression using the `DecisionTreeRegressor class`

In [14]:
# Decision Tree Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values

X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)
model = DecisionTreeRegressor()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("estimate of mean squared error:", results.mean())

estimate of mean squared error: -38.275562352941186


#### Support Vector Machines

`Support Vector Machines` (SVM) were developed for `binary classiffication`. The technique has
been extended for the prediction real-valued problems called `Support Vector Regression` `(SVR)`.
Like the classiffication example, `SVR` is built upon the `LIBSVM library`. You can create an `SVM
model` for regression using the `SVR class`.

In [16]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)


# SVM Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVR

filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values

X = array[:,0:13]
Y = array[:,13]
num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = SVR()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("estimate of mean squared error:", results.mean())

estimate of mean squared error: -91.04782433324428
