# Spot-Check Regression Algorithms

><small><i>from the book 
"Machine Learning Mastery With Python: Understand Your Data, Create Accurate Models and Work Projects End-To-End"
by Jason Brownlee, Migrated to Jupyter with additions by Mitch Sanders 2017</i></small>




Spot-checking is a way of discovering which algorithms perform well on your machine learning
problem. You cannot know which algorithms are best suited to your problem beforehand. You
must trial a number of methods and focus attention on those that prove themselves the most
promising. In this chapter you will discover six machine learning algorithms that you can use
when spot-checking your regression problem in Python with scikit-learn. After completing this
lesson you will know:

1. How to spot-check machine learning algorithms on a regression problem.
2. How to spot-check four linear regression algorithms.
3. How to spot-check three nonlinear regression algorithms.

Let’s get started.

## Algorithms Overview
In this lesson we are going to take a look at seven regression algorithms that you can spot-check
on your dataset. Starting with four linear machine learning algorithms:

- Linear Regression.
- Ridge Regression.
- LASSO Linear Regression.
- Elastic Net Regression.

Then looking at three nonlinear machine learning algorithms:

- k-Nearest Neighbors.
- Classification and Regression Trees.
- Support Vector Machines.

Each recipe is demonstrated on the **Boston House Price** dataset. This is a regression
problem where all attributes are numeric. A test harness with 10-fold cross-validation is used
to demonstrate how to spot-check each machine learning algorithm and mean squared error
measures are used to indicate algorithm performance. Note that mean squared error values are
inverted (negative). This is a quirk of the **cross val score()** function used that requires all
algorithm metrics to be sorted in ascending order (larger value is better). The recipes assume
that you know about each machine learning algorithm and how to use them. We will not go
into the API or parameterization of each algorithm.



## Linear Machine Learning Algorithms
This section provides examples of how to use four different linear machine learning algorithms
for regression in Python with scikit-learn.

### Linear Regression
Linear regression assumes that the input variables have a Gaussian distribution. It is also
assumed that input variables are relevant to the output variable and that they are not highly
correlated with each other (a problem called collinearity). You can construct a linear regression
model using the LinearRegression class.


http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html


In [1]:
# Linear Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
filename = '../housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)
model = LinearRegression()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

# Running the example provides a estimate of mean squared error.


-34.7052559445


### Ridge Regression
Ridge regression is an extension of linear regression where the loss function is modified to
minimize the complexity of the model measured as the sum squared value of the coefficient
values (also called the L2-norm). You can construct a ridge regression model by using the Ridge
class.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html


In [2]:
# Ridge Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
filename = '../housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = Ridge()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())


-34.0782462093


### LASSO Regression

The Least Absolute Shrinkage and Selection Operator (or LASSO for short) is a modification
of linear regression, like ridge regression, where the loss function is modified to minimize the
complexity of the model measured as the sum absolute value of the coefficient values (also called
the L1-norm). You can construct a LASSO model by using the Lasso class.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html


In [3]:
# Lasso Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso
filename = '../housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)
model = Lasso()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())


# Running the example provides an estimate of the mean squared error.

-34.4640845883


### ElasticNet Regression

ElasticNet is a form of regularization regression that combines the properties of both Ridge
Regression and LASSO regression. It seeks to minimize the complexity of the regression model
(magnitude and number of regression coefficients) by penalizing the model using both the
L2-norm (sum squared coefficient values) and the L1-norm (sum absolute coefficient values).
You can construct an ElasticNet model using the ElasticNet class.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html


In [4]:
# ElasticNet Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import ElasticNet
filename = '../housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)
model = ElasticNet()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

# Running the example provides an estimate of the mean squared error.

-31.1645737142


## Nonlinear Machine Learning Algorithms

This section provides examples of how to use three different nonlinear machine learning algorithms
for regression in Python with scikit-learn

### K-Nearest Neighbors

The k-Nearest Neighbors algorithm (or KNN) locates the k most similar instances in the
training dataset for a new data instance. From the k neighbors, a mean or median output
variable is taken as the prediction. Of note is the distance metric used (the metric argument).
The Minkowski distance is used by default, which is a generalization of both the Euclidean
distance (used when all inputs have the same scale) and Manhattan distance (for when the
scales of the input variables differ). You can construct a KNN model for regression using the
KNeighborsRegressor class.

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html



In [5]:
# KNN Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor
filename = '../housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)
model = KNeighborsRegressor()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

# Running the example provides an estimate of the mean squared error.

-107.28683898


### Classification and Regression Trees

Decision trees or the Classification and Regression Trees (CART as they are known) use the training
data to select the best points to split the data in order to minimize a cost metric. The default
cost metric for regression decision trees is the mean squared error, specified in the criterion
parameter. You can create a CART model for regression using the DecisionTreeRegressor
class.

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html


In [6]:
# Decision Tree Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
filename = '../housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)
model = DecisionTreeRegressor()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

# Running the example provides an estimate of the mean squared error.

-39.6383321569


### Support Vector Machines
Support Vector Machines (SVM) were developed for binary classification. The technique has
been extended for the prediction real-valued problems called Support Vector Regression (SVR).
Like the classification example, SVR is built upon the LIBSVM library. You can create an SVM
model for regression using the SVR class.

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html


In [7]:
# SVM Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVR
filename = '../housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = SVR()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

# Running the example provides an estimate of the mean squared error.

-91.0478243332


## Summary
In this chapter you discovered how to spot-check machine learning algorithms for regression
problems in Python using scikit-learn. Specifically, you learned about four linear machine
learning algorithms: Linear Regression, Ridge Regression, LASSO Linear Regression and Elastic
Net Regression. You also learned about three nonlinear algorithms: k-Nearest Neighbors,
Classification and Regression Trees and Support Vector Machines.
### Next
Now that you know how to use classification and regression algorithms you need to know how
to compare the results of different algorithms to each other. In the next lesson you will discover
how to design simple experiments to directly compare machine learning algorithms to each other
on your dataset.

<hr>

### About the Boston House Price dataset:
Maintained at UCI machine Learning Repository
https://archive.ics.uci.edu/ml/datasets/housing

Included in scikit-learn datasets module
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html


#### Attribute Information:

1. CRIM: per capita crime rate by town 
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft. 
3. INDUS: proportion of non-retail business acres per town 
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 
5. NOX: nitric oxides concentration (parts per 10 million) 
6. RM: average number of rooms per dwelling 
7. AGE: proportion of owner-occupied units built prior to 1940 
8. DIS: weighted distances to five Boston employment centres 
9. RAD: index of accessibility to radial highways 
10. TAX: full-value property-tax rate per 10,000 US Dollars
11. PTRATIO: pupil-teacher ratio by town 
12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 
13. LSTAT: % lower status of the population 
14. MEDV: Median value of owner-occupied homes in 1000's US Dollars


In [8]:
# demo using Boston Housing data in SciKit-learn
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.data.shape)

(506, 13)


### About the Pima Indian Dataset 

#### Attribute Information:

1. Number of times pregnant 
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
3. Diastolic blood pressure (mm Hg) 
4. Triceps skin fold thickness (mm) 
5. 2-Hour serum insulin (mu U/ml) 
6. Body mass index (weight in kg/(height in m)^2) 
7. Diabetes pedigree function 
8. Age (years) 
9. Class variable (0 or 1) 