# Supervised learning - regression

In supervised regression, we want to train the algorithm to predict a value, rather than a categorical value as do in classification. These techniques are essentially the same as regression in statistics.

In this notebook we will look at how to create a regression pipeline using the tools available in the Scikit-learn library. See the scikit-learn documentation for more information on the functions used:

* user Guide: https://scikit-learn.org/stable/user_guide.html
* API reference: https://scikit-learn.org/stable/modules/classes.html

In [1]:
import os

In [24]:
import numpy
import pandas
import matplotlib
import matplotlib.pyplot
import mpl_toolkits.mplot3d

In [95]:
import sklearn
import sklearn.datasets
import sklearn.model_selection
import sklearn.preprocessing
import sklearn.linear_model
import sklearn.metrics

## Data loading and exploration
We will use a generated dataset from scikit-learn to demonstrate the regression pipeline. To better demonstrate a standard pipeline, I am changing the mean and standard deviation of the input features to represent a dataset closer to a real example where preprpocessing is required.

Documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html#sklearn.datasets.make_regression

In [23]:
n_samples=1000
n_features=10
n_informative=5
n_targets = 1
noise_std = 0.2

In [81]:
scale_random = numpy.repeat(numpy.random.random([1,n_features])*5, n_samples, axis=0)
translate_random = numpy.repeat(numpy.array(numpy.random.random([1,n_features])*10-5,dtype=numpy.int32), n_samples, axis=0)
input_features, target_feature, coef1 =  sklearn.datasets.make_regression(n_samples=n_samples, 
                                         n_features=n_features, 
                                         n_informative=n_informative, 
                                         n_targets=n_targets, 
                                         noise=noise_std,
                                         coef=True)
input_features = (input_features * scale_random) + translate_random

In [82]:
coef1

array([ 0.        , 90.91238786, 93.36253291,  0.        , 89.10777337,
        0.        ,  0.        , 66.89337894, 18.09660984,  0.        ])

In [83]:
input_features.mean(axis=0)

array([-2.08635773e+00, -3.08418204e+00,  3.19026051e-03, -1.94716593e+00,
       -3.54290089e-02,  1.99152353e+00, -3.09384437e+00, -4.01785782e+00,
       -3.00026743e+00, -6.06944446e-02])

In [84]:
input_features.std(axis=0)

array([4.43414389, 1.78519554, 3.44177786, 2.47964482, 3.72483471,
       1.22571448, 4.71310688, 0.7202224 , 0.45396689, 1.89564103])

In [85]:
print(f'target variable range=[{target_feature.min()},{target_feature.max()}] mean={target_feature.mean()} std={target_feature.std()}')

target variable range=[-562.8482776890254,659.9814371374064] mean=-6.7749403837336954 std=177.67507055910667


## Data preprocessing


Documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [86]:
input_train, input_test, target_train, target_test = sklearn.model_selection.train_test_split(input_features, target_feature)

In [87]:
input_scaler = sklearn.preprocessing.StandardScaler()
input_scaler.fit(input_train)
X_train = input_scaler.transform(input_train)
X_test = input_scaler.transform(input_test)
y_train = target_train
y_test = target_test

In [88]:
X_train.mean(axis=0)

array([ 5.28466160e-17, -3.59712260e-17, -2.53130850e-17,  7.44737605e-16,
       -3.70074342e-18,  3.76365605e-16,  7.98028310e-16,  4.32128407e-15,
        5.31064082e-15, -1.83556873e-17])

In [89]:
X_train.std(axis=0)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

## Training regression algorithms

Documentation: 
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [92]:
linear_regressor1 = sklearn.linear_model.LinearRegression()
linear_regressor1.fit(X_train, y_train)
linear_regressor1.coef_

array([ 6.03809056e-03,  9.23329128e+01,  9.35872814e+01, -1.99057648e-03,
        9.32641654e+01,  6.03841251e-03,  5.01223321e-03,  6.53609042e+01,
        1.76552912e+01, -3.60678660e-03])

In [91]:
linear_regressor1.coef_ - coef1

array([ 6.03809056e-03,  1.42052495e+00,  2.24748462e-01, -1.99057648e-03,
        4.15639204e+00,  6.03841251e-03,  5.01223321e-03, -1.53247474e+00,
       -4.41318657e-01, -3.60678660e-03])

In [93]:
y_out_train = linear_regressor1.predict(X_train)

In [94]:
y_out_test = linear_regressor1.predict(X_test)

## Performance metrics

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

In [96]:
err_train = sklearn.metrics.mean_squared_error(y_train, y_out_train, squared=False)

In [102]:
err_test = sklearn.metrics.mean_squared_error(y_test, y_out_test, squared=False)

In [103]:
err_train

0.042411641945348466

In [104]:
err_test

0.18094292582481272