# Supervised learning - regression

In supervised regression, we want to train the algorithm to predict a value, rather than a categorical value as do in classification. These techniques are essentially the same as regression in statistics.

In this notebook we will look at how to create a regression pipeline using the tools available in the Scikit-learn library. The examples are quite simple, but demonstrate the key steps in doing regression. The inerface for each type of class in the process is quite consistent for all the different examples of that type, so een when dealing with much  more complicated examples, the interface will be the same and so the code will look similar. See the scikit-learn documentation for more information on the functions used:

Scikit-learn documentation
* User Guide - Supervised learning: https://scikit-learn.org/stable/supervised_learning.html
* API reference: https://scikit-learn.org/stable/modules/classes.html

In [1]:
import os
import numpy
import pandas
import matplotlib
import matplotlib.pyplot
import mpl_toolkits.mplot3d

In [2]:
import sklearn
import sklearn.datasets
import sklearn.model_selection
import sklearn.preprocessing
import sklearn.linear_model
import sklearn.metrics
import sklearn.neural_network
import sklearn.pipeline

## Data loading and exploration
We will use a generated dataset from scikit-learn to demonstrate the regression pipeline. To better demonstrate a standard pipeline, I am changing the mean and standard deviation of the input features to represent a dataset closer to a real example where preprpocessing is required.

Documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html#sklearn.datasets.make_regression

In [3]:
n_samples=10000
n_features=10
n_informative=5
n_targets = 1
noise_std = 0.2

In [4]:
scale_random = numpy.repeat(numpy.random.random([1,n_features])*5, n_samples, axis=0)
translate_random = numpy.repeat(numpy.array(numpy.random.random([1,n_features])*10-5,dtype=numpy.int32), n_samples, axis=0)
input_features, target_feature, coef1 =  sklearn.datasets.make_regression(n_samples=n_samples, 
                                         n_features=n_features, 
                                         n_informative=n_informative, 
                                         n_targets=n_targets, 
                                         noise=noise_std,
                                         coef=True)
input_features = (input_features * scale_random) + translate_random

In [5]:
coef1

array([35.80209775,  0.        ,  0.        , 93.77302891,  0.        ,
       18.43683699, 28.70239851,  0.        ,  0.        ,  0.21332367])

In [6]:
input_features.mean(axis=0)

array([-1.01691928e+00, -4.04241352e+00, -1.94301898e+00, -2.00357408e+00,
       -2.01575347e+00,  2.04532891e+00,  3.14286665e-03,  1.98208087e+00,
       -3.96945976e+00,  8.22689894e-02])

In [7]:
input_features.std(axis=0)

array([3.20446505, 4.57197064, 4.42213003, 0.83012378, 2.53573358,
       3.65422809, 2.071285  , 3.69052539, 3.04286174, 4.73437527])

In [8]:
print(f'target variable range=[{target_feature.min()},{target_feature.max()}] mean={target_feature.mean()} std={target_feature.std()}')

target variable range=[-431.25759755496983,450.59596123977815] mean=-0.31874233551843845 std=107.21340980646025


## Data preprocessing

Before applying a supervised learning algorithm to our data, we usually have to do some preprocessing to get it into a suitable state for training. The most common is to normalise the data. This is so the algorthm gives equal importance to each of the features, which is a usual starting assumption when applying machine learning techniques.

Documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Before we normalise the data, we should split our data into training and test sets. To preoperly evaulate our regressor performance, we must evaluate it on unseen data to see how well it generalises. We don't want any part of the pipeline to be trained on the data to be used for testing, so we do the split as early as possible.

In [9]:
input_train, input_test, target_train, target_test = sklearn.model_selection.train_test_split(input_features, target_feature)

For this data, to normaise the input features, we want each feature to have zero mean and standard deviation of 1. The StandardScaler class does this. There are many other preprocessing classes for differents sort of data and different required preprocessing.

In [10]:
input_scaler = sklearn.preprocessing.StandardScaler()
input_scaler.fit(input_train)
X_train = input_scaler.transform(input_train)
X_test = input_scaler.transform(input_test)
y_train = target_train
y_test = target_test

We can now see that the mean of each column of data is nearly zero, and the standard deviation is 1. our features are now ready for training.

In [11]:
X_train.mean(axis=0)

array([ 1.68673964e-15, -1.49035229e-15, -7.46632386e-16, -7.52945854e-15,
        1.97018698e-15, -7.71027686e-16, -6.51330841e-19,  1.92574105e-15,
       -3.21372558e-16, -9.17784367e-18])

In [12]:
X_train.std(axis=0)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

## Training regression algorithms

The next step is to train the  regression algorithm sing our data. We are using a linear regression so the training involves calculating the linear coefficients using an ordinary least squares solver, which in the case of scikit-learn is using the scipy implementation of this algorithm.

The regressor, as with most steps in a scikit-learn pipeline, have 2 key methods. For a regressor these are fit and predict. The call to fit uses the training data to calculate the parameters of the algorithm to give the best match to the target data. The call to predict then gives a prediction for each of the observations which will as close as possible to the known answer. One can then apply the algorithm to the unseen test data to how well the algorithm generalises for those parameters.

Documentation: 
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Here we create a regressor object, then train the algorithm (i.e. calculate the parameters, in this case the linear coefficients). The calculated linear coefficients are displayed after fitting. Because we generated the data, we can compare our calculated coefficients to those used for generating the data. They are not an exact match because we added noise to our data to make it more "real".

In [13]:
linear_regressor1 = sklearn.linear_model.LinearRegression()
linear_regressor1.fit(X_train, y_train)
linear_regressor1.coef_

array([ 3.61270538e+01, -1.90992567e-03, -3.20748807e-03,  9.48344741e+01,
        3.24327448e-03,  1.85440416e+01,  2.84925574e+01, -1.89352136e-03,
        1.38633154e-04,  2.17598731e-01])

In [14]:
linear_regressor1.coef_ - coef1

array([ 3.24956074e-01, -1.90992567e-03, -3.20748807e-03,  1.06144515e+00,
        3.24327448e-03,  1.07204592e-01, -2.09841163e-01, -1.89352136e-03,
        1.38633154e-04,  4.27505954e-03])

Now that we've trained our algorithm, we use to calcuate results for both the observation used in training, so how good our fit it, and also on the unseen observations, to see how well the trained algorithm generalises.

In [15]:
y_out_train = linear_regressor1.predict(X_train)

In [16]:
y_out_test = linear_regressor1.predict(X_test)

## Performance metrics
The last step is to evaluate the performance of the trained algorithms. There are many different metrics available from which one choose based on the nature of the problem. They allhave the same interface, where one passes the true target values and the predicted target values and the results are calculated. In this case we are using the root mean squared error to measure the regression fit.

Documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

In [17]:
err_train = sklearn.metrics.mean_squared_error(y_train, y_out_train, squared=False)

In [18]:
err_test = sklearn.metrics.mean_squared_error(y_test, y_out_test, squared=False)

In [19]:
err_train

0.199508276160511

In [20]:
err_test

0.19606198922339482

# A non-linear example

So far we have looked at an example with quite simple, generated data. Now we look at an example that again uses generated data, but now generated through a non-linear process so we can try other regressions algorithms. This is till not a particularly "real world" example, but the algorthms used here are a lot more powerful and would be useful in a real-world problem. What this demonstrates although our regressor is now a lot more powerful, the interfaces for the various bits of our porcessing pipeline looks exactly the same as for our linear case. 

Documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman1.html
* https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html

In [21]:
input_nl, target_nl = sklearn.datasets.make_friedman1(n_samples=n_samples, n_features=n_features, noise=noise_std)
input_nl_train, input_nl_test, target_nl_train, target_nl_test = sklearn.model_selection.train_test_split(input_nl, target_nl)

In [22]:
scaler_nl = sklearn.preprocessing.StandardScaler()
scaler_nl.fit(input_nl_train)
X_nl_train = scaler_nl.transform(input_nl_train)
X_nl_test = scaler_nl.transform(input_nl_test)
y_nl_train = target_nl_train
y_nl_test = target_nl_test
reg_mlp1 = sklearn.neural_network.MLPRegressor(hidden_layer_sizes=[20,20,20], max_iter=10000, tol=1e-5)
reg_mlp1.fit(X_nl_train, y_nl_train)
y_nl_out_train = reg_mlp1.predict(X_nl_train)
y_nl_out_test = reg_mlp1.predict(X_nl_test)


In [36]:
err_nl_train = sklearn.metrics.mean_squared_error(y_nl_train, y_nl_out_train, squared=False)
err_nl_test = sklearn.metrics.mean_squared_error(y_nl_test, y_nl_out_test,squared=False)

In [37]:
print(f'RMS error: training data for non-linear example = {err_nl_train:0.3f}; testing data = {err_nl_test:0.3f}')

RMS error: training data for non-linear example = 0.260; testing data = 0.271


## Scikit-learn processing pipeline

More complicated real-world examples of classification may involve many more steps. Scikit-learn makes this easier by supplying a pipeline class for linking the steps together. This example will show how the above non-linear example could use a pipeline. In this case we are only using 2 elements in our pipeline, but there could be many more elements, which must all have the fit/predict or fit/tranform standard scikit-learn interface.

Documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

In [24]:
pipeline1 = sklearn.pipeline.Pipeline(
    steps=[('scaler', sklearn.preprocessing.StandardScaler()),
           ('neural_net', sklearn.neural_network.MLPRegressor(
               hidden_layer_sizes=[20,20,20], max_iter=10000, tol=1e-5))
          ]
)
pipeline1.fit(input_nl_train, target_nl_train)
y_pipe_out_train = pipeline1.predict(input_nl_train)
y_pipe_out_test = pipeline1.predict(input_nl_test)


In [38]:
err_pipe_train = sklearn.metrics.mean_squared_error(target_nl_train, y_pipe_out_train, squared=False)
err_pipe_test = sklearn.metrics.mean_squared_error(target_nl_test, y_pipe_out_test,squared=False)

In [39]:
print(f'RMS error: training data for pipeline example = {err_pipe_train:0.3f}; testing data = {err_pipe_test:0.3f}')

RMS error: training data for pipeline example = 0.230; testing data = 0.249


### Further Steps

In a real project there are further steps we would take to check our results. The first is cross-validation. This involves using several different train/est splits and check that the results are fairly consistent between different randomly selected splits. This checks that we haven't achieved spuriously good results by an accident of the random train/test split selection.

In addition, we would usually do some hyperparamter tuning. Hyperparameters are the parameters of the model that are not adjusted by the standard training process (in scikit-learn, training is what is done by the call to the "fit" function of a classifier or regressor). For example, in a neural network, the number of hidden layers and the number of neurons in a hidden layer are hyperparameters. We are likely to get different results when selecting different hyperparameters. Hyperparameter tuning refers to a systematic process of arying the hyperparameters to find a set of hyperparameters that are suitable for the problem being solved.

Further reading:
* Cross validation - https://scikit-learn.org/stable/modules/cross_validation.html 
* Hyperparameter tuning - https://scikit-learn.org/stable/modules/grid_search.html
