# Validating performance of regression models
This notebook explains how to use CNTK metric functions to validate the performance of a regression model.
We're using the [car MPG dataset](https://archive.ics.uci.edu/ml/datasets/Auto+MPG) from the UCI dataset library. This dataset is perfect for demonstrating how to build a regression model using CNTK. 

In the dataset, you'll find 9 columns:

1. mpg: continuous 
2. cylinders: multi-valued discrete 
3. displacement: continuous 
4. horsepower: continuous 
5. weight: continuous 
6. acceleration: continuous 
7. model year: multi-valued discrete 
8. origin: multi-valued discrete 
9. car name: string (unique for each instance)

All columns in the dataset contain numeric values except for the origin column which is a categorical value.
We'll strip the `car name` column as it cannot be used in our model.

## The model
The model we're using features two hidden layers. Each with 64 neurons with a ReLU (Rectified Linear Unit) activation function. The output is a single neuron without an activation function. This is necessary to turn this neural network into a regression model.

We're using the 8 input features and the miles per gallon as target for our neural network.

In [11]:
from cntk import default_options, input_variable
from cntk.layers import Dense, Sequential
from cntk.ops import relu

with default_options(activation=relu):
    model = Sequential([
        Dense(64),
        Dense(64),
        Dense(1,activation=None)
    ])
    
features = input_variable(9)
target = input_variable(1)

z = model(features)

## Preprocessing
In this section we'll first preprocess the data so that it is compatible for use with our neural network.
We need to load the data and then clean it up.

In [12]:
import pandas as pd
import numpy as np

In [13]:
df_cars = pd.read_csv('auto-mpg.csv', na_values=['?'])
df_cars = df_cars.dropna()

The origin column contains three possible values, as is shown in the dictionary below. To use the origin in the neural network we need to split it into three separate columns. For this we'll first replace the numeric values with a string value. After we've done that, we ask pandas to generate dummy columns. This creates three columns: usa, europa, and japan. For each sample in the dataset, one of these columns will contain a value of 1 and the rest will contain a value of 0.

In [14]:
origin_mapping = {
    1: 'usa',
    2: 'europe',
    3: 'japan'
}

df_cars.replace({'origin': origin_mapping}, inplace=True)

categorical_origin = pd.get_dummies(df_cars['origin'], prefix='origin')

df_cars = pd.concat([df_cars, categorical_origin], axis=1)
df_cars = df_cars.drop(columns=['origin', 'car name'])

The final result of this operation is the following dataset. It contains 9 columns. Of these columns the `mpg` column is used as the target output. The rest is used as a feature for the model.

In [15]:
df_cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin_europe,origin_japan,origin_usa
0,18.0,8,307.0,130.0,3504,12.0,70,0,0,1
1,15.0,8,350.0,165.0,3693,11.5,70,0,0,1
2,18.0,8,318.0,150.0,3436,11.0,70,0,0,1
3,16.0,8,304.0,150.0,3433,12.0,70,0,0,1
4,17.0,8,302.0,140.0,3449,10.5,70,0,0,1


In [16]:
X = df_cars.drop(columns=['mpg']).values.astype(np.float32)
y = df_cars.iloc[:,0].values.reshape(-1,1).astype(np.float32)

The data has some really extreme values that do not sit well with our neural network. When you run the training process without scaling the inputs you end up with exploding gradients in your neural network. So we apply standard scaling which scales the values to +1 and -1. 

In [17]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(X)

Now that we have a good dataset, we need to create a hold-out set to ensure that we validate the performance on data that we haven't used for training. This is important as this will tell us how the model performs on unseen data.

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Training the neural network
Now that we have a neural network, let's train it using the training set.
We're using a squared error loss function which is a regular loss that you will find in almost any regression model. We'll train the model using a SGD learner, which is the most basic learner around for CNTK.

The criterion for training is the mean squared error function. Additionally we would like to measure the mean absolute error rate for the model. In order to do this, we need to create a CNTK function factory that produces a combination of these two. 

When you create a new function in python marked with `cntk.Function` annotation CNTK will automatically convert it to a function object that has a `train` method for training and a `test` method for validation. 

Since we don't have a mean absolute error function out-of-the-box we'll create it here as well using the standard CNTK operators.

In [19]:
import cntk 
from cntk.losses import squared_error

def absolute_error(output, target):
    return cntk.ops.reduce_mean(cntk.ops.abs(output - target))

@cntk.Function
def criterion_factory(output, target):
    loss = squared_error(output, target)
    metric = absolute_error(output, target)
    
    return loss, metric

In [21]:
from cntk.logging import ProgressPrinter
from cntk.losses import squared_error
from cntk.learners import sgd

loss = criterion_factory(z, target)
learner = sgd(z.parameters, 0.001)

progress_printer = ProgressPrinter(0)

train_summary = loss.train((X_train,y_train), 
                           parameter_learners=[learner], 
                           callbacks=[progress_printer],
                           minibatch_size=16,
                           max_epochs=50)

 average      since    average      since      examples
    loss       last     metric       last              
 ------------------------------------------------------
Learning rate per minibatch: 0.001
      627        627         24         24            16
      587        567       22.9       22.4            48
      541        506         22       21.3           112
      424        322       19.3       16.9           240
     63.1       63.1       7.29       7.29            16
     33.9       19.3       4.73       3.45            48
     29.4         26       4.39       4.14           112
     25.5         22       3.98       3.62           240
     19.1       19.1       3.99       3.99            16
     9.76       5.12       2.56       1.85            48
     9.42       9.16       2.44       2.36           112
     11.7       13.7       2.63       2.78           240
     16.1       16.1       3.44       3.44            16
     8.64       4.91       2.27       1.69            48

     3.67        2.4       1.45       1.14            48
     3.86          4       1.51       1.56           112
      5.6       7.12        1.7       1.86           240
     6.18       6.18       2.05       2.05            16
     3.65       2.39       1.44       1.14            48
     3.84       3.98       1.51       1.56           112
     5.57       7.09       1.69       1.85           240
     6.13       6.13       2.04       2.04            16
     3.63       2.38       1.44       1.14            48
     3.81       3.95        1.5       1.55           112
     5.55       7.07       1.69       1.85           240
     6.08       6.08       2.02       2.02            16
      3.6       2.36       1.43       1.13            48
     3.79       3.93        1.5       1.55           112
     5.53       7.05       1.68       1.85           240
     6.05       6.05       2.02       2.02            16
     3.59       2.36       1.43       1.13            48
     3.77       3.91       1.49

The output of the training session is looking promising, you can see that the loss is going down quite nicely. It's not perfect, but not bad for a first attempt.

## Evaluating model performance
In order to measure the performance of our model we're going first going to use the squared error function from CNTK.
This gives us an rough idea of the error rate of the model. But this is in squares so it is quite hard to read depending on your background.

As an alternative we'll also use the mean absolute error metric. This gives a more understandable error rate.
This metric gives us a good idea of just how much we're off predicting the miles per gallon.

CNTK doesn't include a mean absolute error function, but you can easily create it yourself using the standard CNTK ops.

We're using the test method on the metric to determine how well our model is doing. This is different from the classification model where we had to do quite a bit more to measure the performance of our model.

The output of the `test` method tells us how many miles per gallon the model is off on average when predicting based on the test set we created earlier.

In [22]:
loss.test((X_test, y_test))

{'metric': 2.4377947457229037, 'samples': 79}