# Exercise: Supervised learning using different cost functions

In this exercise, we will have a deeper look at how cost functions can change:

1. how well models appear to have fit data
2. the kinds of relationships a model represents

## Loading the data

Let's start by having a look at the data. To make this exercise simpler, we will only use a few datapoints this time.

In [1]:
import pandas
from datetime import datetime

# Load a file containing our weather data
dataset = pandas.read_csv('Data/seattleWeather_1948-2017.csv', parse_dates=['date'])

# Convert the dates into numbers so we can use it in our models
# We make a year column which can contain fractions. For example
# 1948.5 is half way through the year 1948
dataset["year"] = [(d.year + d.timetuple().tm_yday / 365.25) for d in dataset.date]


# For the sake of this exercise, let's look at Feb 1st for the following years:
desired_dates = [
    datetime(1950,2,1),
    datetime(1960,2,1),
    datetime(1970,2,1),
    datetime(1980,2,1),
    datetime(1990,2,1),
    datetime(2000,2,1),
    datetime(2010,2,1),
    datetime(2017,2,1),
]

dataset = dataset[dataset.date.isin(desired_dates)].copy()

# Print the dataset
dataset


Unnamed: 0,date,amount_of_precipitation,max_temperature,min_temperature,rain,year
762,1950-02-01,0.0,27,1,False,1950.087611
4414,1960-02-01,0.15,52,44,True,1960.087611
8067,1970-02-01,0.0,50,42,False,1970.087611
11719,1980-02-01,0.37,54,36,True,1980.087611
15372,1990-02-01,0.08,45,37,True,1990.087611
19024,2000-02-01,1.34,49,41,True,2000.087611
22677,2010-02-01,0.08,49,40,True,2010.087611
25234,2017-02-01,0.0,43,29,False,2017.087611


## Comparing two cost functions

Let's compare two common cost functions, the _sum of squared differences_ (SSD), and the _sum of absolute differences_ (SAD). These both calculate the difference between each predicted value with the expected value. The difference is simply: 

* SSD squares that difference, and sums the result;
* SAD converts differences into absolute differences, then sums them.

To see these in action, we need to first implement these functions:

In [2]:
import numpy

def sum_of_square_differences(estimate, actual):
    return numpy.sum((estimate - actual)**2)

def sum_of_absolute_differences(estimate, actual):
    return numpy.sum(numpy.abs(estimate - actual))

They're very similar. How do they behave? Let's test with some fake model estimates.

Let's say that the correct answers are `1` and `3`, but the model estimates `2` and `2`:

In [3]:
actual_label = numpy.array([1, 3])
model_estimate = numpy.array([2, 2])

print("SSD:", sum_of_square_differences(model_estimate, actual_label))
print("SAD:", sum_of_absolute_differences(model_estimate, actual_label))

SSD: 2
SAD: 2


We have an error of `1` for each estimate, and both methods have returned the same error. 

What happens if we distribute these errors differently? Let's pretend we estimated the first value perfectly, but were off by `2` for the second value: 

In [4]:
actual_label = numpy.array([1, 3])
model_estimate = numpy.array([1, 1])

print("SSD:", sum_of_square_differences(model_estimate, actual_label))
print("SAD:", sum_of_absolute_differences(model_estimate, actual_label))

SSD: 4
SAD: 2


SAD has calculated the same cost as before, because the average error is still the same (`1 + 1 = 0 + 2`). According to SAD, the first and second set of estimates were equally good. 

By contrast SSD has given a higher (worse) cost for the second set of estimates (<code>1<sup>2</sup> + 1<sup>2</sup> < 0<sup>2</sup> + 2<sup>2</sup></code>). When we use SSD, we not only encourage models to be accurate, but to be consistent in their accuracy.


## Differences in action

Let's compare how our two cost functions affect model fitting.

First, fit a model using the SSD cost function:

In [5]:
from microsoft_custom_linear_regressor import MicrosoftCustomLinearRegressor
import graphing

# Create and fit the model
# We use a custom object that we have hidden from this notebook as
# you do not need to understand its details. This fits a linear model
# using a provided cost function

# Fit a model using sum of square differences
model = MicrosoftCustomLinearRegressor().fit(X = dataset.year, 
                                             y = dataset.min_temperature, 
                                             cost_function = sum_of_square_differences)

# Graph the model
graphing.scatter_2D(dataset, 
                    label_x="year", 
                    label_y="min_temperature", 
                    trendline=model.predict)


Our SSD method normally does well, but here it did a poor job - the line is a far distance from the values for many years. Why? Notice the datapoint at the bottom left doesn't seem to follow the trend of the other datapoints. 1950 was a very cold winter in Seattle, and this datapoint is strongly influencing our final model (the blue line). What happens if we change the cost function?

### Sum of absolute differences

Let's repeat what we've just done, but using SAD.

In [6]:
# Fit a model with SSD
# Fit a model using sum of square differences
model = MicrosoftCustomLinearRegressor().fit(X = dataset.year, 
                                             y = dataset.min_temperature, 
                                             cost_function = sum_of_absolute_differences)

# Graph the model
graphing.scatter_2D(dataset, 
                    label_x="year", 
                    label_y="min_temperature", 
                    trendline=model.predict)


It's clear that this line passes through the majority of points much better than before, at the expense of almost ignoring the measurement taken in 1950. 

In our farming scenario, we're interested in how average temperatures are changing over time. We don't have much interest in 1950 specifically, so for us this is a better result. In other situations, of course, we might consider this result worse.


## Summary

In this exercise, you learned about how changing the cost function used during fitting can result in different final results. 

We also learned how this behaviour is because these cost functions describe the 'best' way to fit a model, although from a data analyst's point of view there can be drawbacks no matter which cost function is chosen.