# A little more complicated

### Introduction

In the last lesson, we started with a machine learning algorithm that found the true relationship between customers and temperature beneath our data: 

$$customers = 3*temperature + 10$$

Then we showed that even when our model finds this true relationship, it cannot perfectly predict data with random errors.  This is because our machine learning algorithm cannot predict random influences of future outcomes, which makes sense.  This is called *irreducible error*.

Now we'll make things harder for our machine learning algorithm.  Unlike in the last lesson, we won't *train* our algorithm on a perfectly clean dataset.  In other words, this time our machine learning algorithm won't have the benefit of training on data that perfectly reflects the true underlying model.

This is more realistic.  In real life, we don't train our model that perfectly matches a linear model.  The data we collect will be subject to a degree of randomness -- or in other words, noise.  In this lesson, we'll see the complications that arise from training on noisy data.

### Training on a noisy dataset

Once again, we have our customer model of the following:

$$customers = 3*temperature + 10 + \epsilon $$

Let's get our data.

> We'll import a function, `build_dataset` that constructs a noisy dataset.

In [2]:
from random import seed
from data import build_data_set
seed(4)

dataset = build_data_set()

Our `dataset` is a numpy array of rows each with a feature of temperatures, and target variable of customers.

In [3]:
temperatures = dataset[:, 0]
noisy_customers = dataset[:, 1]

In [5]:
from graph import trace_values, plot
import plotly.plotly as py
data_trace = trace_values(temperatures, noisy_customers, name = 'initial data')
layout = {'yaxis': {'title': 'customers'}, 'xaxis': {'title': 'temperature'}}
py.plot([data_trace], layout = layout)

'https://plot.ly/~JeffKatzy/233'

Ok, now it's time for us to train our model on this noisy dataset.

In [10]:
from sklearn.linear_model import LinearRegression
initial_model = LinearRegression()
initial_model.fit(temperatures.reshape(-1, 1), noisy_customers)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Now let's look at what the model discovered.

In [11]:
initial_model.coef_

array([3.49140605])

In [12]:
initial_model.intercept_

-24.212252962233208

Now notice that when we train our data on the noisy data, our model came close to discovering the model of $y = 3x + 10$, but it was a little off.  Why *is* this?  Why did it not say that the best fit line has the underlying parameters of $y = 3x + 10$.

### Introducing Variance

Well here, our hypothesis function did not find the parameters of the true model because whenever we train our model, the model is simply trying to draw a line that minimizes the sum of the squared errors through the random data.  And so the ending hypothesis function is influenced by the whims and randomness it sees. 

In [7]:
import numpy as np
from graph import trace_values
import plotly.plotly as py
from data import temperatures, initial_model, noisy_customers
sorted_temps = np.sort(temperatures)
layout = {'yaxis': {'title': 'customers'}, 'xaxis': {'title': 'temperature'}}
data_trace = trace_values(temperatures, noisy_customers, name = 'initial data')
customer_predictions = initial_model.predict(sorted_temps.reshape(-1, 1))
model_trace = trace_values(sorted_temps, customer_predictions, name = 'expected', mode = 'lines')
py.plot([data_trace, model_trace], layout = layout)

'https://plot.ly/~JeffKatzy/235'

Now let's see how well the model performs.

In [24]:
from sklearn.metrics import mean_squared_error
from math import sqrt
sqrt(mean_squared_error(noisy_customers, initial_model.predict(temperatures.reshape(-1, 1))))

25.303665437086153

So this says that when we make a prediction, on average we will be off by 25.  The root of this error is called variance.  Variance measures the amount of that our parameters change each time that we estimate the model.  And our parameters would change because the our models are trained on different subsets of our random data.  

By way of example, let's train our model on four hundred different sets of data that have an element of randomness in it.

In [26]:
from data import build_data_set

models = []
for training_set in range(0, 600):
    dataset = build_data_set()
    model = LinearRegression()
    temperatures = dataset[:, 0]
    noisy_customers = dataset[:, 1]
    
    model.fit(temperatures.reshape(-1, 1), noisy_customers)
    models.append(model)

parameters = np.array([(model.coef_, model.intercept_) for model in models])

In [27]:
np.average(parameters[:, 0])

array([3.01825515])

In [28]:
np.average(parameters[:, 1])

8.661770692678378

Here, we can see that our parameters are much closer to the underlying model than when we just performed one estimate.

So we just saw the two main points with variance.  
1. If we trained our model many times with data that random variations, the parameters of our model would vary each time and this variation from the true parameter is called error due to variance.  
2. However, if we were to perform fit our model many times, we expect each estimated parameter to hover around the true parameter and, for the average of each parameter to approach the parameter of the true model. 

### Summary

In this lesson, we learned about error due to variance.  Error due to variance occurs because we train on data that has randomness built into it.  Because of this if we imagined (or actually did) train our model on different multiple times, the parameters of our model would vary each time.  This fluctuation is called variance.  Now because this variance is random, if we were to take the average of the parameters we would expect the error due to variance to cancel each other out, and thus equal zero.  We saw a demonstration of this, when we averaged our models, and the parameters approached the true model parameters.

We saw that one danger of error due to variance is overfitting.  We have overfitting when our model performs better on data it trained on than on data it did not yet see.  This problem is referred to as a problem of generalization.

