# A perfect world

### Introduction

In the last lesson, we discussed the different types of error that we see when we train our machine learning models: irreducible error, variance and bias.  Now the first two of these errors are due to randomness.  In the lessons that follow, we'll explore these errors in turn.  

Our technique for exploring errors is to first construct our data in such a way that these errors do not exist.  Doing so will have two benefits:

1. The easiest way to appreciate and visualize our different errors is to see what a machine learning model looks like when they do not exist.
2. The ability to create artificial datasets is crucial to our ability to explore new techniques and algorithms

Let's get going.

### Imagining a world without error

Ok, so in this lesson, we'll create a dataset that our machine learning model can perfectly predict.  Then we'll train our model on this dataset.

Here's our scenario.

Imagine that we are hired by a restaurant to predict the number of customers who enter our restaurant each day.  The restaurant wants to reduce the amount it wastes on purchasing extra food supplies.  It knows that temperature affects the number of customers it sees each day, but it doesn't know by how much.

In other words, our model looks like the following:

$$ customers = \theta_1*temperature + \theta_0 $$

> Here we use the variable theta, $\theta$ to represent the parameter for temperature that we are trying top discover.  Previously, we have used $x_1$ or $m$ for this.  $\theta_1$ is equivalent to our coefficient parameter and $\theta_0$ is equivalent to our intercept parameter.

Now we don't have any data yet from the restaurant -- they haven't yet hired us -- so we need to create a mock dataset ourselves.  Let's first explore how to create a synthetic dataset, and then we'll see what our model from this data looks like.

### 1. Creating random numbers

We know that our only feature is temperature and our target is the number of customers.  So we can start off by feature variable data -- a list of random temperatures between 30 degrees and 101 degrees.

In [10]:
from random import randint
random_temperatures = [randint(30,101) for num in range(0, 50)]
random_temperatures[0:3]

[38, 62, 45]

> Press shift + enter on the cell above multiple times.  

Notice that every time the cell is executed, a different set of numbers is generated.  This makes sense, but it can also be problematic.  We don't want this process to generate a list of random numbers that we cannot reproduce.  What if something goes wrong?  We'll need the ability to recreate our data so that we can then carefully explore everything that went wrong.  

So we want the ability to generate the same list of random, really pseudo-random numbers, each time.  We do this by *seeding* our random number generator. 

Let's see how seeding works.

In [5]:
from random import randint
import random
random.seed(1)
randint(1, 10)

3

Seeding configures our random number generator to generate the same sequence of numbers.  Try re-executing the cell above, notice that the first random number is always 3.

So now let's use seeding to generate a list of 50 random numbers that will be the same list even if we re-execute the cell.

In [13]:
from random import randint
random.seed(1)
random_temperatures = [randint(30,101) for num in range(0, 50)]
random_temperatures[0:3]

[47, 38, 62]

So the first three numbers will always be 47, 38, and 62.

### 2. Generate perfect outcomes

Ok, so now that we have generated a list of random temperatures, our inputs, it's time to generate our outcomes.  

Remember our goal for now is to construct a perfect dataset, so we want to do this in a way that these outcomes have no randomness in them.  Also remember that our the restaurant believes that the true model follows the following form: 

$$ customers = \theta_1*temperature + \theta_0 $$

Let's assume that $\theta_1 = 3$ and $\theta_0 = 10$.  We then have the following model:

$$ customers = 3*temperature + 10 $$

And we should be able to construct a dataset that perfectly matches this model.  Here it is.

In [11]:
perfect_customers = [3*temp + 10 for temp in random_temperatures]
perfect_customers[0:3]

[151, 124, 196]

Ok, so let's plot our data.

In [10]:
import plotly.plotly as py
from graph import trace_values, plot
data_trace = trace_values(random_temperatures, perfect_customers, name = 'initial data')
layout = {'yaxis': {'title': 'customers'}, 'xaxis': {'title': 'temperature'}}
py.plot([data_trace], layout = layout)

'https://plot.ly/~JeffKatzy/225'

### 3. Training our linear model

Now so far we created a synthetic dataset.  And to create a perfect synthetic dataset, we started off with our true model, and then created our dataset from there.

But normally, we do not know the true underlying model.  Instead we use a machine learning algorithm to approximate it.  Let's imagine we did not know that our underlying model was $ customers = 3x + 10$ and see how good our regression model is approximating this underlying model.

1. Transform dataset into rows

SciKit learn takes feature variables in the form of a nested list.  So we wrap each of our temperatures in a list like so:

In [8]:
from data import random_temperatures
temperature_inputs = [[temperature] for temperature in random_temperatures]

This takes our data from here:

In [9]:
random_temperatures[0:3]

[47, 38, 62]

To here:

In [10]:
temperature_inputs[0:3]

[[47], [38], [62]]

2. Fit the model and view the estimated parameters

Next we go through our standard steps of training our linear regression model, to see if the model can detect the underlying pattern in the data.

In [12]:
from data import perfect_customers
from sklearn.linear_model import LinearRegression
perfect_model = LinearRegression()
perfect_model.fit(temperature_inputs, perfect_customers)

perfect_model.coef_

array([3.])

In [13]:
perfect_model.intercept_

10.000000000000028

Hot damn! Our regression model started off not knowing the underlying true model, but by feeding it data, it was able to discover this model.  

We can confirm that this model matches the data well, by scoring the model against this data with the root mean squared error.

In [14]:
from sklearn.metrics import mean_squared_error
from math import sqrt

In [15]:
sqrt(mean_squared_error(perfect_customers, perfect_model.predict(temperature_inputs)))

6.029155041345696e-15

That number is a very small number.  We can plug it into Google, and calculate that it's $0.0000018$.  So just like we expect, our model predicts quite accurately.

### 4. Predicting with our model

Now that this model has been trained on our data, we can try the model on data it has not yet seen.  Because we have a world where the underlying model has $customers =  3*temp + 10 $, and this what our model predicts, our model will predict the new number of customers perfectly.

In [5]:
from random import randint
new_random_temperatures  = [randint(30,101) for num in range(0, 50)]
new_perfect_customers = [3*temp + 10 for temp in new_random_temperatures]
new_random_temperatures[0:3]

[36, 31, 51]

In [6]:
input_new_temperatures = [[temp] for temp in new_random_temperatures]
input_new_temperatures[0:3]

[[36], [31], [51]]

In [11]:
from data import perfect_model
from math import sqrt
from sklearn.metrics import mean_squared_error
new_predictions = perfect_model.predict(input_new_temperatures)
sqrt(mean_squared_error(new_perfect_customers, new_predictions))

5.317214951750688e-15

So our model predicts our future data with almost no error as well.

In [13]:
import plotly.plotly as py
from graph import trace_values, plot
model_trace = trace_values(new_random_temperatures, new_predictions, name = 'trained model', mode = 'lines')
new_data_trace = trace_values(new_random_temperatures, new_perfect_customers, name = 'new data')
layout = {'yaxis': {'title': 'customers'}, 'xaxis': {'title': 'temperature'}}
py.plot([new_data_trace, model_trace], layout = layout)

'https://plot.ly/~JeffKatzy/227'

### Summary

In this lesson, we saw a couple things.

First, we saw how to generate a dataset.  To do this, we first created a list of random inputs, our temperatures.  Then we created the corresponding outputs by simply creating outputs that were a function of our inputs.

The second thing we saw was what occurs when we train our model on a dataset that perfectly follows a linear equation.  What happens is our model can perfectly discover the underlying model.  We saw this when we trained our model on our dataset.  It perfectly discovered the parameters underlying the data,  $customers = 3*temperature + 10$.  We also saw that saw that our model also predicted data it had not previously seen, so long as the outputs also followed the underlying formula.

We'll use this perfect scenario by way of comparison to see what occurs when we do not have perfect data.