# Overfitting

### Introduction

At this point, we have seen our three sources of error in a machine learning model: irreducible error, variance and bias.  We've seen that bias can be caused by underfitting our model.  Here, let's see how by overfitting our model our model is more subject to variance.

### Setting up our data

We have stored our feature data in the `data.py` file.  Our data adheres to the following model:

$$customer\_amount = 3*temp + 40*is\_weekend + 10 + \epsilon_i$$

This is the error related to our model, when we properly fit a model with temperatures and weekends against our data.

In [2]:
from data import temps_and_is_weekends, customers_with_errors
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from math import sqrt

model = LinearRegression()
model.fit(temps_and_is_weekends, customers_with_errors)

sqrt(mean_squared_error(customers_with_errors, model.predict(temps_and_is_weekends)))

21.565735597602885

That error is caused by randomness in our training data.  Here is our related plot.

In [6]:
from graph import plot, trace_values, build_layout
import numpy as np
temps = np.array(temps_and_is_weekends)[:, 0]
predictions = model.predict(temps_and_is_weekends)

trace_1 = dict(x = temps, y =  customers_with_errors, mode = 'markers') 
model_trace = trace_values(temps, predictions, mode = 'lines', name = 'updated model')
plot([trace_1, model_trace])

### We want more

Now our model is pretty good, and has an rmse of  21.5.  Let's try to do better.  One way that we try to improve our model is by adding another another feature.  But it's difficult to know beforehand if a feature is can be used to explain our outcome.  So we may be adding something to our model that is completely irrelevant.  Let's see what happens when we add an irrelevant feature into our model.

Our irrelevant feature is called `random_ages`, and it represents the average age of the cashiers who were working that day.  But really it's just a list of random data.  Because it's just a list of random data that we'll produce now, it would not have any relevance to our customer amounts list.  Still, let's throw it into our model and see what happens.

In [12]:
from random import randint, seed
seed(2)
random_ages = [randint(25, 65) for num in range(0, 50)] 
random_ages[0:3]

[28, 30, 30]

Ok, let's add it to our list of independent variables and throw it into our model.

In [15]:
from data import temps, random_ages, is_weekends
updated_independent_vars = list(zip(temps, random_ages, is_weekends))

from sklearn.linear_model import LinearRegression
updated_model = LinearRegression()
updated_model.fit(updated_independent_vars, customers_with_errors)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [17]:
from sklearn.metrics import mean_squared_error
from math import sqrt

sqrt(mean_squared_error(customers_with_errors, updated_model.predict(updated_independent_vars)))

21.556603670493445

Now as you can see, by introducing the `random_ages`, our rmse did decrease - even if just a little bit.  It went from 21.56 to 21.55.

### Why did this work?

So now we have two different models, and the one that includes our random list of average ages has the higher score.  But it seems like it can't be right.  Remember, we just randomly generated our list of ages.  They obviously did not cause an impact on our previously formulated list of customer amounts.  So how do we account for the decrease in our error.

The reason why training with our list of random numbers improved the model is another case of overfitting.  We are essentially including a noisy, irrelevant parameter in our model.  With this, our linear regression algorithm takes the numbers in this parameter and tries to find an association to the number of customers.  But this association, isn't really there.  It's just picking up on a coincidental association between the random numbers and the average customers.  

We named this error as variance.  As we see, introducing *more* parameters makes our model more flexible and thus introduces more variance.  

This struggle to balance adding too many parameters and introducing error due to variance, or not including enough parameters and introducing error due to bias is called the bias variance tradeoff.  We'll continue to explore this, as well as a technique to help us strike the right balance - by again using a holdout set.

### Summary 

In this lesson, we saw another source of error called variance.  Error due to variance occurs when our model is too flexible, and we include parameters that do not have predictive value.  Error due to variance can be deceptive because when we introduce variance the score on our training data improves.  However, this is due to our model fitting to randomness in the training data and not detecting an underlying association between our new independent variable and our dependent variable.

We have now seen that error can occur from including too few parameters, which introduces bias, and from too many parameters, which introduces variance.