# Overfitting Demo
## Purpose
Illustrate the issues of overfitting a model and how this can affect our training and test accuracy.

## References
Note that details from this lesson come from the following source. Recommended reading if you want to learn more!

http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html


# Comments from Lee
The notebook is super clear. I think we should have no problem going through this. One thing that might be nice though is if we could supply a starter notebook that contains the following:
- Initializing the libraries
- Initializing the data and true functions
- Initializing X_test

Also a note about this line of code:

    pipeline.fit(X[:, np.newaxis],y)

I have been using 

    X.reshape(30,1)

I both methods work fine but I think we should try to be consistent in the lesson. I am fine with using either. 

We may also want to show the training scores, cross validation scores, and test scores after each regression to show why test data is so important. 

In [228]:
# Initialize all our libraries

%matplotlib inline
import matplotlib # I dont think this import is necessary
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

# Cross validation
# from sklearn.model_selection import cross_val_score
# If you have issues with sklearn.model_selection, use the line below instead
from sklearn.cross_validation import cross_val_score
# Import some random, but deterministic data
np.random.seed(2)

# Define number of samples we take
n_samples = 20
degrees = [1, 4, 15]

# Create a function which generates our cos function data
true_fun = lambda X: np.cos(5 * np.pi * X)

# Setup our x data from the random generator
X = np.sort(np.random.rand(n_samples))

# The true data that we check the model against. Note that we will add some random noise as well to keep things interesting
y = true_fun(X) + np.random.randn(n_samples) * 0.1

# Let's check the size of our data
print X.shape
print y.shape

# We notice that X and y don't appear like regular arrays. Python seems OK with this in general, 
# but we start to hit issues when we want to treat these objects like regular arrays

X = np.reshape(X,(len(X),1))
y = np.reshape(y,(len(y),1))

# Check our shapes again to check
print X.shape,  y.shape
# Excellent!

# Let's cut our data into a training set and a testing set. Grab the last 10 points as test data. The rest is training data
slice = 10

x_test = X[-slice:]
y_test = y[-slice:]

# Cut X and y down to make the training data
X = X[:len(X)-slice]
y = y[:len(y)-slice]
