# Learning Unit 5 - Overfitting - Example

Overfitting is a scary thing. 

It is generally where contact with reality breaks the pretty illusions of the data scientist about how brilliant (s)he is, and keeps one's ego in check. 

There are a number of ways to reduce overfitting (Regularization, proper setting of hyper parameters, bagging, etc), but the first step is always to diagnose its presence.

In [None]:
cd ..

In [None]:
from ipywidgets import interact   # <-- did you know you could do this in jupyter? Is that cool or what? 
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from bokeh.plotting import figure, output_notebook
from utils import load_data, visualizations
from sklearn.model_selection import cross_val_score
output_notebook()

Let's get our Ying Yang dataset again. 

In [None]:
data = load_data.get_ying_yang(200)  

Let's take a look at it with 2 neighbors

In [None]:
visualizations.plot_data(model=KNeighborsClassifier(2), 
                     data=data, 
                     target='c', 
                     feature1='a', 
                     feature2='b', 
                     out_of_sample=False, 
                     probabilities=False)

It looks fine actually, but why are we allowing those errors to go unchallenged? Maybe switching to 1 Neighbour would actually catch all of them? 

In [None]:
visualizations.plot_data(model=KNeighborsClassifier(1), 
                     data=data, 
                     target='c', 
                     feature1='a', 
                     feature2='b', 
                     out_of_sample=False, 
                     probabilities=False)

Brilliant! We have the perfect model! Except of course we know that this isn't the real distribution that we want, we're just overfitting the training set. So what would happen if we were to train on half of the dataset, and save the other half for evaluation? 

In [None]:
visualizations.plot_data(model=KNeighborsClassifier(1), 
                     data=data, 
                     target='c', 
                     feature1='a', 
                     feature2='b', 
                     out_of_sample=True, # <--- I kept everything else the same, but changed this to True    
                     probabilities=False)

Now we can clearly see the overfit. The separate red blob in the middle is the result of overfitting some noise. We also have a slightly worse model because we have less data to train on (we just trained on 50% this time), but we can control for that by doubling the dataset size: 

In [None]:
data = load_data.get_ying_yang(400)  # <-- this was 200 in the original dataset  

In [None]:
model = KNeighborsClassifier(1)

visualizations.plot_data(model=model, 
                         data=data, 
                         target='c', 
                         feature1='a', 
                         feature2='b', 
                         out_of_sample=True,  
                         probabilities=False)

print('CV score: %0.2f' % cross_val_score(estimator=model, 
                X=data[['a', 'b']], 
                y=data['c']).mean())

Wow, dramatically overfit. What if we changed this to use 2 neigbors? 

In [None]:
model = KNeighborsClassifier(2)

visualizations.plot_data(model=model, 
                         data = load_data.get_ying_yang(400), # <-- this was 200 in the original dataset  
                         target='c', 
                         feature1='a', 
                         feature2='b', 
                         out_of_sample=True,  
                         probabilities=False)

print('CV score: %0.2f' % cross_val_score(estimator=model, 
                X=data[['a', 'b']], 
                y=data['c']).mean())

Better... but still way too much variance. 

Ok, handtuning is boring. Let's use Jupyter Interact! (If it doesn't work first time, try [this](http://ipywidgets.readthedocs.io/en/stable/user_install.html))

In [None]:
def s(n_neighbors):
    visualizations.plot_data(model=KNeighborsClassifier(n_neighbors=n_neighbors), 
                     data=data, 
                     target='c', 
                     feature1='a', 
                     feature2='b', 
                     out_of_sample=True, 
                     probabilities=False)
    
    
    
interact(s, n_neighbors=(1, 12))

Let's see this with a different model: the [RandomForestClassifier](scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), which is naturally more resistant to overfitting: 

In [None]:
def s(max_depth, n_estimators):
    visualizations.plot_data(model=RandomForestClassifier(max_depth=max_depth, n_estimators=n_estimators), 
                         data=data, 
                         target='c', 
                         feature1='a', 
                         feature2='b', 
                         out_of_sample=True, 
                         probabilities=False)
    
interact(s, max_depth=(1, 15), n_estimators=(10, 100, 10))