It's come time to address another potential source of error in our models: overfitting. __Overfitting__ is when your model is so excessively complex that it starts to catch random noise instead of describing the true underlying relationships. This is typically manifested with a model that evaluates as more accurate than it really is. In most situations you shouldn't be able to build a perfect model, some error is to be expected. Overfitting is extremely common and easy to do, but there are ways to guard against it. The main way is through how you evaluate your model.

Thus far we've been using our training data to evaluate our model. By this we mean that we've used the same data to train the model and to see how well the model is doing. When you think about it, some of the danger of that approach may become apparent. If we create a very elaborate model it will pick up on the nuances of the data that are just from random noise.  If we evaluate the model on the training data, that ability to pick up noise will be returned as accuracy. In reality, this isn't the case and doesn't depict how we'd really want to evaluate a model. Generally we don't care about predicting things we already know. We care about other data, new information, or other situations. This is why testing with training data really isn't what we want to do. 

But if that's the case, what can we do?

## Holdout Groups

The simplest way to combat overfitting is with a **holdout group** (or sometimes "holdback group"). All this means is that you do not include all of your data in your training set, instead reserving some of it exclusively for testing. While there is a cost to having less training data, your evaluation will be far more reliable.

When directly comparing two models that are based on different techniques or different specifications, this holdout method combats overfitting. Overfit models will see a drop in success rate outside of their training data, and so their performance will not be artificially inflated as it would be if you trained and validated your model using the whole data set. This is because they got really good at matching the patterns within the data they were trained with, but didn't actually learn the things that matter but random noise. When they try to match that random noise on new data their accuracy suffers.

How much data you choose to keep in a holdout is really up to you and depends on how much and what kind of data you have to begin with as well as what kind of model you're training. You should check and see how much variance your model has as you add more data as well as how much data it would take to maintain a reasonably representative test sample. It is, however, a balance. 30% is a common starting point, but really anything from 50% to 1% of the original dataset could be reasonable.

This seems relatively simple to code up. We'll try it below with our spam model:


In [3]:
import pandas as pd
import sklearn
import numpy as np
import matplotlib.pyplot as plt

In [4]:
# Grab and process the raw data.
data_path = ("https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
             "master/sms_spam_collection/SMSSpamCollection"
            )
sms_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
sms_raw.columns = ['spam', 'message']

# Enumerate our spammy keywords.
keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent']

for key in keywords:
    sms_raw[str(key)] = sms_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
)

sms_raw['allcaps'] = sms_raw.message.str.isupper()
sms_raw['spam'] = (sms_raw['spam'] == 'spam')
data = sms_raw[keywords + ['allcaps']]
target = sms_raw['spam']

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
y_pred = bnb.fit(data, target).predict(data)

In [5]:
# Test your model with different holdout groups.

from sklearn.model_selection import train_test_split
# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

With 20% Holdout: 0.884304932735426
Testing on Sample: 0.8916008614501076


These scores look really consistent! It doesn't seem like our model is overfitting. Part of the reason for that is that it's so simple (more on that in a bit). But we should look and see if any other issues are lurking here. So let's try a more robust evaluation technique, cross validation.

## Cross Validation

Cross validation is a more robust version of holdout groups. Instead of creating just one holdout, you create several.

The way it works is this: start by breaking up your data into several equally sized pieces, or __folds__. Let's say you make _x_ folds. You then go through the training and testing process _x_ times, each time with a different fold held out from the training data and used as the test set. The number of folds you create is up to you, but it will depend on how much data you want in your testing set. At its most extreme, you're creating the same number of folds as you have observations in your data set. This kind of cross validation has a special name: __Leave One Out__. Leave one out is useful if you're worried about single observations skewing your model, whereas large folds combat more general overfitting.



In [6]:
from sklearn.model_selection import cross_val_score
cross_val_score(bnb, data, target, cv=10)

array([0.89784946, 0.89426523, 0.89426523, 0.890681  , 0.89605735,
       0.89048474, 0.88150808, 0.89028777, 0.88489209, 0.89568345])

That's exactly what we'd hope to see. The array that `cross_val_score` returns is a series of accuracy scores with a different hold out group each time. If our model is overfitting at a variable amount, those scores will fluctuate. Instead, ours are relatively consistent.

Above we used the SKLearn built in functions for both of these kinds of cross validation, the documentation for which can be found [here](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-with-stratification-based-on-class-labels). However, the outputs from that are somewhat limited. By default it uses the `score` method. You can adjust what is returned, but you don't get all of the error types or outputs you may be interested in. That's why it's not uncommon for people to code up their own cross validation.

To make sure you understand how cross validation works, try to code it up yourself below, not relying on SKLearn:


In [7]:
# Implement your own cross validation with your spam model.
# Perform your additional evaluation here.
from sklearn.metrics import confusion_matrix

holdout_data = []
holdout_target = []
training_data = []
training_target = []
x_val_score = []

def x_validate(d, trg, folds):
    size = len(d)/folds
    current_index = 0
    
    # Iterate over each list and create hold out data
    for i in range(folds):
        
        # Make copies of data passed in
        training_data = d[:]
        training_target = trg[:]
        
        # Create holdout data and target
        holdout_data = training_data.iloc[current_index:int(size*(i+1))]        
        holdout_target = training_target.iloc[current_index:int(size*(i+1))]
        
        # Drop test rows from training data and target
        training_data = training_data.drop(training_data.index[current_index:int(size*(i+1))])
        training_target = training_target.drop(training_target.index[current_index:int(size*(i+1))])
        
        current_index = int(size*(i+1))
        
        #print(len(holdout_data), len(training_data))
        #print(bnb.fit(training_data, training_target).score(holdout_data, holdout_target))
        x_val_score.append(bnb.fit(training_data, training_target).score(holdout_data, holdout_target))
        
    return x_val_score

x_validate(data, target, 10)


[0.8886894075403949,
 0.8761220825852782,
 0.9048473967684022,
 0.895870736086176,
 0.899641577060932,
 0.8994614003590664,
 0.8850987432675045,
 0.8833034111310593,
 0.8868940754039497,
 0.8960573476702509]

## What's a good score?

When we're looking at this model, we've been getting accuracy scores around .89. Intuitively that seems like a pretty good score, but in the start of this lesson we mentioned different kinds of error. We also mentioned class imbalance. Both of these things are at play here. Using the topics we introduced earlier in this lesson, try to do a more in depth evaluation of the model looking at the kind of errors we're generating and what accuracy we'd get if we just randomly guessed. You may want to use what's known as a [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) to show different kinds of errors.


In [10]:
# Perform your additional evaluation here.
def x_validate(d, trg, folds):
    size = len(d)/folds
    current_index = 0
    
    # Iterate over each list and create hold out data
    for i in range(folds):
        
        # Make copies of data passed in
        training_data = d[:]
        training_target = trg[:]
        
        # Create holdout data and target
        holdout_data = training_data.iloc[current_index:int(size*(i+1))]        
        holdout_target = training_target.iloc[current_index:int(size*(i+1))]
        
        # Drop test rows from training data and target
        training_data = training_data.drop(training_data.index[current_index:int(size*(i+1))])
        training_target = training_target.drop(training_target.index[current_index:int(size*(i+1))])
        
        current_index = int(size*(i+1))
        
        #print(len(holdout_data), len(training_data))
        #print(bnb.fit(training_data, training_target).score(holdout_data, holdout_target))
        x_val_score.append(bnb.fit(training_data, training_target).score(holdout_data, holdout_target))
        y_pred = bnb.fit(training_data, training_target).predict(holdout_data)
        c_mat = confusion_matrix(holdout_target, y_pred)
        print(c_mat)
        print(holdout_target)
        
        # Sensitivity (Rate of correct prediction for spam)
        # True-pos / sum(True-pos & False-neg)
        sens = c_mat[1, 1]/sum(c_mat[1])
        
        # Specificity (Rate of correct prediction for ham)
        # True-neg / sum(True-neg & False-pos)
        spec = c_mat[0, 0]/sum(c_mat[0])
        
        print("Sensitivity: " + str(sens) + '\n',
              "Specificity: " + str(spec))
        
    return x_val_score

x_validate(data, target, 10)



[[470   7]
 [ 55  25]]
0      False
1      False
2       True
3      False
4      False
5       True
6      False
7      False
8       True
9       True
10     False
11      True
12      True
13     False
14     False
15      True
16     False
17     False
18     False
19      True
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
527     True
528    False
529     True
530    False
531     True
532    False
533    False
534    False
535    False
536    False
537    False
538    False
539    False
540    False
541     True
542    False
543    False
544    False
545    False
546    False
547    False
548    False
549    False
550    False
551    False
552    False
553    False
554    False
555    False
556    False
Name: spam, Length: 557, dtype: bool
Sensitivity: 0.3125
 Specificity: 0.9853249475890985
[[464   5]
 [ 64  24]]
557     False
558     False
559     False
560     False
561     False
5

[0.8886894075403949,
 0.8761220825852782,
 0.9048473967684022,
 0.895870736086176,
 0.899641577060932,
 0.8994614003590664,
 0.8850987432675045,
 0.8833034111310593,
 0.8868940754039497,
 0.8960573476702509,
 0.8886894075403949,
 0.8761220825852782,
 0.9048473967684022,
 0.895870736086176,
 0.899641577060932,
 0.8994614003590664,
 0.8850987432675045,
 0.8833034111310593,
 0.8868940754039497,
 0.8960573476702509,
 0.8886894075403949,
 0.8761220825852782,
 0.9048473967684022,
 0.895870736086176,
 0.899641577060932,
 0.8994614003590664,
 0.8850987432675045,
 0.8833034111310593,
 0.8868940754039497,
 0.8960573476702509,
 0.8886894075403949,
 0.8761220825852782,
 0.9048473967684022,
 0.895870736086176,
 0.899641577060932,
 0.8994614003590664,
 0.8850987432675045,
 0.8833034111310593,
 0.8868940754039497,
 0.8960573476702509]

In [26]:
# Random guesses
random_data = pd.DataFrame(np.random.randint(0, 1, size=(data.shape[0], data.shape[1])))
random_data.head()

cross_val_score(bnb, random_data, target, cv=10)




array([0.8655914 , 0.8655914 , 0.8655914 , 0.8655914 , 0.8655914 ,
       0.86535009, 0.86535009, 0.86690647, 0.86690647, 0.86690647])

So it looks like even with random data, we get around 86% accuracy.


## Thinking like a Data Scientist

How you choose to validate your model in real life will depend upon the kind of data you're working with and the kinds of concerns you have about the model's performance. Remember, your model is trained to fit the data you feed it, so if the situation changes your model will become less accurate. For example, if there are seasonal changes to your observed variable but you only train on one month's data, you're going to have a problem. You could test that by seeing how accurate your model is with a specific time period as your holdout, rather than a random sample. We'll cover techniques for dealing with time more later.

## Overfitting and Naive Bayes

Overfitting is always possible, but some models are more susceptible to it than others. Naive Bayes is actually pretty good for avoiding overfitting. This is largely because the assumptions are so simple, particularly the assumed independence between any two independent variables. One of the sources of overfitting is when a model tries to map complex interactions between variables that aren't really there or significant. Naive Bayes cannot do this because it assumes they are all independent and therefore not interacting. It's a nice characteristic at times, but it does mean it doesn't take into account how your features affect each other.

Also, one final note on our models here. They weren't overfitting, but they weren't telling us much either. They were just barely more accurate than the dominant class. Discuss with your mentor why that is and what you could do to improve the model.
