## Preprocessing Email Data 

Preparing Chris and Sara’s email for the author identification project; it had to do with a feature that was a little too powerful (effectively acting like a signature, which gives an arguably unfair advantage to an algorithm). You’ll work through that discovery process here.

### #1 If a decision tree is overfit, would you expect the accuracy on a test set to be very high or pretty low?

Low. This bug was found when Katie was trying to make an overfit decision tree to use as an example in the decision tree mini-project. A decision tree is classically an algorithm that can be easy to overfit; one of the easiest ways to get an overfit decision tree is to use a small training set and lots of features.

### #2 If a decision tree is overfit, would you expect high or low accuracy on the training set?

The accuracy would be very high on the training set, but would plummet once it was actually tested.

### #3 How many training points are there, according to the starter code?

In [65]:
import pickle
import numpy as np
np.random.seed(42)

from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn import tree
from sklearn.metrics import accuracy_score

import os

In [72]:
pkl_files = os.listdir('data')

In [76]:
pkl_files

['email_authors.pkl',
 'email_authors_overfit.pkl',
 'unix',
 'word_data.pkl',
 'word_data_overfit.pkl']

In [77]:
"""
convert dos linefeeds (crlf) to unix (lf)
usage: dos2unix.py 
"""

for original in pkl_files:
    content = ''
    outsize = 0
    if 'pkl' in original:
        with open('data/' + original, 'rb') as infile:
            content = infile.read()
        with open('data/unix/' + original, 'wb') as output:
            for line in content.splitlines():
                outsize += len(line) + 1
                output.write(line + str.encode('\n'))

        print("Done. Saved %s bytes." % (len(content)-outsize))

Done. Saved 199 bytes.
Done. Saved 17578 bytes.
Done. Saved 398 bytes.
Done. Saved 17390 bytes.


#### Load the features (words) and labels (autors)

In [60]:
### The words (features) and authors (labels), already largely processed.
### These files should have been created from the previous (Lesson 10)
### mini-project.
words_file = "../10_text_learning/your_word_data.pkl" 
authors_file = "../10_text_learning/your_email_authors.pkl"
word_data = pickle.load( open(words_file, "rb"))
authors = pickle.load( open(authors_file, "rb") )

#### Create training/testing set

In [61]:
### test_size is the percentage of events assigned to the test set (the
### remainder go into training)
### feature matrices changed to dense representations for compatibility with
### classifier functions in versions 0.15.2 and earlier
features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)

#### Create the Vectorizer

In [62]:
# create the vectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
# create the tf-idf matrix
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()

In [50]:
### a classic way to overfit is to use a small number
### of data points and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]

#### Get a decision tree up and training on the training data, and print out the accuracy. How many training points are there?

In [51]:
len(features_train), len(labels_train)

(150, 150)

#### #1 Create the Classifier

In [11]:
def classify(features_train, labels_train, criterion="gini", min_samples_split=2):
    
    ### returns a trained decision tree classifer
    
    # instantiate the classifier
    clf = tree.DecisionTreeClassifier(criterion=criterion, min_samples_split=min_samples_split)
    # fit the data
    clf = clf.fit(features_train, labels_train)
    
    return clf

In [12]:
# create the classifier
clf = classify(features_train, labels_train)

#### #2 Make Predictions

In [13]:
# store predictions in a list named pred
pred = clf.predict(features_test)

#### #3 Accuracy

In [14]:
# calculate accuracy
accuracy = accuracy_score(pred, labels_test)
accuracy

0.9476678043230944

### #4 What’s the importance of the most important feature? What is the number of this feature?

Take your (overfit) decision tree and use the `feature_importances_` attribute to get a list of the relative importance of all the features being used. 
We suggest iterating through this list (it’s long, since this is text data) and only printing out the feature importance if it’s above some threshold (say, 0.2--remember, if all words were equally important, each one would give an importance of far less than 0.01). 

In order to figure out what words are causing the problem, you need to go back to the `TfIdf` and use the feature number. You can return a list of all the words in the TfIdf by calling `get_feature_names()` on it; 
* pull out the word that’s causing most of the discrimination of the decision tree. 
* What is it? Does it make sense as a word that’s uniquely tied to either Chris Germany or Sara Shackleton, a signature of sorts?

In [18]:
clf.feature_importances_[clf.feature_importances_ > 0.01]

array([0.07495003, 0.02631579, 0.13402829, 0.76470588])

In [34]:
vectorizer.get_feature_names()[1]

'000'

In [33]:
for i in range(len(clf.feature_importances_)):
    if clf.feature_importances_[i] > 0.2:
        print("Feature name {} has the importance {} and the number {}".format(vectorizer.get_feature_names()[i], clf.feature_importances_[i], i))

Feature name sshacklensf has the importance 0.7647058823529412 and the number 33614


This word seems like an outlier in a certain sense, so let’s remove it and refit. Go back to text_learning/vectorize_text.py, and remove this word from the emails using the same method you used to remove “sara”, “chris”, etc. Rerun vectorize_text.py, and once that finishes, rerun find_signature.py. Any other outliers pop up? What word is it? Seem like a signature-type word? (Define an outlier as a feature with importance >0.2, as before).

In [44]:
# create the classifier
clf = classify(features_train, labels_train)
# store predictions in a list named pred
pred = clf.predict(features_test)
# calculate accuracy
accuracy = accuracy_score(pred, labels_test)
accuracy

0.9692832764505119

In [45]:
clf.feature_importances_[clf.feature_importances_ > 0.01]

array([0.16260163, 0.05060729, 0.66666667, 0.09380863, 0.02631579])

In [46]:
for i in range(len(clf.feature_importances_)):
    if clf.feature_importances_[i] > 0.5:
        print("Feature name {} has the importance {} and the number {}".format(vectorizer.get_feature_names()[i], clf.feature_importances_[i], i))

Feature name cgermannsf has the importance 0.6666666666666667 and the number 14343


### #5 Update `vectorize_test.py` one more time, and rerun. Then run `find_signature.py` again. Any other important features (importance>0.2) arise? How many? Do any of them look like “signature words”, or are they more “email content” words, that look like they legitimately come from the text of the messages?

In [52]:
# create the classifier
clf = classify(features_train, labels_train)
# store predictions in a list named pred
pred = clf.predict(features_test)
# calculate accuracy
accuracy = accuracy_score(pred, labels_test)
accuracy

0.8134243458475541

In [53]:
clf.feature_importances_[clf.feature_importances_ > 0.01]

array([0.02481019, 0.10537858, 0.02628019, 0.01777778, 0.02552933,
       0.04740741, 0.04266667, 0.18692724, 0.36363636, 0.08406921,
       0.07551703])

In [55]:
for i in range(len(clf.feature_importances_)):
    if clf.feature_importances_[i] > 0.2:
        print("Feature name {} has the importance {} and the number {}".format(vectorizer.get_feature_names()[i], clf.feature_importances_[i], i))

Feature name houectect has the importance 0.36363636363636365 and the number 21323


Yes, there is one more word ("houectect").  It doesn't look like an obvious signature word so let's keep moving without removing it.

### #6 What’s the accuracy of the decision tree now? We've removed two "signature words", so it will be more difficult for the algorithm to fit to our limited training set without overfitting. Remember, the whole point was to see if we could get the algorithm to overfit--a sensible result is one where the accuracy isn't that great!

0.8134243458475541
Now that we've removed the outlier "signature words", the training data is starting to overfit to the words that remain.

### #7 Use all the features

In [63]:
# create the classifier
clf = classify(features_train, labels_train)
# store predictions in a list named pred
pred = clf.predict(features_test)
# calculate accuracy
accuracy = accuracy_score(pred, labels_test)
accuracy

0.9948805460750854