# Author ID Accuracy

We have a set of emails, half of which were written by one person and the other half by another person at the same company . Our objective is to classify the emails as written by one person or the other based only on the text of the email. We will start with Naive Bayes in this mini-project, and then expand in later projects to other algorithms.

We will start by giving you a list of strings. Each string is the text of an email, which has undergone some basic preprocessing; we will then provide the code to split the dataset into training and testing sets. 

One particular feature of Naive Bayes is that it’s a good algorithm for working with text classification. When dealing with text, it’s very common to treat each unique word as a feature, and since the typical person’s vocabulary is many thousands of words, this makes for a large number of features. The relative simplicity of the algorithm and the independent features assumption of Naive Bayes make it a strong performer for classifying texts. In this mini-project, you will download and install sklearn on your computer and use Naive Bayes to classify emails by author.

In [0]:
import warnings
warnings.filterwarnings("ignore")

import matplotlib 
matplotlib.use('agg')

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy
# import sklearn
import nltk
import urllib
import tarfile
import os
import sys

from time import time

# features and labels creation
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# SVM
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Decision Trees
from sklearn import tree

import pickle
import _pickle as cPickle

In [4]:
!wget https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tar.gz

--2019-03-18 10:49:17--  https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tar.gz
Resolving www.cs.cmu.edu (www.cs.cmu.edu)... 128.2.42.95
Connecting to www.cs.cmu.edu (www.cs.cmu.edu)|128.2.42.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 443254787 (423M) [application/x-tar]
Saving to: ‘enron_mail_20150507.tar.gz.1’


2019-03-18 10:57:50 (844 KB/s) - ‘enron_mail_20150507.tar.gz.1’ saved [443254787/443254787]



In [5]:
%%time
tfile = tarfile.open("enron_mail_20150507.tar.gz", "r:gz")
tfile.extractall(".")

CPU times: user 1min 34s, sys: 49.3 s, total: 2min 23s
Wall time: 2min 25s


In [0]:
sys.path.append("../tools/")


In [0]:
!mv email_authors.pkl tools/email_authors.pkl

In [13]:
original = "tools/word_data.pkl"
destination = "tools/word_data_unix.pkl"

content = ''
outsize = 0
with open(original, 'rb') as infile:
    content = infile.read()
with open(destination, 'wb') as output:
    for line in content.splitlines():
        outsize += len(line) + 1
        output.write(line + str.encode('\n'))

print("Done. Saved %s bytes." % (len(content)-outsize))

Done. Saved 35156 bytes.


In [0]:
def preprocess(words_file = "tools/word_data_unix.pkl", authors_file="tools/email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features
        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions
        4 objects are returned:
            -- training/testing features
            -- training/testing labels
    """

    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, "rb")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "rb")
    word_data = pickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)



    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)



    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=10)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print("no. of Chris training emails:", sum(labels_train))
    print("no. of Sara training emails:", len(labels_train)-sum(labels_train))
    
    return features_train_transformed, features_test_transformed, labels_train, labels_test

In [26]:
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails: 7884


### #1 Naive Bayes

In [0]:
# instantiate the classifier
clf = GaussianNB()

In [0]:
%%time
# train
clf.fit(features_train, labels_train)

CPU times: user 1.06 s, sys: 13.9 ms, total: 1.07 s
Wall time: 1.07 s


GaussianNB(priors=None, var_smoothing=1e-09)

In [0]:
%%time
# predict
pred = clf.predict(features_test)

CPU times: user 104 ms, sys: 7.62 ms, total: 112 ms
Wall time: 117 ms


In [0]:
accuracy = accuracy_score(pred, labels_test)
print('\naccuracy = {0}'.format(accuracy))


accuracy = 0.9732650739476678


### #2 SVM

In [0]:
# classifier
clf = SVC(C=1.0, kernel='linear')

In [0]:
t0 =time()
# train
clf.fit(features_train, labels_train)
print("\ntraining time:", round(time()-t0, 3), "s")


training time: 268.667 s


In [0]:
t0 = time()
# predict
pred = clf.predict(features_test)
print("predicting time:", round(time()-t0, 3), "s")

predicting time: 27.241 s


In [0]:
# calculate accuracy
accuracy = accuracy_score(pred, labels_test)
print('\naccuracy = {0}'.format(accuracy))


accuracy = 0.9840728100113766


#### Create a function for SVM classifier, training, predicting and calculating accuracy

* it returns the predictions

In [0]:
def ml_svm(features_train, features_test, labels_train, labels_test, kernel="linear", C=1.0):
  # classifier
  clf = SVC(kernel=kernel, C=C)
  
  t0 =time()
  # train
  clf.fit(features_train, labels_train)
  print("\ntraining time:", round(time()-t0, 3), "s")
  
  t0 = time()
  # predict
  pred = clf.predict(features_test)
  print("predicting time:", round(time()-t0, 3), "s")
  
  # calculate accuracy
  accuracy = accuracy_score(pred, labels_test)
  print('\naccuracy = {0}'.format(accuracy))
  
  return pred

In [0]:
pred = ml_svm(features_train, features_test, labels_train, labels_test)


training time: 265.231 s
predicting time: 27.534 s

accuracy = 0.9840728100113766


#### Speed up an Algorithm
* Create a smaller Training Set
* Trade off: the accuracy goes down

In [0]:
# one percent of the data set
features_train = features_train[:int(len(features_train)/100)] 
labels_train = labels_train[:int(len(labels_train)/100)] 

In [0]:
pred = ml_svm(features_train, features_test, labels_train, labels_test)


training time: 0.139 s
predicting time: 1.398 s

accuracy = 0.8845278725824801


Voice recognition and transaction blocking need to happen in real time, with almost no delay.  There's no obvious need to predict an email author instantly.

#### Deploy an RBF Kernel

Keep the training set slice code from the last quiz, so that you are still training on only 1% of the full training set. Change the kernel of your SVM to “rbf”

In [0]:
pred = ml_svm(features_train, features_test, labels_train, labels_test, kernel="rbf")


training time: 0.167 s
predicting time: 1.655 s

accuracy = 0.6160409556313993


#### Optimize C Parameter

Keep the training set size and rbf kernel from the last quiz, but try several values of C (say, 10.0, 100., 1000., and 10000.). Which one gives the best accuracy?

In [0]:
for C in [10, 100, 1000, 10000]:
    print('C =',C,)
    pred = ml_svm(features_train, features_test, labels_train, labels_test, kernel='rbf', C=C)
    print('\n\n')

C = 10

training time: 0.157 s
predicting time: 1.633 s

accuracy = 0.6160409556313993



C = 100

training time: 0.154 s
predicting time: 1.633 s

accuracy = 0.6160409556313993



C = 1000

training time: 0.146 s
predicting time: 1.553 s

accuracy = 0.8213879408418657



C = 10000

training time: 0.144 s
predicting time: 1.306 s

accuracy = 0.8924914675767918





##### Accuracy after Optimizing C

The decision boundary becomes more complex.

#### Optimized RBF vs Linear SVML Accuracy

* What is the accuracy of the optimized SVM?

In [0]:
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails: 7884


In [0]:
# optimized parameters 
pred = ml_svm(features_train, features_test, labels_train, labels_test, kernel='rbf', C=10000)


training time: 179.25 s
predicting time: 20.657 s

accuracy = 0.9908987485779295


#### Extracting Predictions from an SVM

What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test set? The 26th? The 50th? (Use the RBF kernel, C=10000, and 1% of the training set.

In [0]:
elem = [10, 26, 40]
for i in elem:
  print(pred[i])

1
0
1


#### How many Chris EMails Predicted
There are over 1700 test events--how many are predicted to be in the “Chris” (1) class? 

In [0]:
sum(pred)

877

#### Final Thoughts

Hopefully it’s becoming clearer what Sebastian meant when he said Naive Bayes is great for text--it’s faster and generally gives better performance than an SVM for this particular problem. Of course, there are plenty of other problems where an SVM might work better. Knowing which one to try when you’re tackling a problem for the first time is part of the art and science of machine learning. In addition to picking your algorithm, depending on which one you try, there are parameter tunes to worry about as well, and the possibility of overfitting (especially if you don’t have lots of training data).

Our general suggestion is to try a few different algorithms for each problem. Tuning the parameters can be a lot of work, but just sit tight for now--toward the end of the class we will introduce you to GridCV, a great sklearn tool that can find an optimal parameter tune almost automatically.


### #3 Decision Tree

#### #1 First Email DT: Accuracy

* get a decision tree up and running as a classifier, setting min_samples_split=40

In [0]:
def ml_dt(features_train, features_test, labels_train, labels_test, min_samples_split=2):
  # classifier
  clf = tree.DecisionTreeClassifier(min_samples_split=min_samples_split)
  
  t0 =time()
  # train
  clf = clf.fit(features_train, labels_train)
  print("\ntraining time:", round(time()-t0, 3), "s")
  
  t0 = time()
  # predict
  pred = clf.predict(features_test)
  print("predicting time:", round(time()-t0, 3), "s")
  
  # calculate accuracy
  accuracy = accuracy_score(pred, labels_test)
  print('\naccuracy = {0}'.format(accuracy))
  
  return pred

In [19]:
pred = ml_dt(features_train, features_test, labels_train, labels_test, min_samples_split=40)


training time: 84.351 s
predicting time: 0.019 s

accuracy = 0.9789533560864618


#### #2 What's the number of features in your data? 

In [24]:
features_train.shape[1]

3785

#### #3 Changing the number of features

Into the `preprocess` function:

`selector = SelectPercentile(f_classif, percentile=10)`
Change percentile from 10 to 1, and rerun

In [27]:
features_train.shape[1]

379

#### #4 Select Percentile and Complexity

What do you think SelectPercentile is doing? Would a large value for percentile lead to a more complex or less complex decision tree, all other things being equal? 

Having fewer features around means there are fewer chances for the decision tree to carve out very specific little spots when finding a decision surface.  These specific little spots (what we'd also call evidence of a high-variance result) indicate a more complex decision-making process.  So having more features doesn't usually mean you have a less complex decision tree.

In [28]:
pred = ml_dt(features_train, features_test, labels_train, labels_test, min_samples_split=40)


training time: 6.267 s
predicting time: 0.003 s

accuracy = 0.9670079635949943
