In [1]:
#!/usr/bin/python3

""" 
    This is the code to accompany the Lesson 2 (SVM) mini-project.

    Use a SVM to identify emails from the Enron corpus by their authors:    
    Sara has label 0
    Chris has label 1
"""
    
import sys
from time import time
sys.path.append(r"D:\machine_learning\ud120-projects\tools")
from email_preprocess import preprocess


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()


#########################################################

No. of Chris training emails :  7936
No. of Sara training emails :  7884


Go to the svm directory to find the starter code (svm/svm_author_id.py). Import, create, train and make predictions with the sklearn SVC classifier. When creating the classifier, use a linear kernel (if you forget this step, you will be unpleasantly surprised by how long the classifier takes to train). What is the accuracy of the classifier?

In [2]:
### your code goes here ###
# SVM classifier
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

In [3]:
# Create classifier
clf = SVC(kernel='linear', C= 1.0)

# Train
t0 = time()
clf.fit(features_train, labels_train)
print("Training time:", round(time()-t0, 3), "s")

# Predict
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "s")

# Calculate accuracy
acc = accuracy_score(labels_test, pred)
print("Accuracy:", acc)

Training time: 385.116 s
Predicting time: 38.287 s
Accuracy: 0.9840728100113766


One way to speed up an algorithm is to train it on a smaller training dataset. The tradeoff is that the accuracy almost always goes down when you do this. Let’s explore this more concretely: add in the following two lines immediately before training your classifier.

features_train = features_train[:len(features_train)/100] labels_train = labels_train[:len(labels_train)/100]

These lines effectively slice the training dataset down to 1% of its original size, tossing out 99% of the training data. You can leave all other code unchanged. What’s the accuracy now?

In [4]:
features_train = features_train[:int(len(features_train)/100)]
labels_train = labels_train[:int(len(labels_train)/100)]

# Create classifier
clf = SVC(kernel='linear', C= 1.0)

# Train
t0 = time()
clf.fit(features_train, labels_train)
print("Training time:", round(time()-t0, 3), "s")

# Predict
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "s")

# Calculate accuracy
acc = accuracy_score(labels_test, pred)
print("Accuracy:", acc)

Training time: 0.413 s
Predicting time: 2.15 s
Accuracy: 0.8845278725824801


Keep the training set slice code from the last quiz, so that you are still training on only 1% of the full training set. Change the kernel of your SVM to “rbf”. What’s the accuracy now, with this more complex kernel?

In [5]:
# Create classifier
clf = SVC(kernel='rbf', C= 1.0)

# Train
t0 = time()
clf.fit(features_train, labels_train)
print("Training time:", round(time()-t0, 3), "s")

# Predict
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "s")

# Calculate accuracy
acc = accuracy_score(labels_test, pred)
print("Accuracy:", acc)

Training time: 0.715 s
Predicting time: 3.586 s
Accuracy: 0.8953356086461889


Keep the training set size and rbf kernel from the last quiz, but try several values of C (say, 10.0, 100., 1000., and 10000.). Which one gives the best accuracy?

In [6]:
# Create classifier
clf = SVC(kernel='rbf', C= 10.0)

# Train
t0 = time()
clf.fit(features_train, labels_train)
print("Training time:", round(time()-t0, 3), "s")

# Predict
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "s")

# Calculate accuracy
acc = accuracy_score(labels_test, pred)
print("Accuracy:", acc)

Training time: 0.429 s
Predicting time: 3.097 s
Accuracy: 0.8998862343572241


In [7]:
# Create classifier
clf = SVC(kernel='rbf', C= 100.0)

# Train
t0 = time()
clf.fit(features_train, labels_train)
print("Training time:", round(time()-t0, 3), "s")

# Predict
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "s")

# Calculate accuracy
acc = accuracy_score(labels_test, pred)
print("Accuracy:", acc)

Training time: 0.509 s
Predicting time: 3.47 s
Accuracy: 0.8998862343572241


In [8]:
# Create classifier
clf = SVC(kernel='rbf', C= 1000.0)

# Train
t0 = time()
clf.fit(features_train, labels_train)
print("Training time:", round(time()-t0, 3), "s")

# Predict
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "s")

# Calculate accuracy
acc = accuracy_score(labels_test, pred)
print("Accuracy:", acc)

Training time: 0.419 s
Predicting time: 3.308 s
Accuracy: 0.8998862343572241


C=10000 gives the best accuracy.

Once you've optimized the C value for your RBF kernel, what accuracy does it give? Does this C value correspond to a simpler or more complex decision boundary?

(If you're not sure about the complexity, go back a few videos to the "SVM C Parameter" part of the lesson. The result that you found there is also applicable here, even though it's now much harder or even impossible to draw the decision boundary in a simple scatterplot.)

In [9]:
print('More complex decision boundary')

More complex decision boundary


Now that you’ve optimized C for the RBF kernel, go back to using the full training set. In general, having a larger training set will improve the performance of your algorithm, so (by tuning C and training on a large dataset) we should get a fairly optimized result. What is the accuracy of the optimized SVM?

In [10]:
features_train, features_test, labels_train, labels_test = preprocess()

# Create classifier
clf = SVC(kernel='rbf', C= 1000.0)

# Train
t0 = time()
clf.fit(features_train, labels_train)
print("Training time:", round(time()-t0, 3), "s")

# Predict
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "s")

# Calculate accuracy
acc = accuracy_score(labels_test, pred)
print("Accuracy:", acc)

No. of Chris training emails :  7936
No. of Sara training emails :  7884
Training time: 423.775 s
Predicting time: 54.614 s
Accuracy: 0.9960182025028441


What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test set? The 26th? The 50th? (Use the RBF kernel, C=10000, and 1% of the training set. Normally you'd get the best results using the full training set, but we found that using 1% sped up the computation considerably and did not change our results--so feel free to use that shortcut here.)

And just to be clear, the data point numbers that we give here (10, 26, 50) assume a zero-indexed list. So the correct answer for element #100 would be found using something like answer=predictions[100]

In [11]:
features_train = features_train[:int(len(features_train)/100)]
labels_train = labels_train[:int(len(labels_train)/100)]

# Create classifier
clf = SVC(kernel='rbf', C= 1000.0)

# Train
t0 = time()
clf.fit(features_train, labels_train)
print("Training time:", round(time()-t0, 3), "s")

# Predict
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "s")

# Calculate accuracy
acc = accuracy_score(labels_test, pred)
print("Accuracy:", acc)

Training time: 0.526 s
Predicting time: 3.55 s
Accuracy: 0.8998862343572241


In [12]:
print("Prediction for element 10th, 26th and 50th are:", pred[[10, 26, 50]])

Prediction for element 10th, 26th and 50th are: [1 0 1]


There are over 1700 test events--how many are predicted to be in the “Chris” (1) class? (Use the RBF kernel, C=10000., and the full training set.)

In [13]:
features_train, features_test, labels_train, labels_test = preprocess()

# Create classifier
clf = SVC(kernel='rbf', C= 10000.0)

# Train
t0 = time()
clf.fit(features_train, labels_train)
print("Training time:", round(time()-t0, 3), "s")

# Predict
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "s")

# Calculate accuracy
acc = accuracy_score(labels_test, pred)
print("Accuracy:", acc)

No. of Chris training emails :  7936
No. of Sara training emails :  7884
Training time: 367.897 s
Predicting time: 49.927 s
Accuracy: 0.9960182025028441


In [14]:
# Count of Chris
num_chris = sum(pred == 1)
print("Number of emails predicted to be from Chris (1):", num_chris)

Number of emails predicted to be from Chris (1): 866
