## SVM Author ID Accuracy

Go to the svm directory to find the starter code (svm/svm_author_id.py).

Import, create, train and make predictions with the sklearn SVC classifier. When creating the classifier, use a linear kernel (if you forget this step, you will be unpleasantly surprised by how long the classifier takes to train). What is the accuracy of the classifier?

In [1]:
import pickle
import cPickle
import numpy

In [2]:
from sklearn import cross_validation



In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
from sklearn.feature_selection import SelectPercentile, f_classif

In [14]:
def preprocess(words_file = "../ud120-projects/tools/word_data.pkl", authors_file="../ud120-projects/tools/email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features
        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions
        4 objects are returned:
            -- training/testing features
            -- training/testing labels
    """

    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, "r")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "r")
    word_data = cPickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)



    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)



    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=10)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print "no. of Chris training emails:", sum(labels_train)
    print "no. of Sara training emails:", len(labels_train)-sum(labels_train)
    
    return features_train_transformed, features_test_transformed, labels_train, labels_test

In [15]:
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails: 7884


In [16]:
from sklearn.svm import SVC

In [17]:
clf = SVC(kernel='linear')

In [18]:
clf.fit(features_train, labels_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [19]:
pred = clf.predict(features_test)

Accuracy:

In [20]:
import numpy as np
np.mean(pred == labels_test)

0.98407281001137659

In [21]:
from sklearn.metrics import accuracy_score

In [22]:
accuracy_score(pred, labels_test)

0.98407281001137659

### SVM Author ID Timing

In [23]:
from time import time

In [24]:
clf = SVC(kernel='linear')

In [25]:
t0 = time()
clf.fit(features_train, labels_train)
print "training time: ", round(time()-t0, 3), "s"

training time:  240.542 s


In [27]:
t1 = time()
pred = clf.predict(features_test)
print "predicting time: ", round(time()-t0, 3), "s"

predicting time:  295.806 s


Conclusion: SVM training and predicting times are slower than Naive Bayes training and predicting times.

### A Smaller Training Set

One way to speed up an algorithm is to train it on a smaller training dataset. The tradeoff is that the accuracy almost always goes down when you do this. Let’s explore this more concretely: add in the following two lines immediately before training your classifier. 

```
features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100] 
```

These lines effectively slice the training dataset down to 1% of its original size, tossing out 99% of the training data. You can leave all other code unchanged. What’s the accuracy now?

In [28]:
features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100]

In [29]:
clf = SVC(kernel='linear')

In [30]:
t0 = time()
clf.fit(features_train, labels_train)
print "training time: ", round(time()-t0, 3), "s"

training time:  0.138 s


In [31]:
t1 = time()
pred = clf.predict(features_test)
print "predicting time: ", round(time()-t0, 3), "s"

predicting time:  10.315 s


In [32]:
print "accuracy: ", np.mean(pred == labels_test)

accuracy:  0.884527872582


Conclusion: accuracy is lower but the process is faster.

### Speed-Accuarcy Tradeoff

If speed is a major consideration (and for many real-time machine learning applications, it certainly is) then you may want to sacrifice a bit of accuracy if it means you can train/predict faster. Which of these are applications where you can imagine a very quick-running algorithm is especially important?

- predicting the author of an email
- flagging credit card fraud, and blocking a transaction before it goes through
- voice recognition, like Siri

Answer:
2nd and 3rd options

### Deploy an RBF Kernel

Keep the training set slice code from the last quiz, so that you are still training on only 1% of the full training set. Change the kernel of your SVM to “rbf”. What’s the accuracy now, with this more complex kernel?

In [33]:
clf = SVC(kernel='rbf')
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print "accuracy: ", round(np.mean(pred==labels_test), 3)

accuracy:  0.616


-- accuracy is decreased

### Optimize C Parameter

Keep the training set size and rbf kernel from the last quiz, but try several values of C (say, 10.0, 100., 1000., and 10000.). Which one gives the best accuracy?

In [34]:
clf = SVC(kernel='rbf', C=10)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print "accuracy: ", round(np.mean(pred==labels_test), 3)

accuracy:  0.616


In [35]:
clf = SVC(kernel='rbf', C=100)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print "accuracy: ", round(np.mean(pred==labels_test), 3)

accuracy:  0.616


In [36]:
clf = SVC(kernel='rbf', C=1000)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print "accuracy: ", round(np.mean(pred==labels_test), 3)

accuracy:  0.821


In [37]:
clf = SVC(kernel='rbf', C=10000)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print "accuracy: ", round(np.mean(pred==labels_test), 3)

accuracy:  0.892


-- Best accuracy is C = 10000

### Accuracy after Optimizing C

Once you've optimized the C value for your RBF kernel, what accuracy does it give? Does this C value correspond to a simpler or more complex decision boundary?

(If you're not sure about the complexity, go back a few videos to the "SVM C Parameter" part of the lesson. The result that you found there is also applicable here, even though it's now much harder or even impossible to draw the decision boundary in a simple scatterplot.)

-- Accuracy is now 89.2%, the boundary is more complex than when C has its default value of 1.0

### Optimized RBF vs. Linear SVM: Accuracy

Now that you’ve optimized C for the RBF kernel, go back to using the full training set. In general, having a larger training set will improve the performance of your algorithm, so (by tuning C and training on a large dataset) we should get a fairly optimized result. What is the accuracy of the optimized SVM?

In [38]:
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails: 7884


In [39]:
clf = SVC(kernel='rbf', C=10000)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print "accuracy: ", round(np.mean(pred==labels_test), 3)

accuracy:  0.991


### Extracting Predictions from an SVM

What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test set? The 26th? The 50th? (Use the RBF kernel, C=10000, and 1% of the training set. Normally you'd get the best results using the full training set, but we found that using 1% sped up the computation considerably and did not change our results--so feel free to use that shortcut here.)

And just to be clear, the data point numbers that we give here (10, 26, 50) assume a zero-indexed list. So the correct answer for element #100 would be found using something like answer=predictions[100]

In [52]:
clf.predict(features_test[10])



array([1])

In [53]:
clf.predict(features_test[26].reshape(1, -1))

array([0])

In [54]:
# 50th element:
clf.predict(features_test[50].reshape(1, -1))

array([1])

### How many Chris Emails Predicted?

There are over 1700 test events--how many are predicted to be in the “Chris” (1) class? (Use the RBF kernel, C=10000., and the full training set.)

In [55]:
from collections import Counter

In [56]:
print Counter(pred)

Counter({0: 881, 1: 877})


-- Sara has 881 emails and Chris has 877 emails.

### Final Thoughts on Deploying SVMs

Hopefully it’s becoming clearer what Sebastian meant when he said Naive Bayes is great for text--it’s faster and generally gives better performance than an SVM for this particular problem. Of course, there are plenty of other problems where an SVM might work better. Knowing which one to try when you’re tackling a problem for the first time is part of the art and science of machine learning. In addition to picking your algorithm, depending on which one you try, there are parameter tunes to worry about as well, and the possibility of overfitting (especially if you don’t have lots of training data).

Our general suggestion is to try a few different algorithms for each problem. Tuning the parameters can be a lot of work, but just sit tight for now--toward the end of the class we will introduce you to GridCV, a great sklearn tool that can find an optimal parameter tune almost automatically.