# Using SVM to categorize emails

In [1]:
from email_preprocess import preprocess_emails
from sklearn.svm import SVC
from time import time
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [2]:
features_train, features_test, labels_train, labels_test = preprocess_emails()

## How long does SVM take to compute?

In [3]:
def using_less_training_data(amount_of_training_data):
    if amount_of_training_data == 1:
        features = features_train
        labels = labels_train
    else:
        features,_,labels,_ = train_test_split(
            features_train,
            labels_train,
            train_size=amount_of_training_data,
            random_state=91,
        )

    print("training on", len(features), "out of", len(features_train),
          "(", len(features)/len(features_train)*100 ,"%)"
         )

    clf = SVC()

    t = time()
    clf.fit(features, labels)
    print("clf fit time:", round(time()-t, 3), "s")

    t = time()
    labels_pred = clf.predict(features_test)
    print("clf predict time:", round(time()-t, 3), "s")
    print("accuracy:", accuracy_score(labels_pred, labels_test))

In [4]:
using_less_training_data(0.10)

training on 1582 out of 15820 ( 10.0 %)
clf fit time: 10.58 s
clf predict time: 8.738 s
accuracy: 0.9687144482366326


In [4]:
using_less_training_data(0.20)

training on 3164 out of 15820 ( 20.0 %)
clf fit time: 24.239 s
clf predict time: 12.988 s
accuracy: 0.981797497155859


In [5]:
using_less_training_data(0.40)

training on 6328 out of 15820 ( 40.0 %)
clf fit time: 70.775 s
clf predict time: 19.084 s
accuracy: 0.987485779294653


In [10]:
using_less_training_data(0.80)

training on 12656 out of 15820 ( 80.0 %)
clf fit time: 199.186 s
clf predict time: 26.44 s
accuracy: 0.9920364050056882


In [13]:
using_less_training_data(1)

training on 15820 out of 15820 ( 100.0 %)
clf fit time: 270.564 s
clf predict time: 29.974 s
accuracy: 0.9926052332195677


It takes a LONG time but seems to get some very accurate results with a very simple implementation

## How do the different kernels perform on this dataset

In [14]:
def switch_kernels(kernel, amount_of_training_data = 0.20):
    if amount_of_training_data == 1:
        features = features_train
        labels = labels_train
    else:
        features,_,labels,_ = train_test_split(
            features_train,
            labels_train,
            train_size=amount_of_training_data,
            random_state=91,
        )

    print("training on", len(features), "out of", len(features_train),
          "(", len(features)/len(features_train)*100 ,"%)"
         )

    clf = SVC(kernel=kernel)

    t = time()
    clf.fit(features, labels)
    print("clf fit time:", round(time()-t, 3), "s")

    t = time()
    labels_pred = clf.predict(features_test)
    print("clf predict time:", round(time()-t, 3), "s")
    print("accuracy:", accuracy_score(labels_pred, labels_test))

In [8]:
switch_kernels("linear")

training on 3164 out of 15820 ( 20.0 %)
clf fit time: 16.07 s
clf predict time: 8.437 s
accuracy: 0.9726962457337884


In [9]:
switch_kernels("poly")

training on 3164 out of 15820 ( 20.0 %)
clf fit time: 43.96 s
clf predict time: 23.428 s
accuracy: 0.8526734926052332


In [10]:
switch_kernels("rbf")

training on 3164 out of 15820 ( 20.0 %)
clf fit time: 24.575 s
clf predict time: 13.312 s
accuracy: 0.981797497155859


In [11]:
switch_kernels("sigmoid")

training on 3164 out of 15820 ( 20.0 %)
clf fit time: 13.183 s
clf predict time: 6.683 s
accuracy: 0.9732650739476678


In [12]:
switch_kernels("precomputed")

training on 3164 out of 15820 ( 20.0 %)


ValueError: Precomputed matrix must be a square matrix. Input is a 3164x3785 matrix.

- Linear and sigmoid seem the fastest on this subset of the data
- linear, sigmoid, and rbf all seem to have decent accuracy

I'll run the linear and sigmoid kernels against the full dataset to see how they perform

In [15]:
switch_kernels("linear", 1)

training on 15820 out of 15820 ( 100.0 %)
clf fit time: 199.196 s
clf predict time: 21.761 s
accuracy: 0.9840728100113766


In [16]:
switch_kernels("sigmoid", 1)

training on 15820 out of 15820 ( 100.0 %)
clf fit time: 168.383 s
clf predict time: 16.947 s
accuracy: 0.9857792946530148


The accuracies on the full data are all higher than on 20% of the data.

The rbf has a great accuracy at 99.26% however takes the longest to run, but not by much

| |accuracies| train time |
|-|- | - |
| linear | 0.9840728100113766 | 199.196s |
| rbf | 0.9926052332195677 | 270.564s |
| sigmoid | 0.9857792946530148 | 168.383s | 

At this point I would start tuning the parameters on the rbf kernel, however, I'll first go through the course exercises.