# Using SVM to categorize emails

In [1]:
from email_preprocess import preprocess_emails
from sklearn.svm import SVC, LinearSVC
from time import time
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

In [2]:
features_train, features_test, labels_train, labels_test = preprocess_emails()

In [3]:
def optimize_svm(kernel = 'rbf',
                 amount_of_training_data = 0.20,
                 C = 1.0,
                 features = None,
                 labels = None):
    if features is not None and labels is not None:
        pass
    elif amount_of_training_data == 1:
        features = features_train
        labels = labels_train
    else:
        features,_,labels,_ = train_test_split(
            features_train,
            labels_train,
            train_size=amount_of_training_data,
            random_state=91,
        )

    print("training on", len(features), "out of", len(features_train),
          "(", len(features)/len(features_train)*100 ,"%)"
         )
    
    clf = LinearSVC(C=C) if kernel == "LinearSVC" else SVC(kernel=kernel, C=C)

    t = time()
    clf.fit(features, labels)
    fit_delta = round(time()-t, 3)
    print("clf fit time:", fit_delta, "s")

    t = time()
    labels_pred = clf.predict(features_test)
    print("clf predict time:", round(time()-t, 3), "s")
    
    accuracy = accuracy_score(labels_pred, labels_test)
    print("accuracy:", accuracy)
    
    print(f'| {kernel} | {amount_of_training_data} | {C} | {accuracy} | {fit_delta}s |')

## How long does SVM take to compute?

In [4]:
optimize_svm(amount_of_training_data = 0.10)

training on 1582 out of 15820 ( 10.0 %)
clf fit time: 10.604 s
clf predict time: 8.995 s
accuracy: 0.9687144482366326
| rbf | 0.1 | 1.0 | 0.9687144482366326 | 10.604s |


In [5]:
optimize_svm(amount_of_training_data = 0.20)

training on 3164 out of 15820 ( 20.0 %)
clf fit time: 25.309 s
clf predict time: 13.144 s
accuracy: 0.981797497155859
| rbf | 0.2 | 1.0 | 0.981797497155859 | 25.309s |


In [6]:
optimize_svm(amount_of_training_data = 0.40)

training on 6328 out of 15820 ( 40.0 %)
clf fit time: 72.54 s
clf predict time: 19.364 s
accuracy: 0.987485779294653
| rbf | 0.4 | 1.0 | 0.987485779294653 | 72.54s |


In [7]:
optimize_svm(amount_of_training_data = 0.80)

training on 12656 out of 15820 ( 80.0 %)
clf fit time: 197.733 s
clf predict time: 26.531 s
accuracy: 0.9920364050056882
| rbf | 0.8 | 1.0 | 0.9920364050056882 | 197.733s |


In [8]:
optimize_svm(amount_of_training_data = 1)

training on 15820 out of 15820 ( 100.0 %)
clf fit time: 272.625 s
clf predict time: 29.404 s
accuracy: 0.9926052332195677
| rbf | 1 | 1.0 | 0.9926052332195677 | 272.625s |


It takes a LONG time but seems to get some very accurate results with a very simple implementation

## How do the different kernels perform on this dataset

In [9]:
optimize_svm("linear")

training on 3164 out of 15820 ( 20.0 %)
clf fit time: 16.306 s
clf predict time: 8.467 s
accuracy: 0.9726962457337884
| linear | 0.2 | 1.0 | 0.9726962457337884 | 16.306s |


In [10]:
optimize_svm("poly")

training on 3164 out of 15820 ( 20.0 %)
clf fit time: 44.582 s
clf predict time: 22.979 s
accuracy: 0.8526734926052332
| poly | 0.2 | 1.0 | 0.8526734926052332 | 44.582s |


In [11]:
optimize_svm("rbf")

training on 3164 out of 15820 ( 20.0 %)
clf fit time: 24.518 s
clf predict time: 13.166 s
accuracy: 0.981797497155859
| rbf | 0.2 | 1.0 | 0.981797497155859 | 24.518s |


In [12]:
optimize_svm("sigmoid")

training on 3164 out of 15820 ( 20.0 %)
clf fit time: 13.106 s
clf predict time: 6.409 s
accuracy: 0.9732650739476678
| sigmoid | 0.2 | 1.0 | 0.9732650739476678 | 13.106s |


In [13]:
# optimize_svm("precomputed")
# >>> ValueError: Precomputed matrix must be a square matrix. Input is a 3164x3785 matrix.

training on 3164 out of 15820 ( 20.0 %)


ValueError: Precomputed matrix must be a square matrix. Input is a 3164x3785 matrix.

- Linear and sigmoid seem the fastest on this subset of the data
- linear, sigmoid, and rbf all seem to have decent accuracy

I'll run the linear and sigmoid kernels against the full dataset to see how they perform

In [14]:
optimize_svm("linear", 1)

training on 15820 out of 15820 ( 100.0 %)
clf fit time: 208.987 s
clf predict time: 22.715 s
accuracy: 0.9840728100113766
| linear | 1 | 1.0 | 0.9840728100113766 | 208.987s |


In [15]:
optimize_svm("sigmoid", 1)

training on 15820 out of 15820 ( 100.0 %)
clf fit time: 179.525 s
clf predict time: 17.315 s
accuracy: 0.9857792946530148
| sigmoid | 1 | 1.0 | 0.9857792946530148 | 179.525s |


The accuracies on the full data are all higher than on 20% of the data.

The rbf has a great accuracy at 99.26% however takes the longest to run, but not by much

| | data | C | accuracies | train time |
|-|-|-|-|-|
| linear | 1 | 1.0 | 0.9840728100113766 | 208.987s |
| rbf | 1 | 1.0 | 0.9926052332195677 | 272.625s |
| sigmoid | 1 | 1.0 | 0.9857792946530148 | 179.525s |

At this point I would start tuning the parameters on the rbf kernel, however, I'll first go through the course exercises.

### Quiz: A Smaller Training Set

In [16]:
optimize_svm(
    kernel="linear",
    amount_of_training_data = 0.01,
    features = features_train[:round(len(features_train)/100)],
    labels = labels_train[:round(len(labels_train)/100)],
)

training on 158 out of 15820 ( 0.9987357774968394 %)
clf fit time: 0.104 s
clf predict time: 1.1 s
accuracy: 0.8845278725824801
| linear | 0.01 | 1.0 | 0.8845278725824801 | 0.104s |


### Quiz: Deploy an RBF Kernel

In [17]:
optimize_svm(
    amount_of_training_data = 0.01,
    features = features_train[:round(len(features_train)/100)],
    labels = labels_train[:round(len(labels_train)/100)],
)

training on 158 out of 15820 ( 0.9987357774968394 %)
clf fit time: 0.117 s
clf predict time: 1.145 s
accuracy: 0.8953356086461889
| rbf | 0.01 | 1.0 | 0.8953356086461889 | 0.117s |


### Quiz: Optimize C Parameter

In [18]:
optimize_svm(
    C=10.0,
    amount_of_training_data = 0.01,
    features = features_train[:round(len(features_train)/100)],
    labels = labels_train[:round(len(labels_train)/100)],
)

training on 158 out of 15820 ( 0.9987357774968394 %)
clf fit time: 0.112 s
clf predict time: 1.112 s
accuracy: 0.8998862343572241
| rbf | 0.01 | 10.0 | 0.8998862343572241 | 0.112s |


In [19]:
optimize_svm(
    C=100.0,
    amount_of_training_data = 0.01,
    features = features_train[:round(len(features_train)/100)],
    labels = labels_train[:round(len(labels_train)/100)],
)

training on 158 out of 15820 ( 0.9987357774968394 %)
clf fit time: 0.109 s
clf predict time: 1.108 s
accuracy: 0.8998862343572241
| rbf | 0.01 | 100.0 | 0.8998862343572241 | 0.109s |


In [20]:
optimize_svm(
    C=1000.0,
    amount_of_training_data = 0.01,
    features = features_train[:round(len(features_train)/100)],
    labels = labels_train[:round(len(labels_train)/100)],
)

training on 158 out of 15820 ( 0.9987357774968394 %)
clf fit time: 0.11 s
clf predict time: 1.106 s
accuracy: 0.8998862343572241
| rbf | 0.01 | 1000.0 | 0.8998862343572241 | 0.11s |


In [21]:
optimize_svm(
    C=10000.0,
    amount_of_training_data = 0.01,
    features = features_train[:round(len(features_train)/100)],
    labels = labels_train[:round(len(labels_train)/100)],
)

training on 158 out of 15820 ( 0.9987357774968394 %)
clf fit time: 0.115 s
clf predict time: 1.12 s
accuracy: 0.8998862343572241
| rbf | 0.01 | 10000.0 | 0.8998862343572241 | 0.115s |


## The Optimized RBF model

In [22]:
clf = SVC(C=10000.0)
clf.fit(features_train, labels_train)

SVC(C=10000.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [23]:
labels_pred = clf.predict(features_test)

### Quiz: Optimized RBF vs. Linear SVM: Accuracy

In [24]:
print("accuracy:", accuracy_score(labels_pred, labels_test))

accuracy: 0.9960182025028441


### Quiz: Extracting Predictions from an SVM

In [25]:
print("labels_pred[10] =", labels_pred[10])
print("labels_pred[26] =", labels_pred[26])
print("labels_pred[50] =", labels_pred[50])

labels_pred[10] = 1
labels_pred[26] = 0
labels_pred[50] = 1


### Quiz: How Many Chris Emails Predicted?

In [26]:
print("number of emails in test set predicted to be authored by Chris(1)", sum(labels_pred))

number of emails in test set predicted to be authored by Chris(1) 866


### What's the difference in performace between `SVC(kernel="linear")` and `LinearSVC`?

In [27]:
optimize_svm(kernel="LinearSVC", amount_of_training_data = 1, C=16)

training on 15820 out of 15820 ( 100.0 %)
clf fit time: 0.483 s
clf predict time: 0.005 s
accuracy: 0.9931740614334471
| LinearSVC | 1 | 16 | 0.9931740614334471 | 0.483s |



| | data | C | accuracies | train time |
|-|-|-|-|-|
| linear | 1 | 1.0 | 0.9840728100113766 | 208.987s |
| rbf | 1 | 1.0 | 0.9926052332195677 | 272.625s |
| rbf | 10000.0 | 1.0 | 0.9960182025028441 | ~272.625s |
| sigmoid | 1 | 1.0 | 0.9857792946530148 | 179.525s |
| LinearSVC | 1 | 16 | 0.9931740614334471 | 0.483s |

LinearSVC is FAST!!! but not as accurate

## Conclusion
SVC has a lot more parameters to manage than Naive Bayes.
This was a good order to learn them in.
Looking back at the Naive Bayes implementation it seems very simple.

- slower that NB for the most part
- higher accuracy (0.992 vs 0.973)
- inference time is sometimes unacceptably long

Some take aways
- Refactoring the code into a single optimization function helped me see what was happening much more easily.
- With long runtimes it became important to plan what work needed to be done on the kernel.
    - where could I line up multiple long running tasks to answer a question
    - where could I join multiple questions to answer them with fewer runs
    - could I have cached the results / models so that running something with the same parameter would return instantly