# Building an E-Mail Importance Scorer using SKLearn
* I begin with a set of documents, each labeled as either "important" or "unimportant".
* I create two sets: an important set filled with the important e-mails and an unimportant set filled with unimportant e-mails.
* Then I train two TF-IDF vectorizers. The "important" vectorizer is trained with the important e-mails. The "unimportant" vectorizer is trained with the unimportant e-mails.
* I then run the important set through the important vectorizer to get an important vector and the unimportant set through the unimportant vectorizer to get an unimportant vector.
* Given a new e-mail, I run it through the important vectorizer to get a vector. I use average cosine-similarity to measure this vector to the important vector. Then I run the e-mail through the unimportant vectorizer to get a vector. Again, I use average cosine-similarity to measure this vector to the unimportant vector.  
Finally, I calculate the score "0.5 + (similarity_to_important - similarity_to_unimportant) / 2

In [151]:
import numpy as np
import pandas as pd
import sklearn.feature_extraction

In [152]:
STOP_WORDS = sklearn.feature_extraction.stop_words.ENGLISH_STOP_WORDS

def clean(doc: str) -> str:
    """ Prepare a document for processing. """
    doc = doc.lower()
    doc = "".join([c for c in doc if c.isalpha() or c.isspace()])
    doc = " ".join([word for word in doc.split() if word not in STOP_WORDS])
    return doc

In [153]:
columns = ["Content", "Important"]
train = pd.concat([
    pd.read_excel("Data/emails_300_set_1.xlsx", header=1)[columns],
    pd.read_excel("Data/emails_300_set_2.xlsx", header=1)[columns]
])
test = pd.read_excel("Data/emails_300_set_3.xlsx", header=1)[columns]

In [154]:
train.head(3)

Unnamed: 0,Content,Important
0,"\r\n\r\nAt the request of Patrice Thurston, pl...",0.0
1,Pursuant to the various discussions over the p...,1.0
2,"Get ready. Beginning in November, electronic ...",1.0


In [155]:
test.head(3)

Unnamed: 0,Content,Important
0,Dear Louise and Greg:\r\r\n\r\r\nFortune magaz...,False
1,"Hey guys,\r\r\n\r\r\nJust wanted to make known...",True
2,"Dear Ken,\r\r\n\r\r\nI hope you are faring oka...",True


In [156]:
tr_important   = [clean(d) for d in train[train["Important"] == 1]["Content"]]
tr_unimportant = [clean(d) for d in train[train["Important"] == 0]["Content"]]
te_important   = [clean(d) for d in test[test["Important"] == 1]["Content"]]
te_unimportant = [clean(d) for d in test[test["Important"] == 0]["Content"]]

In [157]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [169]:
vec_i = TfidfVectorizer()
vec_u = TfidfVectorizer()
trans_i = vec_i.fit_transform(tr_important)
trans_u = vec_u.fit_transform(tr_unimportant)

def score(doc: str) -> float:
    trans_doc_i = vec_i.transform(doc)
    trans_doc_u = vec_u.transform(doc)
    sim_i = cosine_similarity(trans_doc_i, trans_i).mean(axis=1)
    sim_u = cosine_similarity(trans_doc_u, trans_u).mean(axis=1)
    return 0.5 + (sim_i - sim_u) / 2

In [170]:
print("Percent classified as important:", (score(te_important) > 0.5).mean())
print(score(te_important))

Percent classified as important: 0.9292929292929293
[0.50150519 0.50067201 0.50110205 0.50781734 0.5031217  0.50147309
 0.50104478 0.50099364 0.50128166 0.5011009  0.50039853 0.5046933
 0.50180411 0.50143349 0.50326282 0.50287289 0.49963065 0.50266377
 0.50305432 0.50072138 0.50040259 0.50148278 0.49984627 0.50420662
 0.50351152 0.50310949 0.50005209 0.50038197 0.50227731 0.50175771
 0.5011842  0.50090374 0.50174217 0.50423661 0.50003605 0.50151282
 0.50183769 0.50181869 0.5008755  0.50104496 0.50341875 0.50219764
 0.49493427 0.50388944 0.50135183 0.50129513 0.49983187 0.50306095
 0.50379419 0.50251769 0.50209894 0.50309446 0.50163591 0.49977558
 0.4998795  0.50136686 0.50146167 0.50066877 0.50505038 0.5017485
 0.50292082 0.50275816 0.50373073 0.50137574 0.50180273 0.50398679
 0.50134761 0.5019294  0.49954458 0.50066097 0.50246768 0.50066092
 0.50266694 0.50150533 0.5011317  0.50212163 0.50399296 0.50088664
 0.50143359 0.50228953 0.50283545 0.50099391 0.50100684 0.50271956
 0.50389476 

In [171]:
print("Percent classified as important:", (score(te_unimportant) > 0.5).mean())
print(score(te_unimportant))

Percent classified as important: 0.7761194029850746
[0.50216358 0.49866878 0.50345915 0.5028035  0.50251898 0.50575787
 0.50112383 0.50272306 0.50331736 0.50056681 0.50530604 0.50040787
 0.50058973 0.49985998 0.50038944 0.49858586 0.50357208 0.50466133
 0.50059664 0.50107763 0.50197344 0.50148135 0.50221642 0.50334748
 0.50015269 0.50077929 0.50230439 0.50275368 0.50220206 0.50039398
 0.50136748 0.49811215 0.50078295 0.5012593  0.50203964 0.50046991
 0.49769813 0.50098469 0.50008859 0.49996757 0.50218436 0.50037214
 0.50083323 0.502708   0.5048503  0.50068891 0.50025614 0.49951834
 0.50297047 0.50138035 0.50054933 0.50234459 0.50558529 0.49855503
 0.50282112 0.4986806  0.49907473 0.49866517 0.50093291 0.4992527
 0.50162046 0.50175962 0.50253579 0.49931245 0.49984686 0.50156146
 0.49930681 0.50119103 0.50322641 0.50090404 0.50172339 0.50150958
 0.50616419 0.50017516 0.50044753 0.50165621 0.50314462 0.50241621
 0.49884559 0.49918095 0.49883827 0.50241539 0.50365373 0.50342866
 0.50566312