# Build a Spam Classifier using Machine Learning and ElasticSearch.




Consider the trec07_spam set of documents annotated for spam, available “data resources”. 
First read and accept agreement at http://plg.uwaterloo.ca/~gvcormac/treccorpus07/. Then download the 255 MB Corpus (trec07p.tgz). The html data is in data/; the labels ("spam" or "ham") are in full/.

Index the documents with ElasticSearch, but use library to clean the html into plain test first. You dont have to do stemming or skipping stopwords (up to you); eliminating some punctuation might be useful. 
Cleaning Data is Required: By "unigram" we mean an English word, so as part of reading/processing data there will be a filter step to remove anything that doesnt look like an English word or small number. Some mistake unigrams passing the filter are acceptable, if they look like words (e.x. "artist_", "newyork", "grande") as long as they are not overwhelming the set of valid unigrams. You can use any library/script/package for cleaning, or share your cleaning code (but only the cleaning code) with the other students.
Make sure to have a field “label” with values “yes” or “no” (or "spam"/"ham") for each document.
Partition the spam data set into TRAIN 80% and TEST 20%. One easy way to do so is to add to each document in ES a field "split" with values either "train" or "test" randomly, following the 80%-20% rule. Thus there will be 2 feature matrices, one for training and one for testing (different documents, same exact columns/features). The spam/ham distribution is roughly a third ham and two thirds spam; you should have a similar distribution in both TRAIN and TEST sets.

In [4]:
import os
import re
import time
import sys
import copy
import email as e
import random 
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch 
from sklearn.naive_bayes import GaussianNB
from sklearn import tree

In [5]:
## list of all email names

location = r"C:\Users\mm199\DM and IR\IR-hw\HW7_data\trec07p\data"
list_of_email = os.listdir(location)

In [6]:
## Elasticsearch instance
es = Elasticsearch()

In [8]:
## Parse the email data
def get_html_content(html):
    soup = BeautifulSoup(html, "lxml")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text


def check_content_type(part, text_content):
    if part.get_content_type() == "text/html":
        html = part.get_payload()
        text = get_html_content(html)
        text_content += text
    elif part.get_content_type() == "text/plain":
        text = part.get_payload()
        text_content += text
    else:
        pass
    return text_content

text_content = ""
email_content_dict = {}
i = 1
t1 = time.time()
for email in list_of_email:
    filename = "C:\Users\mm199\DM and IR\IR-hw\HW7_data\trec07p\data" + email
    with open(filename, "rt",encoding = "utf8", errors = 'ignore') as file:
        txt =  e.message_from_string(file.read())
        if "Subject" in txt:
            text_content = txt["Subject"]  + " "
        else:
            text_content = ""
        if txt.is_multipart():
            for each_part in txt.get_payload():
                text_content = check_content_type(each_part, text_content)
        else:
            text_content = check_content_type(txt, text_content)
        clean_text = re.sub("[_\=]+","",text_content)
        words_list = [i.group() for i in re.finditer("[\w]+",clean_text) if i.group() not in punctuations]
        clean_text = " ".join(words_list)
        email_content_dict[email] = clean_text
        if i % 1000 == 0:
            t2 = time.time()
            print ("At email: ", i, "time taken: ", t2 - t1)
            t1 = t2
        i += 1
        
   
            

At email:  1000 time taken:  21.968018054962158
At email:  2000 time taken:  29.30673360824585
At email:  3000 time taken:  26.245988130569458
At email:  4000 time taken:  25.899571657180786
At email:  5000 time taken:  25.4673068523407
At email:  6000 time taken:  28.58362603187561
At email:  7000 time taken:  28.211841106414795
At email:  8000 time taken:  29.224868059158325
At email:  9000 time taken:  28.640576601028442
At email:  10000 time taken:  24.476439237594604
At email:  11000 time taken:  22.500931978225708
At email:  12000 time taken:  19.837464570999146
At email:  13000 time taken:  20.590888500213623
At email:  14000 time taken:  20.617992877960205
At email:  15000 time taken:  20.86680793762207
At email:  16000 time taken:  20.409116983413696
At email:  17000 time taken:  20.139161825180054
At email:  18000 time taken:  19.491543292999268
At email:  19000 time taken:  21.534316062927246
At email:  20000 time taken:  20.420719385147095
At email:  21000 time taken:  21.5

In [9]:
## put the data in ES and also get a email to integer mapping
def put_data_in_ES(email_content_dict):
    i = 1
    docid_to_es_id_mapping = {}
    for key in email_content_dict:
        es_doc = {"doc_id" : key, "content":email_content_dict[key]}
        es.index(index = "classifier", doc_type = "emails", id = i, body = es_doc)
        if i % 100 == 0:
            print ("At email: ", i)
        docid_to_es_id_mapping[key] = i
        i += 1
    return docid_to_es_id_mapping


In [10]:
## call the function to put the data in ES
docid_to_es_id_mapping = put_data_in_ES(email_content_dict)

At email:  100
At email:  200
At email:  300
At email:  400
At email:  500
At email:  600
At email:  700
At email:  800
At email:  900
At email:  1000
At email:  1100
At email:  1200
At email:  1300
At email:  1400
At email:  1500
At email:  1600
At email:  1700
At email:  1800
At email:  1900
At email:  2000
At email:  2100
At email:  2200
At email:  2300
At email:  2400
At email:  2500
At email:  2600
At email:  2700
At email:  2800
At email:  2900
At email:  3000
At email:  3100
At email:  3200
At email:  3300
At email:  3400
At email:  3500
At email:  3600
At email:  3700
At email:  3800
At email:  3900
At email:  4000
At email:  4100
At email:  4200
At email:  4300
At email:  4400
At email:  4500
At email:  4600
At email:  4700
At email:  4800
At email:  4900
At email:  5000
At email:  5100
At email:  5200
At email:  5300
At email:  5400
At email:  5500
At email:  5600
At email:  5700
At email:  5800
At email:  5900
At email:  6000
At email:  6100
At email:  6200
At email:  6300
A

At email:  71900
At email:  72000
At email:  72100
At email:  72200
At email:  72300
At email:  72400
At email:  72500
At email:  72600
At email:  72700
At email:  72800
At email:  72900
At email:  73000
At email:  73100
At email:  73200
At email:  73300
At email:  73400
At email:  73500
At email:  73600
At email:  73700
At email:  73800
At email:  73900
At email:  74000
At email:  74100
At email:  74200
At email:  74300
At email:  74400
At email:  74500
At email:  74600
At email:  74700
At email:  74800
At email:  74900
At email:  75000
At email:  75100
At email:  75200
At email:  75300
At email:  75400


In [12]:
## mapping of doc id to integer
docid_to_es_id_mapping

{'inmail.1': 1,
 'inmail.10': 2,
 'inmail.100': 3,
 'inmail.1000': 4,
 'inmail.10000': 5,
 'inmail.10001': 6,
 'inmail.10002': 7,
 'inmail.10003': 8,
 'inmail.10004': 9,
 'inmail.10005': 10,
 'inmail.10006': 11,
 'inmail.10007': 12,
 'inmail.10008': 13,
 'inmail.10009': 14,
 'inmail.1001': 15,
 'inmail.10010': 16,
 'inmail.10011': 17,
 'inmail.10012': 18,
 'inmail.10013': 19,
 'inmail.10014': 20,
 'inmail.10015': 21,
 'inmail.10016': 22,
 'inmail.10017': 23,
 'inmail.10018': 24,
 'inmail.10019': 25,
 'inmail.1002': 26,
 'inmail.10020': 27,
 'inmail.10021': 28,
 'inmail.10022': 29,
 'inmail.10023': 30,
 'inmail.10024': 31,
 'inmail.10025': 32,
 'inmail.10026': 33,
 'inmail.10027': 34,
 'inmail.10028': 35,
 'inmail.10029': 36,
 'inmail.1003': 37,
 'inmail.10030': 38,
 'inmail.10031': 39,
 'inmail.10032': 40,
 'inmail.10033': 41,
 'inmail.10034': 42,
 'inmail.10035': 43,
 'inmail.10036': 44,
 'inmail.10037': 45,
 'inmail.10038': 46,
 'inmail.10039': 47,
 'inmail.1004': 48,
 'inmail.10040'

In [None]:
## get labels for all emails
def label_collector():
    location = r"C:\Users\mm199\IR-hw\HW7_data\trec07p\full"
    list_of_email = os.listdir(location)
    filename = location+ "\\" + "index"
    dict_label = {}
    with open(filename) as file:
        file = file.readlines()
        file = [i.strip("\n").split("/") for i in file]
        for element in file:
            label = element[0].split(" ")[0]
            if label.lower() == "spam":
                label = 1
            else:
                label = 0 
            dict_label[element[-1]] = label
    return dict_label
df_y_label = label_collector()       
        

In [13]:
## create a data frame with columns as key words and index as email ids (email name)
keyword_list = "free spam click buy insurance claim clearance shopper percent order earn cash extra money double collect credit check affordable fast price loans profit refinance hidden freedom chance miracle lose home remove success virus malware ad subscribe sales performance viagra valium medicine diagnostics million join deal unsolicited trial prize now legal bonus limited instant luxury legal celebrity only compare win viagra dollar discount click here meet singles incredible deal lose weight cialis sex medication love act now 100 percent c free fast cash million dollars lower interest rate visit our website no credit check"
column = list(set(keyword_list.split(" ")))
index = list(email_content_dict.keys())
df = pd.DataFrame(0, columns = column, index = index)


In [14]:
## populate dataframe with values from Elastic Search
for doc_no in index:
    doc_id = docid_to_es_id_mapping[doc_no]
    dict_all = es.termvectors(index = "classifier", doc_type = "emails", id = doc_id, term_statistics = True, fields = ["content"])
    for word in column:
        if dict_all["term_vectors"] != {}:
            temp_dict = dict_all["term_vectors"]["content"]["terms"]
        else:
            pass
        try:     
            tfwd = temp_dict[word]["term_freq"]
        except:
            tfwd = 0
        df.set_value(doc_no, word, tfwd) 
    if doc_id % 1000 == 0 :
        print ("done",doc_id)


done 1000
done 2000
done 3000
done 4000
done 5000
done 6000
done 7000
done 8000
done 9000
done 10000
done 11000
done 12000
done 13000
done 14000
done 15000
done 16000
done 17000
done 18000
done 19000
done 20000
done 21000
done 22000
done 23000
done 24000
done 25000
done 26000
done 27000
done 28000
done 29000
done 30000
done 31000
done 32000
done 33000
done 34000
done 35000
done 36000
done 37000
done 38000
done 39000
done 40000
done 41000
done 42000
done 43000
done 44000
done 45000
done 46000
done 47000
done 48000
done 49000
done 50000
done 51000
done 52000
done 53000
done 54000
done 55000
done 56000
done 57000
done 58000
done 59000
done 60000
done 61000
done 62000
done 63000
done 64000
done 65000
done 66000
done 67000
done 68000
done 69000
done 70000
done 71000
done 72000
done 73000
done 74000
done 75000


In [56]:
## randomly sample 80% of data for training
index_training = random.sample(index, k = int(len(index) * 0.80))
## make rest 20 % as testing
index_testing = (set(index) ^ set(index_training))
## get train and test dataframe
training_df = df.loc[index_training]
testing_df = df.loc[index_testing]
## get right label for randomly sampled train and test data


In [None]:
## function to get actual labels -->spam (1) or no spam(0) after splitting the data 
def get_actual_label(index_list):
    true_result = []
    for i in index_list:
        true_result.append(df_y_label[i])
    return true_result

training_result = get_actual_label(index_training)
testing_result = get_actual_label(index_testing)

In [57]:
## Run Decision Tree
log_reg = tree.DecisionTreeClassifier()
fit_data = log_reg.fit(training_df, training_result)
test_result = fit_data.predict(testing_df)
score = accuracy_score(y_true = testing_result, y_pred = test_result)
print (round((score * 100),3), "%")

85.19 %


In [105]:
## Logistic regression to get coefficients and then the top 50 spam emails
log_reg = linear_model.LogisticRegression()
fit_data = log_reg.fit(training_df, training_result)
coeff = (fit_data.coef_)
test_set = testing_df.dot(np.array(fit_data.coef_).T)
sorted_df = test_set.sort_values(by = [0], ascending=False)
top_ranked_docs = []
for i in sorted_df.index:
    if df_y_label[i] == 1:
        top_ranked_docs.append(i)
    if len(top_ranked_docs) == 50:
        break
top_50_spam_df = df.loc[top_ranked_docs]

In [106]:
## get accuracy of top 50 docs
log_reg = linear_model.LogisticRegression()
fit_data = log_reg.fit(training_df, training_result)

test_result = fit_data.predict(top_50_spam_df)
top_50_test_result = get_actual_label(top_ranked_docs)
score = accuracy_score(y_true = top_50_test_result, y_pred = test_result)
print (round((score * 100),3), "%")


100.0 %


In [108]:
top_50_spam_df

Unnamed: 0,buy,miracle,virus,refinance,performance,prize,credit,million,diagnostics,profit,...,success,love,sales,home,clearance,sex,c,lower,our,extra
inmail.49843,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,30,0
inmail.49260,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,24,0
inmail.69829,0,0,0,0,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,5,0
inmail.47807,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
inmail.59511,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
inmail.52242,0,0,0,0,0,0,0,0,0,0,...,0,2,0,2,0,0,0,0,38,0
inmail.71393,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
inmail.12803,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
inmail.15242,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
inmail.13935,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [103]:
## get accuracy of top 50 docs
test_result = fit_data.predict(top_50_spam_df)
top_50_test_result = get_actual_label(top_ranked_docs)
score = accuracy_score(y_true = top_50_test_result, y_pred = test_result)
print (round((score * 100),3), "%")


100.0 %
