# Build a Spam Classifier using Machine Learning and ElasticSearch.




Consider the trec07_spam set of documents annotated for spam, available “data resources”. 
First read and accept agreement at http://plg.uwaterloo.ca/~gvcormac/treccorpus07/. Then download the 255 MB Corpus (trec07p.tgz). The html data is in data/; the labels ("spam" or "ham") are in full/.

Index the documents with ElasticSearch, but use library to clean the html into plain test first. You dont have to do stemming or skipping stopwords (up to you); eliminating some punctuation might be useful. 
Cleaning Data is Required: By "unigram" we mean an English word, so as part of reading/processing data there will be a filter step to remove anything that doesnt look like an English word or small number. Some mistake unigrams passing the filter are acceptable, if they look like words (e.x. "artist_", "newyork", "grande") as long as they are not overwhelming the set of valid unigrams. You can use any library/script/package for cleaning, or share your cleaning code (but only the cleaning code) with the other students.
Make sure to have a field “label” with values “yes” or “no” (or "spam"/"ham") for each document.
Partition the spam data set into TRAIN 80% and TEST 20%. One easy way to do so is to add to each document in ES a field "split" with values either "train" or "test" randomly, following the 80%-20% rule. Thus there will be 2 feature matrices, one for training and one for testing (different documents, same exact columns/features). The spam/ham distribution is roughly a third ham and two thirds spam; you should have a similar distribution in both TRAIN and TEST sets.

In [3]:
import os
import re
import time
import sys
import copy
import email as e
import random 
import numpy as np
import pandas as pd
import time
import nltk
import string
import pickle

from sklearn import linear_model
from sklearn.metrics import accuracy_score
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch 
from liblinear import *


In [4]:
## list of all email names
location = r"C:\Users\mm199\IR-hw\HW7_data\trec07p\data"
list_of_email = os.listdir(location)

In [5]:
## Elasticsearch instance
es = Elasticsearch()

In [7]:
## load email and content from pickle saved from first part
with open("email_content", "rb") as file:
    email_content_dict = pickle.load(file)

In [52]:
email_content_dict

{'inmail.1': 'from sun apr smtp id apr apr generic branded apr microsoft outlook express normal ro do feel pressure perform rising anxiety thing past back old ',
 'inmail.10': 'from sun apr esmtp id apr esmtp id apr spamassassin esmtp id apr esmtp id apr apr list r mailing primary o i r find sensitivity specificity following diagnostic a particular diagnostic test multiple sclerosis conducted ms patients healthy ms patients classified healthy healthy subjects classified suffering i find number ms patients required sensitivity is simple i completely new help jochen view message sent r help mailing list archive mailing list please read posting guide reproducible ',
 'inmail.100': 'from sun apr friend esmtp id apr she wants better all apr normal microsoft outlook express produced by microsoft mimeole o this message mime html public html ',
 'inmail.1000': 'from mon apr esmtp id apr windows nt smtp id apr larry king live apr could immortality become o this message mime if mime compliant ma

In [11]:
## map email name to an integer
dict_index_values = list(email_content_dict.keys())
docid_to_es_id_mapping = {dict_index_values[i]:i+1 for i in range(len(dict_index_values))}

In [17]:
## lib linear format = {{docid: {term1:termfreq1, term2:termfreq2,...}}}
liblinear_dict = {}
t1 = time.time()

for doc_id in docid_to_es_id_mapping:
    _id = docid_to_es_id_mapping[doc_id]
    dict_all = es.termvectors(index = "spam_test2", doc_type = "document", id = doc_id, term_statistics = True, fields = ["body"])
    try:
        words_dictionary = dict_all["term_vectors"]["body"]["terms"]
        save_for_doc = {}
        for word in words_dictionary:
            if re.match("[a-zA-Z]+",word):
                save_for_doc[word] = words_dictionary[word]["term_freq"]
        liblinear_dict[doc_id] = save_for_doc
        if _id % 1000 == 0:
            print("Time taken to complete: ",doc_id,"-->", time.time()-t1, "secs")
            t1 = time.time()
    except:
        continue
    

Time taken to complete:  inmail.1114 --> 4.597630977630615 secs
Time taken to complete:  inmail.12248 --> 3.8342018127441406 secs
Time taken to complete:  inmail.13362 --> 4.154789447784424 secs
Time taken to complete:  inmail.14499 --> 4.193894863128662 secs
Time taken to complete:  inmail.15624 --> 4.014488458633423 secs
Time taken to complete:  inmail.16755 --> 3.6645243167877197 secs
Time taken to complete:  inmail.1788 --> 3.794894218444824 secs
Time taken to complete:  inmail.19005 --> 3.9887447357177734 secs
Time taken to complete:  inmail.20121 --> 3.4716434478759766 secs
Time taken to complete:  inmail.21221 --> 3.5013256072998047 secs
Time taken to complete:  inmail.22353 --> 3.526991844177246 secs
Time taken to complete:  inmail.23473 --> 3.885139226913452 secs
Time taken to complete:  inmail.24598 --> 3.0518481731414795 secs
Time taken to complete:  inmail.25741 --> 2.8676905632019043 secs
Time taken to complete:  inmail.26854 --> 3.5618553161621094 secs
Time taken to compl

In [47]:
liblinear_dict

{'inmail.1': {'anxieti': 1,
  'apr': 4,
  'back': 1,
  'brand': 1,
  'express': 1,
  'feel': 1,
  'gener': 1,
  'id': 1,
  'microsoft': 1,
  'normal': 1,
  'old': 1,
  'outlook': 1,
  'past': 1,
  'perform': 1,
  'pressur': 1,
  'rise': 1,
  'ro': 1,
  'smtp': 1,
  'sun': 1,
  'thing': 1},
 'inmail.10': {'apr': 6,
  'archiv': 1,
  'classifi': 2,
  'complet': 1,
  'conduct': 1,
  'diagnost': 2,
  'esmtp': 4,
  'find': 2,
  'follow': 1,
  'guid': 1,
  'healthi': 3,
  'help': 2,
  'id': 4,
  'jochen': 1,
  'list': 3,
  'mail': 3,
  'messag': 1,
  'multipl': 1,
  'new': 1,
  'number': 1,
  'o': 1,
  'particular': 1,
  'patient': 3,
  'pleas': 1,
  'post': 1,
  'primari': 1,
  'r': 3,
  'read': 1,
  'reproduc': 1,
  'requir': 1,
  'sclerosi': 1,
  'sensit': 2,
  'simpl': 1,
  'spamassassin': 1,
  'specif': 1,
  'subject': 1,
  'suffer': 1,
  'sun': 1,
  'test': 1,
  'view': 1},
 'inmail.100': {'apr': 3,
  'better': 1,
  'esmtp': 1,
  'express': 1,
  'friend': 1,
  'html': 2,
  'id': 1,
  'm

In [18]:
## get labels for all emails
def label_collector():
    location = r"C:\Users\mm199\IR-hw\HW7_data\trec07p\full"
    list_of_email = os.listdir(location)
    filename = location+ "\\" + "index"
    dict_label = {}
    with open(filename) as file:
        file = file.readlines()
        file = [i.strip("\n").split("/") for i in file]
        for element in file:
            label = element[0].split(" ")[0]
            if label.lower() == "spam":
                label = 1
            else:
                label = 0 
            dict_label[element[-1]] = label
    return dict_label        

In [None]:
## call the label function
df_y_label = label_collector()       


In [21]:
## map each word to an integer but first get a set of all words from liblinear dict values
flatted_dict = {}
words_list_for_all = []
i = 1
for key in liblinear_dict:
    for words in liblinear_dict[key]:
        words_list_for_all.append(words)

flatted = list(set(words_list_for_all))
flatted_dict = {word : index+1 for index,word in enumerate(flatted)}

In [22]:
## split the data in 80% training and 20% test
index = list(liblinear_dict.keys())
index_training = random.sample(index, k = int(len(index) * 0.80))
index_testing = (set(index) ^ set(index_training))

In [23]:
## get labels list in split format
def separate_file(index, df_y_label):
    partial_dict = {}
    for i in index:
        partial_dict[i] = df_y_label[i]
    return partial_dict
training_liblinear = separate_file(index_training, df_y_label)
testing_liblinear = separate_file(index_testing, df_y_label)

In [24]:
## get liblinear dict in split format
def lib_linear_input_func(index_list, flatted_dict, liblinear_dict,):
    liblinear_split = {}
    for key in index_list:
        doc_id = int(key.split(".")[-1])
        temp_dict = {}
        for word_id in liblinear_dict[key]:
            temp_dict[flatted_dict[word_id]] = liblinear_dict[key][word_id]
        liblinear_split[doc_id] = temp_dict
    liblinear_input_set = list(liblinear_split.values())
    return liblinear_input_set
liblinear_training_input_set = lib_linear_input_func(index_list = index_training, flatted_dict = flatted_dict, liblinear_dict = liblinear_dict,)
liblinear_testing_input_set = lib_linear_input_func(index_list = index_testing, flatted_dict = flatted_dict, liblinear_dict = liblinear_dict,)

In [26]:
## input of labels to liblinear function 
input_label_training = list(training_liblinear.values())
input_label_testing = list(testing_liblinear.values())

In [27]:
## model and predict using liblinear function
prob = problem(input_label_training, liblinear_training_input_set)
param = parameter('-s 0')
m = train(prob, param) # m is a ctype pointer to a model
p_labels, p_acc, p_vals = predict(input_label_testing, liblinear_testing_input_set, m)

Accuracy = 99.6271% (12023/12068) (classification)


In [50]:
## sort the probabilities to get top 50 spam docs
probability_dict = {i: p_labels[index] for index,i in enumerate(list(testing_liblinear.keys()))}
sorted_top_all = sorted(probability_dict.items(), key = lambda x:x[1], reverse = True)
sorted_top_50 = [i for i in sorted_top_all if df_y_label[i[0]] == 1][:50]

In [45]:
## get only the labels and the liblinear dictionary format for these top 50 spam labels
sorted_top_50_indices = [i[0] for i in sorted_top_50]
liblinear_top_50_testing_input_set = lib_linear_input_func(index_list = sorted_top_50_indices, flatted_dict = flatted_dict, liblinear_dict = liblinear_dict)
sorted_top_50_result = [i[1] for i in sorted_top_50]


In [46]:
## run the same liblinear function
top_50_p_labels, top_50_p_acc, top_50_p_vals = predict(sorted_top_50_result, liblinear_top_50_testing_input_set, m)

Accuracy = 100% (50/50) (classification)
