# Simple spam classification
**author:** Surya K

**dataset info:** The famous enron email dataset is used for the spam classification task. The dataset used to train had about 6000 emails containing multiple sentences containing spam emails in 1:3 ratio. 

**usage:** notebook has code to train the data.There is a html embedded version of the model store in the repo which can also be used

 
 U can use the saved model to evaluate without training (requires spacy model to be loaded) :

![demo](spam_classifier.png)

In [1]:
import os
import numpy as np
import spacy
import joblib # save model

### load datasets into memory

In [2]:
spam_dir = 'dataset/spam/'
ham_dir = 'dataset/ham/'

def getfilecontents(dir_path):
    global cnt
    contents = []
    files = os.listdir(dir_path)
    
    for file in files:
        with open(dir_path + file, 'r') as f:
            content = f.read()
            contents.append(content.strip())
    
    return contents

ham = getfilecontents(ham_dir)
spam = getfilecontents(spam_dir)

data = ham + spam
labels = [0] * len(ham) + [1] * len(spam) # 0 - not spam, 1 - spam

In [3]:
data[3], labels[3]

('Subject: re : issue\nfyi - see note below - already done .\nstella\n- - - - - - - - - - - - - - - - - - - - - - forwarded by stella l morris / hou / ect on 12 / 14 / 99 10 : 18\nam - - - - - - - - - - - - - - - - - - - - - - - - - - -\nfrom : sherlyn schumack on 12 / 14 / 99 10 : 06 am\nto : stella l morris / hou / ect @ ect\ncc : howard b camp / hou / ect @ ect\nsubject : re : issue\nstella ,\nthis has already been taken care of . you did this for me yesterday .\nthanks .\nhoward b camp\n12 / 14 / 99 09 : 10 am\nto : stella l morris / hou / ect @ ect\ncc : sherlyn schumack / hou / ect @ ect , howard b camp / hou / ect @ ect , stacey\nneuweiler / hou / ect @ ect , daren j farmer / hou / ect @ ect\nsubject : issue\nstella ,\ncan you work with stacey or daren to resolve\nhc\n- - - - - - - - - - - - - - - - - - - - - - forwarded by howard b camp / hou / ect on 12 / 14 / 99 09 : 08\nam - - - - - - - - - - - - - - - - - - - - - - - - - - -\nfrom : sherlyn schumack 12 / 13 / 99 01 : 14 pm\

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(
    data, labels, test_size=0.3)

### preprocess text data

In [5]:
nlp = spacy.load('en_core_web_sm')

In [6]:
def tokenizer(text):
    '''
    text: string to tokenize
    special character removal -> lemmatization
    '''
    doc = nlp(text)
    # preprocess during tokenizing
    tokens = [token.lemma_ for token in doc 
              if not (token.is_stop or token.is_digit or token.is_quote or token.is_space
                     or token.is_punct or token.is_bracket)]
    
    return tokens

In [7]:
# testing tokenizer
print('before tokenization:\n', data[3])
print('\n\nafter tokenization:\n', tokenizer(data[3]))

before tokenization:
 Subject: re : issue
fyi - see note below - already done .
stella
- - - - - - - - - - - - - - - - - - - - - - forwarded by stella l morris / hou / ect on 12 / 14 / 99 10 : 18
am - - - - - - - - - - - - - - - - - - - - - - - - - - -
from : sherlyn schumack on 12 / 14 / 99 10 : 06 am
to : stella l morris / hou / ect @ ect
cc : howard b camp / hou / ect @ ect
subject : re : issue
stella ,
this has already been taken care of . you did this for me yesterday .
thanks .
howard b camp
12 / 14 / 99 09 : 10 am
to : stella l morris / hou / ect @ ect
cc : sherlyn schumack / hou / ect @ ect , howard b camp / hou / ect @ ect , stacey
neuweiler / hou / ect @ ect , daren j farmer / hou / ect @ ect
subject : issue
stella ,
can you work with stacey or daren to resolve
hc
- - - - - - - - - - - - - - - - - - - - - - forwarded by howard b camp / hou / ect on 12 / 14 / 99 09 : 08
am - - - - - - - - - - - - - - - - - - - - - - - - - - -
from : sherlyn schumack 12 / 13 / 99 01 : 14 pm
to 

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(tokenizer=tokenizer)

### build the pipeline

In [9]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()

In [10]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([('vectorizer', vectorizer), ('classifier', classifier)])
pipe.fit(X_train, Y_train)

Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=Tr...      vocabulary=None)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [11]:
acc = pipe.score(X_test, Y_test)
print('Model validation accuracy: ' + str(acc))

Model validation accuracy: 0.9194068343004513


In [12]:
# save model
save_dir = 'spamclassifier.pkl'

with open(save_dir, 'wb') as f:
    joblib.dump(pipe, f)

### test model performance

In [13]:
def classify(text):
    pred = pipe.predict([text])[0]
    
    if pred:
        print('It is spam!!')
    else:
        print('Not spam')

In [14]:
classify('hello, how are you? hope you are well')

Not spam


In [15]:
classify('Buy our products at 50% offer. visit your nearest store now!')

It is spam!!


In [16]:
classify('Hey. I want to receive some guidance in a project do u think you can help me?')

Not spam


In [17]:
classify('You are one of the winners of the surprise lottery. reply your bank account numbers to get the money')

It is spam!!


positive examples were relatively skewed towards medicinal drug based advertisements in the dataset, still performs good