# Spam Classifier

The objective of this project is to build a simple email spam classifier using Machine Learning Techniques. The data used in this notebook was provided by [Euron-spam corpus](https://www2.aueb.gr/users/ion/data/enron-spam/) and it consists of around 33716 preprocessed emails. I've implemented BagOfWords and Term Frequency-Inverse Document Frequency vectors to represent the data and feed the ML models.

## Get the data

First, let's fetch the data from the dataset's website. Let's create a simple function that allows us to download the data:

In [2]:
import os
import tarfile
import urllib.request
import re

DOWNLOAD_ROOT = "http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/preprocessed/"
DATA_PATH = os.path.join(os.curdir, "data")

def fetch_spam_data(data_path=DATA_PATH):
    if not os.path.isdir(data_path):
        os.makedirs(data_path)
    
    for i in range(1,7):
        filename = f"enron{i}.tar.gz"
        print(f"Current filename: {filename}")
        
        url = DOWNLOAD_ROOT + filename
        path = os.path.join(data_path,filename)
        
        urllib.request.urlretrieve(url, path)
        tar_bz2_file = tarfile.open(path)
        tar_bz2_file.extractall(path=data_path)
        tar_bz2_file.close()

In [3]:
fetch_spam_data()

Current filename: enron1.tar.gz
Current filename: enron2.tar.gz
Current filename: enron3.tar.gz
Current filename: enron4.tar.gz
Current filename: enron5.tar.gz
Current filename: enron6.tar.gz


Now let's load the data

In [4]:
HAM_DIRS = [os.path.join(DATA_PATH, f"enron{i}", "ham") for i in range(1,7)]
SPAM_DIRS = [os.path.join(DATA_PATH, f"enron{i}", "spam") for i in range(1,7)]

ham_filenames=list()
spam_filenames=list()
for ham_dir in HAM_DIRS:
    for filename in sorted(os.listdir(ham_dir)):
            ham_filenames.append(os.path.join(ham_dir,filename))

for spam_dir in SPAM_DIRS:
    for filename in sorted(os.listdir(spam_dir)):
            spam_filenames.append(os.path.join(spam_dir,filename))

In [5]:
print(f"Number of ham emails: {len(ham_filenames)}\nNumber of spam emails: {len(spam_filenames)}")

Number of ham emails: 16545
Number of spam emails: 17171


In [6]:
sample_ham_file = ham_filenames[2]

In [7]:
with open(sample_ham_file,'r') as f:
    print(f.read())

Subject: calpine daily gas nomination
- calpine daily gas nomination 1 . doc


In [8]:
import numpy as np
from sklearn.model_selection import train_test_split

ham_emails=list()
spam_emails=list()
for file in ham_filenames + spam_filenames:
    try:
        with open(file, 'r') as f:
            if file in ham_filenames:
                ham_emails.append(f.read())
            else:
                spam_emails.append(f.read())
    except UnicodeDecodeError:
        pass

X = np.array(ham_emails + spam_emails, dtype=object)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
len(X_train)

26960

In [10]:
X_train[:2]

array(['Subject: localized software , all languages available .\nhello , we would like to offer localized software versions ( german , french , spanish , uk , and many others ) .\naii iisted software is availabie for immediate downioad !\nno need to wait 2 - 3 week for cd deiivery !\njust few exampies :\n- norton lnternet security pro 2005 - $ 29 . 95\n- windows xp professionai with sp 2 fuil version - $ 59 . 95\n- corei draw graphics suite 12 - $ 49 . 95\n- dreamweaver mx 2004 ( homesite 5 . 5 inciudinq ) - $ 39 . 95\n- macromedia studio mx 2004 - $ 119 . 95\njust browse our site and find any software you need in your native lanquaqe !\nbest regards ,\nmae\n',
       'Subject: industry forum # 136\nthe industry forum\nminute man ii\n160 lbs . light - requires no electricity - under $ 6000 complete ! now everybody can be a foamer !\nsmall , one time project ? froth - pak is the answer ! smallest self - contained out - of - box foam application for repairs and small jobs ! also availabl

## Data Preprocessing

As you might have seen, we've split the dataset into a training set and a test set containing 26960 and 6756 emails respectively.

In any text mining problem, text cleaning is the first step where we remove those words from the document which may not contribute to the information we want to extract. Emails may contain a lot of undesirable characters like punctuation marks, stop words, digits, etc which may not be helpful in detecting the spam email. 

That's why we will implement the following preprocessing steps:
- **Remove Stopwords**: Stop words like “and”, “the”, “of”, etc are very common in all English sentences and are not very meaningful in deciding spam or legitimate status, so these words have been removed from the emails.
- **Lemmatization**: It is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, “include”, “includes,” and “included” would all be represented as “include”. The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider meaning of the sentence).
- **Replace Numbers and URLs**.
- **Transform all the text to lowercase**.
- **Remove non-non-word characters**.

In [11]:
from sklearn.base import BaseEstimator, TransformerMixin
import urlextract # may require an Internet connection to download root domain names
    

class EmailToTextTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, replace_numbers=True, remove_punctuation=True, replace_urls=True, to_lower=True):
        self.replace_numbers = replace_numbers
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.to_lower = to_lower
        
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            email = re.sub(r'[_-]+', ' ', email)
            email = re.sub(r'(?i)subject:', '', email)
            if self.to_lower:
                email = email.lower()
            if self.replace_urls:
                url_extractor = urlextract.URLExtract()
                urls = list(set(url_extractor.find_urls(email)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    email = email.replace(url, " URL ")
            if self.replace_numbers:
                email = re.sub(r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', 'NUMBER', email)
            if self.remove_punctuation:
                email = re.sub(r'\W+', ' ', email, flags=re.M)
            X_transformed.append(email)
        return X_transformed

### Word Tokenizer

In [12]:
from sklearn.base import BaseEstimator, TransformerMixin
from nltk.tokenize import word_tokenize

class WordTokenizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_transformed = []
        for text_email in X:
            # Tokenize the text into words
            words = word_tokenize(text_email)
            X_transformed.append(words)
            
        return X_transformed

### Data Cleaning

In [13]:
from sklearn.base import BaseEstimator, TransformerMixin
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download the stopwords corpus if not already present
nltk.download('stopwords')
nltk.download('punkt')

# Download the open Multilingual WordNet. It is a lexical database of English words and their semantic meanings.
nltk.download('omw-1.4')
nltk.download('wordnet')

class DataCleaner(BaseEstimator, TransformerMixin):
    def __init__(self, remove_stopwords=True, lemmatization=True):
        self.remove_stopwords = remove_stopwords
        self.lemmatization = lemmatization
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_transformed = []
        for word_list in X:
            word_list_transformed = []
            if self.remove_stopwords:
                stop_words = set(stopwords.words('english'))
                for word in word_list:
                    if word.lower() not in stop_words:
                        word_list_transformed.append(word)
                word_list = word_list_transformed
                word_list_transformed = []
                
            if self.lemmatization:
                lemmatizer = WordNetLemmatizer()
                for word in word_list:
                    lemma = lemmatizer.lemmatize(word)
                    word_list_transformed.append(lemma)
                word_list = word_list_transformed
        
            word_list_transformed = word_list
            X_transformed.append(word_list_transformed)
        return X_transformed

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Let's try these transformers in a Pipeline with a few emails

In [14]:
from sklearn.pipeline import Pipeline

preprocessing_pipeline = Pipeline([
    ('email_to_text', EmailToTextTransformer()),
    ('word_tokenizer', WordTokenizer()),
    ('data_cleaner', DataCleaner()),
])
"""
    """
X_few = X_train[:3]
X_few_wordlist = preprocessing_pipeline.fit_transform(X_few)
X_few_wordlist[0][:10]

['localized',
 'software',
 'language',
 'available',
 'hello',
 'would',
 'like',
 'offer',
 'localized',
 'software']

## Feature Representation

### Word Counter

In [15]:
from sklearn.base import BaseEstimator, TransformerMixin
from collections import Counter
import numpy as np


class WordCounter(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_transformed = []
        for word_list in X:
            word_counts = Counter(word_list)
            X_transformed.append(word_counts)
        return np.array(X_transformed)

In [16]:
X_few_wordcounts = WordCounter().fit_transform(X_few_wordlist)
X_few_wordcounts[0].most_common(5)

[('NUMBER', 19),
 ('software', 4),
 ('localized', 2),
 ('version', 2),
 ('need', 2)]

### Bags of Words

Now we have the word counts transformer, and we need to convert them to vectors. Firstly, we will use a **Bag of Words (BoW) vectorizer**.  It involves converting a piece of text, such as an email or message, into a numerical feature vector based on the frequency of occurrence of words in that text. The idea behind BoW is that the presence and frequency of specific words can help determine whether a text is spam or not.

In [17]:
from sklearn.base import BaseEstimator, TransformerMixin
from scipy.sparse import csr_matrix


class BagsOfWordsVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))

In [18]:
vocab_transformer = BagsOfWordsVectorizer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors

<3x11 sparse matrix of type '<class 'numpy.intc'>'
	with 15 stored elements in Compressed Sparse Row format>

In [19]:
X_few_vectors.toarray()

array([[ 60,  19,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [563,  69,  12,  11,  10,   9,   7,   7,   7,   5,   5],
       [ 30,   4,   0,   0,   0,   0,   0,   0,   0,   0,   0]],
      dtype=int32)

What does this matrix mean? Well, the 147 in the second row, first column, means that the second email contains 147 words that are not part of the vocabulary. The 37 next to it means that the first word in the vocabulary is present 37 times in this email. The 0 next to it means that the second word is present 0 times, and so on. You can look at the vocabulary to know which words we are talking about. The first word are numbers that were replaced by the word "NUMBER", the second word is "cnet", etc.

In [20]:
vocab_transformer.vocabulary_

{'NUMBER': 1,
 'industry': 2,
 'forum': 3,
 'foam': 4,
 'cpi': 5,
 'com': 6,
 'free': 7,
 'cpillc': 8,
 'test': 9,
 'u': 10}

### TF-IDF

**TF-IDF (Term Frequency-Inverse Document Frequency)** vectors are another commonly used technique for text representation, particularly in information retrieval and text mining tasks, including spam classification. TF-IDF takes into account not only the frequency of occurrence of words but also their importance in the context of the entire corpus.

## Full Pipeline

Once the transformers are ready we can apply these transformations to our dataset with pipelines and then train different models.

In [21]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

preprocessing_param_grid = {
    "preprocessing__data_cleaner__remove_stopwords": [True, False],
    "preprocessing__data_cleaner__lemmatization": [True, False],
}

bow_full_vectorizer = Pipeline([
    ("preprocessing", preprocessing_pipeline),
    ("word_counter", WordCounter()),
    ("bags_of_words", BagsOfWordsVectorizer()),
])

tfidf_vectorizer = TfidfVectorizer()

X_train_transformed_bow = bow_full_vectorizer.fit_transform(X_train)
X_train_transformed_tfidf = tfidf_vectorizer.fit_transform(X_train)

We are now ready to train our first spam classifier! Let's transform the whole dataset:

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import joblib

param_grid = {'C': [0.1, 0.5, 1, 10], 'solver': ['lbfgs', 'liblinear']}

lr_grid_search_bow = GridSearchCV(LogisticRegression(max_iter=5000), param_grid=param_grid, cv=5, verbose=5, n_jobs=-1)
lr_grid_search_bow.fit(X_train_transformed_bow, y_train)

lr_bow_best_score = lr_grid_search_bow.best_score_
print(f"Best bow score: {lr_bow_best_score}")

lr_grid_search_tfidf = GridSearchCV(LogisticRegression(max_iter=5000), param_grid=param_grid, cv=5, verbose=5, n_jobs=-1)
lr_grid_search_tfidf.fit(X_train_transformed_tfidf, y_train)

lr_tfidf_best_score = lr_grid_search_tfidf.best_score_
print(f"Best tfidf score: {lr_tfidf_best_score}")

if lr_bow_best_score > lr_tfidf_best_score:
    lr_best_model = lr_grid_search_bow.best_estimator_
    # Save the models to a file
    joblib.dump(lr_best_model, 'bow_grid_search_model.joblib')
else:
    lr_best_model = lr_grid_search_tfidf.best_estimator_
    # Save the models to a file
    joblib.dump(lr_best_model, 'tfidf_grid_search_model.joblib')

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best bow score: 0.9810459940652819
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best bow score: 0.9897255192878338


## Predictions & Results

Let's transform the test set and use the best model make predictions:

In [24]:
X_test_transformed = tfidf_vectorizer.transform(X_test)

y_pred = lr_best_model.predict(X_test_transformed)

Let's plot the confusion matrix

In [25]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[3260,   60],
       [   9, 3412]], dtype=int64)

In the following cell we can see that the Bag of Words Vectorization technique achieves a high performance in the dataset.

In [26]:
from sklearn.metrics import precision_score, recall_score, f1_score

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))
print("F1 score: {:.2f}%".format(100 * f1_score(y_test, y_pred)))

Precision: 98.27%
Recall: 99.74%
F1 score: 99.00%
