<a href="https://colab.research.google.com/github/Akash-Rayhan/Spam-Non-Spam-Classifier/blob/main/Spam_Non_Spam_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import os
import tarfile
import urllib.request
%pip install -q -U urlextract
%pip install nltk
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")

def fetch_spam_data(ham_url=HAM_URL, spam_url=SPAM_URL, spam_path=SPAM_PATH):
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
    for filename, url in (("ham.tar.bz2", ham_url), ("spam.tar.bz2", spam_url)):
        path = os.path.join(spam_path, filename)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
        tar_bz2_file = tarfile.open(path)
        tar_bz2_file.extractall(path=spam_path)
        tar_bz2_file.close()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
fetch_spam_data()

In [None]:
HAM_DIR = os.path.join(SPAM_PATH, "easy_ham")
SPAM_DIR = os.path.join(SPAM_PATH, "spam")

In [None]:
ham_filenames = [name for name in sorted(os.listdir(HAM_DIR)) if len(name) > 10]
spam_filenames = [name for name in sorted(os.listdir(SPAM_DIR)) if len(name) > 10]

In [None]:
len(ham_filenames)

2500

In [None]:
len(spam_filenames)

500

In [None]:
import email
import email.policy

def load_email(is_spam, filename, spam_path=SPAM_PATH):
    directory = "spam" if is_spam else "easy_ham"
    with open(os.path.join(spam_path, directory, filename), "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

In [None]:
ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filenames]
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filenames]

In [None]:
def get_email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(payload, list):
        return "multipart({})".format(", ".join([
            get_email_structure(sub_email)
            for sub_email in payload
        ]))
    else:
        return email.get_content_type()

In [None]:
from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

Some emails are actually multipart, with images and attachments

In [None]:
structures_counter(spam_emails)

Counter({'text/html': 183,
         'text/plain': 218,
         'multipart(text/plain, application/octet-stream)': 1,
         'multipart(text/html)': 20,
         'multipart(text/plain, text/html)': 45,
         'multipart(text/plain)': 19,
         'multipart(text/html, text/plain)': 1,
         'multipart(text/html, application/octet-stream)': 2,
         'multipart(multipart(text/html))': 5,
         'multipart(text/plain, image/jpeg)': 3,
         'multipart(multipart(text/html), application/octet-stream, image/jpeg)': 1,
         'multipart(multipart(text/plain, text/html), image/gif)': 1,
         'multipart/alternative': 1})

Some emails are  'text/html' 

Some emails are  'text/plain'

There are also some mails which have multipart(meaning image, links, html combined)

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

X = np.array(ham_emails + spam_emails, dtype=object)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from bs4 import BeautifulSoup
def html_to_plain_text(html):
  return BeautifulSoup(html).get_text()


In [None]:
def email_to_text(email):
    html = None
    for part in email.walk():# walk through multipart of the email
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
        try:
            content = part.get_content()
        except: 
            content = str(part.get_payload()) #get_payload() in case of encoding issue
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return html_to_plain_text(html)

**Preparing the text data.**

Text cleaning is the first step where we remove those words from the document which may not contribute to the information we want to extract.

1.Strip the headers of email messages

2.Use the lower case for all words

3.Remove punctuation marks

4.Replace links with word 'URL'

5.Replace digits with word 'NUMBER'

6.Removal of stop words

7.Remove absurd single characters which are irrelevant in dictionary

7.Lemmatization - The context of the sentence is also preserved in lemmatization as opposed to stemming. It takes into consideration the morphological analysis of the words.

stem : studies - studi

lemmatization : studies - study

In [None]:
import nltk
import urlextract
import re
from collections import Counter
from sklearn.base import BaseEstimator, TransformerMixin
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
url_extractor = urlextract.URLExtract()

class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_headers=True, lower_case=True, remove_punctuation=True,
                 replace_urls=True, replace_numbers=True, remove_stop_words=True, 
                 remove_single_char = True, lemmatization=True):
        self.strip_headers = strip_headers
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.remove_stop_words = remove_stop_words
        self.remove_single_char = remove_single_char
        self.lemmatization = lemmatization
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = email_to_text(email) or ""
            if self.lower_case:
                text = text.lower()
            
            if self.replace_urls and url_extractor is not None:# replace links with word'URL'
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")
            
            if self.replace_numbers:# replace digits with word'NUMBER'
                text = re.sub(r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', 'NUMBER', text)
            
            if self.remove_punctuation:# remove punctuation
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            
            if self.remove_stop_words:# remove stop words
              sentence = text.split()
              stop_words = stopwords.words('english')
              filtered_sentence = []
              for w in sentence:
                if w not in stop_words:
                  filtered_sentence.append(w)

            if self.remove_single_char: # remove single_char
              for index, item in enumerate(filtered_sentence):
                if item.isalpha() == False: 
                    del filtered_sentence[index]
                elif len(item) == 1:
                    del filtered_sentence[index]

            word_counts = Counter(filtered_sentence)

            if self.lemmatization and lemmatizer is not None: # lemmatization
                lemmatize_word_counts = Counter()
                for word, count in word_counts.items():
                    lemmatize_word = lemmatizer.lemmatize(word)
                    lemmatize_word_counts[lemmatize_word] += count
                word_counts = lemmatize_word_counts
            
            X_transformed.append(word_counts)
        
        return X_transformed

In [None]:
X_few = X_train[:2]
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)
X_few_wordcounts


[Counter({'chuck': 1, 'murcko': 1, 'wrote': 1, 'stuff': 1, 'yawn': 1}),
 Counter({'interesting': 1,
          'quote': 1,
          'URL': 1,
          'thomas': 1,
          'jefferson': 2,
          'examined': 1,
          'known': 1,
          'superstition': 2,
          'word': 1,
          'find': 1,
          'particular': 1,
          'christianity': 3,
          'one': 2,
          'redeeming': 1,
          'feature': 1,
          'alike': 1,
          'founded': 1,
          'fable': 1,
          'mythology': 1,
          'million': 1,
          'innocent': 1,
          'men': 1,
          'woman': 1,
          'child': 1,
          'since': 1,
          'introduction': 1,
          'burnt': 1,
          'tortured': 1,
          'fined': 1,
          'imprisoned': 1,
          'effect': 1,
          'coercion': 1,
          'make': 1,
          'half': 2,
          'world': 1,
          'fool': 1,
          'hypocrite': 1,
          'support': 1,
          'roguery': 2,
    

Now we have the word counts, and we need to convert them to vectors. For this, we will build another transformer whose **fit()** method will build the vocabulary (an ordered list of the most common words) and whose **transform()** method will use the vocabulary to convert word counts to vectors. The output is a sparse matrix.

In [None]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))

In [None]:
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)

The 62 in the second row, first column, means that the second email contains 62 words that are not part of the vocabulary. The 3 next to it means that the first word in the vocabulary is present 3 times in this email. The 2 next to it means that the second word is present 2 times, and so on.

In [None]:
X_few_vectors.toarray()

array([[ 3,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1],
       [62,  3,  2,  2,  2,  2,  2,  2,  2,  0,  0]])

You can look at the vocabulary to know which words we are talking about

In [None]:
vocab_transformer.vocabulary_

{'christianity': 1,
 'jefferson': 2,
 'superstition': 3,
 'one': 4,
 'half': 5,
 'roguery': 6,
 'teaching': 7,
 'jesus': 8,
 'chuck': 9,
 'murcko': 10}

In [None]:
from sklearn.pipeline import Pipeline

preprocess_pipeline = Pipeline([
    ("email_to_wordcount", EmailToWordCounterTransformer()),
    ("wordcount_to_vector", WordCounterToVectorTransformer()),
])

X_train_transformed = preprocess_pipeline.fit_transform(X_train)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

log_clf = LogisticRegression(solver="lbfgs", max_iter=1000, random_state=42)
score = cross_val_score(log_clf, X_train_transformed, y_train, cv=10, verbose=10)
score.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] START .....................................................................
[CV] END ................................ score: (test=0.992) total time=   0.3s
[CV] START .....................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s


[CV] END ................................ score: (test=0.975) total time=   0.3s
[CV] START .....................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.6s remaining:    0.0s


[CV] END ................................ score: (test=0.983) total time=   0.3s
[CV] START .....................................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.9s remaining:    0.0s


[CV] END ................................ score: (test=0.983) total time=   0.3s
[CV] START .....................................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    1.3s remaining:    0.0s


[CV] END ................................ score: (test=0.992) total time=   0.3s
[CV] START .....................................................................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.6s remaining:    0.0s


[CV] END ................................ score: (test=0.971) total time=   0.3s
[CV] START .....................................................................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    1.9s remaining:    0.0s


[CV] END ................................ score: (test=0.971) total time=   0.3s
[CV] START .....................................................................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    2.2s remaining:    0.0s


[CV] END ................................ score: (test=1.000) total time=   0.3s
[CV] START .....................................................................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    2.6s remaining:    0.0s


[CV] END ................................ score: (test=0.992) total time=   0.3s
[CV] START .....................................................................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    2.9s remaining:    0.0s


[CV] END ................................ score: (test=0.996) total time=   0.3s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    3.3s finished


0.9854166666666668

In [None]:
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import confusion_matrix 

X_test_transformed = preprocess_pipeline.transform(X_test)

log_clf = LogisticRegression(solver="lbfgs", max_iter=1000, random_state=42)
log_clf.fit(X_train_transformed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))
print(confusion_matrix(y_test,y_pred))

Precision: 97.80%
Recall: 93.68%
[[503   2]
 [  6  89]]


The value of false negative(FN) = 6

which means actual class = 1(spam) and predict = 0(non-spam)

it miscalculated few spam emails as ham emails which will not affect a lot the user

the value of false positive(FP) = 2

which means actual class = 0(non-spam) and predict = 1(spam)

It made ignorable mistake to recognise non-spam messages as spam. It is good the user will barely miss his/her important mails