# Spam Classifier

The objective of this project is to build a spam classifier using Machine Learning Techniques. The data used in this notebook was provided by "Apache SpamAssasin" and it consists of around 6000 emails. I've implemented BagOfWords and Term Frequency-Inverse Document Frequency vectors to represent the data and feed the ML models. As a model I've only implemented a Logistic Regression model, which trains fast enough and achieves high performance (around 97%) on the dataset.

## Get the data

First, let's fetch the data:

In [1]:
import os
import tarfile
import urllib.request
import re

DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"

HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
HAM2_URL = DOWNLOAD_ROOT + "20030228_easy_ham_2.tar.bz2"
HAM3_URL = DOWNLOAD_ROOT + "20030228_hard_ham.tar.bz2"
SPAM2_URL = DOWNLOAD_ROOT + "20050311_spam_2.tar.bz2"

DATA_PATH = os.path.join(os.curdir, "data")

def fetch_spam_data(urls, data_path=DATA_PATH):
    if not os.path.isdir(data_path):
        os.makedirs(data_path)
    
    for url in urls:
        filename = re.search(r"_(.*)", url).group(1)
        print(f"Current filename: {filename}")
        
        path = os.path.join(data_path,filename)
        
        urllib.request.urlretrieve(url, path)
        tar_bz2_file = tarfile.open(path)
        tar_bz2_file.extractall(path=data_path)
        tar_bz2_file.close()

In [2]:
urls = [HAM_URL, HAM2_URL, HAM3_URL, SPAM_URL, SPAM2_URL]
fetch_spam_data(urls)

Current filename: easy_ham.tar.bz2
Current filename: easy_ham_2.tar.bz2
Current filename: hard_ham.tar.bz2
Current filename: spam.tar.bz2
Current filename: spam_2.tar.bz2


Now let's load the data

In [3]:
HAM_DIRS = [os.path.join(DATA_PATH, "easy_ham"),
           os.path.join(DATA_PATH, "easy_ham_2"),
           os.path.join(DATA_PATH, "hard_ham")]
SPAM_DIRS = [os.path.join(DATA_PATH, "spam"),
            os.path.join(DATA_PATH, "spam_2")]

ham_filenames=list()
spam_filenames=list()
for ham_dir in HAM_DIRS:
    for filename in sorted(os.listdir(ham_dir)):
        if len(filename) > 20:
            ham_filenames.append(os.path.join(ham_dir,filename))

for spam_dir in SPAM_DIRS:
    for filename in sorted(os.listdir(spam_dir)):
        if len(filename) > 20:
            spam_filenames.append(os.path.join(spam_dir,filename))

In [4]:
print(f"Number of ham emails: {len(ham_filenames)}\nNumber of spam emails: {len(spam_filenames)}")

Number of ham emails: 4150
Number of spam emails: 1896


In [5]:
ham_filenames[0]

'.\\data\\easy_ham\\00001.7c53336b37003a9286aba55d2945844c'

We can use Python's **email** module to parse these emails (this handles headers, encoding, and so on):

In [6]:
import email
import email.policy

def load_emails(filenames, data_path=DATA_PATH):
    emails = []
    for filename in filenames:
        with open(filename, "rb") as f:
            emails.append(email.parser.BytesParser(policy=email.policy.default).parse(f))
    return emails

In [7]:
ham_emails = load_emails(ham_filenames)
spam_emails = load_emails(spam_filenames)

Let's look at some examples

In [8]:
print(spam_emails[6].get_content().strip())

Help wanted.  We are a 14 year old fortune 500 company, that is
growing at a tremendous rate.  We are looking for individuals who
want to work from home.

This is an opportunity to make an excellent income.  No experience
is required.  We will train you.

So if you are looking to be employed from home with a career that has
vast opportunities, then go:

http://www.basetel.com/wealthnow

We are looking for energetic and self motivated people.  If that is you
than click on the link and fill out the form, and one of our
employement specialist will contact you.

To be removed from our link simple go to:

http://www.basetel.com/remove.html


4139vOLW7-758DoDY1425FRhM1-764SMFc8513fCsLl40


In [9]:
print(ham_emails[1].get_content().strip())

Martin A posted:
Tassos Papadopoulos, the Greek sculptor behind the plan, judged that the
 limestone of Mount Kerdylio, 70 miles east of Salonika and not far from the
 Mount Athos monastic community, was ideal for the patriotic sculpture. 
 
 As well as Alexander's granite features, 240 ft high and 170 ft wide, a
 museum, a restored amphitheatre and car park for admiring crowds are
planned
---------------------
So is this mountain limestone or granite?
If it's limestone, it'll weather pretty fast.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
4 DVDs Free +s&p Join Now
http://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM
---------------------------------------------------------------------~->

To unsubscribe from this group, send an email to:
forteana-unsubscribe@egroups.com

 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/


Some emails are actually multipart, with images and attachments (which can have their own attachments). Let's look at the various types of structures we have:

In [10]:
def get_email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(payload, list):
        return "multipart({})".format(", ".join([
            get_email_structure(sub_email)
            for sub_email in payload
        ]))
    else:
        return email.get_content_type()

In [11]:
from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

In [12]:
structures_counter(ham_emails).most_common()[:5]

[('text/plain', 3832),
 ('text/html', 120),
 ('multipart(text/plain, application/pgp-signature)', 101),
 ('multipart(text/plain, text/html)', 63),
 ('multipart(text/plain, text/plain)', 5)]

In [13]:
structures_counter(spam_emails).most_common()[:5]

[('text/plain', 815),
 ('text/html', 772),
 ('multipart(text/plain, text/html)', 159),
 ('multipart(text/html)', 49),
 ('multipart(text/plain)', 44)]

It seems that the ham emails are more often plain text, while spam has quite a lot of HTML. Moreover, quite a few ham emails are signed using PGP, while no spam is. In short, it seems that the email structure is useful information to have.

Let's split the data into a training set and a test set:

In [14]:
import numpy as np
from sklearn.model_selection import train_test_split

X = np.array(ham_emails + spam_emails, dtype=object)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Data Preprocessing

### HTML to plain text

Okay, let's start writing the preprocessing functions. First, we will need a function to convert HTML to plain text using BeautifulSoup.

In [15]:
from bs4 import BeautifulSoup
from html import unescape
import re

def html_to_plain_text(html):
    soup = BeautifulSoup(html, 'html.parser')
    
    # Remove script and style tags
    for script in soup(["script", "style"]):
        script.extract()
    
    # Replace line breaks with space
    text = soup.get_text(separator=' ')
    text = text.replace('\n', ' ')
    
    # Unescape HTML entities
    text = unescape(text)
    
    return re.sub('\s+', ' ', text.strip())

Let's see if this works:

In [16]:
html_spam_emails = [email for email in X_train[y_train==1]
                    if get_email_structure(email) == "text/html"]
sample_html_spam = html_spam_emails[7]
print(sample_html_spam.get_content().strip()[:200], "....")

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>

<head>
<meta http-equiv="Content-Language" content="en-ie">
<meta name="Microsoft Theme 2.00" content="9to6 1011 011">
<meta http ....


In [17]:
print(html_to_plain_text(sample_html_spam.get_content())[:200], "....")

ink_refill_toner PRINTER INK CARTRIDGES & REFILL KITS from  4.85... BULK ORDERS or TRADE welcome... please contact us at info@9to6.ie for discounted prices guaranteed to give you huge savings of betw ....


Great! Now let's write a function that takes an email as input and returns its content as plain text, whatever its format is:

In [18]:
def email_to_text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
        try:
            content = part.get_content()
        except: # in case of encoding issues
            content = str(part.get_payload())
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return html_to_plain_text(html)

In [19]:
print(email_to_text(sample_html_spam)[:100], "...")

ink_refill_toner PRINTER INK CARTRIDGES & REFILL KITS from  4.85... BULK ORDERS or TRADE welcome... ...


Now let's build this as a Transformer, so then we can use it in a Data Pipeline

In [20]:
from sklearn.base import BaseEstimator, TransformerMixin
import urlextract # may require an Internet connection to download root domain names
    

class EmailToTextTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, replace_numbers=True, remove_punctuation=True, replace_urls=True, to_lower=True):
        self.replace_numbers = replace_numbers
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.to_lower = to_lower
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = email_to_text(email) or ""
            text = re.sub(r'_', '', text)
            if self.to_lower:
                text = text.lower()
            if self.replace_urls:
                url_extractor = urlextract.URLExtract()
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', 'NUMBER', text)
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            X_transformed.append(text)
        return X_transformed

### Word Tokenizer

In [21]:
from sklearn.base import BaseEstimator, TransformerMixin
from nltk.tokenize import word_tokenize

class WordTokenizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_transformed = []
        for text_email in X:
            # Tokenize the text into words
            words = word_tokenize(text_email)
            X_transformed.append(words)
            
        return X_transformed

### Data Cleaning

In [22]:
from sklearn.base import BaseEstimator, TransformerMixin
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download the stopwords corpus if not already present
nltk.download('stopwords')
nltk.download('punkt')

# Download the open Multilingual WordNet. It is a lexical database of English words and their semantic meanings.
nltk.download('omw-1.4')
nltk.download('wordnet')

class DataCleaner(BaseEstimator, TransformerMixin):
    def __init__(self, remove_stopwords=True, stemming=True, lemmatization=True):
        self.remove_stopwords = remove_stopwords
        self.stemming = stemming
        self.lemmatization = lemmatization
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_transformed = []
        for word_list in X:
            word_list_transformed = []
            if self.remove_stopwords:
                stop_words = set(stopwords.words('english'))
                for word in word_list:
                    if word.lower() not in stop_words:
                        word_list_transformed.append(word)
                word_list = word_list_transformed
                word_list_transformed = []
                
            if self.lemmatization:
                lemmatizer = WordNetLemmatizer()
                for word in word_list:
                    lemma = lemmatizer.lemmatize(word)
                    word_list_transformed.append(lemma)
                word_list = word_list_transformed
                word_list_transformed = []
            
            if self.stemming:
                stemmer = nltk.PorterStemmer()
                for word in word_list:
                    stemmed_word = stemmer.stem(word)
                    word_list_transformed.append(stemmed_word)
                word_list = word_list_transformed
        
            word_list_transformed = word_list
            X_transformed.append(word_list_transformed)
        return X_transformed

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Let's try these transformers in a Pipeline with a few emails

In [23]:
from sklearn.pipeline import Pipeline

preprocessing_pipeline = Pipeline([
    ('email_to_text', EmailToTextTransformer()),
    ('word_tokenizer', WordTokenizer()),
    ('data_cleaner', DataCleaner(stemming=False)),
])
"""
    """
X_few = X_train[:3]
X_few_wordlist = preprocessing_pipeline.fit_transform(X_few)
X_few_wordlist

[['digital',
  'dispatch',
  'weekly',
  'newsletter',
  'cnet',
  'web',
  'apple',
  'expand',
  'imac',
  'lcd',
  'display',
  'heavy',
  'laptop',
  'gateway',
  'tout',
  'chic',
  'yet',
  'cheap',
  'pc',
  'apple',
  'ipod',
  'come',
  'linux',
  'dell',
  'pc',
  'coming',
  'mall',
  'near',
  'cnet',
  'news',
  'quintessential',
  'player',
  'NUMBER',
  'ai',
  'picture',
  'utility',
  'NUMBER',
  'NUMBER',
  'icq',
  'NUMBERa',
  'build',
  'NUMBER',
  'deck',
  'NUMBER',
  'mac',
  'dell',
  'latitude',
  'cNUMBER',
  'cNUMBER',
  'series',
  'hardware',
  'toshiba',
  'pocket',
  'pc',
  'eNUMBER',
  'electronics',
  'autocad',
  'lt',
  'NUMBER',
  'software',
  'sony',
  'ericsson',
  'tNUMBER',
  'wireless',
  'july',
  'NUMBER',
  'NUMBER',
  'janice',
  'chen',
  'editor',
  'chief',
  'cnet',
  'review',
  'dear',
  'reader',
  'crushing',
  'blow',
  'discover',
  'vindigo',
  'time',
  'favorite',
  'palm',
  'app',
  'longer',
  'free',
  'twenty',
  'five',

## Feature Representation

### Word Counter

In [24]:
from sklearn.base import BaseEstimator, TransformerMixin
from collections import Counter
import numpy as np


class WordCounter(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_transformed = []
        for word_list in X:
            word_counts = Counter(word_list)
            X_transformed.append(word_counts)
        return np.array(X_transformed)

In [25]:
X_few_wordcounts = WordCounter().fit_transform(X_few_wordlist)
X_few_wordcounts

array([Counter({'NUMBER': 32, 'free': 11, 'cnet': 10, 'player': 10, 'mpNUMBER': 8, 'web': 7, 'e': 7, 'time': 6, 'software': 5, 'review': 5, 'worldcom': 5, 'one': 5, 'tech': 5, 'pc': 4, 'editor': 4, 'mailer': 4, 'service': 4, 'top': 4, 'week': 4, 'woe': 4, 'new': 4, 'perfect': 4, 'peachtree': 4, 'army': 4, 'apple': 3, 'news': 3, 'hard': 3, 'based': 3, 'mail': 3, 'account': 3, 'see': 3, 'another': 3, 'version': 3, 'popular': 3, 'bearshare': 3, 'without': 3, 'ad': 3, 'best': 3, 'continue': 3, 'jukebox': 3, 'notebook': 3, 'test': 3, 'find': 3, 'trend': 3, 'right': 3, 'game': 3, 'come': 2, 'dell': 2, 'deck': 2, 'cNUMBER': 2, 'series': 2, 'wireless': 2, 'vindigo': 2, 'app': 2, 'dish': 2, 'dough': 2, 'u': 2, 'read': 2, 'four': 2, 'want': 2, 'stay': 2, 'buzz': 2, 'fee': 2, 'latest': 2, 'file': 2, 'offer': 2, 'paid': 2, 'search': 2, 'last': 2, 'year': 2, 'model': 2, 'nomad': 2, 'audio': 2, 'sonicblue': 2, 'rio': 2, 'take': 2, 'look': 2, 'feature': 2, 'good': 2, 'check': 2, 'price': 2, 'personal

### Bags of Words

Now we have the word counts transformer, and we need to convert them to vectors. Firstly, we will use a **Bag of Words (BoW) vectorizer**.  It involves converting a piece of text, such as an email or message, into a numerical feature vector based on the frequency of occurrence of words in that text. The idea behind BoW is that the presence and frequency of specific words can help determine whether a text is spam or not.

In [26]:
from sklearn.base import BaseEstimator, TransformerMixin
from scipy.sparse import csr_matrix


class BagsOfWordsVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))

In [27]:
vocab_transformer = BagsOfWordsVectorizer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors

<3x11 sparse matrix of type '<class 'numpy.intc'>'
	with 19 stored elements in Compressed Sparse Row format>

In [28]:
X_few_vectors.toarray()

array([[597,  32,  10,  10,  11,   0,   7,   3,   8,   7,   6],
       [147,  37,   0,   0,   0,  10,   1,   4,   0,   0,   1],
       [153,   2,   0,   0,   0,   0,   0,   1,   0,   0,   0]],
      dtype=int32)

What does this matrix mean? Well, the 147 in the second row, first column, means that the second email contains 147 words that are not part of the vocabulary. The 37 next to it means that the first word in the vocabulary is present 37 times in this email. The 0 next to it means that the second word is present 0 times, and so on. You can look at the vocabulary to know which words we are talking about. The first word are numbers that were replaced by the word "NUMBER", the second word is "cnet", etc.

In [29]:
vocab_transformer.vocabulary_

{'NUMBER': 1,
 'cnet': 2,
 'player': 3,
 'free': 4,
 'openssl': 5,
 'e': 6,
 'version': 7,
 'mpNUMBER': 8,
 'web': 9,
 'time': 10}

### TF-IDF

**TF-IDF (Term Frequency-Inverse Document Frequency)** vectors are another commonly used technique for text representation, particularly in information retrieval and text mining tasks, including spam classification. TF-IDF takes into account not only the frequency of occurrence of words but also their importance in the context of the entire corpus.

In [30]:
from sklearn.base import BaseEstimator, TransformerMixin
import math
import numpy as np
from collections import Counter


class TFIDFVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.vocabulary_ = None
        self.idf_scores = None
    
    def fit(self, X, y=None):
        # Create vocabulary
        self.vocabulary_ = self._create_vocabulary(X)

        # Compute IDF scores
        self.idf_scores = self._compute_idf(X)
        
        return self

    def transform(self, X, y=None):
        # Compute TF scores
        tf_scores = self._compute_tf(X)
    
        tfidf_scores = []
        
        for tf in tf_scores:
            tfidf_scores.append({word: tf[word] * self.idf_scores.get(word, 0.0) for word in tf})

        # Convert the tf-idf scores to a numpy array
        tfidf_matrix = np.zeros((len(X), len(self.vocabulary_)), dtype=np.float32)

        for i, tfidf_dict in enumerate(tfidf_scores):
            for j, word in enumerate(self.vocabulary_):
                tfidf_matrix[i, j] = tfidf_dict.get(word, 0.0)

        return tfidf_matrix
        
    def _create_vocabulary(self, X):
        vocab = set()
        for word_list in X:
            vocab.update(word_list)
        return sorted(list(vocab))
    
    def _compute_tf(self, X):
        tf_scores = []
        for word_list in X:
            word_counts = Counter(word_list)
            total_words = len(word_list)
            tf_scores.append({word: word_counts[word] / total_words for word in word_list})
        return tf_scores
    
    def _compute_idf(self, X):
        idf_scores = {}
        num_documents = len(X)
        for word_list in X:
            unique_words = set(word_list)
            for word in unique_words:
                if word in idf_scores:
                    idf_scores[word] += 1
                else:
                    idf_scores[word] = 1
        idf_scores = {word: math.log(num_documents / count) for word, count in idf_scores.items()}
        return idf_scores
    
        

In [31]:
X_few_tfidf_vector = TFIDFVectorizer().fit_transform(X_few_wordlist)
X_few_tfidf_vector.shape

(3, 623)

## Full Pipeline

Once the data vectors are ready we can write pipelines to train different models. In this case I've only tried Logistic Regression models due to their fast convergence and high performance in this dataset.

Here I've implemented several preprocessing options, which consist of including or not **stop words** in the vectorized dataset and whether or not use **stemming** and **lemmatization**. I've explored the performance of these options using GridSearch

In [32]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

preprocessing_param_grid = {
    "preprocessing__data_cleaner__remove_stopwords": [True, False],
    "preprocessing__data_cleaner__stemming": [True, False],
    "preprocessing__data_cleaner__lemmatization": [True, False],
    "model__solver": ['lbfgs', 'liblinear']
}

bow_full_pipeline = Pipeline([
    ("preprocessing", preprocessing_pipeline),
    ("word_counter", WordCounter()),
    ("bags_of_words", BagsOfWordsVectorizer()),
    ("model", LogisticRegression(max_iter=5000)),
])


tfidf_full_pipeline = Pipeline([
    ("preprocessing", preprocessing_pipeline),
    ("tfidf_vectorizer", TFIDFVectorizer()),
    ("model", LogisticRegression(max_iter=5000)),
])

We are now ready to train our first spam classifier! Let's transform the whole dataset:

In [33]:
from sklearn.model_selection import GridSearchCV

bow_grid_srch = GridSearchCV(bow_full_pipeline, param_grid=preprocessing_param_grid, cv=3, verbose=5)
bow_grid_srch.fit(X_train, y_train)

tfidf_grid_srch = GridSearchCV(tfidf_full_pipeline, param_grid=preprocessing_param_grid, cv=3, verbose=5)
tfidf_grid_srch.fit(X_train, y_train)

Fitting 3 folds for each of 16 candidates, totalling 48 fits
[CV 1/3] END model__solver=lbfgs, preprocessing__data_cleaner__lemmatization=True, preprocessing__data_cleaner__remove_stopwords=True, preprocessing__data_cleaner__stemming=True;, score=0.973 total time= 4.5min
[CV 2/3] END model__solver=lbfgs, preprocessing__data_cleaner__lemmatization=True, preprocessing__data_cleaner__remove_stopwords=True, preprocessing__data_cleaner__stemming=True;, score=0.972 total time= 4.1min
[CV 3/3] END model__solver=lbfgs, preprocessing__data_cleaner__lemmatization=True, preprocessing__data_cleaner__remove_stopwords=True, preprocessing__data_cleaner__stemming=True;, score=0.976 total time= 3.9min
[CV 1/3] END model__solver=lbfgs, preprocessing__data_cleaner__lemmatization=True, preprocessing__data_cleaner__remove_stopwords=True, preprocessing__data_cleaner__stemming=False;, score=0.971 total time= 4.3min
[CV 2/3] END model__solver=lbfgs, preprocessing__data_cleaner__lemmatization=True, preprocessi

[CV 1/3] END model__solver=liblinear, preprocessing__data_cleaner__lemmatization=False, preprocessing__data_cleaner__remove_stopwords=True, preprocessing__data_cleaner__stemming=False;, score=0.969 total time=10.9min
[CV 2/3] END model__solver=liblinear, preprocessing__data_cleaner__lemmatization=False, preprocessing__data_cleaner__remove_stopwords=True, preprocessing__data_cleaner__stemming=False;, score=0.972 total time=13.6min
[CV 3/3] END model__solver=liblinear, preprocessing__data_cleaner__lemmatization=False, preprocessing__data_cleaner__remove_stopwords=True, preprocessing__data_cleaner__stemming=False;, score=0.974 total time=11.6min
[CV 1/3] END model__solver=liblinear, preprocessing__data_cleaner__lemmatization=False, preprocessing__data_cleaner__remove_stopwords=False, preprocessing__data_cleaner__stemming=True;, score=0.976 total time=12.4min
[CV 2/3] END model__solver=liblinear, preprocessing__data_cleaner__lemmatization=False, preprocessing__data_cleaner__remove_stopword

[CV 3/3] END model__solver=liblinear, preprocessing__data_cleaner__lemmatization=True, preprocessing__data_cleaner__remove_stopwords=True, preprocessing__data_cleaner__stemming=False;, score=0.914 total time= 4.8min
[CV 1/3] END model__solver=liblinear, preprocessing__data_cleaner__lemmatization=True, preprocessing__data_cleaner__remove_stopwords=False, preprocessing__data_cleaner__stemming=True;, score=0.815 total time= 4.9min
[CV 2/3] END model__solver=liblinear, preprocessing__data_cleaner__lemmatization=True, preprocessing__data_cleaner__remove_stopwords=False, preprocessing__data_cleaner__stemming=True;, score=0.813 total time= 4.9min
[CV 3/3] END model__solver=liblinear, preprocessing__data_cleaner__lemmatization=True, preprocessing__data_cleaner__remove_stopwords=False, preprocessing__data_cleaner__stemming=True;, score=0.826 total time= 5.1min
[CV 1/3] END model__solver=liblinear, preprocessing__data_cleaner__lemmatization=True, preprocessing__data_cleaner__remove_stopwords=Fal

Save and load the models

In [40]:
import joblib

# Save the models to a file
joblib.dump(bow_grid_srch, 'bow_grid_search_model.joblib')
joblib.dump(tfidf_grid_srch, 'tfidf_grid_search_model.joblib')

['tfidf_grid_search_model.joblib']

In [41]:
loaded_bow_grid_search = joblib.load('bow_grid_search_model.joblib')
loaded_tfidf_grid_search = joblib.load('tfidf_grid_search_model.joblib')

## Predictions & Results

Let's transform the test set and use the best model make predictions:

In [42]:
# Get the best models for BoW and TF-IDF vectorization techniques:
bow_best_model = loaded_bow_grid_search.best_estimator_
tfidf_best_model = loaded_tfidf_grid_search.best_estimator_

bow_y_pred = bow_best_model.predict(X_test)
tfidf_y_pred = tfidf_best_model.predict(X_test)

In the following cell we can see that the Bag of Words Vectorization technique achieves a high performance in the dataset.

In [43]:
from sklearn.metrics import precision_score, recall_score, f1_score

print("Bag of Words Vectorization:")
print("Precision: {:.2f}%".format(100 * precision_score(y_test, bow_y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, bow_y_pred)))
print("F1 score: {:.2f}%".format(100 * f1_score(y_test, bow_y_pred)))
print("-"*10)
print("TF-IDF Vectorization:")
print("Precision: {:.2f}%".format(100 * precision_score(y_test, tfidf_y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, tfidf_y_pred)))
print("F1 score: {:.2f}%".format(100 * f1_score(y_test, tfidf_y_pred)))

Bag of Words Vectorization:
Precision: 96.73%
Recall: 98.34%
F1 score: 97.53%
----------
TF-IDF Vectorization:
Precision: 100.00%
Recall: 80.61%
F1 score: 89.26%
