<a href="https://colab.research.google.com/github/Nuri-Tas/NLP/blob/main/Text%20Classification/Spam_Detection_with_Multinomial_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will build a spam detection method with Multinomial Naive Bayes method. We start with sklearn's own NB model and then proceed to build our own NB classifier. At the end, we compare our results with the sklearn's classifier based on different metrics such as macro and micro f1 scores.

# Imports

In [225]:
import pandas as pd
import numpy as np

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix

from string import punctuation


# Load Data

In [226]:
# https://www.kaggle.com/uciml/sms-spam-collection-dataset
!wget https://lazyprogrammer.me/course_files/spam.csv

--2023-04-15 17:02:17--  https://lazyprogrammer.me/course_files/spam.csv
Resolving lazyprogrammer.me (lazyprogrammer.me)... 172.67.213.166, 104.21.23.210, 2606:4700:3031::6815:17d2, ...
Connecting to lazyprogrammer.me (lazyprogrammer.me)|172.67.213.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 503663 (492K) [text/csv]
Saving to: ‘spam.csv.2’


2023-04-15 17:02:17 (18.8 MB/s) - ‘spam.csv.2’ saved [503663/503663]



# Basic EDA and Preprocessing

In [227]:
df = pd.read_csv("spam.csv", usecols=[0,1], encoding="latin-1")
df['v2'] = df['v2'].str.lower()
df.head()

Unnamed: 0,v1,v2
0,ham,"go until jurong point, crazy.. available only ..."
1,ham,ok lar... joking wif u oni...
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor... u c already then say...
4,ham,"nah i don't think he goes to usf, he lives aro..."


In [228]:
# remove punctuations 
puncs = punctuation
df['v2'] = df['v2'].apply(lambda x: x.translate(str.maketrans("", "", puncs)))
df.head()

Unnamed: 0,v1,v2
0,ham,go until jurong point crazy available only in ...
1,ham,ok lar joking wif u oni
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor u c already then say
4,ham,nah i dont think he goes to usf he lives aroun...


In [229]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(df["v2"], df["v1"], test_size=0.2)

# Build sklearn's NB Model

get_scores calculates the mean accuracy of sklearn's NB model. We additionally include `use_stopwords` parameter to compare models depending whether it ignores the stop words.

In [230]:
def get_scores(x_train, x_test, y_train, y_test, use_stopwords=False):
  if use_stopwords:
    cv = CountVectorizer(stop_words="english")
  else:
    cv = CountVectorizer()
    
  train_vectors = cv.fit_transform(x_train)
  test_vectors = cv.transform(x_test)

  model = MultinomialNB()
  model.fit(train_vectors, y_train)
  preds = model.predict(test_vectors)
  scores = model.score(test_vectors, y_test)
  return preds, scores

Apparently, removing stopwords yields in a lower accuracy.

In [231]:
pred_no_stopwords,  score_no_stopwords = get_scores(X_train, X_test, y_train, y_test)
pred_with_stopwords, score_with_stopwords = get_scores(X_train, X_test, y_train, y_test, True)

print("Score without removing stopwords: ", score_no_stopwords)
print("Score with removing stopwords: ", score_with_stopwords)

Score without removing stopwords:  0.9829596412556054
Score with removing stopwords:  0.9847533632286996


# Building Naive Bayes from Scratch

We start with creating a vocabulary dictionary for the training set. Additionally, we also include the term `<UNK>` to store the words that are in only one of the classes, or in the test set but not in the training set. In the dictionary, keys correspond to the unique index for each word, whereas the values are the words.

In [232]:
# note, we construct vocabulary for the training set. We will denote the words that are only in the test set as <UNK>
vocabulary = {word: index for index, word in enumerate(X_train.str.split().explode().unique())}

# add <UNK>
vocabulary["<UNK>"] = len(vocabulary)

In [233]:
ham_index = y_train[y_train.eq("ham")].index
spam_index = y_train[y_train.eq("spam")].index

ham_train = X_train[ham_index]
spam_train = X_train[spam_index]

Build dictionaries for ham and spam classes. 

In [234]:
# we will initialize the likelihood dictionary with 1 for each word
ham_dict = {word: 1 for index, word in enumerate(vocabulary)}

for word in ham_train.str.split().explode():
  ham_dict[word] += 1

# normalize the values
ham_dict = {k: v / sum(ham_dict.values()) for k, v in ham_dict.items()}

In [235]:
spam_dict = {word: 1 for index, word in enumerate(vocabulary)}

for word in spam_train.str.split().explode():
  spam_dict[word] += 1

spam_dict = {k: v / sum(spam_dict.values()) for k, v in spam_dict.items()}

Get prior values for ham and spam mails

In [236]:
ham_prior = len(ham_train) / len(X_train)
spam_prior = len(spam_train) / len(X_train)

print(f"The prior values for ham and spam classes are {round(ham_prior,3)} and {round(spam_prior, 3)}, respectively.")

The prior values for ham and spam classes are 0.868 and 0.132, respectively.


In [237]:
def get_likelihood(row):
  ham_likelihood = 0
  spam_likelihood = 0
  for word in row.split():
      if word in vocabulary:
        ham_likelihood += np.log2(ham_dict[word])
        spam_likelihood += np.log2(spam_dict[word])
      else:
        ham_likelihood += np.log2(ham_dict["<UNK>"])
        spam_likelihood += np.log2(spam_dict["<UNK>"])
    
  # don't forget to add one to likelihood values in case every word is encountered only once, hence resulting in 0 likelihood
  ham_posterior = (ham_likelihood + 1) +  ham_prior
  spam_posterior = (spam_likelihood + 1) + spam_prior

  if spam_posterior >= ham_posterior:
    return "spam"
  else:
    return "ham"


We achieve 96% mean accuracy, a competitive score against the sklearn's own model. However, as our dataset was unbalanced in favor of ham mails, we are going to check other metrics as well.

In [238]:
results = X_test.apply(lambda row: get_likelihood(row))
our_score = np.mean(results == y_test)
print(our_score)

0.9668161434977578


Compare macro and micro f1 scores and scores only for the spam mails. First, display the confusion matrix. Here, the rows are actual classes, whereas columns correspond to predicted classes. 

In [244]:
confused_matrix = confusion_matrix(y_test, results)
confused_matrix

array([[928,  29],
       [  8, 150]])

In [240]:
# macro f1 scores 
ours_macro = f1_score(y_test, results, average="macro")
actual_macro = f1_score(y_test, pred_no_stopwords, average="macro")
print(f"Actual macro: {actual_macro}, Ours: {ours_macro}")

Actual macro: 0.9639173940813286, Ours: 0.9353310102344887


In [241]:
# micro f1 scores
ours_micro = f1_score(y_test, results, average="micro")
actual_micro = f1_score(y_test, pred_no_stopwords, average="micro")
print(f"Actual micro: {actual_micro}, Ours: {ours_micro}")

Actual micro: 0.9829596412556054, Ours: 0.9668161434977578


In [243]:
# f1 scores for only spam mails 
ours_spam = f1_score(y_test, results, average="binary", pos_label="spam")
actual_spam = f1_score(y_test, pred_no_stopwords,  average="binary", pos_label="spam")
print(f"Actual spam-only f1 score : {actual_spam}, Our spam-only f1 score: {ours_spam}")

Actual spam-only f1 score : 0.9377049180327869, Our spam-only f1 score: 0.8902077151335311
