## SMS Spam Detection

<p>You are hired as an AI expert in the development department of a telecommunications company. The first thing on your orientation plan is a small project that your boss has assigned you for the following given situation. Your supervisor has given away his private cell phone number on too many websites and is now complaining about daily spam SMS. Therefore, it is your job to write a spam detector in Python. </p>

<p>In doing so, you need to use a Naive Bayes classifier that can handle both bag-of-words (BoW) and tf-idf features as input. For the evaluation of your spam detector, an SMS collection is available as a dataset - this has yet to be suitably split into train and test data. To keep the costs as low as possible and to avoid problems with copyrights, your boss insists on a new development with Python.</p>

<p>Include a short description of the data preprocessing steps, method, experiment design, hyper-parameters, and evaluation metric. Also, document your findings, drawbacks, and potential improvements.</p>

<p>Note: You need to implement the bag-of-words (BoW) and tf-idf feature extractor from scratch. You can use existing python libraries for other tasks.</p>

**Dataset and Resources**

* SMS Spam Collection Dataset: https://archive.ics.uci.edu/dataset/228/sms+spam+collection

## Solution

### Data preparation

In [16]:
from collections import defaultdict
import numpy as np
import re
import pandas as pd

In [17]:
def preprocess_text(text):
    """
    Preprocesses the text by removing non-alphanumeric characters and converting to lowercase.
    """
    text = text.replace('\n', ' ')  # convert new line character to space
    text = text.lower()  # Convert to lower case
    text = text.split(' ')  # tokenisation with spaces
    
    # tokenisation with tabs
    xtext = []
    for i in range(len(text)):  
        for ele in text[i].split('\t'):
            xtext.append(ele)
    text = xtext
    
    # remove special characters
    xtext = []
    for i in text:  
        if re.sub(r'[^a-zA-Z\d\s]', '', i) !='':
            xtext.append(re.sub(r'[^a-zA-Z\d\s]', '', i))
    text = xtext

    return ' '.join(text)

### Preprocessing 
* convert the text to lowercase
* Removing special characters and spaces
* tokenization
* splitting the text, between classification and the messages
* converting the ham/spam to binary values to make things easier

In [18]:
with open('D:\\BS. CS & AI\\Natural Language Understanding\\Practical Course\\Assignment_2\\Assignment_2\\_sms_spam_collection\\SMSSpamCollection') as f:
    corpus = f.readlines()

In [19]:
# preprocess the corpus
for i in range(len(corpus)):
	corpus[i] = preprocess_text(corpus[i])
corpus

# remove the empty documents
l = len(corpus) - 1
for i in range(len(corpus)):
	if ''.join(corpus[l-i].split())=='':
		print(corpus[l-i])
		del corpus[l-i]

# remove classification word and create classification_labels list
classification_labels = []
for i in range(len(corpus)):
	document = corpus[i]

	# extract label from document and add it to classification_labels
	label = document.split(' ')[0]
	classification_labels.append(label)

	# remove label from document in the corpus
	corpus[i] = ' '.join(document.split(' ')[1:])

In [20]:
# convert ham to 0 and spam to 1 in classification list
classification_labels = [0 if label=='ham' else 1 for label in classification_labels]

In [21]:
# prepare vocabulary
vocab = set()
for document in corpus:
	for word in document.split(' '):
		vocab.add(word)

### Bag of Words implemented in python.

In [22]:
def bag_of_words(corpus, vocabulary):
    '''
        Returns the vector of bag of words
    '''
    vectors = []

    # convert vocabulary to list to get the index by its element
    vocabulary = list(vocabulary) 

    for document in corpus:
        # Tokenize the document into words
        words = document.lower().split()

        # Create a dictionary to store the word counts
        word_counts = defaultdict(int)

        # Count the occurrences of each word
        for word in words:
            word_counts[word] += 1

        # Create the bag-of-words vector
        vector = np.zeros(len(vocabulary))
        for word, count in word_counts.items():
            if word in vocabulary:
                vector[vocabulary.index(word)] = count
        
        vectors.append(vector)

    return vectors

In [23]:
bow_vector = bag_of_words(corpus, vocab)

### Tf-idf implemented in python.

In [24]:
def tf(word, document):     
	# Calculate tf
	return document.count(word) / len(document)

def idf(word, corpus):
	# Calculate idf
    # +1 to avoid division by 0
    count_of_documents = len(corpus) + 1
    count_of_documents_with_word = sum([1 for doc in corpus if word in doc]) + 1
    idf = np.log10(count_of_documents/count_of_documents_with_word) + 1
    return idf

def tf_idf(word, document, corpus):
	# Calculate tf-idf
	return tf(word, document) * idf(word, corpus)

In [25]:
word_to_index = {word:i for i, word in enumerate(vocab)}
num_words = len(vocab)

In [26]:
# create an empty list to store our word vectors
tf_idf_vector = []
for document in corpus:
    # for our new document create a new word vector
    new_word_vector = [0 for i in range(num_words)]
    
    # now we loop through each word in our document and compute the tf-idf score and populate our vector with it,
    # we only care about words in this document because words outside of it will remain zero
    for word in document.split():
        # get the score
        tf_idf_score = tf_idf(word, document, corpus)
        # next get the index for this word in our word vector
        word_index = word_to_index[word] 
        # populate the vector
        new_word_vector[word_index] = tf_idf_score
    
    # don't forget to add this new word vector to our list of existing tf_idf_vector
    tf_idf_vector.append(new_word_vector)

In [27]:
print(len(bow_vector))
print(len(bow_vector[0]))
# print(bow_vector)

5574
9578


In [28]:
print(len(tf_idf_vector))
print(len(tf_idf_vector[0]))
# print(tf_idf_vector)

5574
9578


In [29]:
len(classification_labels)

5574

In [30]:
len(vocab)

9578

# Model building

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score

#### Model using Bag of words vector

In [32]:
# Splitting into test and test sets
X_train, X_test, y_train, y_test = train_test_split(bow_vector, classification_labels, test_size=0.2, random_state=4, stratify=classification_labels) 

In [33]:
df = pd.DataFrame(y_train)
df.value_counts()

0    3861
1     598
dtype: int64

In [34]:
df = pd.DataFrame(y_test)
df.value_counts()

0    966
1    149
dtype: int64

In [35]:
# Models
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

In [36]:
gnb.fit(X_train, y_train)
y_pred1 = gnb.predict(X_test)
print(accuracy_score(y_test, y_pred1))
print(confusion_matrix(y_test, y_pred1))
print(precision_score(y_test, y_pred1))

0.9076233183856502
[[872  94]
 [  9 140]]
0.5982905982905983


The Gaussian Naive Bayes classifier predicts approximately 90.76% accurately, indicating that the model performs well overall. The model has high number of true positives and true negatives but still there are many false positives and negatives, indicating some room for improvement. as we can see from the precision score, only 59.82% times it actually classifies spam correctly which is not got at all.

In [38]:
mnb.fit(X_train, y_train)
y_pred2 = mnb.predict(X_test)
print(accuracy_score(y_test, y_pred2))
print(confusion_matrix(y_test, y_pred2))
print(precision_score(y_test, y_pred2))

0.9829596412556054
[[955  11]
 [  8 141]]
0.9276315789473685


The Multinomial Naive Bayes classifer gives better results compared to GNB in both overall accuracy and precision score, meaning that it tackles the imbalanced dataset accurately learning both classes correctly.

In [40]:
bnb.fit(X_train, y_train)
y_pred3 = bnb.predict(X_test)
print(accuracy_score(y_test, y_pred3))
print(confusion_matrix(y_test, y_pred3))
print(precision_score(y_test, y_pred3))

0.979372197309417
[[963   3]
 [ 20 129]]
0.9772727272727273


The Bernoulli Naive Bayes classifier also gives better results than GNB, learning the classes better and having less false positives and negatives 

#### Model using tf-idf vector

In [42]:
X_train, X_test, y_train, y_test = train_test_split(tf_idf_vector, classification_labels, test_size=0.2, random_state=4, stratify=classification_labels)

In [45]:
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

In [46]:
gnb.fit(X_train, y_train)
y_pred1 = gnb.predict(X_test)
print(accuracy_score(y_test, y_pred1))
print(confusion_matrix(y_test, y_pred1))
print(precision_score(y_test, y_pred1))

0.9103139013452914
[[876  90]
 [ 10 139]]
0.6069868995633187


Here also Gaussian NB classifer fails to perform well on imbalanced dataset even using tf-idf vector instead of bag of words.

In [48]:
mnb.fit(X_train, y_train)
y_pred2 = mnb.predict(X_test)
print(accuracy_score(y_test, y_pred2))
print(confusion_matrix(y_test, y_pred2))
print(precision_score(y_test, y_pred2))

0.8663677130044843
[[966   0]
 [149   0]]
0.0


  _warn_prf(average, modifier, msg_start, len(result))


The Multinomial NB classifier fails completely here, not predicting any positive instances correctly, resulting in precision score of 0. This can be due to class imbalance.

In [50]:
bnb.fit(X_train, y_train)
y_pred3 = bnb.predict(X_test)
print(accuracy_score(y_test, y_pred3))
print(confusion_matrix(y_test, y_pred3))
print(precision_score(y_test, y_pred3))

0.979372197309417
[[963   3]
 [ 20 129]]
0.9772727272727273


Bernoulli NB classifer also performs well on tf-idf vector just same as Bag of words.

Here we can conclude according to all results that the Bernoulli NB classifer turns out to be best when there is class imbalance, achieving around 98% precision and accuracy score. 
Also it doesn't matter according to our results that if we use bag of words or tf-idf vector, as results are almost same in this case.

### Additional Experiments *(5 additional points - <span style="color: red;">Optional</span>)*