# SMS Spam Filter ###

## Introduction ##

In this notebook I will create a SMS spam filter using the Naive Bayes algorithm.

## Import Data ##

For this exercise I will use the UCL Spam Collection Data Set. The dataset is available [here](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).
After downloading the file we can load up the .csv file using pandas and take a look at the first few rows.

In [150]:
import pandas as pd

# the encoding in this file is not utf-8, so we need to specify the encoding
df = pd.read_csv("spam.csv", encoding = "latin-1")

df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


There are two columns that are important in this dataframe. First - v1 - is the column that contains the __class__ of SMS and is what I will be teaching the model to predict. From the description of the dataset we know that there are only two __classes__: spam and ham (a common name for non-spam content).

Second, column v2 which contains the __text__ of the SMS. This can be used to predict the __class__ of the SMS.

To make it easier to work with the dataframe I will drop the unnecessary columns and rename v1 and v2 to __class__ and __text__.

In [140]:
# dropping all columns starting with unnamed, just listing the columns may have beed faster
df.drop([c for c in df.columns.values.tolist() if c.startswith("Unnamed")], axis = 1, inplace = True)
df.rename(columns = {"v1" : "class", "v2" : "text"}, inplace = True)

df.head()

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Data Analysis ##

Before proceeding with pre-processing our data so we can feed it to a model, it is a good idea to take a look at some features of our dataframe. To for this we can use the __describe__. We can look at the two classes (ham/spam)in more detail using the __groupby__.

In [129]:
df.groupby("class").describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


I can see that the dataset is rather small. I can also see that there is some repetition in the the text, but it is not too bad. So to accurately predict the class of the text I would want a simple alogrithm, for example __Naive Bayes__.

## Data pre-processing ##

Before I can feed the data to the algorithm, I need to clean up the text a bit. Initially we can rely on our knowledge of SMS and the examples we have seen so far. Here are 3 things we can consider:
- Punctuation is not particularly consistent in SMS, so we should get rid of it.
- As with any text, stopwords are generally unimportant to the meaining of the phrase, so we should get rid of them.
- [Stemming](https://en.wikipedia.org/wiki/Stemming) (stripping the stem of a word) or [Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) (parsing the word to a certain form) is generally a good idea when working with text, since they can allow similar words by meaning to create identical vectors. However in our case this can be unreliable, since the text contains non-standart words like "wkly" (weekly), which makes it hard for standart __stemming__ and __lemmatizing__ methods to work reliably. So it is best to avoid them altogether for now.

Now let's create a method that can clean our data the way we described. Thankfully __nltk__ contains a preset list of stopwords for English, and we can get punctuation symbols from python's __string__ module. We can also use __word_tokenize__ from nltk to split our sentences into separate __tokens__.

In [130]:
from nltk import word_tokenize
from nltk.corpus import stopwords
import string

def clean_text(text):
    text = word_tokenize(text)
    text = [w for w in text if w not in stopwords.words("english")]
    text = [w for w in text if w not in string.punctuation]
    return text

Let's use the lines rows from __df.head()__ to test this out.

In [131]:
sample = df["text"].head()

sample = sample.apply(clean_text)
print(sample)

0    [Go, jurong, point, crazy.., Available, bugis,...
1             [Ok, lar, ..., Joking, wif, u, oni, ...]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3    [U, dun, say, early, hor, ..., U, c, already, ...
4    [Nah, I, n't, think, goes, usf, lives, around,...
Name: text, dtype: object


We can notice two oddities here:
- __...__ is present in the text.
- After tokenizing __don't__ turned into __n't__.

Let's look at the __n't__ first. __Do__ was removed as it is a stopword. The same happens to __haven't__, __didn't__ and so on. Now there is no reason we can't just keep __n't__, but to make it more consistent I will replace them with __not__.

Now to the punctuation - __...__ was not included as __string.punctuation__ contains a string of unique punctuation symbols, so there is no __...__ in it. The __word_tokenize__ function is useful, however it does have some quirks. For example:


In [132]:
test_phrase = "I'm a WHAT!? OMG.. I wont believe he said that! I won't!!!"
print(word_tokenize(test_phrase))


['I', "'m", 'a', 'WHAT', '!', '?', 'OMG..', 'I', 'wont', 'believe', 'he', 'said', 'that', '!', 'I', 'wo', "n't", '!', '!', '!']


We can see that it didn't separate __..__ after OMG. That is easy to fix if we replace __"."__ with " " before tokenizing. However there are some issues with __wont__ and __won't__. The former wasn't registered as __won't__ and even if it did, __wo__ is not a stopword, so our function __clean_text__ wouldn't get rid of it. Fixing these is a bit tedious now, so let's leave this for future iterations of our model and for now just implement the __"."__ and __n't__ fix.

In [133]:
def clean_text(text):
    text = text.replace(".", " ")
    text = word_tokenize(text)
    text = [w for w in text if w not in stopwords.words("english")]
    text = [w for w in text if w not in string.punctuation]
    text = [w if w.lower() != "n't" else "not" for w in text]
    return text

sample = df["text"].head()

sample = sample.apply(clean_text)
print(sample)

0    [Go, jurong, point, crazy, Available, bugis, n...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3        [U, dun, say, early, hor, U, c, already, say]
4    [Nah, I, not, think, goes, usf, lives, around,...
Name: text, dtype: object


We could continue normalizing our text for a long time (and that tends to be most of the job when it comes to ML), but for now let's get on with making a model so we have a baseline to compare to.

## Vectorization ##

Now we need to vectorize our data. For this we will use the most common method - __bag-of-words__.

The first step is to represent each SMS text as a vector with as many dimensions as there are unique words in the whole body of text. For this we can use __sklearn__'s CountVectorizer. Let's test it on our __df.head()__

In [134]:
from sklearn.feature_extraction.text import CountVectorizer


sample = df.head()
bow_trans = CountVectorizer(analyzer = clean_text).fit(sample["text"])
bow = bow_trans.transform(sample["text"])

Now we can proceede to weight and normalize the vectors using [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). __TF__ stands for __Term Frequency__. This will divide the number of occurances of a each word in our SMS by the number of total tokens in it. __IDF__ stands for __Inverse Document Frequency__, which measures the importance of each words basen on the number of its occurances in the whole body of text. Let's apply it to our __bag-of-words__.

In [135]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_trans = TfidfTransformer().fit(bow)
tfidf = tfidf_trans.transform(bow)

At this point we can fit the data to our NaiveBayes classifier.

In [136]:
from sklearn.naive_bayes import MultinomialNB

model_nb = MultinomialNB().fit(tfidf, sample["class"].tolist())

## Training the model ##

Before training the data we should first split our data into a training and testing set. SciKit Learn has a built in function for splitting data to a training and testing sets. Here is an example of how we get a train/test set where the test set is 30% of the total data set.

In [147]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df['text'],df['class'],test_size=0.3)

Now we can repeat the steps above on our training data.

In [99]:
bow_trans = CountVectorizer(analyzer = clean_text).fit(x_train)
bow = bow_trans.transform(x_train)

tfidf_trans = TfidfTransformer().fit(bow)
tfidf = tfidf_trans.transform(bow)

model_nb = MultinomialNB().fit(tfidf, y_train)

Sklearn has a built in __Pipeline__ function, which can accomplish the same in just 2 lines. So let's use that instead.


In [148]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([("bow", CountVectorizer(analyzer = clean_text)), ("tfidf", TfidfTransformer()), ("classifier", MultinomialNB())])
pipeline.fit(x_train, y_train)


Pipeline(memory=None,
     steps=[('bow', CountVectorizer(analyzer=<function clean_text at 0x7f004f3e17b8>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

## Testing the model ##

Now we can test the model on our test sets and see how well it did with the __classification_report__ function baked into SciKit Learn.

In [149]:
from sklearn.metrics import classification_report

predictions = pipeline.predict(x_test)
print(classification_report(predictions,y_test))

              precision    recall  f1-score   support

         ham       1.00      0.95      0.97      1512
        spam       0.67      1.00      0.80       160

   micro avg       0.95      0.95      0.95      1672
   macro avg       0.83      0.97      0.89      1672
weighted avg       0.97      0.95      0.96      1672



Let's go over the results:
- [Precision](https://en.wikipedia.org/wiki/Precision_and_recall) is the ratio of relevant instances divided by recognised instances. Let's look at __spam precison__. We can see that out of a 100 SMS our model predicted were spam 67 were actually spam.
- [Recall](https://en.wikipedia.org/wiki/Precision_and_recall) is the ratio of recognised instances over the total number of instances. Looking at __spam recall__ we can see that our model found all of the existing spam messages.
- [F1-score](https://en.wikipedia.org/wiki/F1_score) is a metric combined metric of __precision__ and __recall__.
- __Support__ is the number of instances of that class

Looking at the metrics we can see that our model sometimes mistakes ham for spam.

If we wanted to improve our model we would want to return to the data pre-processing step and tweak the data there. Ideally we would also acquire more data to feed the model as well, which should increase the accuracy. But for now we are done with this exercise.

Thank you for your time!