# Detecting Spam Messages Using NLP & Naive Bayes

In this project, we will build a very simple SPAM detector for SMS messages.

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. 

It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam. The distribution is a total of 4,827 SMS legitimate messages (86.6%) and a total of 747 (13.4%) spam messages.

## Introduction To The Dataset

We will import the data and define a separator (in this case, a tab) and rename the columns accordingly

In [1]:
import pandas as pd

df = pd.read_table(r'C:\projectdatasets\SMSSpamCollection',  
                   sep='\t', 
                   header=None,
                   names=['label', 'message'])

We can see the data has the format [label] [tab] [message]

In [3]:
# show the first ten rows
df.head(10)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [24]:
# there are 5572 rows and two columns
df.shape

(5572, 2)

## Pre-Processing

We need to perform some data cleansing on the data before we apply our model

In [2]:
# convert labels from strings to binary values for our classifier
df['label'] = df.label.map({'ham': 0, 'spam': 1})  

In [3]:
# check the label field now has 0s and 1s to indicate ham and spam
df['label'].head()

0    0
1    0
2    1
3    0
4    0
Name: label, dtype: int64

In [4]:
# convert all characters in the message to lower case
df['message'] = df.message.map(lambda x: x.lower())

# remove any punctuation:
df['message'] = df.message.str.replace('[^\w\s]', '')  

In [5]:
# check the message field has been cleansed
df['message'].head()

0    go until jurong point crazy available only in ...
1                              ok lar joking wif u oni
2    free entry in 2 a wkly comp to win fa cup fina...
3          u dun say so early hor u c already then say
4    nah i dont think he goes to usf he lives aroun...
Name: message, dtype: object

The Natural Language Toolkit (NLTK), is a suite of libraries and programs for natural language processing for Python
We will import the 'nltk' package (once we download it) to apply tokenisation

In [6]:
import nltk  
# nltk.download()  

In [7]:
# tokenize the messages into into single words using nltk
df['message'] = df['message'].apply(nltk.word_tokenize)  

In [8]:
# check the message field has been tokenised
df['message'].head()

0    [go, until, jurong, point, crazy, available, o...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, in, 2, a, wkly, comp, to, win, f...
3    [u, dun, say, so, early, hor, u, c, already, t...
4    [nah, i, dont, think, he, goes, to, usf, he, l...
Name: message, dtype: object

In [9]:
# perform word stemming (normalize text for all variations of words that carry the same meaning, regardless of the tense) 
# a popular stemming algorithm is the Porter Stemmer

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

df['message'] = df['message'].apply(lambda x: [stemmer.stem(y) for y in x])  

In [10]:
# check the message field has been word stemmed (can see 'crazy', 'entry' has been replaced with 'crazi', 'entri' etc.)
df['message'].head()

0    [go, until, jurong, point, crazi, avail, onli,...
1                         [ok, lar, joke, wif, u, oni]
2    [free, entri, in, 2, a, wkli, comp, to, win, f...
3    [u, dun, say, so, earli, hor, u, c, alreadi, t...
4    [nah, i, dont, think, he, goe, to, usf, he, li...
Name: message, dtype: object

In [11]:
# we will transform the data into occurrences, which will be the features that will feed into the model
from sklearn.feature_extraction.text import CountVectorizer

# convert the list of words into space-separated strings
df['message'] = df['message'].apply(lambda x: ' '.join(x))

In [12]:
# check the message field has been converted to space separated strings (like it was at the beginning)
df['message'].head()

0    go until jurong point crazi avail onli in bugi...
1                                ok lar joke wif u oni
2    free entri in 2 a wkli comp to win fa cup fina...
3          u dun say so earli hor u c alreadi then say
4    nah i dont think he goe to usf he live around ...
Name: message, dtype: object

In [13]:
count_vect = CountVectorizer()  
counts = count_vect.fit_transform(df['message'])

counts

<5572x8169 sparse matrix of type '<class 'numpy.int64'>'
	with 72500 stored elements in Compressed Sparse Row format>

In [14]:
# we could leave it as the simple word-count per message, but it is better to use Term Frequency Inverse Document Frequency 
# this is more known as tf-idf
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer().fit(counts)
counts = transformer.transform(counts)

In [34]:
transformer

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)


In [33]:
print(counts)

  (0, 7925)	0.22378642176936625
  (0, 7715)	0.18293604147358436
  (0, 7497)	0.232012730496152
  (0, 7130)	0.15808501470085967
  (0, 5635)	0.22485506312666312
  (0, 5292)	0.1588008730270491
  (0, 4273)	0.2781965206152583
  (0, 4128)	0.32930301835453774
  (0, 3872)	0.10860920003212803
  (0, 3425)	0.18328548053939198
  (0, 3388)	0.15280952404957904
  (0, 3336)	0.132266862568599
  (0, 2248)	0.255022519528138
  (0, 2029)	0.2781965206152583
  (0, 1750)	0.2781965206152583
  (0, 1748)	0.31435532599420324
  (0, 1340)	0.2504083119963028
  (0, 1146)	0.32930301835453774
  (1, 7835)	0.44483654514496557
  (1, 5289)	0.5633498837724461
  (1, 5257)	0.2825014776211812
  (1, 4308)	0.42081977871680865
  (1, 4094)	0.4773478663822099
  (2, 7883)	0.18653623125647448
  (2, 7848)	0.14242759355834578
  :	:
  (5570, 6587)	0.19054252105358732
  (5570, 5048)	0.21643786562194572
  (5570, 4396)	0.16284308112975754
  (5570, 3987)	0.11780359009346424
  (5570, 3940)	0.27149395792904457
  (5570, 3872)	0.1156240697440695

## Training The Model

Now we have performed feature extraction from our data, we will build our model. 

We will start by splitting our data into training and test sets

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, df['label'], test_size=0.1, random_state=69)  

In [38]:
print(X_train)

  (0, 989)	0.28052144773473153
  (0, 1072)	0.3432827956011483
  (0, 1184)	0.412788938318317
  (0, 1304)	0.1705583078355282
  (0, 1773)	0.16702897405622036
  (0, 3732)	0.26222639184071767
  (0, 3850)	0.16406777394283956
  (0, 3976)	0.1388949438216507
  (0, 5088)	0.17825595732136162
  (0, 5540)	0.21779783920283172
  (0, 5745)	0.3166941755901548
  (0, 5951)	0.3041276257048051
  (0, 6508)	0.3940516607477597
  (0, 7797)	0.1839341921120709
  (1, 211)	0.28267413293218063
  (1, 359)	0.238804249783713
  (1, 582)	0.2536777174054881
  (1, 1228)	0.2698430146557444
  (1, 1356)	0.17557364291658975
  (1, 1805)	0.10254684297717932
  (1, 1905)	0.22109496673870935
  (1, 2166)	0.21686930820947917
  (1, 2746)	0.20538500729287698
  (1, 3193)	0.12744459627924393
  (1, 3307)	0.20031089495440416
  :	:
  (5011, 8130)	0.22336964227394898
  (5012, 1132)	0.16877711316090904
  (5012, 1160)	0.23405780291790354
  (5012, 1631)	0.24796824405972512
  (5012, 1852)	0.19136883768188218
  (5012, 2368)	0.16661539279372278
 

In [39]:
print(X_test)

  (0, 895)	0.28999797162758306
  (0, 1124)	0.17503953462676497
  (0, 1260)	0.1806775784485796
  (0, 1605)	0.28999797162758306
  (0, 1773)	0.11734341495843983
  (0, 2055)	0.26025028442328957
  (0, 2109)	0.12945014502730604
  (0, 3028)	0.1882030959165165
  (0, 3930)	0.28999797162758306
  (0, 3976)	0.2927349662641272
  (0, 3985)	0.24116759238885288
  (0, 5227)	0.10770668790321371
  (0, 5380)	0.28999797162758306
  (0, 5491)	0.27683441033783157
  (0, 6049)	0.28999797162758306
  (0, 6579)	0.267494714968356
  (0, 6780)	0.14918636698714263
  (0, 7109)	0.08721601188417048
  (0, 7846)	0.12158774440143195
  (0, 7919)	0.15447508303473523
  (0, 8130)	0.10192583382482392
  (1, 1160)	0.1073266226344332
  (1, 1417)	0.2846605090910865
  (1, 1597)	0.26992717420861423
  (1, 3336)	0.13037008878027842
  :	:
  (554, 7681)	0.139088862729437
  (554, 7871)	0.25318299772379493
  (554, 8113)	0.06505727058380065
  (554, 8130)	0.2669593308715244
  (555, 1304)	0.4081000495497748
  (555, 1805)	0.3583092712410307
  (

In [41]:
# can run the 'head' command to check the data, as this is a dataframe
y_train.head()

4096    0
866     1
1732    0
1260    0
5415    0
Name: label, dtype: int64

In [42]:
# can run the 'head' command to check the data, as this is a dataframe
y_test.head()

3444    0
378     0
3330    0
4606    0
2050    0
Name: label, dtype: int64

In [16]:
# now we need to initialize the Naive Bayes Classifier and fit the training data
# for text classification problems, the Multinomial Naive Bayes Classifier is well-suited

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(X_train, y_train)  

In [46]:
model

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Evaluating The Model

We have now put together our classifier, so we can evaluate its performance using the testing dataset

In [17]:
import numpy as np

# run predictions using the test dataset
predicted = model.predict(X_test)

predicted

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,

In [18]:
print(np.mean(predicted == y_test))  

0.9480286738351255


The Naive Bayes Classifier has a 94.2% accuracy with this test set. Note that 'accuracy' might not be a good assessment, since the dataset is imbalanced when it comes to the labels (86.6% legitimate vs 13.4% spam)

It could be that our classifier is over-fitting the legitimate class, while ignoring the spam class. To solve this uncertainty, we will look at a confusion matrix.

In [19]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, predicted))  

[[482   0]
 [ 29  47]]


We can see the amount of errors is relatively balanced between legitimate and spam, with 0 legitimate messages classified as spam and 29 spam messages classified as legitimate. Overall, these are good results for our simple classifier.