NLP (Natural Language Processing) For Spam detection

Part 1: Data

Part 2: Text Pre-Processing

Part 3: Vectorization

Part 4: Naive Bayes Model Evaluation

Part 5: Testing It On A example


#### Part 1: Data

In [6]:
# Requirements: NLTK installed along with corpus for stopwords
# Download SMS Spam Dataset from here: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
# Extract and save it in same working directory. Type pwd to see working directory
# Reading the Data from File. Has more than 5000 sms spam messages.

messages = [line.rstrip() for line in open('smsspamcollection/SMSSpamCollection')]
print(len(messages))

5574


In [3]:
#Standard imports
import pandas as pd

#Panda method to read data from CSV and add labels to convert into DataFrame
messages = pd.read_csv('smsspamcollection/SMSSpamCollection', sep='\t', names=['label', 'message'])

messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### Part 2: Text Processing

1) Remove punctuations from sentences

2) Remove Stop Words.


In [31]:
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [32]:
messages['message'].head(5).apply(text_process)

0    [Go, jurong, point, crazy, Available, bugis, n...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3        [U, dun, say, early, hor, U, c, already, say]
4    [Nah, dont, think, goes, usf, lives, around, t...
Name: message, dtype: object

#### Part 3: Vectorization Using TF-IDF

Here, we convert the text into vectors(numbers) for computer to analyse.

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(use_idf=True, ngram_range=(1, 3))
X = vectorizer.fit_transform(messages['message'])

X[0]

<1x104967 sparse matrix of type '<class 'numpy.float64'>'
	with 51 stored elements in Compressed Sparse Row format>

In [38]:
#After
print(X[0])

  (0, 33323)	0.06964811198313603
  (0, 92177)	0.10833755405777373
  (0, 46426)	0.15376692251360957
  (0, 68459)	0.12026955655973237
  (0, 20910)	0.11908189665388873
  (0, 10391)	0.11501264329702043
  (0, 64211)	0.07348935881678281
  (0, 42085)	0.050393325538225064
  (0, 14457)	0.12990291750969232
  (0, 35018)	0.08494093271861272
  (0, 100459)	0.10401136696342005
  (0, 48003)	0.12990291750969232
  (0, 14449)	0.14678714849145277
  (0, 18506)	0.12990291750969232
  (0, 85357)	0.07324746854860877
  (0, 34496)	0.07208651809860835
  (0, 6017)	0.15376692251360957
  (0, 95423)	0.08591558399587033
  (0, 33650)	0.15376692251360957
  (0, 92193)	0.15376692251360957
  (0, 46427)	0.15376692251360957
  (0, 68464)	0.15376692251360957
  (0, 20916)	0.15376692251360957
  (0, 10409)	0.15376692251360957
  (0, 64328)	0.14678714849145277
  :	:
  (0, 14459)	0.15376692251360957
  (0, 35154)	0.15376692251360957
  (0, 100478)	0.15376692251360957
  (0, 48004)	0.15376692251360957
  (0, 14450)	0.15376692251360957
  

In [39]:
X.shape

(5572, 104967)

#### Part 5: Training a Model

In [44]:
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X, messages['label'])
spam_detect_model

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

#### Part 6: Testing It

In [51]:
import numpy as np

sms_msg_array=np.array(["GENT! Claim Valid 150ppm "])

sms_msg_vector = vectorizer.transform(sms_msg_array)

print (spam_detect_model.predict(sms_msg_vector))

['spam']
