# Code_SMS Spam Classifier using Decision Trees and Naive Bayes

Let us now use the various NLP techniques that we have learnt to build a SMS Spam classifier.

In order to build the classifier, we have collected historic data which has different SMS messages marked as 'ham' (not-spam) or 'spam'. You can download this data from here .

To build the model, we shall follow the below steps

# Step 1: Load the data into the environment

In [3]:
import numpy as np
import pandas as pd
# Loading the data into the environment using pandas
# Note: Please use appropriate filename and path
sms_data = pd.read_csv("spam.csv", encoding='latin-1')
# Review the loaded data
print('sms_data: \n',sms_data.head())
cols = sms_data.columns[:2]
data = sms_data[cols]
print(data.shape)
data = data.rename(columns={"v1":"Value","v2":"Text"})
print(data.head())
print(data.Value.value_counts())


sms_data: 
      v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  
(5572, 2)
  Value                                               Text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
ham     4825
spam     747
Nam

# Step 2: Feature Engineering

In [4]:
from string import punctuation
import re
import nltk
from nltk import word_tokenize
punctuation = list(punctuation)
# Creating a new feature called Punctuations. 
# This feature counts the number of punctuation characters in the sms message 
data["Punctuations"] = data["Text"].apply(lambda x: len(re.findall(r"[^\w+&&^\s]",x)))
# Creating a new feature called Phonenumbers. 
# This feature indicates if the sms text contains a phonenumber or not
data["Phonenumbers"] = data["Text"].apply(lambda x: len(re.findall(r"[0-9]{10}",x)))
# Creating a new feature called Links.
# This feature indicates if the sms text contains a URL or not 
is_link = lambda x: 1 if re.search(r"https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+",x)!=None else 0
data["Links"] = data["Text"].apply(is_link)
# Creating a new feature called Uppercase.
# This feature indicates how many words in the the sms text are in upper case
count_upper = lambda x : list(map(str.isupper,x.split())).count(True) 
upper_case = lambda y,n : n+1 if y.isupper() else n
data["Uppercase"] = data["Text"].apply(count_upper)
# Identifying and counting how many unusual words are there in the sms text
def find_unusual_words(text):
    text_vocab_set = set(w.lower() for w in text if w.isalpha())
    english_vocab_set = set(w.lower() for w in nltk.corpus.words.words())
    unusual_set = text_vocab_set - english_vocab_set
    return len(sorted(unusual_set))
data["unusualwords"] = data["Text"].apply(lambda x: find_unusual_words(word_tokenize(x)))
# View a few records of the data after creating these features
print(data[14:25])


  data["Punctuations"] = data["Text"].apply(lambda x: len(re.findall(r"[^\w+&&^\s]",x)))


   Value                                               Text  Punctuations  \
14   ham                I HAVE A DATE ON SUNDAY WITH WILL!!             2   
15  spam  XXXMobileMovieClub: To use your credit, click ...            11   
16   ham                         Oh k...i'm watching here:)             6   
17   ham  Eh u remember how 2 spell his name... Yes i di...             5   
18   ham  Fine if thatåÕs the way u feel. ThatåÕs the wa...             1   
19  spam  England v Macedonia - dont miss the goals/team...             7   
20   ham          Is that seriously how you spell his name?             1   
21   ham  IÛ÷m going to try for 2 months ha ha only joking             2   
22   ham  So Ì_ pay first lar... Then when is da stock c...             6   
23   ham  Aft i finish my lunch then i go str down lor. ...             3   
24   ham  Ffffffffff. Alright no way I can meet up with ...             2   

    Phonenumbers  Links  Uppercase  unusualwords  
14             0      0 

In the above code snippet, we have created new features by understanding the content of our data. This is a critical exercise to undergo when you are building NLP applications.

The below code snippet converts the text into a TF-IDF matrix. Recall that the TF-IDF matrix is a numeric representation of the text.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf= TfidfVectorizer(stop_words="english",strip_accents='ascii',max_features=300)
tf_idf_matrix = tf_idf.fit_transform(data["Text"])


TF-IDF vectorization also does some of the required cleaning and normalization steps such as removing punctuation, removing stop words, removing accents, etc. In the above snippet, we have set the value of max_features to 300, indicating that the TF-IDF matrix contain only the 300 most common words in the text. Doing this reduces the dimensionality of the TF-IDF vector. 

Finally, the below code snippet, combines the TF-IDF matrix and the other features we created earlier into a single data frame

In [6]:
data_extra_features = pd.concat([data,pd.DataFrame(tf_idf_matrix.toarray(),columns=tf_idf.get_feature_names())],axis=1)


In [7]:
data_extra_features

Unnamed: 0,Value,Text,Punctuations,Phonenumbers,Links,Uppercase,unusualwords,000,10,150p,...,world,www,xmas,xxx,ya,yeah,year,yes,yo,yup
0,ham,"Go until jurong point, crazy.. Available only ...",9,0,0,0,3,0.0,0.0,0.0,...,0.594379,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ham,Ok lar... Joking wif u oni...,6,0,0,0,3,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,5,1,0,2,5,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ham,U dun say so early hor... U c already then say...,6,0,0,2,1,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ham,"Nah I don't think he goes to usf, he lives aro...",2,0,0,1,3,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,9,1,0,2,0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5568,ham,Will Ì_ b going to esplanade fr home?,1,0,0,1,1,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5569,ham,"Pity, * was in mood for that. So...any other s...",7,0,0,0,1,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5570,ham,The guy did some bitching but I acted like i'd...,1,0,0,1,3,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
from sklearn.model_selection import train_test_split
X=data_extra_features
features = X.columns.drop(["Value","Text"])#"Value_num"
target = ["Value"]
X_train,X_test,y_train,y_test = train_test_split(X[features],X[target])


In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
dt = DecisionTreeClassifier(min_samples_split=40)
dt.fit(X_train,y_train)
pred = dt.predict(X_test)
print(accuracy_score(y_train, dt.predict(X_train)))
print(accuracy_score(y_test, pred))


0.9837281646326872
0.9755922469490309


#  in the below code snippets, we are building 2 more classifier models - Naive Bayes Classifier and Maximum Entropy Classifier (Logistic Regression)

In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Building a Naive Bayes Model
mnb = MultinomialNB()
mnb.fit(X_train,y_train)
pred_mnb = mnb.predict(X_test)
print(accuracy_score(y_test, pred_mnb))
# Building a Logistic Regression Model
lr = LogisticRegression()
lr.fit(X_train,y_train)
pred_lr = lr.predict(X_test)
print(accuracy_score(y_test, pred_lr))


  return f(*args, **kwargs)


0.964824120603015


  return f(*args, **kwargs)


0.9791816223977028


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
