# Building a SMS Spam Classifier


### Dataset information

Dataset containing 5,572 Text Messages and their corresponding label (target): 
- **ham**: 4,828 observations
- **spam**: 747 observations

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix
from sklearn.naive_bayes import MultinomialNB

In [2]:
fileName = ".\\data\\sms.tsv"
sms = pd.read_table(fileName, header = None, names = ['label', 'message'])
sms.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
for i,x in enumerate(sms["message"]):
    if i < 6:
        print(x, "\n")

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 

Ok lar... Joking wif u oni... 

Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 

U dun say so early hor... U c already then say... 

Nah I don't think he goes to usf, he lives around here though 

FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv 



**Let's convert label to a numerical values => 1 (positive class) would be spam.**

In [4]:
sms["label"].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [5]:
sms["target"] = (sms["label"] == "spam").astype(int)
sms.drop(["label"], axis = 1, inplace = True)
sms.head()

Unnamed: 0,message,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


#### A quick example to see how to produce bag of words

Bag of words

There are 2 steps to produce a vector representation for each text document (corpus):
1. Learning the vocubulary in the corpus **using the fit method**.
2. Using that vocabulary to produce the vector representation for each document **using the transform method**.

In [6]:
corpus = [
    "This is the first document",
    "This is the second second document",
    "And the third one, Yes, yes, yes this is",
    "Is this the first document"
]

Step1 & 2: Learning the vocabulary of the training data and Vectorizing the document 
(dtm: document tocken matrix)

In [7]:
CV = CountVectorizer()
CV.fit(corpus)
CV.get_feature_names()

['and',
 'document',
 'first',
 'is',
 'one',
 'second',
 'the',
 'third',
 'this',
 'yes']

In [8]:
X_dtm = CV.transform(corpus)

Bag of words representation:


In [9]:
pd.DataFrame(data = X_dtm.toarray(), columns = CV.get_feature_names(),
             index = ["doc" + str(i + 1) for i in range(len(corpus))])

Unnamed: 0,and,document,first,is,one,second,the,third,this,yes
doc1,0,1,1,1,0,0,1,0,1,0
doc2,0,1,0,1,0,2,1,0,1,0
doc3,1,0,0,1,1,0,1,1,1,3
doc4,0,1,1,1,0,0,1,0,1,0


## Bag of words for SMS

Splitting:

In [10]:
targetName = ["target"]
y = sms[targetName]
X = sms["message"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
                                                    random_state = 123, stratify = y)

Next steps are:
- Making an instance for the vectorizer
- Producing the document-tocken matrix in one-step (for the train data)
- Transforming the testing data using fitted vocabulary into a document-tocken matrix

In [11]:
CV1 = CountVectorizer()
X_train_dtm = CV1.fit_transform(X_train)
X_test_dtm = CV1.transform(X_test)

In [12]:
X_train_dtm.shape

(4457, 7778)

In [13]:
X_test_dtm.shape

(1115, 7778)

**Defining a function to print the confusion matrix:**

In [14]:
def printMatrix(CM, labels = ["ham", "spam"]):
    df = pd.DataFrame(data = CM, columns = labels, index = labels)
    df.index.name = "TRUE"
    df.columns.name = "PREDICTION"
    df.loc["Total"] = df.sum()
    df["Total"] = df.sum(axis = 1)
    return df

## Building the classifier

#### Multinomial Naive Bayes classifier
The multinomial Naive Bayes classifier is suitable for classification with discrete features(e.g., word counts for text classification).

In [15]:
MNB = MultinomialNB()
MNB.fit(X_train_dtm, y_train)
y_pred_test = MNB.predict(X_test_dtm)

accuracy = 100 * round(accuracy_score(y_pred = y_pred_test, y_true = y_test), 3)
precision = 100 * round(precision_score(y_pred = y_pred_test, y_true = y_test), 3)
print("Accuracy is ", accuracy)
print("Precision is ", precision)

Accuracy is  98.7
Precision is  97.2


  return f(*args, **kwargs)


In [16]:
CM = confusion_matrix(y_pred = y_pred_test, y_true = y_test)
printMatrix(CM)

PREDICTION,ham,spam,Total
TRUE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ham,962,4,966
spam,10,139,149
Total,972,143,1115


#### Insight:
- **precision** means: the proportion of cases that our predictions are correct when we predict spam
- **Accuracy** means: how well our classifier works, overally. 
- We have a 98.7 percent of accuracy, which is very good, and as you can see from the confusion matrix when we predicted a spam we were wrong just 5 time, just in 5 occasions our classifier made mistake. This also yeilds a high precision level.

## Making prediction

**Lets predict the class for the following sms:**
1. "Today is your lucky day! claim $100 of free gas now! just text back saying YES."
2. "I have been calling you all day, r u comming back before dinner?"

A function tp accepets a string containing a text message and classifies it into spam or ham

In [17]:
def spam_filter(text):
    text1 = CV1.transform([text])
    pred = MNB.predict(text1)[0]
    print("Text Message:", text)
    if pred:
        return "Type of the message: Spam"
    else:
        return "Type of the message: Ham"

In [18]:
sms1 = "Today is your lucky day. You win 100 $"
sms2 = "I want to go to school"

In [19]:
spam_filter(sms1)

Text Message: Today is your lucky day. You win 100 $


'Type of the message: Spam'

In [20]:
spam_filter(sms2)

Text Message: I want to go to school


'Type of the message: Ham'