# Spam Detection using Multinomial Naive Bayes Model & Pipeline

> - The `SMS Spam Collection` is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of `5,574 messages`, tagged acording being ham (legitimate) or spam.
> - A collection of `425` SMS spam messages was manually extracted from the `Grumbletext Web site`. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. 
> - A subset of `3,375` SMS randomly chosen ham messages of the `NUS SMS Corpus (NSC)`, which is a dataset of about `10,000` legitimate messages collected for research at the `Department of Computer Science at the National University of Singapore`. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. 
> - A list of `450` SMS ham messages collected from `Caroline Tag's PhD Thesis`.
> - Finally, we have incorporated the `SMS Spam Corpus v.0.1 Big`. It has 1,002 SMS ham messages and 322 spam messages.

<div class="alert alert-block alert-info">
<h1>Table of Contents</h1></div><a class="anchor" id="0.1"></a>

1. [Importing Libraries](#1)
2. [Importing Dataset](#2)
3. [Exploratory Data Analysis](#3)
4. [Text Processing](#4)
5. [Spliting Dataset to Train and Test](#5)
6. [CountVectorizer](#6)
7. [TF/IDF Vectorization](#7)
8. [Multinomial Naive Bayes Model](#8)
9. [Pipeline](#9)

<div class="alert alert-block alert-info">
<h1>1. Importing Libraries</h1></div><a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Esentials
import pandas as pd
import numpy as np
import os

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style("darkgrid")

# Ignore useless warnings
import warnings
warnings.filterwarnings("ignore")

#Limiting floats output to 2 decimal points
pd.set_option('display.float_format', lambda x: '{:.2f}'.format(x)) 

# Text Analysis
from collections import Counter
import nltk
from nltk.corpus import stopwords
import string

# Modelling Library
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.pipeline import Pipeline


#print(os.getcwd())

<div class="alert alert-block alert-info">
<h1>2. Importing Dataset</h1></div><a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

In [None]:
sms = pd.read_csv("/kaggle/input/sms-spam-collection-dataset/spam.csv", encoding='latin1')[['v1', 'v2']]
sms.columns = ['Label','Message']
print('Dataset Dimension:', sms.shape)
sms.head()

<div class="alert alert-block alert-info">
<h1>3. Exploratory Data Analysis</h1></div><a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)

In [None]:
# dataset grouped as per Label
print(sns.countplot(data=sms, x='Label'))
plt.title('Spam/ham Count')

In [None]:
count = pd.value_counts(sms["Label"], sort= True)
count.plot(kind='pie', figsize=(15,5), autopct='%1.0f%%')
plt.title('Spam/ham Distribution')
plt.ylabel('')

In [None]:
# Length of the messages are calculated and plotted
sms['Length'] = sms.Message.apply(len)
sms.hist(column='Length',by='Label',bins=50, figsize=(15,6))

In [None]:
sms.groupby('Label').describe()

In [None]:
# longest message in the dataset
sms[sms.Length == 910].Message.iloc[0]

In [None]:
# Example of spam message
sms[sms.Length == 157].Message.iloc[0]

<div class="alert alert-block alert-warning">
<h1>Observation</h1></div> </h1></div><a class="anchor"></a>

> - There are `4825` ham messages and `747` spam messages
> - `87%` of the messages are ham while `13%` are spam
> - Most of the ham messages have message length of approx `100` while spam messages have around `130-150`.

<div class="alert alert-block alert-info">
<h1>4. Text Processing</h1></div><a class="anchor" id="4"></a>

[Back to Table of Contents](#0.1)

In [None]:
def process_text(text):
    '''
    What will be covered:
    1. Remove punctuation
    2. Remove stopwords
    3. Return list of clean text words
    '''
    STOPWORDS = stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', '16' ,'im', 'dont', 'doin', 'ure']
    # Check characters to see if they are in punctuation
    nopunc = [char for char in text if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return ' '.join([word for word in nopunc.split() if word.lower() not in STOPWORDS])


sms['clean_msg'] = sms.Message.apply(process_text)
sms.head()

In [None]:
# Testing the process_text function:
process_text('Hi. My name is Rhea Das, I am a Data Scientist. It\'s amazing!!')

In [None]:
# Visualizing the most common words occuring in Ham

ham_count = Counter(" ".join(sms[sms['Label'] == 'ham']['clean_msg']).split()).most_common(20)
ham_count = pd.DataFrame.from_dict(ham_count)
ham_count = ham_count.rename(columns={0: "words in non-spam", 1 : "count"})

ham_count.plot.bar(legend=False, figsize=(12,5),color = 'black')
y_pos = np.arange(len(ham_count["words in non-spam"]))
plt.xticks(y_pos, ham_count["words in non-spam"])
plt.title('More frequent words in non-spam messages')
plt.xlabel('words')
plt.ylabel('number')
plt.show()

In [None]:
ham_count

In [None]:
# Visualizing the most common words occuring in Spam

spam_count = Counter(" ".join(sms[sms['Label'] == 'spam']['clean_msg']).split()).most_common(20)
spam_count = pd.DataFrame.from_dict(spam_count)
spam_count = spam_count.rename(columns={0: "words in spam", 1 : "count"})

spam_count.plot.bar(legend=False, figsize=(12,5),color = 'blue')
y_pos1 = np.arange(len(spam_count["words in spam"]))
plt.xticks(y_pos1, spam_count["words in spam"])
plt.title('More frequent words in Spam messages')
plt.xlabel('words')
plt.ylabel('number')
plt.show()

In [None]:
spam_count

<div class="alert alert-block alert-warning">
<h1>Observation</h1></div> </h1></div><a class="anchor"></a>

> - We can see that the majority of frequent words in both classes are stop words such as `'to', 'a', 'or' and so on`. <br>
> - With `stop words` we refer to the most common words in a language. <br>
> - Removing the common words, we have visualized the top 20 most freq words in spam list as well as ham list

<div class="alert alert-block alert-info">
<h1>5. Spliting Dataset to Train and Test</h1></div><a class="anchor" id="5"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Split into X and Y

X = sms['clean_msg']
Y = sms['Label'].replace({'ham':0,'spam':1})
print("X Dimension", X.shape)
print("Y Dimension", Y.shape)

In [None]:
# Splitting X & Y into train and test

x_train, x_test, y_train, y_test = train_test_split(X, Y, random_state=1)
print('X_train Dimension:', x_train.shape)
print('X_test Dimension:', x_test.shape)
print('Y_train Dimension:', y_train.shape)
print('Y_test Dimension:', y_test.shape)

<div class="alert alert-block alert-comment">
<h1>Word of the wise</h1></div> </h1></div><a class="anchor"></a>

- From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> - Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**. <br>

-  We will use `CountVectorizer` to convert text into a matrix of token counts. In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a `feature`.
> - The vector of all the token frequencies for a given document is considered a `multivariate sample`.

> - A `corpus of documents` can thus be represented by a matrix with **one row per document** and **one column per token** occurring in the corpus.
> - We call `vectorization` the general process of turning a collection of text documents into numerical feature vectors. <br>
> - This specific strategy (tokenization, counting and normalization) is called the `Bag of Words` or `Bag of n-grams` representation. <br>
> - Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

### What we should do with this dataset - Vectorization

> - Currently, we have the messages as lists of tokens (also known as [lemmas](http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)) and now we need to convert each of those messages into a vector the SciKit Learn's algorithm models can work with.
> - Now we'll convert each message, represented as a list of tokens (lemmas) above, into a vector that machine learning models can understand.

> **We'll do that in three steps using the bag-of-words model:**
1. Count how many times does a word occur in each message (Known as term frequency)
2. Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)
3. Normalize the vectors to unit length, to abstract from the original text length (L2 norm)

**Summary:**

1. `vect.fit(train)` **learns the vocabulary** of the training data

2. `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data

3. `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)

<div class="alert alert-block alert-info">
<h1>6. CountVectorizacer</h1></div><a class="anchor" id="6"></a>

[Back to Table of Contents](#0.1)

In [None]:
 # Initiating Vector
vect = CountVectorizer()

# Fitting the training dataset
vect.fit(x_train)

# learn training data vocabulary, then use it to create a document-term matrix(dtm)
x_train_dtm = vect.transform(x_train) 

# Combine fit and transform 
x_train_dtm = vect.fit_transform(x_train) 

# Transform test dataset into a document-term matrix(dtm)
x_test_dtm = vect.transform(x_test) 

<div class="alert alert-block alert-info">
<h1>7. TF/IDF Vectorization</h1></div><a class="anchor" id="7"></a>

> - `Term Frequency - Inverse Document Frequency` is a numerical statistics that is intended to reflect how important a word is to a document . It is used as a weighing factor in information retrieval and text mining. 
> - TF/IDF value increases proportionally to a no. of times a word appears in a document but is offset by the frequency of the word in corpus.
<br><center>where, `TF-IDF(t)= Term Frequency (TF) * Inverse Document Frequency (IDF)`</center>

> - `Term Frequency (TF)` is a measure of how frequent a term occurs in a document.
<br><center>`TF(t)= Number of times term t appears in document (p) / Total number of terms in that document`</center>
> - `Inverse Document Frequency (IDF)` is measure of how important term is. For TF, all terms are equally treated. But, in IDF, for words that occur frequently like 'is' 'the' 'of' are assigned less weight. While terms that occur rarely that can easily help identify class of input features will be weighted high.
<br><center>`IDF(t)= log<sub><i>e</i></sub>(Total number of documents / Number of documents with term t in it)`</center>

[Back to Table of Contents](#0.1)

In [None]:
 # Initiating Model
tfidf_transformer = TfidfTransformer()

# Fitting the training dataset
tfidf_transformer.fit(x_train_dtm)

# Transforming the test dataset
tfidf_transformer.transform(x_train_dtm)

<div class="alert alert-block alert-info">
<h1>8. Multinomial Naive Bayes Model</h1></div><a class="anchor" id="8"></a>

> - The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). 
> - The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

[Back to Table of Contents](#0.1)

In [None]:
# Initiating Model
nb = MultinomialNB()

# Train the model using X_train_dtm 
nb.fit(x_train_dtm, y_train)

# Make class predictions for X_test_dtm
y_pred_class = nb.predict(x_test_dtm)

##  Calculating accuracy of the class predictions:
print('Accuracy of Multinomial Naive-Bayes Model:',round(metrics.accuracy_score(y_test, y_pred_class)*100,2))

## Print confusion Metrics
print("\nConfusion Metrics\n", metrics.confusion_matrix(y_test, y_pred_class))

## calculation ROC/AUC
print('\nROC/AUC:',round(metrics.roc_auc_score(y_test, y_pred_class)*100,2))

In [None]:
# Printing the false positive predictions - The messages which actually HAM but model is predicting SPAM (#7)

x_test[y_pred_class > y_test]

In [None]:
# Printing the false negetive predictions - The messages which actually SPAM but model is predicting HAM (#16)

x_test[y_pred_class < y_test]

<div class="alert alert-block alert-warning">
<h1>Observation</h1></div> </h1></div><a class="anchor"></a>

> - The goal of the algorithm is to predict if a new sms is a HAM or SPAM and there are 2 possible situation:
> -  `False Positive Prediction`: The messages which actually SPAM but model is predicting HAM 
<br><center> **OUTCOME: I probably do not read it!!** </center>  
> -  `False Negative Prediction`: The messages which actually HAM but model is predicting SPAM
<br><center> **OUTCOME: I delete it!!** </center>
    
> - The second option is preferable!!

<div class="alert alert-block alert-info">
<h1>9. Pipeline</h1></div><a class="anchor" id="9"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Initiating Model
pipe = Pipeline([('bow', CountVectorizer()), 
                 ('tfid', TfidfTransformer()),  
                 ('model', MultinomialNB())])

# Fitting the training dataset
pipe.fit(x_train, y_train)

# Predicting on test dataset
y_pred_pipe = pipe.predict(x_test)

##  Calculating accuracy of the class predictions:
print('Accuracy of Pipeline:',round(metrics.accuracy_score(y_test, y_pred_pipe)*100,2))

## Print confusion Metrics
print("\nConfusion Metrics\n", metrics.confusion_matrix(y_test, y_pred_pipe))

## calculation ROC/AUC
print('\nROC/AUC:',round(metrics.roc_auc_score(y_test, y_pred_pipe)*100,2))

In [None]:
def detect_spam(s):
    return pipe.predict([s])[0]
detect_spam('Hi, this is Rhea.')

<div class="alert alert-block alert-warning">
<h1>Observation</h1></div> </h1></div><a class="anchor"></a>

> Pipeline is predicting the spam messages with 96% accuracy with lowset False Positives.

___