<a href="https://colab.research.google.com/github/Estrada-John/SpamClassification/blob/master/Spam_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h3><center>ECE 49500/59500 Machine Learning<center>
<center>Spring 2020<center>
<h2><center>Spam Email Classification using Naive Bayes Classifier<center>

## Context

In this exercise, you will use the SMS Spam Collection dataset which is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

## Content

The files contain one message per line. Each line is composed by two columns: **v1 contains the label (ham or spam) and v2 contains the raw text.**  This corpus has been collected from free or free for research sources at the Internet. More details can be found in here: https://www.kaggle.com/uciml/sms-spam-collection-dataset

## Objective

Apply Naive Bayes Classifier to this dataset to accurately predict which texts are spam.

## 1. Contents of this notebook

*  Text Analysis
        - Explore the Data
        - Developing Insights
*  Test Transformation
        - Data Cleaning (Removing unimportant data/ Stopwords/ Stemming)
        - Converting data into a model usable format (Bag of words Model)
*  Naive Bayes Model for Spam Classification


#### TEXT ANALYSIS

In [1]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Text Preprocessing
import nltk
nltk.download("all")   # you will need to download it if you have not done so
from nltk.corpus import stopwords
import string
from nltk.tokenize import word_tokenize


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

* #####  Load dataset.  We will use Pandas library to load the dataset. More information regarding Pandas can be found at https://pandas.pydata.org/

In [2]:
messages = pd.read_csv("spam.csv", encoding = 'latin-1')

# Drop the extra columns and rename columns
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["category", "text"]

FileNotFoundError: ignored

In [0]:
display(messages.head(n = 20))

* ##### Check overall information of the dataset

In [0]:
messages.info()

* ##### Let us see what precentage of our data is spam or ham (legitimate)

In [0]:
messages["category"].value_counts().plot(kind = 'pie', figsize = (6, 6), fontsize=14, autopct = '%1.1f%%', shadow = True)
plt.ylabel("Spam vs Ham")
plt.legend(["Ham", "Spam"])
plt.show()

From above Pie chart, it can be seen that about 86% of our dataset consists of non-spam messages. 

*  As we split our data set into train and test, **stratified sampling** is recommended in this case, otherwise we have a chance of our training model being skewed towards normal messages. If the sample we choose to train our model consists majorly of normal messages, it may end up predicting everything as ham and we might not be able to figure this out since most of the messages we get are actually ham and will have a pretty good accuracy.

* #####  now let us check individual Spam/ham words

In [0]:
spam_messages = messages[messages["category"] == "spam"]["text"]
ham_messages = messages[messages["category"] == "ham"]["text"]

spam_words = []
ham_words = []

# Since this is just classifying the message as spam or ham, we can use isalpha(). 
# This will also remove the not word in something like can't etc. 
# In a sentiment analysis setting, it's better to use sentence.translate(string.maketrans("", "", ), chars_to_remove)

def extractSpamWords(spamMessages):
    global spam_words
    words = [word.lower() for word in word_tokenize(spamMessages) if word.lower() not in stopwords.words("english") and word.lower().isalpha()]
    spam_words = spam_words + words
    
def extractHamWords(hamMessages):
    global ham_words
    words = [word.lower() for word in word_tokenize(hamMessages) if word.lower() not in stopwords.words("english") and word.lower().isalpha()]
    ham_words = ham_words + words

spam_messages.apply(extractSpamWords)
ham_messages.apply(extractHamWords)

In [0]:
print("Total Messages:" , len(ham_messages) + len(spam_messages))

In [0]:
# Top 10 spam words
spam_words = np.array(spam_words)
print("Top 10 Spam words are :\n")
pd.Series(spam_words).value_counts().head(n = 10)

In [0]:
# Top 10 Ham words
ham_words = np.array(ham_words)
print("Top 10 Ham words are :\n")
pd.Series(ham_words).value_counts().head(n = 10)

* #### Does the length of the message indicates us anything?

In [0]:
messages["messageLength"] = messages["text"].apply(len)
messages["messageLength"].describe()

In [0]:
f, ax = plt.subplots(2, 1, figsize = (6, 10))

sns.distplot(messages[messages["category"] == "spam"]["messageLength"], bins = 20, ax = ax[0])
ax[0].set_xlabel("Spam Message Word Length")

sns.distplot(messages[messages["category"] == "ham"]["messageLength"], bins = 20, ax = ax[1])
ax[1].set_xlabel("Ham Message Word Length")

plt.show()

**It can be observed that spam messages are usually longer which could be a a feature to predict whether the message is spam/ ham. Right?**

#### TEXT TRANSFORMATION

#### Lets clean our data by removing punctuations/ stopwords and stemming words
* __Stemming__ reduces related words to a common stem. e.g., fish and fishes become 'fish'
* __Stop words__ are commonly used words that are unlikely to have any benefit in natural language processing. These includes words such as ‘a’, ‘the’, ‘is’.

More references: https://pythonhealthcare.org/2018/12/14/101-pre-processing-data-tokenization-stemming-and-removal-of-stop-words/

In [0]:
from nltk.stem import SnowballStemmer
def stemmer(text):
    text = text.split()
    words = ""
    for i in text:
            stemmer = SnowballStemmer("english")
            words += (stemmer.stem(i))+" "
    return words

def puncStopW(text):
    
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = [word for word in text.split() if word.lower() not in stopwords.words('english')]
    
    return " ".join(text)

messages["text"] = messages["text"].apply(stemmer)
messages["text"] = messages["text"].apply(puncStopW)
messages.head(n = 10)     # You may compare the different between orginaltext and filtered one to see the difference

##### Convert the clean text into a feature representation

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
features_np = vec.fit_transform(messages["text"]).toarray()  # converting to array
print(features_np.shape)

## 2. MODEL APPLICATION

In this section, you will implement the Naive Bayes Classifier to the input data and predict a given email is spam or ham. 

### 2.1 Firstly, convert category of SPAM and HAM messages into 1 and 0, respectively. And then split the data into training set and test set 

In [0]:
print(messages["category"])
def encodeCategory(cat):
    if cat == "spam":
        return 1
    else:
        return 0
       
messages["category"] = messages["category"].apply(encodeCategory)

# convert dataframe to numpy array
messages_np = messages["category"].to_numpy()
print(messages_np)


In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_np, messages_np, stratify = messages_np, test_size = 0.3, random_state=1)

In [0]:
print("The size of ham messages:", len(ham_messages))
print("The size of spam messages:", len(spam_messages))
print("The size of total samples:", len(messages["category"]))
print("The size of trainng samples:", X_train.shape)
print("The size of testing samples:", X_test.shape)
print("The size of spam messages in training samples:", len(y_train[y_train==1]))
print("The size of ham messages in training samples:", len(y_train[y_train==0]))

### 2.2 Naive Bayes Classifier Implementation

#### 1) Sort data into two classes: spam and non-spam

In [0]:
Xy0 = X_train[y_train == 0]
Xy1 = X_train[y_train == 1]
print("The size of ham samples in training data:", Xy0.shape)
print("The size of spam samples in training data:", Xy1.shape)

#### 2) Calculate conditional Probability, prior probability (Feel free to build functions and use multiple cells to complete this step)

In [0]:
# 1)
priorHam = Xy0.shape[0] / X_train.shape[0]
priorSpam = Xy1.shape[0] / X_train.shape[0]

#
totalRowHam = np.sum(Xy0, axis=1)
wordsHam = np.sum(totalRowHam, axis=0) 

totalRowSpam = np.sum(Xy1, axis=1)
wordsSpam = np.sum(totalRowSpam, axis=0) 

In [0]:
# 2.0) probability of each word given a specific class
def conditional(Spam, Ham):
    ProbSpam = np.sum(Xy1, axis=0)
    ProbHam = np.sum(Xy0, axis=0)
    probabilityS = lambda x: (x + 1) / (wordsSpam + X_train.shape[1])
    probabilityH = lambda x: (x + 1) / (wordsHam + X_train.shape[1])
    return probabilityS(ProbSpam), probabilityH(ProbHam)

In [0]:
# 2.1) get ptobability
ConditionalSpam, ConditionalHam = conditional(Xy1, Xy0)
print(ConditionalHam)
print(ConditionalSpam)

#### 3) Classify test examples

In [0]:
posteProb_spam = np.array([])
posteProb_ham = np.array([])
X_testTemp = np.array([])
prob_s = 1
prob_h = 1
for i in range(0, X_test.shape[0]):
    for j in range(0, X_test.shape[1]):
        if(X_test[i][j] == 1): 
            prob_s *= ConditionalSpam[j]
            prob_h *= ConditionalHam[j]
    posteProb_spam = np.append(posteProb_spam, prob_s * priorSpam)
    posteProb_ham = np.append(posteProb_ham, prob_h * priorHam )

In [0]:
converter = lambda x,y : x*y if (x == 1).all else (x*1)
a = converter(X_train, ConditionalSpam)
b = converter(X_train, ConditionalHam)
w1 = np.array([])
w2 = np.array([])
for i in range(0, X_test.shape[0]):
    z = np.trim_zeros(a[i])
    y = np.trim_zeros(b[i])
    w1 = np.append(w1, np.prod(z))
    w2 = np.append(w2, np.prod(y))
    #print(w1, w2)

preAcc = lambda x,y : 1 if (x > y).any else (0)
#zx = preAcc(w1, w2)
print(w1)
#print(np.trim_zeros(a.))
#np.trim_zeros(a[0])
#np.multiply(a[0])
#np.count_nonzero(a[0] > 0)


In [0]:
totalRowHam = np.sum(X_test, axis=1)
totalRowHam[0:50]

In [0]:
temp = np.array([])
for i in range(0, X_test.shape[0]):
    if (prob_s > prob_h):
        temp = np.append(temp, 1)
    else:
        temp = np.append(temp, 0)

accuracy = 0
for i in range(0, X_test.shape[0]):
    if (temp[i] == y_test[i]):
        accuracy += 1 
        
print(accuracy / y_test.shape[0])

#### 4) Compute prediction accuracy

<h2><center>Enjoy !<center>

### <center>Acknowledgement<center>
**The section 1** in this exercise is modified from online source: https://www.kaggle.com/ishansoni/sms-spam-collection-dataset

The original dataset can be found here (https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). More references can be found at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ and A comprehensive study of this corpus in the following paper. 

Almeida, T.A., GÃ³mez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.