# COMP5318 - Machine Learning and Data Mining 

## Tutorial 6 - Naive Bayes

**Semester 2, 2019**

**Objectives:**

* To learn about bag of words features and naïve Bayes classifier.

**Instructions:**
* Exercises to be completed on IPython notebook such as: 
   * Ipython 3 (Jupyter) notebook installed on your computer http://jupyter.org/install (you need to have Python installed first https://docs.python.org/3/using/index.html )
   * Web-based Ipython notebooks such as Google Colaboratory https://colab.research.google.com/ 
   
* If you are using Jupyter intalled on your computer, Go to File->Open. Drag and drop "lab6.ipynb" file to the home interface and click upload. 
* If you are using Google Colaboratory, Click File->Upload notebook, and and upload "lab6.ipynb" file
* Complete exercises in "lab6.ipynb".
* To run the cell you can press Ctrl-Enter or hit the Play button at the top.
* Complete all exercises marked with **TODO**.
* Save your file when you are done with the exercises, so you can show your tutor next week.

Lecturers: Nguyen Hoang Tran

Tutors: Fengxiang He, Shaojun Zhang, Fangzhou Shi, Yang Lin, Iwan Budiman, Zhiyi Wang, Canh Dinh, Yixuan Zhang, Rui Dong, Haoyu He, Dai Hoang Tran, Peibo Duan

**Let consider we have a dataset which inlude spam sms and ham sms. In this tutorial we will build Naive Bayes model to clasify spam and ham sms** 

Read and view the dataset

In [1]:
import pandas as pd
import numpy as np
# Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
# Read a comma-separated values (csv) file into a DataFrame
df = pd.read_csv('data/SMSSpamCollection',
                   sep='\t',
                   header=None,
                   names=['label', 'sms_message'])
#TODO: output the first 5 rows of the DataFrame
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Process the data set

We need to transform the labels to binary values so we can run the regression. Here 1 = "spam" and 0 = "ham"

In [2]:
#Map applies a function to all the items in an input list or df column.
#TODO: map the label to binary values where 1 = "spam" and 0 = "ham"
df['label'] = df.label.map({'ham':0, 'spam':1})
#TODO: output the first 5 rows of the DataFrame
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


## 1. Naïve Bayes Classifier (NBC)

The main assumption in naïve Bayes is ***conditional independence***.

Given a random vector $ X \in R^K $ and a dependent variable $Y \in[C]$, the Naive Bayes model defines the joint distribution:
\begin{equation}
 P(X = \textbf{x}, Y = c) = P(Y = c)P(X= \textbf{x}|Y = c) = P(Y=c)\prod_{k=1}^K P(x_k|Y=c)
\end{equation}

Given an unlabeled point $\textbf{x} = \{x_k, k = 1,2,....,K\}$ we label $\textbf{x}$ by finding:
\begin{equation}
\hat{Y} = argmax_{c \in [C]} P(Y = c|\textbf{x}) = argmax_{c \in [C]} \frac{P(\textbf{x}|c)P( Y= c)}{P(\textbf{x})} = argmax_{c \in [C]} P(\textbf{x}|c)P( Y= c)
\end{equation}

Due to $P(\textbf{x})$ is not depended on $c$.

We can also write as log-likelihood :
\begin{align}
\label{log}
\hat{Y} = argmax_{c \in [C]} [log P(Y=c) + \sum_k log(P(x_k|Y=c))]
\end{align}

### 1.1 Bag of Words
Since we're dealing with text data and the naive bayes classifier is better suited to having numerical data as inputs we will need to perform transformations. To accomplish this we'll use the ("bag of words")[https://en.wikipedia.org/wiki/Bag-of-words_model] method to count the frequency of occurance for each word. Note: the bag of words method assumes equal weight for all words in our "bag" and does not consider the order of occurance for words.

**1.1.1 Bag of Words from scratch**

There are modules that will do this for us but we will implement bag of words from scratch to understand what's happening under the hood using simple documents and then we will learn how to use existing libraries in next section 

The steps are as follow:

1. Convert bag of words to lowercase. 
2. Remove punctuation from sentences. 
3. Break on each word. 
4. Count the frequency of each word.

In [3]:
import string #punctuation
import pprint
from collections import Counter #frequencies

#Bag of Words from scratch
# assume we have a sample doccuments 
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']

lower_case_documents = []

for i in documents:
    lower_case_documents.append(i.lower())
print ("lower case:", lower_case_documents)

# Remove punctuation.
sans_punctuation_documents = []

for i in lower_case_documents:
    sans_punctuation_documents = ["".join( j for j in i if j not in string.punctuation) for i in  lower_case_documents]
print("no punctuation:", (sans_punctuation_documents))

#Break each sentence into words using the list of lowercase sentences with punctuation removed as input
preprocessed_documents = []

for i in sans_punctuation_documents:
    preprocessed_documents.append(i.split(' ')) #split on space
print("break words:", (preprocessed_documents))

#Count frequency of words using counter
frequency_list = []

#Use Counter function to count the frequency of words in each sentence.
for i in preprocessed_documents:
    frequency_counts = Counter(i)
    frequency_list.append(frequency_counts)
print ("tokenized counts:", pprint.pprint(frequency_list))

lower case: ['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']
no punctuation: ['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']
break words: [['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]
[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}),
 Counter({'win': 2, 'money': 1, 'from': 1, 'home': 1}),
 Counter({'call': 1, 'me': 1, 'now': 1}),
 Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]
tokenized counts: None


**1.1.2 Using SciKit-Learn**

That was pretty simple but scikit-learn makes the above process even easier. Let's try it using the sklearn.feature_extraction.text.CountVectorizer method from the module.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer() #set the variable

count_vector.fit(documents) #fit the function
count_vector.get_feature_names() #get the outputs

['are',
 'call',
 'from',
 'hello',
 'home',
 'how',
 'me',
 'money',
 'now',
 'tomorrow',
 'win',
 'you']

Create an array where each row represents one of the 4 columns and each column represents the counts for each word within the document.



In [5]:
doc_array = count_vector.transform(documents).toarray()
doc_array

array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)

Convert the array to a data frame and apply get_feature_names as the column names.

In [6]:
frequency_matrix = pd.DataFrame(doc_array,columns = count_vector.get_feature_names())
frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


### 1.2 Working on the sms spam and ham dataset

We'll split our dataset using scikit's train_test_split method into training and testing sets so we can make inferences about the model's accuracy on data it hasn't been trained on.


In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'],
                                                    df['label'],
                                                    random_state=1)

print("Our original set contains", df.shape[0], "observations")
print("Our training set contains", X_train.shape[0], "observations")
print("Our testing set contains", X_test.shape[0], "observations")

Our original set contains 5572 observations
Our training set contains 4179 observations
Our testing set contains 1393 observations


Fit the training & testing data to the CountVectorizer() method and return a matrix

In [8]:
train = count_vector.fit_transform(X_train).toarray()
test = count_vector.transform(X_test).toarray()

**1.2.1 Implementing Multinomial Naive Bayes from Scratch using the NB definition**

In [9]:
class MultinomialNB(object):
    def __init__(self, alpha= 1.0):
        self.alpha = alpha # Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing)

    def fit(self, X, y):
        count_sample = X.shape[0]
        separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
        self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
        count = np.array([np.array(i).sum(axis=0) for i in separated]) + self.alpha
        self.feature_log_prob_ = np.log(count / count.sum(axis=1)[np.newaxis].T)
        return self

    def predict_log_proba(self, X): # apply log-likelihood notation
        return [(self.feature_log_prob_ * x).sum(axis=1) + self.class_log_prior_
                for x in X]

    def predict(self, X):
        return np.argmax(self.predict_log_proba(X), axis=1)

**1.2.2 Train our data and evaluate the accuracy**

In [10]:
model = MultinomialNB()
model.fit(train,y_train)
y_pred = model.predict(test)

In [11]:
from sklearn.metrics import accuracy_score
print('accuracy score: '),format(accuracy_score(y_test,y_pred))

accuracy score: 


(None, '0.9885139985642498')