# Document Classification with Naive Bayes - Lab

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [22]:
import pandas as pd
df = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'text'])
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [31]:
spam = df[df['label']=='spam']
ham = df[df['label']=='ham'][0:len(spam)]

In [35]:
len(spam) == len(ham)

True

In [39]:
df = pd.concat([spam, ham])
df

Unnamed: 0,label,text
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
5,spam,FreeMsg Hey there darling it's been 3 week's n...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...
11,spam,"SIX chances to win CASH! From 100 to 20,000 po..."
...,...,...
883,ham,I love to give massages. I use lots of baby oi...
884,ham,Dude we should go sup again
885,ham,Yoyyooo u know how to change permissions for a...
886,ham,Gibbs unsold.mike hussey


In [88]:
p_classes = dict(df['label'].value_counts(normalize=True))

## Train-test split

Now implement a train-test split on the dataset: 

In [40]:
# Your code here
from sklearn.model_selection import train_test_split
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
train_df = pd.concat([X_train, y_train], axis=1) 
test_df = pd.concat([X_test, y_test], axis=1) 

## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [56]:
spam_freq = {}
for row in spam['text']:
    word_list = row.split()
    for word in word_list:
        if word in spam_freq:
            spam_freq[word] += 1
        else:
            spam_freq[word] = 1

In [58]:
import operator
sorted(spam_freq.items(), key=operator.itemgetter(1), reverse=True)

[('to', 607),
 ('a', 360),
 ('your', 187),
 ('call', 185),
 ('or', 185),
 ('the', 178),
 ('2', 169),
 ('for', 169),
 ('you', 164),
 ('is', 143),
 ('Call', 136),
 ('on', 136),
 ('have', 128),
 ('and', 119),
 ('from', 116),
 ('ur', 107),
 ('with', 101),
 ('&', 98),
 ('4', 93),
 ('of', 93),
 ('FREE', 89),
 ('mobile', 81),
 ('You', 77),
 ('are', 77),
 ('our', 76),
 ('To', 73),
 ('claim', 73),
 ('Your', 71),
 ('U', 70),
 ('txt', 68),
 ('text', 68),
 ('in', 64),
 ('now', 64),
 ('Txt', 63),
 ('reply', 58),
 ('free', 56),
 ('contact', 56),
 ('-', 55),
 ('now!', 49),
 ('be', 48),
 ('just', 48),
 ('u', 47),
 ('send', 46),
 ('this', 46),
 ('won', 45),
 ('get', 45),
 ('only', 45),
 ('Nokia', 45),
 ('prize', 44),
 ('per', 44),
 ('STOP', 44),
 ('been', 43),
 ('service', 43),
 ('who', 43),
 ('Reply', 42),
 ('new', 42),
 ('cash', 42),
 ('out', 40),
 ('Text', 39),
 ('will', 39),
 ('This', 39),
 ('stop', 38),
 ('awarded', 37),
 ('We', 36),
 ('Free', 35),
 ('Please', 34),
 ('by', 34),
 ('£1000', 33),
 ('

In [59]:
ham_freq = {}
for row in ham['text']:
    word_list = row.split()
    for word in word_list:
        if word in ham_freq:
            ham_freq[word] += 1
        else:
            ham_freq[word] = 1

In [60]:
import operator
sorted(ham_freq.items(), key=operator.itemgetter(1), reverse=True)

[('you', 270),
 ('to', 245),
 ('I', 235),
 ('the', 178),
 ('a', 170),
 ('and', 127),
 ('in', 122),
 ('i', 121),
 ('my', 118),
 ('is', 104),
 ('of', 88),
 ('u', 85),
 ('me', 79),
 ('for', 79),
 ('that', 71),
 ('have', 70),
 ('your', 66),
 ('on', 59),
 ('are', 57),
 ('not', 52),
 ('it', 51),
 ('so', 50),
 ('be', 50),
 ('with', 45),
 ('at', 45),
 ('will', 44),
 ("I'm", 43),
 ('can', 43),
 ('You', 40),
 ('but', 39),
 ('get', 39),
 ('like', 38),
 ('call', 38),
 ('if', 36),
 ('or', 36),
 ('am', 36),
 ('&lt;#&gt;', 36),
 ('know', 35),
 ('got', 34),
 ('U', 34),
 ('when', 34),
 ('out', 34),
 ('we', 33),
 ('just', 33),
 ('all', 31),
 ('go', 30),
 ('this', 28),
 ('up', 26),
 ('need', 26),
 ('want', 25),
 ('come', 25),
 ('...', 24),
 ('2', 23),
 ('going', 23),
 ('?', 23),
 ('4', 23),
 ('been', 22),
 ('about', 22),
 ("I'll", 22),
 ('still', 22),
 ('was', 22),
 ('do', 22),
 ('as', 21),
 ('So', 21),
 ('But', 21),
 ('then', 20),
 ('ur', 20),
 ('there', 19),
 ('think', 19),
 ('love', 19),
 ('now', 19),

In [112]:
class_word_freq = {} 
classes = ['spam', 'ham']

#iterate through the list of the two classes
for class_ in classes:
#get subset of training data where the label matches the class 
    df = train_df[train_df['label'] == class_]
    #create empty bag dict
    bag = {}
    #iterate through rows of the class subset
    for row in df.index:
        #get text from each row
        row_text = df['text'][row]
        #split the row text into a list of words and iterate through it
        for word in row_text.split():
            #create new key in bag dict for the word, and update its frequency
            bag[word] = bag.get(word, 0) + 1
            #update class word freq dict with the freq dicts for each class
    class_word_freq[class_] = bag

## Count the total corpus words
Calculate V, the total number of words in the corpus: 

In [113]:
corpus = set()
for text in train_df['text']:
    for word in text.split():
        corpus.add(word)
V = len(corpus)
V

6090

## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [114]:
def bag_it(doc):
    bag = {}
    for word in doc.split():
        bag[word] = bag.get(word, 0) + 1
    return bag
bag = bag_it(doc)
bag

{'Was': 1, 'the': 1, 'farm': 1, 'open?': 1}

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [115]:
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    classes = []
    posteriors = []
    for class_ in class_word_freq.keys():
        p = np.log(p_classes[class_])
        for word in bag.keys():
            num = bag[word]+1
            denom = class_word_freq[class_].get(word, 0) + V
            p += np.log(num/denom)
        classes.append(class_)
        posteriors.append(p)
    if return_posteriors:
        print(posteriors)
    return classes[np.argmax(posteriors)]

In [102]:
p_classes

{'spam': 0.5, 'ham': 0.5}

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [103]:
y_hat_train = X_train.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))
residuals = y_train == y_hat_train
residuals.value_counts(normalize=True)

True     0.513393
False    0.486607
dtype: float64

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!