# Question 1: Text Classification using Multinomial Naive Bayes

You are required to implement a multinomial naive bayes classifier to predict spam SMS.

The training data is a set of SMS categoried into `spam` and `ham`.

In [1]:
import pandas as pd

raw_data = pd.read_csv('./asset/data.txt', sep='\t')
raw_data.head()

Unnamed: 0,category,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In order to process the unigram model, we need to tokenize the text first. Similar to L7 notebook, we use the count of  tokens (words) in the SMS as its feature (i.e., bag of word model). And we store the features of each SMS and its category in a `dictionary`. 

In [3]:
def tokenize(sms):
    return sms.split(' ')

def get_freq_of_tokens(sms):
    tokens = {}
    for token in tokenize(sms):
        if token not in tokens:
            tokens[token] = 1
        else:
            tokens[token] += 1
    return tokens

training_data = []
for index in range(len(raw_data)):
    training_data.append((get_freq_of_tokens(raw_data.iloc[index].text), raw_data.iloc[index].category))
print(training_data)

[({'bugis': 1, 'in': 1, 'la': 1, 'Available': 1, 'amore': 1, 'crazy..': 1, 'n': 1, 'world': 1, 'buffet...': 1, 'great': 1, 'Cine': 1, 'Go': 1, 'point,': 1, 'there': 1, 'until': 1, 'got': 1, 'jurong': 1, 'wat...': 1, 'only': 1, 'e': 1}, 'ham'), ({'oni...': 1, 'u': 1, 'Joking': 1, 'Ok': 1, 'wif': 1, 'lar...': 1}, 'ham'), ({'2005.': 1, "08452810075over18's": 1, 'wkly': 1, 'apply': 1, 'to': 3, 'question(std': 1, 'FA': 2, '2': 1, 'comp': 1, 'Text': 1, 'tkts': 1, 'final': 1, 'May': 1, 'in': 1, 'win': 1, 'a': 1, '21st': 1, "rate)T&C's": 1, 'receive': 1, 'Free': 1, 'entry': 2, 'txt': 1, '87121': 1, 'Cup': 1}, 'spam'), ({'already': 1, 'U': 2, 'say': 1, 'c': 1, 'say...': 1, 'early': 1, 'dun': 1, 'then': 1, 'hor...': 1, 'so': 1}, 'ham'), ({'he': 2, 'to': 1, 'around': 1, 'though': 1, 'usf,': 1, 'lives': 1, 'think': 1, 'Nah': 1, 'I': 1, 'here': 1, 'goes': 1, "don't": 1}, 'ham'), ({'for': 1, 'Hey': 1, "I'd": 1, 'word': 1, 'to': 2, 'fun': 1, 'some': 1, 'back!': 1, 'it': 1, 'darling': 1, 'and': 1, 'th

Now you need to **implement** a multinomial naive bayes classifier (i.e., `multinomial_nb()` in `submission.py`) with add-1 smoothing. The input arguments of `multinomial_nb()` are:
* `training_data`: pre-processed data stored as a `dictionary`
* `sms`: a list of tokens, which you need to predict whether it is a ham or spam

The return value of `multinomial_nb()` is the **ratio** of the probability of sms is spam and the probability of sms is ham (if the returned value larger than 1, means the sms is spam, vice versa).

For example, your implemented `multinomial_nb()` should behavior like below:

In [3]:
import submission

sms = 'I am not spam'
print(submission.multinomial_nb(training_data, tokenize(sms)))

0.23427672956


# Submission

You need to complete the `multinomial_nb()` function in `submission.py`, and submit your code to the online submission system. Similar to the previous labs, you will receive your score in a few minutes.

You can submit for unlimited times.

# Test Environment

We have pre-installed the following modules, you can only use these modules and the built-in modules and functions.
* python: 3.5.2
* pandas: 0.19.2

NOTE: you need to implement the classifier by yourself. Therefore you cannot import sklearn in Lab3.