# COMP9318 Lab3

## Instructions
1. This note book contains instructions for **COMP9318-lab3**.

* You are required to complete your implementation in a file `submission.py` provided along with this notebook.

* You are not allowed to print out unnecessary stuff. We will not consider any output printed out on the screen. All results should be returned in appropriate data structures via corresponding functions.

* You can submit your implementation for **lab3** via following link: https://kg.cse.unsw.edu.au:8318/lab3/ .

* For each question, we have provided you with detailed instructions along with question headings. In case of any problem, you can post your query @ Piazza.


* You are allowed to add other functions and/or import modules (you may have to in this lab), but you are not allowed to define global variables. **Only functions are allowed** in `submission.py`. 

* You should not import unnecessary modules/libraries, failing to import such modules at test time will lead to errors.

* We will provide immediate feedback on your submission. You can access your scores using the online submission portal on the same day. 

* For **Final Evaluation** we will be using a different dataset, so your final scores may vary.  

* You are allowed to submit as many times as you want before the deadline, but **ONLY the latest version will be kept and marked**.

* Submission deadline for this assignment is **23:59:59 on 2nd May, 2018**. We will **not** accept any late submissions.

# Question-1: Text Classification using Multinomial Naive Bayes

You are required to implement a multinomial naive bayes classifier to predict spam SMS.

The training data is a set of SMS categoried into `spam` and `ham`.

In [1]:
import pandas as pd
import numpy as np
raw_data = pd.read_csv('./asset/data.txt', sep='\t')
raw_data.head()

Unnamed: 0,category,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In order to implement a unigram model, first we tokenize the text. We use the count corresponding to each token (word) in the SMS as its feature (i.e., bag of words). We store the features and catrgorical information for each SMS in a `dictionary`.

In [50]:
import pandas as pd
import numpy as np
from math import log
import operator
raw_data = pd.read_csv('./asset/data.txt', sep='\t')
raw_data.head()
def tokenize(sms):
    if type(sms) is list:
        return sms
    return sms.split(' ')

def get_freq_of_tokens(sms):
    tokens = {}
    for token in tokenize(sms):
        if token not in tokens:
            tokens[token] = 1
        else:
            tokens[token] += 1
    return tokens

training_data = []
test_data = []
total_samples = len(raw_data)

test_samples_amount = 0
for index in range(total_samples-test_samples_amount):
    training_data.append((get_freq_of_tokens(raw_data.iloc[index].text), raw_data.iloc[index].category))
for index in range(total_samples-test_samples_amount, total_samples):
    test_data.append(raw_data.iloc[index].text)
#training_data[:5]
def multinomial_nb(training_data, sms):
    spam_vector = {}
    spam_sum = 0.0
    spam_rows = 0.0
    ham_vector = {}
    ham_sum = 0.0
    ham_rows = 0.0
    count_vector = {}
    sms_frequency = {}
    vector_length = 0
    p_spam = 0.0
    p_ham = 0.0
    sms_frequency = get_freq_of_tokens(sms)
    #smoothing
    for pair in training_data:
        for feature in pair[0]:
            #smoothing
            spam_vector[feature] = 1.0
            ham_vector[feature] = 1.0
            count_vector[feature] = 0
    vector_length = len(spam_vector)
    #smoothing
    spam_sum += vector_length
    ham_sum += vector_length
    for pair in training_data:
        if pair[1] == 'spam':
            spam_rows += 1
            for feature in pair[0]:
                spam_vector[feature] += pair[0][feature]
                spam_sum += pair[0][feature]
        else:
            ham_rows += 1
            for feature in pair[0]:
                ham_vector[feature] += pair[0][feature]
                ham_sum += pair[0][feature]
    p_spam = spam_rows / (spam_rows + ham_rows)
    p_ham = ham_rows / (spam_rows + ham_rows)
    #print(p_spam, p_ham)
    for token in sms_frequency:
        if token not in count_vector:
            count_vector[token] = 0
        count_vector[token] += sms_frequency[token]
    for feature in spam_vector:
        spam_vector[feature] = spam_vector[feature]/spam_sum
        ham_vector[feature] = ham_vector[feature]/ham_sum
    p_spam = log(p_spam)
    p_ham = log(p_ham)
    stat_ham = {}
    stat_spam = {}
    stat = {}
    for feature in spam_vector:
        stat_spam[feature] = log(spam_vector[feature])
        stat_ham[feature] = log(ham_vector[feature])
        stat[feature] = stat_spam[feature]/stat_ham[feature]
        #print(feature, count_vector[feature])
        p_spam = p_spam + log(spam_vector[feature])*count_vector[feature]
        p_ham = p_ham + log(ham_vector[feature])*count_vector[feature]
        #print('pham', p_ham)
    #spam:1 ham:0
    ratio = p_spam/p_ham

    return ratio, sorted(stat_spam.items(), key=operator.itemgetter(1))[::-1][:50],\
sorted(stat_ham.items(), key=operator.itemgetter(1))[::-1][:50], sorted(stat.items(), key=operator.itemgetter(1))[::-1]


#for index in range(len(test_data)):
#    sms = tokenize(test_data[index])
#    result = multinomial_nb(training_data, sms)
#    print(total_samples - test_samples_amount + index + 4, sms[:3], result, 'spam' if result>1 else 'ham')

sms = "I am not spam"
sms = ['david', 'yeltsin', 'executive', 'powell', 'replacement', 'phone', 'brazilian', 'wimbledon', 'mubarak', 'mr.', 'speech',\
       'm.', 'autonomy', 'jan', 'bhutto', 'cypriot', 'bulgaria', 'lap', 'inspire', 'midfielder']
sms = ['kashmir', 'fail', 'crude', 'october', 'himself', 'sea', 'ship', 'period', 'mogadishu', 'present', 'less', 'poll', 'pct',\
       'east', 'iraq', 'you', 'peninsula', 'niger', 'education', 'whose']
c = multinomial_nb(training_data, tokenize(sms))
print(c[0])
test_data = './asset/modified_data.txt'
with open(test_data,'r') as data:
    i = 0
    res = []
    for line in data:
        result, stat_spam, stat_ham, stat = multinomial_nb(training_data, tokenize(line.strip()))
        if result >=1:
            res.append(1)
            #print(i, '1')
        else:
            res.append(0)
            #print(i, '0')
        i += 1
    spam = [item[0] for item in stat_spam]
    ham = [item[0] for item in stat_ham]
    stat1 = [item[0] for item in stat if item[1]>=1]
    stat0 = [item[0] for item in stat if item[1]<1][::-1]
    print('1:')
    print(stat1)
    print('0:')
    print(stat0)
    score = sum(res)/len(res)
    print('score', score)
        
            

1.206219310002343
1:
0:
['david', 'yeltsin', 'executive', 'powell', 'replacement', 'phone', 'brazilian', 'wimbledon', 'mubarak', 'mr.', 'speech', 'm.', 'autonomy', 'jan', 'bhutto', 'cypriot', 'bulgaria', 'lap', 'inspire', 'midfielder', 'maryland', 'wild', 'maliki', 'confront', 'gyanendra', 'title', 'apparently', 'uribe', 'giant', 'benedict', 'denial', 'insult', 'hosni', 'straight', 'legal', 'sell', 'restore', 'rival', 'rice', 'vice', 'initiate', 'reluctant', 'plea', 'penalty', 'recovery', 'enrich', 'abdullahi', 'veto', 'medal', 'karimov', 'bhutan', 'levee', 'inability', 'really', 'armenia', 'marseille', 'industrialize', 'sick', 'contribution', 'these', 'savic', 'visegrad', 'aggressive', 'stake', 'awareness', 'view', 'society', 'jose', 'internal', 'mbeki', 'puk', 'kyrgyzstan', 'voice', 'universal', 'forum', 'eighth', 'suleiman', 'rush', 'roe', 'reunification', 'performance', 'cancel', 'guy', 'flush', 'meadows', 'huber', 'maleeva', 'atoll', 'constitute', 'wildlife', 'judith', 'wiesner', 

For this lab, you need to **implement** a multinomial naive bayes classifier (i.e., `multinomial_nb()` in the file: `submission.py`) with add-1 smoothing. The input arguments of `multinomial_nb()` are:
* `training_data`: pre-processed data stored as a `dictionary`
* `sms`: test-sms (i.e., a list of tokens) that you need to categorize as `spam` and/or `ham`

The return value of `multinomial_nb()` should be the **ratio** of the probability of sms is spam and the probability of sms is ham. A return value larger than 1 implies the `sms` is spam and vice versa.

For example, a sample output is shown in the cell given below:

In [3]:
## How we test your implementation...
import submission_ans as submission

sms = 'I am not spam'
print(submission.multinomial_nb(training_data, tokenize(sms)))

0.2342767295597484


# Submission

You need to complete the function `multinomial_nb()` in the file: `submission.py`. You are allowed to test your submission against sample test cases using online submission system (i.e., https://kg.cse.unsw.edu.au:8318/lab3/).


# Test Environment

For testing, we have pre-installed the requisite modules and/or libraries in the testing environment. You are only allowed to use following libraries:
* python: 3.5.2
* pandas: 0.19.2

NOTE: You are required to implement the classifier by yourself. You are not allowed to import **sklearn** and/or any other library in Lab3.

In [17]:
a = {'aaa':1,}
if 'b' not in a:
    print(a)
sms = ['a','b']
if type(sms) is list:
    print('aaaa')

{'aaa': 1}
aaaa
