### Load Dataset

In [1]:
import random
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading extenrnal modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

# Dev Process

Assume the task we want to classify is determining the label $y$ of observed data $X=(x_1, x_2)$, where $x_1, x_2$ are the attributes of $X$. Further, we assume attribute $x_1 \in A_1 = \{1, 2, 3\}$ and attribute $x_2 \in A_2 = \{S, M, L\}$. And the label $y \in C = \{1, -1\}$. The observed data and labels are below.

The assumed problem is to predict the data $X = \{2, S\}^T$, which is supposed to get estimation value $-1$ based on the Multinomial Naive Bayes.

In [2]:
import gen_synthetic as gs

# Test the randomly generated data
x1_dom = [1, 2, 3]
x2_dom = ['S', 'M', 'L']
x_dom = np.array([x1_dom, x2_dom])
y_dom = np.array([['-1', '1'], ])

X_random = gs.gen_naive_bayes_synthetic(x_dom, y_dom, 15)
print 'X_random: \n', X_random

# Test embedded hard code data
X_hard_code = gs.gen_naive_bayes_synthetic([], [], 0, hard_code=True)
print 'X_hard_code: \n', X_hard_code

X_random: 
[[('2', 'L') '-1']
 [('3', 'M') '-1']
 [('1', 'S') '-1']
 [('2', 'S') '-1']
 [('1', 'S') '1']
 [('2', 'M') '-1']
 [('2', 'S') '-1']
 [('2', 'M') '1']
 [('3', 'L') '-1']
 [('2', 'M') '1']
 [('1', 'S') '-1']
 [('1', 'L') '-1']
 [('1', 'L') '-1']
 [('3', 'M') '-1']
 [('2', 'S') '-1']]
X_hard_code: 
[[('1', 'S') '-1']
 [('1', 'M') '-1']
 [('1', 'M') '1']
 [('1', 'S') '1']
 [('1', 'S') '-1']
 [('2', 'S') '-1']
 [('2', 'M') '-1']
 [('2', 'M') '1']
 [('2', 'L') '1']
 [('2', 'L') '1']
 [('3', 'L') '1']
 [('3', 'M') '1']
 [('3', 'M') '1']
 [('3', 'L') '1']
 [('3', 'L') '-1']]


### Naive Bayes without Laplace smoothing

In [3]:
from classifiers import MultinomialNB
mnNB = MultinomialNB()

mnNB.train(x_dom, y_dom, X_random)
df = mnNB.counts_table
print df[df[0] == '1']

    0  1   2
2   1  S  -1
4   1  S   1
10  1  S  -1
11  1  L  -1
12  1  L  -1


In [4]:
mnNB.train(x_dom, y_dom, X_hard_code)
df = mnNB.counts_table
print df[df[0] == '1']
print 'number of above rows: ', len(df[df[0] == '1'])

print df[(df[0] == '1') & (df[1] == 'M')]
print 'number of above rows: ', len(df[(df[0] == '1') & (df[1] == 'M')])

print 'number of non-existing rows:', len(df[df[0] == '7'])

print '\nThe probability table is:'
for k in sorted(mnNB.prob_table):
    print 'k = %s, prob = %f' % (k, mnNB.prob_table[k])

   0  1   2
0  1  S  -1
1  1  M  -1
2  1  M   1
3  1  S   1
4  1  S  -1
number of above rows:  5
   0  1   2
1  1  M  -1
2  1  M   1
number of above rows:  2
number of non-existing rows: 0

The probability table is:
k = -1, prob = 0.400000
k = 1, prob = 0.600000
k = 1|-1, prob = 0.500000
k = 1|1, prob = 0.222222
k = 2|-1, prob = 0.333333
k = 2|1, prob = 0.333333
k = 3|-1, prob = 0.166667
k = 3|1, prob = 0.444444
k = L|-1, prob = 0.166667
k = L|1, prob = 0.444444
k = M|-1, prob = 0.333333
k = M|1, prob = 0.444444
k = S|-1, prob = 0.500000
k = S|1, prob = 0.111111


### Laplace smoothing

In [5]:
mnNB.train(x_dom, y_dom, X_hard_code, laplaceS=1)
df = mnNB.counts_table
print df[df[0] == '1']
print 'number of above rows: ', len(df[df[0] == '1'])

print df[(df[0] == '1') & (df[1] == 'M')]
print 'number of above rows: ', len(df[df[0] == '1'])

print 'number of non-existing rows:', len(df[df[0] == '7'])

print '\nThe probability table is:'
for k in sorted(mnNB.prob_table):
    print 'k = %s, prob = %f' % (k, mnNB.prob_table[k])

   0  1   2
0  1  S  -1
1  1  M  -1
2  1  M   1
3  1  S   1
4  1  S  -1
number of above rows:  5
   0  1   2
1  1  M  -1
2  1  M   1
number of above rows:  5
number of non-existing rows: 0

The probability table is:
k = -1, prob = 0.400000
k = 1, prob = 0.600000
k = 1|-1, prob = 0.444444
k = 1|1, prob = 0.250000
k = 2|-1, prob = 0.333333
k = 2|1, prob = 0.333333
k = 3|-1, prob = 0.222222
k = 3|1, prob = 0.416667
k = L|-1, prob = 0.222222
k = L|1, prob = 0.416667
k = M|-1, prob = 0.333333
k = M|1, prob = 0.416667
k = S|-1, prob = 0.444444
k = S|1, prob = 0.166667


In [6]:
XX_test = ['2', 'S']
mnNB.predict(XX_test)

['-1']


# Assignment

As part of this problem, you will implement a Naive Bayes classifier for classifying movie reviews as positive or negative. The dataset that you will be using is the IMDB Large Movie Review dataset (Maas et. al, ACL 2011). The processed dataset can be found [here](https://www.dropbox.com/s/liz0o40f5mpj8ye/hw1_dataset_nb.tar.gz?dl=0). The task is to estimate appropriate parameters using the training data, and use it to predict reviews from the test data, and classify each of them as either positive or negative.

We employ the *Multinomial Naive Bayes model for modeling each $P(X_i | Y = y_k)$ ($i = 1 .. n$), with appropriate word counts* (Note $n$ is the number of dimensions).

Please use Matlab, Python, R, C/C++ or Java for your implementation. Note that you will have to submit your codes in Autolab, and provide the answers to the questions in the below subsections in your report.

### Preprocessing
The dataset is partitioned into 2 folders: `train` and `test`, each of which contains 2 subfolders (`pos` and `neg`, for positive and negative samples respectively). The content of each file has to be converted to a bag-of-words representation. So the first task is to go through all the files in the `train` folder, and construct the vocabulary $V$ of all unique words. Please ignore all the stop-words as given in the file `sw.txt` (provided along with the dataset). The words from each file (both in training and testing phase) must be extracted by splitting the raw text only with whitespace characters and {\color{red}converting them to lowercase characters}. 

The next step is to get counts of each individual words for the positive and the negative classes separately, to get $P(word | class)$. 

### Classification
In this step, you need to go through all the negative and positive samples in the test data, and classify each sample according to the parameters learned earlier. The classification should be done by comparing the log-posterior (un-normalized), which is given by $\log(P(X|Y)P(Y))$, for both the classes.

### Laplace smoothing
An issue with the original Naive Bayes setup is that if a test sample contains a word which is not present in the dictionary, the $P(word|label)$ goes to $0$. To mitigate this issue, one solution is to employ Laplace smoothing (it has a parameter $\alpha$). Augment your $P(word | class)$ calculations by including the appropriate terms for doing Laplace smoothing. 

Report the confusion matrix and overall accuracy of your classifier on the test dataset with $\alpha = 1$. Recall that the confusion matrix for such 2-class classification problem, is a matrix of the number of true positives (positive samples correctly classified as positive), number of true negatives (negative samples correctly classified as negative), number of false positives (negative samples incorrectly classified as positive), and number of false negatives (positive samples incorrectly classified as negative). The accuracy is the ratio of sum of true positives and true negatives, and the total number of samples (in the test dataset).

Now vary the value of $\alpha$ from $0.0001$ to $1000$ (by multiplying $\alpha$ with 10 each time), and report a plot of the accuracy on the test dataset for the corresponding values of $\alpha$. (The x-axis should represent $\alpha$ values and use a $\log$ scale for the x-axis).

 Why do you think the accuracy suffers when $\alpha$ is too high or too low? 

In [7]:
import numpy as np
import pandas as pd
import os

fileName = '0_3.txt'
dataPath = os.path.join('dataset', 'hw1_dataset_nb', 'train', 'neg', fileName)
data = open(dataPath, 'r').read()
print data

Story of a man who has unnatural feelings for a pig  Starts out with a opening scene that is a terrific example of absurd comedy  A formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it's singers  Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting  Even those from the era should be turned off  The cryptic dialogue would make Shakespeare seem easy to a third grader  On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond  Future stars Sally Kirkland and Frederic Forrest can be seen briefly 


### Read training, testing data and stop words

In [15]:
from dataset import removePunctuation

def getData(dir_pre, data_type, cg_list, sw=None):
    """
    Reading data from 'dir_pre'. The data is categories
    with name list 'data_type'. And the specified data
    type is categoried with name list 'cg_list'. If the
    stop words list 'sw' is not empty, the content will
    remove the words in 'sw'.
    
    Input:
    ------
    - dir_pre  : the prefix of data set's directory.
    - data_type: 'train' or 'test'.
    - cg_list  : specified data type's category name list.
    - sw       : stop words list.
    
    Output
    ------
    A dictionary with key name as cg_list. Here for IMDB data,
    it's 'neg' --> 'list of comments list'
    """
    docs = {}
    for cg in cg_list:
        texts = []
        pathPre = os.path.join(dir_pre, data_type, cg)
        dir_contents = os.listdir(pathPre)

        for fN in dir_contents:
            contents = removePunctuation(open(os.path.join(pathPre, fN), 'r').read()).split()
            
            # Remove stop words.
            if not sw is None:
                contents = [x for x in contents if not x in sw]
            
            texts.append(contents)

        docs[cg] = texts
        
    return docs   

stop_words = open(os.path.join('dataset', 'hw1_dataset_nb', 'sw.txt'), 'r').read().split()

cgList = ['neg', 'pos']
dirP = os.path.join('dataset', 'hw1_dataset_nb')
X_train = getData(dirP, 'train', cgList, sw=stop_words)
X_test = getData(dirP, 'test', cgList, sw=stop_words)

In [9]:
print stop_words[0:5]

['a', 'about', 'above', 'across', 'after']


In [10]:
print X_train['neg'][0]

['Story', 'unnatural', 'feelings', 'pig', 'Starts', 'scene', 'terrific', 'example', 'absurd', 'comedy', 'A', 'formal', 'orchestra', 'audience', 'insane', 'violent', 'mob', 'crazy', 'chantings', "it's", 'singers', 'Unfortunately', 'stays', 'absurd', 'WHOLE', 'time', 'narrative', 'eventually', 'putting', 'Even', 'era', 'The', 'cryptic', 'dialogue', 'Shakespeare', 'easy', 'third', 'grader', 'On', 'technical', 'level', "it's", 'cinematography', 'future', 'Vilmos', 'Zsigmond', 'Future', 'stars', 'Sally', 'Kirkland', 'Frederic', 'Forrest', 'seen', 'briefly']


In [11]:
print '****** training data: ******'
for i, tx in enumerate(X_train['neg'][0:5]):
    print '%d -----------' % i
    print tx
       
print '\n neg ================> pos \n'

for i, tx in enumerate(X_train['pos'][0:5]):
    print '%d -----------' % i
    print tx

print '\n------------------------------------------\n'

print '****** testing data: ******'    
for i, tx in enumerate(X_train['neg'][0:5]):
    print '%d -----------' % i
    print tx
       
print '\n neg ================> pos \n'

for i, tx in enumerate(X_train['pos'][0:5]):
    print '%d -----------' % i
    print tx


****** training data: ******
0 -----------
['Story', 'unnatural', 'feelings', 'pig', 'Starts', 'scene', 'terrific', 'example', 'absurd', 'comedy', 'A', 'formal', 'orchestra', 'audience', 'insane', 'violent', 'mob', 'crazy', 'chantings', "it's", 'singers', 'Unfortunately', 'stays', 'absurd', 'WHOLE', 'time', 'narrative', 'eventually', 'putting', 'Even', 'era', 'The', 'cryptic', 'dialogue', 'Shakespeare', 'easy', 'third', 'grader', 'On', 'technical', 'level', "it's", 'cinematography', 'future', 'Vilmos', 'Zsigmond', 'Future', 'stars', 'Sally', 'Kirkland', 'Frederic', 'Forrest', 'seen', 'briefly']
1 -----------
['Airport', "'77", 'starts', 'brand', 'luxury', '747', 'plane', 'loaded', 'valuable', 'paintings', 'belonging', 'rich', 'businessman', 'Philip', 'Stevens', 'James', 'Stewart', 'flying', 'bunch', "VIP's", 'estate', 'preparation', 'public', 'museum', 'board', 'Stevens', 'daughter', 'Julie', 'Kathleen', 'Quinlan', 'son', 'The', 'luxury', 'jetliner', 'takes', 'planned', 'mid-air', 'pla

### Construct vocabulary list

In [12]:
vl = []
for d in ['neg', 'pos']:
    for doc in X_train[d]:
        for wd in doc:
            vl.append(wd)
        
voc_list = list(set(vl))

In [13]:
len(voc_list)

127583