# Text Classification with Naive Bayes

The goal of this lab is to build a model using Naive Bayes to classify movie reviews into positive or negative, and then test the classifier on new movie reviews.

## 1. Data preparation
Our input for this problem are two groups of movie reviews, pos and neg, where each review is stored in a separate text file. The dataset can be downloaded from here: [movies_reviews](https://drive.google.com/file/d/1BzgcXlSRqFj1RoadBupx_OQm322-rHTt/view).

The dataset is from the following publication: ''Thumbs up? Sentiment Classification using Machine Learning Techniques''. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Proceedings of EMNLP, pp. 79--86, 2002.

### 1.1 Read files into arrays 
We first read files and store their content into 2 string arrays, where each element is a string of one movie review.

In [11]:
import os

data_dir = "../data_set/movies_reviews"
pos_entries = os.listdir(data_dir+"/pos")
neg_entries = os.listdir(data_dir+"/neg")

pos_ls = []
for file in pos_entries:
    file = data_dir + "/pos/" + file
    text = open(file,"r").read()
    pos_ls.append(text)
    
neg_ls = []
for file in neg_entries:
    file = data_dir + "/neg/" + file
    text = open(file,"r").read()
    neg_ls.append(text)
    
len(pos_ls),len(neg_ls)

(1005, 1000)

### 1.2 Vectorization of documents
To convert the words in each document into a vector of word occurrences, we use use a special class from <code>sklearn.feature_extraction.text</code> called <code>CountVectorizer</code>. It works by first tokenizing, i.e, assigning the unique numbers to each of the words in a text, and then counting the occurrence of these numbers. 

Before we apply the vectorization, we need to first find out the stopped word that are very commonly used in all documents no matter the class of the document, given in file "stop_words.txt". For example, "the" appears a lot in both negative and positive movie reviews. These words would interfere with our classification result so we need to remove them before do anything further. To do that, we can pass it as a parameter to <code>CountVectorizer</code> to remove the stopped words from the tokens in all texts.

In [22]:
stop_words_file = "../data_set/movies_reviews/stop_words.txt"
f = open(stop_words_file, "r", encoding="utf-8")

stopwords = []
for line in f:
    stopwords.append(line.strip())
    
f.close()

print(stopwords[:20])

['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst']


In [24]:
from sklearn.feature_extraction.text import CountVectorizer

vector = CountVectorizer(stop_words=stopwords)
text_ls = pos_ls+neg_ls
vector.fit(text_ls)

# learn a vocabulary dictionary of all tokens in the raw documents
voc_dic = vector.vocabulary_

# produce counts of occurrences of each word in each document
counts = vector.transform(text_ls)
print("The shape of count is: "+str(counts.shape))
counts = counts.toarray()

The shape of count is: (2005, 39373)


### 1.3 Add class labels and perform train test split
We can add the labels pos or neg to the vector depending on the directory of the movie review

In [28]:
import numpy as np

# 1-1005 are positive reviews so our label is 1
Y_orig = np.ones((1005,)) 

# 1005-2005 are negative reviews so our label is 0
Y_orig = np.concatenate((Y_orig, np.zeros((1000,)))) 
Y = Y_orig.reshape(-1)

print(Y.shape)

(2005,)

## 2. Classification

### 2.1 Train test split

In [None]:
from sklearn.cross_validation import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(counts, Y, test_size=0.3,random_state=109)


### 2.2 Classify with Naive Bayse Model
We use multinomial model in sklearn library because...

### 2.3 Test accuracy and predict new reviews

Our result shows that...