# **Session 7: Probability and Statistics for AI & Machine Learning II**
## Document Classification using Naive Bayes Classifier.

## PY599 (Fall 2018): Applied Artificial Intelligence
## NC State University
###Dr. Behnam Kia
### https://appliedai.wordpress.ncsu.edu/


**Disclaimer**: Please note that these codes are simplified version of the algorithms, and they may not give the best, or expected performance that you could possibly get from these algorithms. The aim of this notebook is to help you understand the basics and the essence of these algorithms, and experiment with them. These basic codes are not deployment-ready or free-of-errors for real-world applications. To learn more about these algorithms please refer to text books that specifically study these algorithms, or contact me. - Behnam Kia

# Dataset 

Method 1: You can download dataset from Keras. In this dataset the words are replaced by a unique number. According to Keras' website, "Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word."





In [None]:
# Collaboration: Richard Watson, Mountain Chan
import collections
import math
from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=None,
                                                      skip_top=0,
                                                      maxlen=None,
                                                      seed=113,
                                                      start_char=1,
                                                      oov_char=2,
                                                      index_from=3)

In [None]:
def dict(x_train, y_train):
  # Create positive review dict
  cnt_pos = collections.Counter()
  for review in range(len(x_train)):
    if y_train[review] == 1:
      for word in range(len(x_train[review])):
        cnt_pos[x_train[review][word]] += 1
  
  # Create negative review dict
  cnt_neg = collections.Counter()
  for review in range(len(x_train)):
    if y_train[review] == 0:
      for word in range(len(x_train[review])):
        cnt_neg[x_train[review][word]] += 1
  
  # Find sum of all words in positive review
  sum_pos = sum(cnt_pos.values())
  # Find sum of all words in negative review
  sum_neg = sum(cnt_neg.values())
  
  return cnt_pos, cnt_neg, sum_pos, sum_neg      

def naive_bayes(review, cnt_pos, cnt_neg, sum_pos, sum_neg):
  pos_prob = 0
  neg_prob = 0
  vocab = 10000

  # Create dict per review
  cnt = collections.Counter()
  
  # Get keys for all words in review
  reviewKey = cnt.keys()
  
  for word in review:
    cnt[word] += 1
  
  # Get log probability for positive review
  for word in reviewKey:
    pos_prob += cnt[word] * math.log((cnt_pos[word] + 1)/(sum_pos + vocab))
  
  # Get log probability for negative review
  for word in reviewKey:
    neg_prob += cnt[word] * math.log((cnt_neg[word] + 1)/(sum_neg + vocab))
    
  return pos_prob > neg_prob

correct = 0
cnt_pos, cnt_neg, sum_pos, sum_neg = dict(x_train, y_train)
for review in range (len(x_test)):
  correct += naive_bayes(x_test[review], cnt_pos, cnt_neg, sum_pos, sum_neg) == y_test[review]
print(correct/len(x_test))  

0.81272


Method 2: You can download the original dataset with readible reviews. Please go to: 
http://ai.stanford.edu/~amaas/data/sentiment/

and download "Large Movie Review Dataset v1.0."

There are many different methods to upload dataset to colab. 
One method is to download the dataset to your local computer, then upload it to colab and then unzip it. Please see the code below:

In [None]:
from google.colab import files

uploaded = files.upload()
!tar xzvf aclImdb_v1.tar.gz >/dev/null
