# CSC421 Assignment 3 - Part II Naive Bayes Classification (5 points) #
### Author: George Tzanetakis 

This notebook is based on the supporting material for topics covered in **Chapter 13 Quantifying Uncertainty**and **Chapter 20 - Statistical Learning Method** from the book *Artificial Intelligence: A Modern Approach.* This part does NOT rely on the provided code so you can complete it just using basic Python. 

```
Misunderstanding of probability may be the greatest of all impediments
to scientific literacy.

Gould, Stephen Jay
```



# Introduction 


Text categorization is the task of assigning a given document to one of a fixed set of categories, on the basis of text it contains. Naive Bayes models are often used for this task. In these models, the query variable is
the document category, and the effect variables are the presence/absence
of each word in the language; the assumption is that words occur independently in documents within a given category (condititional independence), with frequencies determined by document category. Download the following file: http://www.cs.cornell.edu/People/pabo/movie-review-data/review_polarity.tar.gz containing a dataset that has been used for text mining consisting of movie reviews classified into negative and positive. You
will see that there are two folders for the positivie and negative category and they each contain multiple text files with the reviews. You can find more information about the dataset at: 
http://www.cs.cornell.edu/People/pabo/movie-review-data/


Our goal will be to build a simple Naive Bayes classifier for this dataset. More complicated approaches using term frequency and inverse document frequency weighting and many more words are possible but the basic concepts
are the same. The goal is to understand the whole process so DO NOT use existing machine learning packages but rather build the classifier from scratch.

Our feature vector representation for each text file will be simply a binary vector that shows which of the following words are present in the text file: Awful Bad Boring Dull Effective Enjoyable Great Hilarious. For example the text file cv996 11592.txt would be represented as (0, 0, 0, 0, 1, 0, 1, 0) because it contains Effective and Great but none of the other words.

# Question 2A (Minimum) CSC421 -  (1 point, CSC581C - 0 points) 

Write code that parses the text files and calculates the probabilities for
each dictionary word given the review polarity

In [3]:
#Jonathan Kalmar V00762777
import os

#NOTE: this program assumes that the current working directory contains the folder
#'review_polarity', containing the subfolders with positive and negative movie reviews.

def get_representation(filename, keywords):
    #takes a text file and returns a list representing word presence in the form
    #[0,0,0,0,0,0,0,0]
    f = open(filename, 'r')
    rep=[0,0,0,0,0,0,0,0]
    text_words = f.read().split(' ')
    for i in range(0, len(keywords)):
        if keywords[i] in text_words:
            rep[i]=1
        else:
            rep[i]=0
    return rep

def mass_representation(folder,keywords):
    #takes a folder directory containing text files of movie reviews, and a keyword list
    #returns a dictionary where keys are reviews numbered 1-1000, and their values are
    #the associated binary word representation in form [0,0,0,0,0,0,0,0,0]
    feature_vector = {}
    for filename in os.listdir(folder):
        rep = get_representation(folder + '/' + filename,keywords)
        feature_vector[int(filename[2:5])]=rep
    return feature_vector

def probabilities(data, keywords):
    #takes a dictionary as created by mass_representation, and returns an array
    #with probabilities for keyword occurence in the given dataset.
    probabilities=[0 for i in range(len(keywords))]
    for i in range(0,len(keywords)):
        total = 0
        for key in data:
            if data[key][i]==1: total += 1
        probabilities[i]=total/1000
    return probabilities
        
key_words = ('awful','bad','boring','dull','effective','enjoyable','great','hilarious')
filepath = os.getcwd() + '/review_polarity/txt_sentoken/'
positive_reviews = mass_representation(filepath + 'pos', key_words)
negative_reviews = mass_representation(filepath + 'neg', key_words)
word_probs_pos = probabilities(positive_reviews, key_words)
word_probs_neg = probabilities(negative_reviews, key_words)

print('Positive Reviews:')
print('-'*20)
for i, probability in enumerate(word_probs_pos):
    print('{:<10s}{:>8.1f}%'.format(key_words[i],probability*100))

print('\nNegative Reviews:')
print('-'*20)
for i, probability in enumerate(word_probs_neg):
    print('{:<10s}{:>8.1f}%'.format(key_words[i],probability*100))

Positive Reviews:
--------------------
awful          1.9%
bad           25.4%
boring         4.8%
dull           2.3%
effective     12.0%
enjoyable      9.4%
great         40.5%
hilarious     12.5%

Negative Reviews:
--------------------
awful          9.9%
bad           50.3%
boring        16.6%
dull           9.0%
effective      4.6%
enjoyable      5.3%
great         28.2%
hilarious      4.8%


# Question 2B (Minimum) (CSC421 - 1 point, CSC581C - 0 point) 


Explain how the probability estimates for each dictionary word given the review polarity can be combined to form a Naive Bayes classifier. You can look up Bernoulli Bayes model for this simple model where only presence/absence of a word is modeled.

Your answer should be a description of the process with equations and a specific example as markdown text NOT python code. You will write the code in the next questinon. 

# ANSWER
We can find the chance that a review is positive by taking the occurrence of each of our keywords in the review (0 or 1), taking the probability of that word occurring (or not occurring, in which case use 1 minus the probability of it occuring) in a positive review, as calculated above, and multiplying the probabilities together. The probability of each word occurring is assumed to be independent so we do not have to worry about conditional probabilities. To find the chance that the review is negative, repeat the process but use the probabilities of the words occuring in negative reviews.

# Question 2C (Expected) 1 point 

Write Python code for classifying a particular test instance (in our case movie review) following a Bernolli Bayes approach. Your code should calculate the likelihood the review is positive given the correspondng conditional probabilities for each dictionary word as well as the likelihood the review is negative given the corresponding conditional probabilities for each dictionary word. Check that your code works by providing a few example cases of prediction. Your code should be written from "scratch" and only use numpy/scipy but not machine learning libraries like scikit-learn or tensorflow. 


In [7]:
import numpy as np

def likelihood(review, word_probs): 
    #given a binary keyword occurence representation of a movie review, and probabilities
    #for occurence of a class of review, gives the probability a review is of that class
    probability_product = 1.0 
    for (i,w) in enumerate(review):
        if (w==1): 
            probability = word_probs[i]
        else: 
            probability = 1.0 - word_probs[i]
        probability_product *= probability
    return probability_product 

def predict(review):
    #given a binary keyword occurence representation of a movie review, return probability
    #that the review is pos, probability it is negative, and the resulting prediction.
    scores = [likelihood(review, word_probs_pos), 
             likelihood(review, word_probs_neg),
             "unclassified"]
    if scores[0]>scores[1]:
        scores[2] = "positive"
    elif scores[0]<scores[1]:
        scores[2] = "negative"
    return scores

i = np.random.randint(1000)
random_review = positive_reviews[i]
print(random_review)
result = predict(random_review)

print('Chance that review #{:} is positive: {:0.1f}%'.format(i,result[0]*100))
print('Chance that review #{:} is negative: {:0.1f}%'.format(i,result[1]*100))
print('Prediction: review is', result[2])

[0, 0, 0, 0, 0, 1, 1, 0]
Chance that review #449 is positive: 2.0%
Chance that review #449 is negative: 0.5%
Prediction: review is positive


# QUESTION 2D (Expected ) 1 point

Calculate the classification accuracy and confusion matrix that you would obtain using the whole data set for both training and testing. Do not use machine learning libraries like scikit-learn or tensorflow for this only the basic numpy/scipy stuff. 

In [5]:
from tabulate import tabulate

tp=0
tn=0
fp=0
fn=0

#tally up the true positive, true negative, false positive, and false negative predictions
for review in positive_reviews:
    prediction = predict(positive_reviews[review])[2]
    if prediction == "positive":
        tp += 1
    elif prediction == "negative":
        fn += 1
for review in negative_reviews:
    prediction = predict(negative_reviews[review])[2]
    if prediction == "positive":
        fp += 1
    elif prediction == "negative":
        tn += 1

#print confusion matrix
headers = ['','predicted positive','predicted negative']
data = [['actual positive', tp, fn],
        ['actual negative', fp, tn]]
print(tabulate(data, headers, tablefmt="fancy_grid", numalign = "center"))

#calculate and print accuracy
accuracy = (tp + tn)/(tp + tn + fp + fn)
print('\nPrediction accuracy: {:>0.2f}%'.format(accuracy*100))

╒═════════════════╤══════════════════════╤══════════════════════╕
│                 │  predicted positive  │  predicted negative  │
╞═════════════════╪══════════════════════╪══════════════════════╡
│ actual positive │         756          │         244          │
├─────────────────┼──────────────────────┼──────────────────────┤
│ actual negative │         411          │         589          │
╘═════════════════╧══════════════════════╧══════════════════════╛

Prediction accuracy: 67.25%


# QUESTION 2E (Advanced) 1 point 

One can consider the Naive Bayes classifier a generative model that can generate binary feature vectors using the associated probabilities from the training data. The idea is similar to how we do direct sampling in
Bayesian Networks and depends on generating random number from a discrete distribution. Describe how you would generate random movie reviews consisting solely of the words from the dictionary using your model. Show 5 examples of randomly generated positive reviews and 5 examples of randomly generated negative reviews. Each example should consists of a subset of the words in the dictionary. Hint: use probabilities to generate both the presence and absence of a word

In [6]:
#for each word in our dictionary, generate a random float between 0 and 1 (step size of 0.1 will do)
#if the randomly generated number is higher than the probability for that class, then include it
#if the generated number is lower than the probability, omit it from the review.

print('Random Positive Reviews: ')
print('-'*25)
for i in range(0,5):
    print([w for (i,w) in enumerate(key_words) if word_probs_pos[i] > np.random.randint(0,101)/100])
print('\nRandom Negative Reviews: ')
print('-'*25)
for i in range(0,5):
    print([w for (i,w) in enumerate(key_words) if word_probs_neg[i] > np.random.randint(0,101)/100])


Random Positive Reviews: 
-------------------------
[]
['dull', 'hilarious']
[]
['bad', 'hilarious']
['bad']

Random Negative Reviews: 
-------------------------
['bad', 'boring', 'dull']
['bad']
['awful', 'bad', 'enjoyable']
['bad']
['hilarious']
