# Naive Bayes and Sentiment Classification

## Problem 1

**Build a naive bayes sentiment classifier using add-1 smoothing, as described in the lecture notes (not binary naive bayes, regular naive bayes). Add an unknown word UNK, as a separate word with count 0. Here is our training corpus:**

In [1]:
# Training Set
negatives = ["just plain boring",
             "entirely predictable and lacks energy",
             "no surprises and very few laughs"]
positives = ["very powerful",
             "the most fun film of the summer"]
documents = {'negative': negatives,
             'positive': positives}
classes = documents.keys()

In [2]:
# Testing Set
test_sentence = "predictable with no originality"

In [3]:
UNK = 'UNK'

**Compute the prior for the two classes + and -, and the likelihoods for each word given the class.**

In [4]:
def tokenize(document):
    """
    Helper function to tokenize a string. For now we're just splitting, but I made
    this a function so we can swap out to a nltk function in the future easily if
    we want to.
    """
    tokens = document.split()  # basic tokenization
    return tokens

In [5]:
# Find vocabulary - unique tokens in training set
vocabulary = set()
for c in classes:
    for d in documents[c]:
        tokens = tokenize(d)
        for token in tokens:
            vocabulary.add(token)
            
vocabulary.add(UNK)  # unknown word, will always have 0 count

print(vocabulary)

{'plain', 'fun', 'the', 'of', 'predictable', 'very', 'summer', 'and', 'energy', 'powerful', 'boring', 'laughs', 'UNK', 'no', 'surprises', 'film', 'lacks', 'most', 'entirely', 'few', 'just'}


In [6]:
def count(w, c, D):
    """
    helper function to count occurrences
    of token w in documents D of class c
    
    uses add-1 smoothing
    """
    n = 0
    for d in D[c]:
        tokens = tokenize(d)
        n += tokens.count(w)  # list-counting method
    return (n + 1)  # add-1 smoothing

In [7]:
def train_naive_bayes(D, C):
    """
    train a naive bayes model. 
    The conditional probability of each token
    in the vocabulary of our training set is
    calculated, as well as priors
    """
    global vocabulary
    
    priors = dict()
    likelihoods = dict()
    
    # Count total number of documents in training set
    N_doc = 0
    for c in classes:
        N_doc += len(documents[c])

    # Compute priors and likelihoods
    for c in C:

        # Compute prior for this class
        c_documents_list = D[c]
        N_c = len(c_documents_list)
        priors[c] = N_c / N_doc
        
        # sum of count(w,c) for all words in the vocabulary
        c_vocab_total_count = sum([count(w,c,D) for w in vocabulary])
        #print([count(w,c,D) for w in vocabulary])

        for token in vocabulary:
            likelihoods[(token, c)] = count(token, c, D) / c_vocab_total_count
            
    return (priors, likelihoods)

priors, liks = train_naive_bayes(documents, classes)
print(priors)
print(liks)

{'positive': 0.4, 'negative': 0.6}
{('powerful', 'negative'): 0.02857142857142857, ('surprises', 'negative'): 0.05714285714285714, ('plain', 'negative'): 0.05714285714285714, ('very', 'negative'): 0.05714285714285714, ('just', 'positive'): 0.03333333333333333, ('few', 'negative'): 0.05714285714285714, ('few', 'positive'): 0.03333333333333333, ('boring', 'negative'): 0.05714285714285714, ('the', 'positive'): 0.1, ('energy', 'negative'): 0.05714285714285714, ('just', 'negative'): 0.05714285714285714, ('energy', 'positive'): 0.03333333333333333, ('plain', 'positive'): 0.03333333333333333, ('boring', 'positive'): 0.03333333333333333, ('summer', 'positive'): 0.06666666666666667, ('film', 'negative'): 0.02857142857142857, ('very', 'positive'): 0.06666666666666667, ('surprises', 'positive'): 0.03333333333333333, ('UNK', 'positive'): 0.03333333333333333, ('no', 'negative'): 0.05714285714285714, ('lacks', 'positive'): 0.03333333333333333, ('entirely', 'negative'): 0.05714285714285714, ('of', 'n

**Then compute whether the sentence in the test set is of class positive or negative. Make sure you know the correct Bayes equation to use to compute a value for each class in order to answer this question.**

In [8]:
test_sentence_tokens = tokenize(test_sentence)
print(test_sentence_tokens)

['predictable', 'with', 'no', 'originality']


In [9]:
# replace unknown test set words with UNK token
for i in range(len(test_sentence_tokens)):
    if test_sentence_tokens[i] not in vocabulary:
        test_sentence_tokens[i] = UNK
        
print(test_sentence_tokens)

['predictable', 'UNK', 'no', 'UNK']


In [10]:
def test_naive_bayes(test_tokens, priors, liks, C):
    """
    test_tokens is a list of tokens
    preprocessed (for UNK replacement etc.)
    """
    global vocabulary
    
    class_probabilities = dict()
    
    for c in C:
        class_probabilities[c] = priors[c]

        for token in test_tokens:
            class_probabilities[c] *= liks[(token, c)]
            
    print(class_probabilities)
    return max(class_probabilities, key=class_probabilities.get)  # argmax

test_naive_bayes(test_sentence_tokens, priors, liks, classes)

{'positive': 4.938271604938272e-07, 'negative': 1.599333610995418e-06}


'negative'

**What would the answer be without add-1 smoothing?**

In [11]:
def count(w, c, D):
    """
    the power of modularity.
    redefine our helper count function,
    to do the same thing but not use add-1
    smoothing.
    
    count occurrences
    of token w in class c
    """
    n = 0
    for d in D[c]:
        tokens = tokenize(d)
        n += tokens.count(w)  # list-counting method
    return n

priors, liks = train_naive_bayes(documents, classes)
test_naive_bayes(test_sentence_tokens, priors, liks, classes)

{'positive': 0.0, 'negative': 0.0}


'positive'

Without add-1 smoothing, the unknown words make the probability for any class 0.

**Would using binary multinomial Naive Bayes change anything?**

No, because changing the counts to 0 would still leave the unknown words in the test set with a count of 0, and thus a conditional probability of 0. We either need to use add-1 smoothing, or drop the unknown words entirely from the test sentence.

**What would happen if you used the second alternative method in Section 3.3.1 of J&M to determine the count of UNK?**

The second method mentioned is to set some amount of words with occur infrequently in the training set, to UNK, and then compute the count of UNK normally.

In [12]:
# count words
token_counts = {}
for c in classes:
    for d in documents[c]:
        for token in tokenize(d):
            token_counts[token] = token_counts.get(token, 0) + 1
            
print(token_counts)

{'plain': 1, 'fun': 1, 'the': 2, 'of': 1, 'summer': 1, 'boring': 1, 'energy': 1, 'no': 1, 'powerful': 1, 'very': 2, 'surprises': 1, 'film': 1, 'lacks': 1, 'most': 1, 'just': 1, 'entirely': 1, 'few': 1, 'predictable': 1, 'and': 2, 'laughs': 1}


It turns out, most words occur infrequently in our training set. If we set tokens with count less than 2 to UNK, we would end up removing most of the useful information from our model. So we won't do that.

## Problem 2

We are given the following corpus, modified from the one in the chapter:

&lt;s&gt; I am Sam &lt;/s&gt;

&lt;s&gt; Sam I am &lt;/s&gt;

&lt;s&gt; I am Sam &lt;/s&gt;

&lt;s&gt; I do not like green eggs and Sam &lt;/s&gt;

If we use linear interpolation smoothing between a maximum-likelihood bi-gram model and a maximum-likelihood unigram model with $\lambda_1 = \frac{1}{2}$ and $\lambda_2 = \frac{1}{2}$, what is $P(Sam \mid am)$? Include <s> and </s> in your counts just like any other token.

## Problem 3

Suppose we didn’t use the end-symbol &lt;/s&gt;. Train an unsmoothed bigram grammar on the following training corpus without using the end-symbol &lt;/s&gt;:

&lt;s&gt; a b

&lt;s&gt; b b

&lt;s&gt; b a

&lt;s&gt; a a

Demonstrate that your bigram model does not assign a single probability distribution across all sentence lengths by showing that the sum of the probability
of the four possible 2 word sentences over the alphabet {a,b} is 1.0, and the sum of the probability of all possible 3 word sentences over the alphabet {a,b} is also 1.0.

## Problem 4

A robot, which only has a camera as a sensor, can either be in one of two locations: L1 or L2. The robot doesn’t know exactly where it is and it represents this uncertainty by keeping track
of two probabilities: P(L1) and P(L2). Based on all past observations, the robot thinks that there is a 0.8 probability it is in L1 and a 0.2 probability that it is in L2.

The robot’s vision algorithm detects a window, and although there is only a window in L2, it can’t conclude that it is in fact in L2: its image recognition algorithm is not perfect. The probability of observing a window given there is no window at its location is 0.2 and the probability of observing a window given there is a window is 0.9. After incorporating the observation of a window, what is the robot’s new values for P(L1) and P(L2)?

## Problem 5

Binary multinomial NB seems to work better on some problems than full count NB, but full count works better on others. For what kinds of problems might binary NB be better, and why? (There is no known right answer to this question, but I'd like you to think about the possibilities.) Come up with an example.