## ❓ Pre-module quiz

Why is Naive Bayes "naive"?

A. Because it's the most basic, i.e. "naive" classifier we can build

B. Because it "naively" assumes that the probabilities of features (i.e., in our case, words) are independent of each other

C. Because the guy who invented it tought it was a cool name

D. Because it "naively" assumes that the probabilities of features (i.e., in our case, words) are dependent of each other

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>The correct answer is B - Naive Bayes assumes that the probability of finding a certain word is independent from the probability of finding another word. So, for example, in the domain of movie reviews, it assumes that the probabilities of finding the words <code>Indiana</code> and <code>Jones</code> are not correlated, even if in practice we know that this is not the case.</p>

</details> 


# Python for Computational Linguists 1.2: Breaking down Naive Bayes

## Introduction

Welcome to module 2.1! In this module we will review Naive Bayes and we will write our own implementation of the algorithm. We will use it to classify some toy examples, and then we will turn to Sentiment Analysis, by training our simple model on a research dataset.

### (⚠️ UPDATE ME!) Recap of last notebook

Module 2.1 introduced lots of new concepts and libraries - pre-processing, feature vectors, `numpy`, and so on. In fact, we have seen how to import data from a research dataset, how to clean them by removing punctuation and stop-words, how to use `numpy` to prepare test and training data for a model, and how to use `scikit-learn` to train, test and evaluate a simple Naive Bayes classifier for sentiment analysis.

However, we didn't look 'behind the scenes' at how Naive Bayes actually works. This module will guide you through your first hand-written machine learning model, by showing you how to write the maths for Naive Bayes yourself. 

### Naive Bayes refresher

> Note: this section borrows heavily from the Naive Bayes chapter of the lecture notes. Please refer to them or to the J&M for more details.

Naive Bayes is a simple classifier based on two assumptions:
- The **bag-of-words** assumption: word ordering doesn't matter. We represent each document in our dataset as a list of pairs $(word_i,frequency_i)$.
- The **conditional independence** assumption: the probability of one word appearing in a sentence is by no means correlated to the occurrence of another word.

These two assumptions heavily simplify our model, since they completely disregard grammar and and any domain knowledge (e.g. the `Indiana Jones` example in the pre-module quiz); but as we know now, they allow us to build a surprisingly efficient classifier.

Let's quickly go through the maths of Naive Bayes. Remember that given a document $d$ and a set of classes $C$, we need to assign the document to the class $\hat{c} \in C$ which has the maximum *a posteriori probability*, i.e. where $\hat{c}$ is defined as follows:

$$ \hat{c} = \text{argmax}_{c \in C} P(c \mid d)$$

How do we calculate $\hat{c}$ efficiently? Well, the Bayes Rule tells us that

$$ P(c \mid d) = \frac{P(c) \ P(d \mid c)}{P(d)} $$

Allowing us to rewrite 

$$\begin{align} 
\hat{c} &= \text{argmax}_{c \in C} P(c \mid d) \\
        &= \text{argmax}_{c \in C} \frac{P(c) \ P(d \mid c)}{P(d)}
\end{align}$$

However, the probability $d$ is constant for each class $c$, hence we can remove it, leaving only:

$$
\hat{c} = \text{argmax}_{c \in C} 
    \underbrace{P(c)}_\text{prior}
    \underbrace{P(d \mid c)}_\text{likelihood}
$$

Where the $prior$ is the **prior probability** of the class $c$ and the $likelihood$ is the probability of finding $d$ given the class $c$.

Using words as features, we can represent $d$ as a list of words $w_1, \dots , w_n$, hence 

$$
\hat{c} = \text{argmax}_{c \in C} 
    \underbrace{P(c)}_\text{prior}
    \underbrace{P(w_1, \dots , w_n \mid c)}_\text{likelihood} 
$$

However, $P(w_1, \dots , w_n \mid c)$ may be prohibitively hard to calculate, since we would need to estimate the probability of every possible combination of words. Here, the **conditional independence** assumption comes to the rescue, assuming the probability of the words (i.e. features) are independent, allowing us to finally rewrite

$$
\begin{align}
\hat{c} &= \text{argmax}_{c \in C} 
       \underbrace{P(c)}_\text{prior}
       \underbrace{P(w_1 \mid c) \times \dots \times P(w_n \mid c)}_\text{likelihood} 
       \\
       &= \text{argmax}_{c \in C} 
       \underbrace{P(c)}_\text{prior}
       \underbrace{\prod_{w \in d}{P(w \mid c)}}_\text{likelihood} 
\end{align}
$$

What does all of this mean in practice? Well, that if we have a document $d$, all we need to know to classify it is:
- The *priors*, i.e. the probability of document $d$ to belong to each class $c$
- The *likelihoods*, i.e. the probabilities for each word $w_i$ of the document to belong to each class $c$.

## Naive Bayes: a simple implementation

Let's begin with a simple example from the Post Lecture exercises (taken from J&M-3, exercise 4.2). Given the following short movie reviews, each labeled with a genre, either comedy or action:

| review                      | class  |
|-----------------------------|--------|
| fun, couple, love, love     | comedy |
| fast, furious, shoot        | action |
| couple, fly, fast, fun, fun | comedy |
| furious, shoot, shoot, fun  | action |
| fly, fast, shoot, love      | action |

And a new document D: 

| review                     | class  |
|----------------------------|--------|
| fast, couple, shoot, fly   | ?      |

We have to compute the most likely class for D.

Let's start by saving our documents in some vectors:

In [1]:
train_docs = [
    ['fun', 'couple', 'love', 'love'],
    ['fast', 'furious', 'shoot'],
    ['couple', 'fly', 'fast', 'fun', 'fun'],
    ['furious', 'shoot', 'shoot', 'fun'],    
    ['fly', 'fast', 'shoot', 'love']]

train_labels = ['comedy', 'action', 'comedy', 'action', 'action']
test_doc = ['fast', 'couple', 'shoot', 'fly']

### Computing the priors

Remember what we needed to do? The first step is to compute the **priors**. Let's do that with a simple function:

In [2]:
# What are our classes?
classes = set(train_labels)
print(classes)

# initialise the priors
priors = {}
for _class in classes:
    priors[_class] = 0

# count how many train example in each class
for _class in classes:
    for label in train_labels:
        if _class == label:
            priors[_class] += 1

print(priors)

{'comedy', 'action'}
{'comedy': 2, 'action': 3}


> **<h3>💻 Try it yourself!</h3>**

Now the priors are not *normalised*, i.e. we have to bring each prior in the range $[0,1]$. Can you do that in the following cell?

In [3]:
# write your code here

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p><pre><code>
for _class in classes:
    priors[_class] = priors[_class] / len(train_labels)
    </code></pre></p>

</details> 

If you've got the priors correct, you should have $P(comedy) = 0.4$ and $P(action)=0.6$.

In [6]:
print(priors)

{'comedy': 0.4, 'action': 0.6}


> **<h3>💻 Try it yourself!</h3>**

Now let's wrap everything nicely into a function. Can you complete the cell below?

In [None]:
def compute_priors(labels):
    '''
    Computes the priors for a set of labels.
    '''

    # What are our classes?
    classes = set(labels)

    # -------------------------#
    #      E X E R C I S E     #
    # -------------------------#
    # initialise the priors
    priors = {}

    # ...?

    # count how many train example in each class
    
    # ...?
    
    # normalise the priors
    
    # ...?
    
    # ~      end exercise    ~ #
    
    return priors

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p><pre><code>
def compute_priors(labels):
    '''
    Computes the priors for a set of labels.
    '''    
    # What are our classes?
    classes = set(labels)
    # initialise the priors
    priors = {}
    for _class in classes:
        priors[_class] = 0
    # count how many train example in each class
    for _class in classes:
        for label in labels:
            if _class == label:
                priors[_class] += 1 
    # normalise the priors
    for _class in classes:
        priors[_class] = priors[_class] / len(labels)        
    return priors
  </code></pre></p>
</details> 

In [8]:
priors = compute_priors(train_labels)
print(priors)

{'comedy': 0.4, 'action': 0.6}


### Find the vocabulary

Since we need to compute the likelihoods for all words in our vocabulary, we need to find all the words in our corpus first. Let's build our vocabulary with a simple function:

In [9]:
def create_vocabulary(lines):
    '''
    Creates a vocabulary from test.
    
    Parameters
    ----------
    lines: a list of lists of words
    
    Returns
    -------
    a set with all the words in the lines.
    
    '''
    
    words = set()
    for line in lines:
        for word in line:
            words.add(word)
            
    return words

In [10]:
create_vocabulary(train_docs)

{'couple', 'fast', 'fly', 'fun', 'furious', 'love', 'shoot'}

### Computing the likelihoods

Now we need to compute the likelihoods of the words w.r.t. to each class. To do that, we can build a `dict` for each class, where `dict[word] = P(word|class)`. To do that we can slightly modify the function `create_vocab_dict` that we defined in [Module 1.4](../../module_1/module_1.4/module_1.4.ipynb). The function we defined was the following:

In [11]:
def create_vocab_dict(lines):
    '''
    Collect vocabulary counts from text

    Parameters
    ----------
    f_processed_arg: a list of lists of words

    Returns
    -------
    a dictionary with words (str) as keys and counts(int) as values
    vocab={
    'SONNETS': 1
    }
    '''
    vocab={}# create an empty vocabulary dictionary to store words as keys and counts as values later. 
    for line in lines:
        for word in line:
            if word in vocab:
                vocab[word]+=1 # update the count for an existing word
            else:
                vocab[word]=1 # initilize the count for a new word
    return vocab

In [12]:
print(create_vocab_dict(train_docs))

{'fun': 4, 'couple': 2, 'love': 3, 'fast': 3, 'furious': 2, 'shoot': 4, 'fly': 2}


> **<h3>💻 Try it yourself!</h3>**

How can we modify this function to give us the likelihoods for each class? Modify the function below, directly derived from `create_vocab_dict`, to return the likelihoods instead of the raw counts.

In [None]:
def compute_likelihoods(lines, vocabulary):
    '''
    Computes the likelihoods of words in a list of strings.

    Parameters
    ----------
    lines: a list of list of words
    vocabulary: the vocabulary of the full corpus

    Returns
    -------
    a dictionary with words (str) as keys and likelihoods(floats) as values
    vocab={
    'SONNETS': 0.01
    }
    '''
    
    
    # -------------------------#
    #      E X E R C I S E     #
    # -------------------------#
    # create an empty vocabulary dictionary to store words 
    # as keys and counts as values later. 
    
    likelihoods = {}
    
    # initialise the likelihoods
    # hint: iterate through the vocabulary and initialise
    # a new element of the likelihoods dict to 0

    
    
    # ~      end exercise    ~ #    
    
    # Now we iterate through the lines
    for line in lines:
        for word in line:
            likelihoods[word] += 1 

    # -------------------------#
    #      E X E R C I S E     #
    # -------------------------#
    # how long are our documents?
    total_tokens = 0
    
    # write your code here
    # (hint: sum the length of the lines in total_tokens!)
    
    for word in likelihoods:
        likelihoods[word] = # Write your code here
    
    # ~      end exercise    ~ #

    return likelihoods

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p><pre><code>
def compute_likelihoods(lines, vocabulary):
    '''
    Computes the likelihoods of words in a list of strings.
    Parameters
    ----------
    lines: a list of list of words
    vocabulary: the vocabulary of the full corpus
    Returns
    -------
    a dictionary with words (str) as keys and likelihoods(floats) as values
    vocab={
    'SONNETS': 0.01
    }
    '''
    # -------------------------#
    #      E X E R C I S E     #
    # -------------------------#
    # create an empty vocabulary dictionary to store words 
    # as keys and counts as values later. 
    likelihoods = {}
    # initialise the likelihoods
    # hint: iterate through the vocabulary and initialise
    # a new element of the likelihoods dict to 0
    for word in vocabulary:
        likelihoods[word] = 0
    # ~      end exercise    ~ #    
    # Now we iterate through the lines
    for line in lines:
        for word in line:
            likelihoods[word] +=1 
    # -------------------------#
    #      E X E R C I S E     #
    # -------------------------#
    # how long are our documents?
    total_tokens = 0
    # write your code here
    # (hint: sum the length of the lines in total_tokens!)
    for line in lines:
        for word in line:
            total_tokens +=1    
    for word in likelihoods:
        likelihoods[word] = likelihoods[word]/total_tokens
    # ~      end exercise    ~ #
    return likelihoods
  </code></pre></p>
</details> 

In [15]:
print(compute_likelihoods(train_docs, create_vocabulary(train_docs)))

{'fun': 0.2, 'fly': 0.1, 'fast': 0.15, 'love': 0.15, 'couple': 0.1, 'furious': 0.1, 'shoot': 0.2}


Note that this method computes the likelihoods of the words in the whole vocabulary, irregardless of the classes. To get the likelihoods for each class we need to do as follows:

In [16]:
target_class = 'action'
target_docs = []
vocabulary = create_vocabulary(train_docs)

# enumerate builds an (index, doc) list, hence allowing
# us to retrieve the label for each doc
for i, doc in enumerate(train_docs):
    if train_labels[i] == target_class:
        target_docs.append(doc)
        
print(target_docs)
print(compute_likelihoods(target_docs, vocabulary))

[['fast', 'furious', 'shoot'], ['furious', 'shoot', 'shoot', 'fun'], ['fly', 'fast', 'shoot', 'love']]
{'fun': 0.09090909090909091, 'fly': 0.09090909090909091, 'fast': 0.18181818181818182, 'love': 0.09090909090909091, 'couple': 0.0, 'furious': 0.18181818181818182, 'shoot': 0.36363636363636365}


### The training function

> **<h3>💻 Try it yourself!</h3>**

Now we have all the instruments to build our training function. Can you complete it?

In [None]:
def train_naive_bayes(documents, labels):
    
    classes = set(labels)
    
    # compute the priors
    priors = compute_priors(labels)
    vocabulary = create_vocabulary(documents)
    
    # this dict will contain the likelihoods, e.g.
    # likelihoods['action'] = {'fast': 0.2, 'furious': 0.1...
    likelihoods = {}
    
    # -------------------------#
    #      E X E R C I S E     #
    # -------------------------#
    for _class in classes:
        # get the documents belonging to each class:
        class_docs = []

        # put your code here
        for #...
        
        # compute the likelihood of this class
        likelihoods[_class] = # hint: use compute_likekihood defined above
        
    # ~      end exercise    ~ #
        
    return priors, likelihoods    

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p><pre><code>
def train_naive_bayes(documents, labels):
    classes = set(labels)
    # compute the priors
    priors = compute_priors(labels)
    vocabulary = create_vocabulary(documents)
    # this dict will contain the likelihoods, e.g.
    # likelihoods['action'] = {'fast': 0.2, 'furious': 0.1...
    likelihoods = {}
    for _class in classes:
        # get the documents belonging to each class:
        class_docs = []
        for i, doc in enumerate(documents):
            if labels[i] == _class:
                class_docs.append(doc)
        # compute the likelihood of this class
        likelihoods[_class] = compute_likelihoods(class_docs, vocabulary)
    return priors, likelihoods 
    </code></pre></p>
</details> 

In [18]:
priors, likelihoods = train_naive_bayes(train_docs, train_labels)

print('Priors:')
print(priors)

print('')
print('Likelihoods:')
print(likelihoods)

Priors:
{'comedy': 0.4, 'action': 0.6}

Likelihoods:
{'comedy': {'fun': 0.3333333333333333, 'fly': 0.1111111111111111, 'fast': 0.1111111111111111, 'love': 0.2222222222222222, 'couple': 0.2222222222222222, 'furious': 0.0, 'shoot': 0.0}, 'action': {'fun': 0.09090909090909091, 'fly': 0.09090909090909091, 'fast': 0.18181818181818182, 'love': 0.09090909090909091, 'couple': 0.0, 'furious': 0.18181818181818182, 'shoot': 0.36363636363636365}}


### Predict unknown classes

So now we have trained our model. How can we predict the likelihood of new sentences belonging to each class? Remember from above that

$$
\hat{c} = \text{argmax}_{c \in C} 
       \underbrace{P(d)}_\text{prior}
       \underbrace{\prod_{w \in d}{P(w \mid c)}}_\text{likelihood} 
$$

So, to predict the class of our new sentence `fast, couple, shoot, fly`, we need to:
- for each class `c`, we need to calculate `prob_c` by
    - multiplying the probability of each word `w` given that class `c`

Then, we'll just have to look at the maximum of our `prob_c`s. So, can we predict the class of `fast, fun, love, fly`? Let's write a function to do this.

In [19]:
def bayes_predict(document, priors, likelihoods):
    '''
    Predicts the label for a document given the trained
    priors and likelihoods.
    
    Parameters
    ----------
    document: the document to analyse
    priors: the trained priors
    likelihoods: the trained likelihoods
    
    Return
    ------
    The probability for each class.
    '''
    
    classes_probabilities = {}

    # unpack the dictionary and iterate 
    # through the priors
    for label, prior in priors.items():
        
        # initialise the probability of a class to its prior
        prob_class = prior
        for word in document:
            if word in likelihoods[label]:
                # multiply the prior for the likelihood of each word
                prob_class = prob_class*likelihoods[label][word]
        classes_probabilities[label] = prob_class
        
    return classes_probabilities
    

Good! Let's try it with:

In [20]:
document = ['fast', 'fun', 'love', 'fly']
bayes_predict(document, priors, likelihoods)

{'action': 8.19616146438085e-05, 'comedy': 0.0003657978966620942}

So for this document, `comedy` is the most likely class! 

Now let's try with the document for the assignment:

In [21]:
print(test_doc)
bayes_predict(test_doc, priors, likelihoods)

['fast', 'couple', 'shoot', 'fly']


{'action': 0.0, 'comedy': 0.0}

Now the classes are both zero! How come? Well, we didn't apply any smoothing, so obviously at some point we are multiplying the likelihoods by zero, since $P(couple \ \mid action) = 0$ and $P(shoot \mid love) = 0$.

### Adding 1-smoothing

> **<h3>💻 Try it yourself!</h3>**

How can we modify `compute_likelihoods` to add 1-smoothing? Please update the function below.

Note that:
- We added a parameter (`smoothing`) to select the smoothing mode
- You will need to change how to normalise the likelihoods.

In [22]:
def compute_likelihoods(lines, vocabulary, smoothing):
    '''
    Computes the likelihoods of words in a list of strings.

    Parameters
    ----------
    lines: a list of list of words
    vocabulary: the vocabulary of the training corpus
    smooething: the smoothing method to use

    Returns
    -------
    a dictionary with the probability for each class
    '''
    
    likelihoods = {}

    # populate the likelihoods
    for word in vocabulary:
        likelihoods[word] = 0
 

    # Now we iterate through the lines to count
    # the appearances of each word
    for line in lines:
        for word in line:
            likelihoods[word] +=1 

    # how long are our documents?
    total_tokens = 0
    for line in lines:
        for word in line:
            total_tokens +=1

    # Apply smoothing, if needed
    for word in likelihoods:
        if smoothing == 'none':
            
            for line in lines:
                for word in line:
                    total_tokens +=1

            for word in likelihoods:
                likelihoods[word] = likelihoods[word]/total_tokens
            
        elif smoothing == 'add1':
            # -------------------------#
            #      E X E R C I S E     #
            # -------------------------#
            # calculate the smoothing parameter for each word.
            smoothing_param = # ...?
            
            likelihoods[word] = # ...?
            # ~      end exercise    ~ #
        else:
            print('Unknown smoothing!')
            return

    return likelihoods

SyntaxError: ignored

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p><pre><code>
def compute_likelihoods(lines, vocabulary, smoothing):
    '''
    Computes the likelihoods of words in a list of strings.
    Parameters
    ----------
    lines: a list of list of words
    vocabulary: the vocabulary of the training corpus
    smooething: the smoothing method to use
    Returns
    -------
    a dictionary with the probability for each class
    '''
    likelihoods = {}
    # populate the likelihoods
    for word in vocabulary:
        likelihoods[word] = 0
    # Now we iterate through the lines to count
    # the appearances of each word
    for line in lines:
        for word in line:
            likelihoods[word] +=1 
    # how long are our documents?
    total_tokens = 0
    for line in lines:
        for word in line:
            total_tokens +=1
    # Apply smoothing, if needed
    for word in likelihoods:
        if smoothing == 'none':       
            for line in lines:
                for word in line:
                    total_tokens +=1
            for word in likelihoods:
                likelihoods[word] = likelihoods[word]/total_tokens
        elif smoothing == 'add1':
            # -------------------------#
            #      E X E R C I S E     #
            # -------------------------#
            # calculate the smoothing parameter for each word.
            smoothing_param = total_tokens + len(vocabulary)
            likelihoods[word] = (likelihoods[word] + 1)/smoothing_param
            # ~      end exercise    ~ #
        else:
            print('Unknown smoothing!')
            return
    return likelihoods
    </code></pre></p>

</details> 

Now let's update `train_naive_bayes` to instruct it to use smoothing, and let's see if the results we obtain are correct:

In [24]:
def train_naive_bayes(documents, labels, smoothing):

    classes = set(labels)

    # compute the priors
    priors = compute_priors(labels)
    vocabulary = create_vocabulary(documents)

    # this dict will contain the likelihoods, e.g.
    # likelihoods['action'] = {'fast': 0.2, 'furious': 0.1...
    likelihoods = {}

    for _class in classes:
        # get the documents belonging to each class:
        class_docs = []

        for i, doc in enumerate(documents):
            if labels[i] == _class:
                class_docs.append(doc)

        # compute the likelihood of this class
        likelihoods[_class] = compute_likelihoods(class_docs, vocabulary, smoothing)

    return priors, likelihoods 

In [25]:
priors, likelihoods = train_naive_bayes(train_docs, train_labels, 'add1')

print('Priors:')
print(priors)

print('')
print('Likelihoods:')
print(likelihoods)

Priors:
{'comedy': 0.4, 'action': 0.6}

Likelihoods:
{'comedy': {'fun': 0.25, 'fly': 0.125, 'fast': 0.125, 'love': 0.1875, 'couple': 0.1875, 'furious': 0.0625, 'shoot': 0.0625}, 'action': {'fun': 0.1111111111111111, 'fly': 0.1111111111111111, 'fast': 0.16666666666666666, 'love': 0.1111111111111111, 'couple': 0.05555555555555555, 'furious': 0.16666666666666666, 'shoot': 0.2777777777777778}}


In [26]:
bayes_predict(test_doc, priors, likelihoods)

{'action': 0.00017146776406035664, 'comedy': 7.324218750000001e-05}

Great! We successfully implemented Naive Bayes in Python. 

### Improving Naive Bayes: `argmax` and log-likelihoods

What we've done until now is good - but there are still a couple of open issues. 
- First, we should note that the original function of Naive Bayes is based on $\text{argmax}$: how do we implement it in Python?
- Then, we may note that the probabilities returned by our model are very small (i.e. in the order of $10^{-5}$), which means that for big vocabularies we may reach probabilities close to zero.

While the latter may not sound like a big issue for you, unfortunately dealing with very small numbers may be problematic for a computer - as mentioned in Module 1.2, computers use [floating-point arithmetic](https://en.wikipedia.org/wiki/Floating-point_arithmetic), which in practice means that they aren't actually very good at dealing with very small or very big numbers.

Actually, machines are not very good at representing any number that can't be expressed as a sum of powers of $2$. Take for example $0.1$. If we ask Python to show it, it will print

In [27]:
0.1

0.1

Which looks fine. But what if we ask Python to show the first 40 decimal digits of $0.1$? That shouldn't make sense, right? Let's see:

In [28]:
print(f'{0.1:.40f}')

0.1000000000000000055511151231257827021182


See? Under the hood Python represents 0.1 as the infinitely repeating binary fraction `0.0001100110011001100110011001100110011001100110011...`. Since we stop at 32 bits, we won't be able to exactly represent $0.1$, but we will get an approximation, i.e. the value printed in the cell above.

This does not mean that our code is buggy or that Python is getting it wrong. It's just how computers work! Unfortunately, this means that, since Naive Bayes will generate very small numbers, it will be more prone to error. For this reason, we use the **log-likelihoods** instead of the actual likelihoods, i.e. we will rewrite:
$$
\begin{align}
\hat{c} &= \text{argmax}_{c \in C} 
       \underbrace{P(c)}_\text{prior}
       \underbrace{\prod_{w \in d}{P(w \mid c)}}_\text{likelihood} \\
       &= \text{argmax}_{c \in C} \log(  
       \underbrace{P(c)}_\text{prior}
       \underbrace{\prod_{w \in d}{P(w \mid c)}}_\text{likelihood} ) \\
       &= \text{argmax}_{c \in C} 
       \underbrace{\log P(c)}_\text{prior} +
       \underbrace{\sum_{w \in d}{\log P(w \mid c)}}_\text{likelihood} 
\end{align}
$$

To do that, we will use `numpy`, arguably Python's most popular mathematical library, which offers the function `log`.

In the next cells we'll install numpy (if it's not already installed), and the we'll import it. You'll almost always find numpy imported as `np` - so if in any Python code you see something like `np.` you can assume that it's based on numpy.

In [29]:
!pip install numpy



In [30]:
# import and test with ln(e), i.e. the logarithm to the base e of e
import numpy as np
np.log(np.e)

1.0

Numpy also offers a convenient `argmax` function. Remember that `argmax` works by selecting the *index* of the biggest element of a list:

In [31]:
np.argmax([0,-2,10,3])   # Remember that we use 0-based indexing!

2

Now can wrap everything in an updated `bayes_predict_log` function, which will be the log-based version of `bayes_predict`. Can you update it yourself?

> **<h3>💻 Try it yourself!</h3>**

In [None]:
def bayes_predict_log(document, priors, likelihoods):
    '''
    Predicts the label for a document given the trained
    priors and likelihoods.
    
    Parameters
    ----------
    document: the document to analyse
    priors: the trained priors
    likelihoods: the trained likelihoods
    
    Return
    ------
    A tuple (best_class, probabilities), where the first element
    is the name of the best class, and the second element is the dictionary of
    the computed probabilities.
    '''
    
    classes_probabilities = {}

    # unpack the dictionary and iterate 
    # through the priors
    for label, prior in priors.items():
        
        # -------------------------#
        #      E X E R C I S E     #
        # -------------------------#
        # initialise the probability of a class to the log of its its prior
        prob_class = # ...?
        for word in document:
            if word in likelihoods[label]:
                # sum the prior with the log-likelihood of each word
                prob_class =  # ...?
        classes_probabilities[label] = prob_class
    
    # get the names of the classes
    class_names = list(priors.keys())

    # -------------------------#
    #      E X E R C I S E     #
    # -------------------------#
    # complete the next line by selecting the name of the best class.
    # hint: the probabilities are saved in the dictionaries, and
    # you can access them by using classes_probabilities.values().
    # note that you can use more than 1 line if you need!
    
    best_class = # ...?

    # ~      end exercise    ~ #
    
    return best_class, classes_probabilities    

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p><pre><code>
def bayes_predict_log(document, priors, likelihoods):
    '''
    Predicts the label for a document given the trained
    priors and likelihoods.
    Parameters
    ----------
    document: the document to analyse
    priors: the trained priors
    likelihoods: the trained likelihoods
    Return
    ------
    A tuple (best_class, probabilities), where the first element
    is the name of the best class, and the second element is the dictionary of
    the computed probabilities.
    '''
    classes_probabilities = {}
    # unpack the dictionary and iterate 
    # through the priors
    for label, prior in priors.items():   
        # -------------------------#
        #      E X E R C I S E     #
        # -------------------------#
        # initialise the probability of a class to the log of its its prior
        prob_class = np.log(prior)
        for word in document:
            if word in likelihoods[label]:
                # sum the prior with the log-likelihood of each word
                prob_class = prob_class + np.log(likelihoods[label][word])
        classes_probabilities[label] = prob_class
    # get the names of the classes
    class_names = list(priors.keys())
    # -------------------------#
    #      E X E R C I S E     #
    # -------------------------#
    # complete the next line by selecting the name of the best class.
    # hint: the probabilities are saved in the dictionaries, and
    # you can access them by using classes_probabilities.values(), which
    # can be converted to a list using list(x)
    best_class = class_names[np.argmax(list(classes_probabilities.values()))]
    # ~      end exercise    ~ #
    return best_class, classes_probabilities          
  </code></pre></p>

</details> 

In [33]:
document = ['fast', 'couple', 'shoot', 'fly']
bayes_predict_log(document, priors, likelihoods)

('action', {'action': -8.671115273688494, 'comedy': -9.52173897104528})

Good! Now the values are more reasonable and we can expect our computer to make less errors. Also, the code will run quicker, as summing typically requires less computational time than multiplying.

### Other improvements

Naturally, we could improve our Naive Bayes implementation even further. We could try different smoothing techniques, we could cache the log-likelihoods in the `likelihoods` dictionary instead of applying `np.log` every time, and so on. For now, let's just be happy with what we've done, and let's apply our code to the data of the previous lecture.

## Classify the *Thumbs up?* dataset

Now let's test our model on a real research dataset. We will use the data from the paper [*Thumbs up? Sentiment Classification using Machine Learning Techniques*](https://www.aclweb.org/anthology/W02-1011.pdf). The dataset consists in 1301 positive reviews and 752 negative reviews from [IMDb](https://www.imdb.com/), a large online database of facts and reviews of movies.

For this part, you are **not** required to understand how the data loading and preparation works, but we encourage you to study the code anyway. After we have loaded the dataset, ou will use the model you just wrote to classify it. 
The paper reports 78.7 accuracy for their Naive Bayes unigrams model; let's see if your code is able to reach this performance.

### Data preprocessing

In this section, we will download the data from GitHub and we'll preprocess them so that they are in the same format as above. As a reminder, training data is formatted this way:

```python
train_docs = [['fun', 'couple', 'love', 'love'],
 ['fast', 'furious', 'shoot'],
 ['couple', 'fly', 'fast', 'fun', 'fun'],
 ['furious', 'shoot', 'shoot', 'fun'],
 ['fly', 'fast', 'shoot', 'love']]

 train_labels = ['comedy', 'action', 'comedy', 'action', 'action']
```

Our goal is to transform the Thumb Up! dataset in this format in order to use
the functions we defined above without any modification.

Let's start by downloading the data. We use the UNIX utility `wget` for this; please be aware that if you're running this code on a Windows machine it most likely won't work, and you'll have to download the dataset by yourself and unzip it in the same folder of this notebook.

In [122]:
!wget https://github.com/cambridgeltl/python4cl/raw/module_2.1/module_2/module_2.1/data.zip
!unzip -n -q data.zip

--2020-11-04 14:39:06--  https://github.com/cambridgeltl/python4cl/raw/module_2.1/module_2/module_2.1/data.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/cambridgeltl/python4cl/module_2.1/module_2/module_2.1/data.zip [following]
--2020-11-04 14:39:07--  https://raw.githubusercontent.com/cambridgeltl/python4cl/module_2.1/module_2/module_2.1/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4848090 (4.6M) [application/zip]
Saving to: ‘data.zip.1’


2020-11-04 14:39:08 (13.2 MB/s) - ‘data.zip.1’ saved [4848090/4848090]



Let's now load the data. We will define a function that:

1. Finds all the files in a directory;
2. For each file:
  - Loads the content of the file in a string, and
  - Extracts all the words from the document by using `re.findall`;
3. Finally, it returns the data as a list of lists of words, e.g.:

```python
[['hello', 'world', '...' ],   # document 1
 ['lorem', 'ipsum', '...' ],   # document 2
 ...
]
```

You should rememeber `re.findall` from Module 1, but we encourage you to have a look at the [documentation](https://docs.python.org/3/library/re.html#re.findall)
and play with this function.

In [125]:
# modules needed to locate and split the data
import os
import random
import re


def process_docs(directory):
    """
    Parameters
    ----------
    directory: a directory containing positive/negative samples from the Thumbs
    Up! dataset.

    Return
    ------
    A list of of documents, where each document is a list of words.
    """
    
    
    docs=[] # this will contain the documents we'll find

    # walk through all files in the folder
    for filename in os.listdir(directory):

        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        
        # Load the document:
        # open the file as read only
        with open(os.path.join(directory,filename), 'r') as f:
          # read all text and put it in a single long string
          doc = ''.join(f.readlines())
          # extract all the words; 
          # re.findall finds all the substrings matching a regular expression.
          # try the line below on some toy examples to understand how it works.
          doc = re.findall(r'\w+', doc)

        # add the new document to the list of documents
        docs.append(doc)

    # return everything we've found
    return docs
 
# load positive examples
positive_docs = process_docs(os.path.join('data', 'pos'))
# load negative examples
negative_docs = process_docs(os.path.join('data', 'neg'))

Now we have our lists of documents:

In [129]:
print('First 10 words of the first 2 positive documents:')
print(positive_docs[0][:10])
print(positive_docs[1][:10])
print('First 10 words of the first 2 negative documents:')
print(negative_docs[0][:10])
print(negative_docs[1][:10])

First 10 words of the first 2 positive documents:
['ingredients', 'starving', 'artist', 'lusting', 'after', 'a', 'beautiful', 'woman', 'from', 'his']
['since', '1990', 'the', 'dramatic', 'picture', 'has', 'undergone', 'a', 'certain', 'change']
First 10 words of the first 2 negative documents:
['jessica', 'lange', 'is', 'one', 'of', 'the', 'most', 'inconsistent', 'actresses', 'working']
['writer', 'director', 'lawrence', 'kasdan', 'had', 'a', 'hand', 'in', 'penning', 'some']


Now we need to shuffle this data to obtain a training and testing set. We will 
use [`zip()`](https://docs.python.org/3/library/functions.html#zip), which 
allows us to create two list of tuples `[(positive_doc_1, 'pos'), ... (positive_doc_n, 'pos')]`
and `[(negative_doc_1, 'neg'), ... (negative_doc_n, 'neg')]` for positive and 
negative documents respectively.

In [103]:
all_docs = list(zip(positive_docs, ['pos'] * len(positive_docs))) +\
  list(zip(negative_docs, ['neg'] * len(negative_docs)))

> **<h3>💻 Try it yourself!</h3>**

Have a look at the list `all_docs` in the cell below. Reassure yourself that the data is in the 
format defined above; check that the first element has a `pos` label and that the last element has a `neg` label.

### Shuffle the data

Now we need to shuffle the data. We will use the function [`random.shuffle()`](https://docs.python.org/3/library/random.html#random.shuffle), which shuffles an array using a [pseudorandom number generator](https://en.wikipedia.org/wiki/Pseudorandom_number_generator). You don't really need to know the nuts and bolts
of random number generation, but it's very important to stress that in computer
science, random number are not actually that random, but they are generated by
a function that emulates a random distribution.

This function starts from a value, called seed. If we manually set the seed, we 
can ensure that the data generated by our random function is consistent 
throughout different runs, hence guaranteeing the repetibility of our experiments.

> 💻 Try it yourself!

Look at the cell below and run it. Then, change the parameter of `random.seed()` with a value of your choice and then run it again. Now change it again to the original value (`1203`). What happens? Why? Have a very quick look at the 
Wikipedia page of  pseudorandom number generators using the link above. 

In [138]:
random.seed(1203)
print(random.randrange(100))  # generates a random number between 0 and 100
print(random.randrange(100))  
print(random.randrange(100))  

26
46
23


Now that you understand how pseudo random number generation works, let's shuffle
the array:

In [104]:
random.seed(18)
random.shuffle(all_docs)

Retain the first 70% of the documents for training, and the remaining 30% for testing:

In [106]:
train_docs_thumbsup = all_docs[:int(len(all_docs)*(0.7))]
test_docs_thumbsup =all_docs[int(len(all_docs)*(0.7)):]

Now, extract the documents and the labels:

In [107]:
train_txts_thumbsup = [txt for txt, _ in train_docs_thumbsup]
train_labels_thumbsup = [label for _, label in train_docs_thumbsup]

Is the array really random? Let's have a look at the first 10 labels. Now they shouldn't be all positives:

In [109]:
train_labels_thumbsup[:10]

['neg', 'pos', 'neg', 'pos', 'pos', 'neg', 'neg', 'neg', 'neg', 'pos']

### Run the model

Now we have our documents and labels arrays in the same format as before. Let's feed them to our `train_naive_bayes()` function and train the model.

In [112]:
tu_priors, tu_likelihoods = train_naive_bayes(train_txts_thumbsup, train_labels_thumbsup, 'add1')

> 💻 Try it yourself!

Very well! Now let's see how well our classifier works. In the cell below, let's
calculate the accuracy of our algorithm.

Accuracy is a metric that calculates how many times the algorithm returns the correct result. Update the cell below to obtain the accuracy of our Naive Bayes classifier on the test set.

In [140]:
correct_answers = 0
# iterate over the documents in the training set;
# unpack each (document, label) tuple
for doc, true_label in test_docs_thumbsup:
  # predict the label for the document
  predicted_label, _ = bayes_predict_log(doc, tu_priors, tu_likelihoods)
  # if the predicted label is the same as the true label, 
  # update the counter
  # write your code here!

# calculate the accuracy score
accuracy = # write your code here!

# print the accuracy
print(accuracy)

0.7883333333333333


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p><pre><code>correct_answers = 0
# iterate over the documents in the training set;
# unpack each (document, label) tuple
for doc, true_label in test_docs_thumbsup:
  # predict the label for the document
  predicted_label, _ = bayes_predict_log(doc, tu_priors, tu_likelihoods)
  # if the predicted label is the same as the true label, 
  # update the counter
  if true_label == predicted_label:
    correct_answers += 1
# calculate the accuracy score
accuracy = correct_answers/len(test_docs_thumbsup)
# print the accuracy
print(accuracy)
    </code></pre></p>
</details> 

Great! $78.\overline{3}\%$ accuracy with such a simple model. The original paper obtained 
$78.7\%$ with a similar configuration; however, we must note that their score is obtained by averaging the score of the models with three different training/test split, and they used further preprocessing steps, so the scores are not directly comparable. 

However, your score is very similar to the one obtained in the paper, which is 
very good! In the next module we will use another popular Python library, `scikit-learn`, to try and improve this result.

# Final Test

Possibilities:

1. Calculate P/R/F of positives
2. Run 5-fold cross validation to introduce the concepts of statistical significance
3. Find another smoothing function and have the students update the training function with it

## Wrapping up

### Additional resources

- Naive Bayes chapter from [Manning, Raghavan and Schütze's Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
- Python's official documentation on [floating-point arithmetics and its issues](https://docs.python.org/3/tutorial/floatingpoint.html)
-  `scikit-learn`'s [implementation of Naive Bayes](https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/naive_bayes.py#L669)
- Zhai and Lafferty, *A study of smoothing methods for language models applied to information retrieval*: a paper comparing different smoothing techniques for language models. [[pdf](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.8978&rep=rep1&type=pdf)]