# Assignment 2: Parts-of-Speech Tagging (POS)

Welcome to the second assignment of Course 2 in the Natural Language Processing specialization. This assignment will develop skills in part-of-speech (POS) tagging, the process of assigning a part-of-speech tag (Noun, Verb, Adjective...) to each word in an input text.  Tagging is difficult because some words can represent more than one part of speech at different times. They are  **Ambiguous**. Let's look at the following example: 

- The whole team played **well**. [adverb]
- You are doing **well** for yourself. [adjective]
- **Well**, this assignment took me forever to complete. [interjection]
- The **well** is dry. [noun]
- Tears were beginning to **well** in her eyes. [verb]

Often distinguishing the parts-of-speech of a word in a sentence will help you better understand the meaning of a sentence. This would be critically important in search queries. Identifying the proper noun, the organization, the stock symbol, or anything similar would greatly improve everything ranging from speech recognition to search. By completing this assignment, you will: 

- Learn how parts-of-speech tagging works
- Compute the transition matrix A in a Hidden Markov Model
- Compute the transition matrix B in a Hidden Markov Model
- Compute the Viterbi algorithm 
- Compute the accuracy of your own model 


In [1]:
# Importing packages and loading in the data set 
from utils_pos import get_word_tag, preprocess  
import pandas as pd
from collections import defaultdict
import math
import numpy as np

#### Data Sources: 
This assignment will use two tagged data sets collected from the Wall Street Journal(WSJ). [Here](https://www.clips.uantwerpen.be/pages/mbsp-tags) is an example 'tag-set' or Part of Speech designation describing the two or three letter tag and their meaning. One data set (WSJ-2_21.pos) will be used for training, the other (WSJ-24.pos) for test. The tagged training data has been preprocessed to form a vocabulary (hmm_vocab.txt). The words in the vocabulary are words from the training set that were used two or more times. The vocabulary is augmented with a set of 'unknown word tokens', described below. The training set will be used to create the emission, transmission and tag counts. 
The test set (WSJ-24.pos) is read in to create 'y'. This contains both the test text and the true tag. The test set has also been preprocessed to remove the tags to form 'test_words.txt'. This is read in and further processed to identify the end of sentences and handle words not in the vocabulary using functions provided in utils_pos.py. This forms the list 'prep', the preprocessed text used to test our  POS taggers.

A POS tagger will necessarily encounter words that are not in its datasets. To improve accuracy, these words are further analyzed during preprocessing to extract available hints as to their appropriate tag. For example, the suffix 'ize' is a hint that the word is a verb, as in 'final-ize' or 'character-ize'. A set of unknown-tokens, such as '--unk-verb--' or '--unk-noun--' will replace the unknown words in both the training and test corpus and will appear in the emission, transmission and tag data structures.


<img src = "DataSources1.PNG" />

Implementation note: For python 3.6 and beyond, dictionaries retain the insertion order. Further, their hash based lookup makes the suitable for rapid membership tests. If _di_ is a dictionary, `key in di` will return `True` if _di_ has a key _key_, else `False`. The dictionary `vocab` will utilize these features.

In [2]:
# load in the training corpus
training_corpus = open("WSJ_02-21.pos", 'r').readlines()           #corpus with tags

voc= open("hmm_vocab.txt", 'r').read().split('\n')
vocab = {} # this dictionary has the index of the corresponding words
for i, word in enumerate(sorted(voc)): # this gets you the index of the corresponding words. 
    vocab[word] = i       

# load in the test corpus
y = open('WSJ_24.pos').readlines()            #corpus with tags
_, prep = preprocess(vocab, "test.words")     #corpus without tags, preprocessed

print('The length of the preprocessed test corpus: ', len(prep))
print('This is an example of the test_corpus: ', prep[0])
print('This is an example of your y: ', y[0])

The length of the preprocessed test corpus:  34199
This is an example of the test_corpus:  The
This is an example of your y:  The	DT



# Part 1: Parts-of-speech tagging 

## Part 1.1 - Training
We start with the simplest possible parts-of-speech tagger and we will build up to the state of the art. In this section, you will find the words that are not ambiguous. For example, the word `is` is a verb and it is not ambiguous. In the `WSJ` corpus, $86$% of the token are unambiguous (meaning they have one tag) and around $14\%$ have more than one tag. 

<img src = "pos.png" style="width:400px;height:250px;"/>



Before we start predicting the tags of each word, we will need to compute a few dictionaries that will help us generate the tables. The first dictionary is the `transition_counts` dictionary which computes the number of times each tag happened next to another tag. This dictionary would then be used to compute: 
$$P(t_i |t_{i-1}) \tag{1}$$
In order for you to compute equation 1, you will create a `transition_counts` dictionary where the keys are `(prev_tag, tag)` and the values are the number of times those two tags appeared in that order. 

The second dictionary you will compute is the `emission_counts` dictionary. This dictionary would then be used to compute:

$$P(w_i|t_i)\tag{2}$$

In other words, you will use it to compute the probability of a word given a tag. In order for you to compute equation 2, you will create an `emission_counts` dictionary where the keys are `(tag, word)` and the values are the number of times that pair showed up in your training set. 

The last dictionary you will compute is the `tag_counts` dictionary. The keys of this dictionary are the tags and the values is the number of times each tag appeared. 

**Instructions:** Write a program that takes in the  `training_corpus` and returns the three dictionaries mentioned above `transition_counts`, `emission_counts`, and `tag_counts`. 
- `emission_counts`: maps (tag, word) to the number of times it happened. 
- `transition_counts`: maps (prev_tag, tag) to the number of times it has appeared. 
- `tag_counts`: maps (tag) to the number of times it has occured. 

Implementation note: This routine utilises a subclass of *dict*. A standard Python dictionary throws a *KeyError* if you try to access an item with a key that is not currently in the dictionary. In contrast, the *defaultdict* will create an item of the type of the argument, in this case an integer. See [defaultdict](https://docs.python.org/3.3/library/collections.html#defaultdict-objects).

In [3]:
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: create_dictionaries
def create_dictionaries(training_corpus, vocab):
    """
    Input: 
        training_corpus: a corpus where each line has a word followed by its tag.
        vocab: a dictionary where keys are words in vocabulary and value is an index
    Output: 
        emission_counts: a dictionary where the keys are (tag, word) and the values are the counts
        transition_counts: a dictionary where the keys are (prev_tag, tag) and the values are the counts
        tag_counts: a dictionary where the keys are the tags and the values are the counts
    """
    m = len(training_corpus)
    emission_counts = defaultdict(int)
    transition_counts = defaultdict(int)
    tag_counts = defaultdict(int)
    prev = '--s--' # this is the start state
    i = 0 
    for line in training_corpus:
        i+=1
        if i % 50000 == 0:
            print("line count = ",i)
        
        ### START CODE HERE ###
        word, tag = get_word_tag(line, vocab)
        emission_counts[(tag, word)] += 1
        transition_counts[(prev, tag)] += 1
        tag_counts[tag] += 1
        prev = tag
        ### END CODE HERE ###
        
    return emission_counts, transition_counts, tag_counts

In [4]:
emission_counts, transition_counts, tag_counts = create_dictionaries(training_corpus, vocab)

line count =  50000
line count =  100000
line count =  150000
line count =  200000
line count =  250000
line count =  300000
line count =  350000
line count =  400000
line count =  450000
line count =  500000
line count =  550000
line count =  600000
line count =  650000
line count =  700000
line count =  750000
line count =  800000
line count =  850000
line count =  900000
line count =  950000


In [5]:
# get all the POS states
# CODE REVIEW COMMENT: sorted returns a list so the type conversion is not necessary
states = list(sorted(tag_counts.keys()))
print(len(states))

46


**Expected Output:**
46 

Let's examine some items in our tables:

In [6]:
print(states)

['#', '$', "''", '(', ')', ',', '--s--', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '``']



The 'states' are the Parts-of-speach designations found in the training data. They will also be referred to as 'tags' or POS in this assignment. Above "NN" is noun, singlular, while 'NNS' is noun, plural. You can get a more complete description at [Penn Treebank II tag set](https://www.clips.uantwerpen.be/pages/mbsp-tags). In addition, there are helpful tags like '--s--' which indicate a start of a sentence.

In [7]:
print("transition examples: ", list(transition_counts.items())[:3])
print("emission examples: ", list(emission_counts.items())[200:203])
print("ambiguous word example: ")
for tup,cnt in emission_counts.items():
    if tup[1] == 'back': print (tup, cnt) 

transition examples:  [(('--s--', 'IN'), 5050), (('IN', 'DT'), 32364), (('DT', 'NNP'), 9044)]
emission examples:  [(('DT', 'any'), 721), (('NN', 'decrease'), 7), (('NN', 'insider-trading'), 5)]
ambiguous word example: 
('RB', 'back') 304
('VB', 'back') 20
('RP', 'back') 84
('JJ', 'back') 25
('NN', 'back') 29
('VBP', 'back') 4


**Expected Output:**

transition examples:  [(('--s--', 'IN'), 5050), (('IN', 'DT'), 32364), (('DT', 'NNP'), 9044)]  
emission examples:  [(('DT', 'any'), 721), (('NN', 'decrease'), 7), (('NN', 'insider-trading'), 5)]   
('RB', 'back')&nbsp; 304  
('VB', 'back')&nbsp; 20  
('RP', 'back')&nbsp; 84  
('JJ', 'back')&nbsp; 25  
('NN', 'back')&nbsp; 29  
('VBP', 'back')&nbsp; 4  


### Part 1.2 - Testing -

Now you will test the accuracy of your parts-of-speech tagger using your `emission_counts` dictionary. Given your preprocessed test corpus `prep`, you will assign a parts-of-speech tag to every word in that corpus. Using the original tagged test corpus `y`, you will then compute what percent of the tags you got correct. 

**Instructions:** Implement `predict_pos` that computes the accuracy of your model. This is a warmup exercise. To assign a part of speech to a word, simply assign the most frequent POS for that word in the training set. 

In [8]:
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: predict_pos

def predict_pos(prep, y, emission_counts, vocab, states):
    '''
    Input: 
        prep: a preprocessed version of 'y'. A list with the 'word' component of the tuples.
        y: a corpus composed of a list of tuples where each tuple consists of (word, POS)
        emission_counts: a dictionary where the keys are (tag,word) tuples and the value is the count
        vocab: a dictionary where keys are words in vocabulary and value is an index
        states: a sorted list of all possible tags for this assignment
    Output: 
        accuracy: Number of times you classified a word correctly
    '''
    
    num_correct = 0
    all_words = set(emission_counts.keys())
    total = len(y)
    for word, y_tup in zip(prep, y): 
        if not y_tup.split():
            continue 
        _, true_label = y_tup.split()
        count_final = 0
        pos_final = ''
        if word in vocab:
            for pos in states:
    
    ### START CODE HERE ###
                if ((pos, word) in all_words) and (emission_counts[(pos, word)] > count_final):
                    count_final = emission_counts[(pos, word)]
                    pos_final = pos
        if pos_final == true_label:
            num_correct += 1
    ### END CODE HERE ###
    return num_correct/total

In [9]:
p = predict_pos(prep, y, emission_counts, vocab, states)
print(p)

0.8888563993099213


**Expected Output:** 0.8888563993099213

88.9% is really good for this warm up exercise. With some dynamic programming, you should be able to get **95% accuracy.** Let's get started!

# Part 2: Hidden Markov Models for POS

In this part you will build something more context specific. Concretely, you will be implementing a Hidden Markov Model (HMM) with a Viterbi decoder, one of the most commonly used algorithms in Natural Language Processing which acts as a foundation to many deeplearning techniques you will later see in the specialization. In addition to parts-of-speech tagging, it is used in speech recognition, speech synthesis, etc... By completing this part of the assignment you will get a 95% accuracy on the same dataset you used in part 1.

The Markov Model contains a number of states and the probabily of transition between those states. In our case, the states are the parts-of-speech. It utlizes a transition matrix, `A`. A Hidden Markov Model adds an observation or emission matrix `B` which describes the probability of a visible observation when in a particular state. In our case, the emissions are the words in the corpus, the state, which is hidden, is the tag of that word.

## Part 2.1 Generating Matrices

### Creating the A transition probabilities matrix
Now that you have your emission_counts, transition_counts, and tag_counts, you will start implementing the Hidden Markov Model. This will allow you to quickly construct the `A` transition probabilities matrix and the `B` emission probabilities matrix. You will also use some smoothing when computing them. Here is an example of what the `A` transition matrix would look like (simplified to 5 tags for viewing. It is 46x46 in this assignment.): 
<!img src = "A_PROBS.png" style="width:500px;height:200px;"/>
<p style='text-align: center;'> <b>A Transitions Probability Matrix (subset)</b>  </p>

|**A**  |...|         RBS  |          RP  |         SYM  |      TO  |          UH|...
| --- ||---:-------------| ------------ | ------------ | -------- | ---------- |----
|**RBS**  |...|2.217069e-06  |2.217069e-06  |2.217069e-06  |0.008870  |2.217069e-06|...
|**RP**   |...|3.756509e-07  |7.516775e-04  |3.756509e-07  |0.051089  |3.756509e-07|...
|**SYM**  |...|1.722772e-05  |1.722772e-05  |1.722772e-05  |0.000017  |1.722772e-05|...
|**TO**   |...|4.477336e-05  |4.472863e-08  |4.472863e-08  |0.000090  |4.477336e-05|...
|**UH**  |...|1.030439e-05  |1.030439e-05  |1.030439e-05  |0.061837  |3.092348e-02|...
| ... |...| ...          | ...          | ...          | ...      | ...        | ...

Note that the matrix above was computed with smoothing. Each cell gives you the probability to go from one part of speech to another. In other words to go from parts-of-speech `TO` to `RP`, there is a 4.48e-8 chance. As you might have guessed, the sum of each row has to equal 1. The smoothing was done as follows: 

$$ P(t_i | t_{i-1}) = \frac{C(t_{i-1}, t_{i}) + \alpha }{C(t_{i-1}) +\alpha * N}\tag{3}$$

Where $N$ is the total number of tags, $C$ are the counts in transmission_counts and alpha is a smoothing parameter.


**Instructions:** Implement the `create_transition_matrix` below for all tags. Your task is to output a matrix that computes equation 3 for each cell in matrix `A`. 

In [10]:
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: create_transition_matrix
def create_transition_matrix(alpha, tag_counts, transition_counts):
    ''' 
    Input: 
        alpha: number used for smoothing
        tag_counts: a dictionary mapping each tag to its respective count
        transition_counts: 
    Output:
        A: matrix of dimension (num_tags,num_tags)
    '''
    all_tags = sorted(tag_counts.keys())
    num_tags = len(all_tags)
    A = np.zeros((num_tags,num_tags))
    trans_keys = set(transition_counts.keys())
    
    ### START CODE HERE ### 
    for i in range(num_tags):
        for j in range(num_tags):
            A[i, j] = (transition_counts.get((all_tags[i], all_tags[j]), 0) + alpha) / (tag_counts[all_tags[i]] + alpha * num_tags)
    ### END CODE HERE ###
    
    return A

In [11]:
alpha = 0.001
A = create_transition_matrix(alpha, tag_counts, transition_counts)
# Testing your function
print(A[0,0])
print(A[3,1])
A_sub = pd.DataFrame(A[30:35,30:35], index=states[30:35], columns = states[30:35] )
print(A_sub)

7.039972966503809e-06
0.16910191896905374
              RBS            RP           SYM        TO            UH
RBS  2.217069e-06  2.217069e-06  2.217069e-06  0.008870  2.217069e-06
RP   3.756509e-07  7.516775e-04  3.756509e-07  0.051089  3.756509e-07
SYM  1.722772e-05  1.722772e-05  1.722772e-05  0.000017  1.722772e-05
TO   4.477336e-05  4.472863e-08  4.472863e-08  0.000090  4.477336e-05
UH   1.030439e-05  1.030439e-05  1.030439e-05  0.061837  3.092348e-02


**Expected Output**:
- 7.039972966503809e-06
- 0.16910191896905374
- And 

|  -  |         RBS  |          RP  |         SYM  |      TO  |          UH|
| --- | ------------ | ------------ | ------------ | -------- | ---------- |
|RBS  |2.217069e-06  |2.217069e-06  |2.217069e-06  |0.008870  |2.217069e-06|
|RP   |3.756509e-07  |7.516775e-04  |3.756509e-07  |0.051089  |3.756509e-07|
|SYM  |1.722772e-05  |1.722772e-05  |1.722772e-05  |0.000017  |1.722772e-05|
|TO   |4.477336e-05  |4.472863e-08  |4.472863e-08  |0.000090  |4.477336e-05|
|UH   |1.030439e-05  |1.030439e-05  |1.030439e-05  |0.061837  |3.092348e-02|

### Creating the B emission probabilities matrix

Now you will create the `B` transition matrix which computes the emission probability. You will use smoothing as defined below: 


$$P(w_i | t_i) = \frac{C(t_i, word_i)+ \alpha}{C(t_{i}) +\alpha * N}\tag{4}$$

Where $C(t_i, word_i)$ is the number of times $word_i$ was associated with $tag_i$ in the training data, $C(t_i)$ is the number of times $tag_i$ was in the training data, N is the number of words in the vocabulary and alpha is a smoothing parameter. Your matrix `B` is of dimension (num_tags, N), where num_tags is the number of possible parts-of-speech. Here is an example of the matrix, only a subset of tags and words are shown: 
<p style='text-align: center;'> <b>B Emissions Probability Matrix (subset)</b>  </p>

|**B**| ...|          725 |     adroitly |    engineers |     promoted |      synergy| ...|
|----|----|--------------|--------------|--------------|--------------|-------------|----|
|**CD**  | ...| **8.201296e-05** | 2.732854e-08 | 2.732854e-08 | 2.732854e-08 | 2.732854e-08| ...|
|**NN**  | ...| 7.521128e-09 | 7.521128e-09 | 7.521128e-09 | 7.521128e-09 | **2.257091e-05**| ...|
|**NNS** | ...| 1.670013e-08 | 1.670013e-08 |**4.676203e-04** | 1.670013e-08 | 1.670013e-08| ...|
|**VB**  | ...| 3.779036e-08 | 3.779036e-08 | 3.779036e-08 | 3.779036e-08 | 3.779036e-08| ...|
|**RB**  | ...| 3.226454e-08 | **6.456135e-05** | 3.226454e-08 | 3.226454e-08 | 3.226454e-08| ...|
|**RP**  | ...| 3.723317e-07 | 3.723317e-07 | 3.723317e-07 | **3.723317e-07** | 3.723317e-07| ...|
| ...    | ...|     ...      |     ...      |     ...      |     ...      |     ...      | ...|

<!img src = "B_probs.png" style="width:500px;height:200px;"/>

**Instructions:** Implement the `create_emission_matrix` below that computes the `B` emission probabilities matrix. Your function takes in $\alpha$, the smoothing parameter, `tag_counts`, which is a dictionary mapping each tag to its respective count, the `emission_counts` dictionary where the keys are (tag, word) and the values are the counts. Your task is to output a matrix that computes equation 4 for each cell in matrix `B`. 

In [12]:
# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: create_emission_matrix

def create_emission_matrix(alpha, tag_counts, emission_counts, vocab):
    '''
    Input: 
        alpha: tuning parameter used in smoothing 
        tag_counts: a dictionary mapping each tag to its respective count
        emission_counts: a dictionary where the keys are (tag, word) and the values are the counts
        vocab: a dictionary where keys are words in vocabulary and value is an index
    Output:
        B: a matrix of dimension (num_tags, len(vocab))
    '''
    num_tags = len(tag_counts)
    all_tags = sorted(tag_counts.keys())
    num_words = len(vocab)
    B = np.zeros((num_tags, num_words))
    emis_keys = set(list(emission_counts.keys()))
    
    ### START CODE HERE ###
    for i in range(num_tags):
        for j in range(num_words):
            B[i, j] = (emission_counts.get((all_tags[i], vocab[j]), 0) + alpha) / (tag_counts[all_tags[i]] + alpha * num_words)
    ### END CODE HERE ###
    
    return B

In [13]:
# creating your emission probability matrix. this takes a few minutes to run. 
B = create_emission_matrix(alpha, tag_counts, emission_counts, list(vocab))
print(B[0,0])
print(B[3,1])
cidx  = ['725','adroitly','engineers', 'promoted', 'synergy']
cols = [vocab[a] for a in cidx]
rvals =['CD','NN','NNS', 'VB','RB','RP']
rows = [states.index(a) for a in rvals]
B_sub = pd.DataFrame(B[np.ix_(rows,cols)], index=rvals, columns = cidx )
print(B_sub)

6.032199882975323e-06
7.195398974080014e-07
              725      adroitly     engineers      promoted       synergy
CD   8.201296e-05  2.732854e-08  2.732854e-08  2.732854e-08  2.732854e-08
NN   7.521128e-09  7.521128e-09  7.521128e-09  7.521128e-09  2.257091e-05
NNS  1.670013e-08  1.670013e-08  4.676203e-04  1.670013e-08  1.670013e-08
VB   3.779036e-08  3.779036e-08  3.779036e-08  3.779036e-08  3.779036e-08
RB   3.226454e-08  6.456135e-05  3.226454e-08  3.226454e-08  3.226454e-08
RP   3.723317e-07  3.723317e-07  3.723317e-07  3.723317e-07  3.723317e-07


**Expected Output:**
- 6.032199882975323e-06
- 7.195398974080014e-07   and

|  \ |          725 |     adroitly |    engineers |     promoted |      synergy
|----|--------------|--------------|--------------|--------------|-------------|
|CD  | 8.201296e-05 | 2.732854e-08 | 2.732854e-08 | 2.732854e-08 | 2.732854e-08
|NN  | 7.521128e-09 | 7.521128e-09 | 7.521128e-09 | 7.521128e-09 | 2.257091e-05
|NNS | 1.670013e-08 | 1.670013e-08 | 4.676203e-04 | 1.670013e-08 | 1.670013e-08
|VB  | 3.779036e-08 | 3.779036e-08 | 3.779036e-08 | 3.779036e-08 | 3.779036e-08
|RB  | 3.226454e-08 | 6.456135e-05 | 3.226454e-08 | 3.226454e-08 | 3.226454e-08
|RP  | 3.723317e-07 | 3.723317e-07 | 3.723317e-07 | 3.723317e-07 | 3.723317e-07


# Part 3: Viterbi Algorithm and Dynamic Programming

In this part of the assignment you will implement the Viterbi algorithm which makes use of dynamic programming. Specifically, you will use your two matrices, `A` and `B` to compute the Viterbi algorithm. We have decomposed this process into three main steps for you. 

* **Initialization** - In this part you initialize the best_paths and best_probabilities matrices that you will be populating in feed_forward.
* **Feed forward** - At each step, you calculate the probability of each path happening and the best paths up to that point. 
* **Feed backward**: This allows you to find the best path with the highest probabilities. 

## Part 3.1:  Initialization 

You will start by initializing two matrices of the same dimension. 

- best_probs: Each cell contains the probability of going from one tag to that word in the corpus.

- best_paths: A matrix that helps you trace through the best possible path in the corpus. 

**Instructions**: Write a program below that initializes the `best_probs` and the `best_paths` matrix. Both matrices will be initilized to zero except for the first column of `best_probs`.  That column is initialized assuming the first word of the corpus was preceeded by a start token("--s--"). This allows you to reference the A matrix for the transition probablity.
The values in column zero are set to: 

$ if A[s_{idx}, i] <> 0 : best\_probs[i,0] = log(A[s_{idx}, i]) + log(B[i, vocab[corpus[0]]$

$ if A[s_{idx}, i] == 0 : best\_probs[i,0] = float('-inf')$

$vocab[corpus[0]]$ gives you the column index into the `B` matrix of the first word of the corpus. Taking the log is an implementation hack. If you were to just multiply $A[s_{idx}, i] \times B[i, vocab[corpus[0]] $ then you might get very small numbers as your corpus gets larger. The A=0 case has special handling to avoid taking the log of 0. 

Utilize [math.log](https://docs.python.org/3/library/math.html) to compute the natural logarithm.

The example below shows the initialization assuming the corpus starts with the phrase "Loss tracks upward".

<img src = "Initialize4.PNG"/>

In [14]:
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: initialize
def initialize(states, tag_counts, A, B, corpus, vocab):
    '''
    Input: 
        states: a list of all possible parts-of-speech
        tag_counts: a dictionary mapping each tag to its respective count
        A: Transition Matrix of dimension (num_tags, num_tags)
        B: Emission   Matrix of dimension (num_tags, len(vocab))
        corpus: a sequence of words whose POS is to be identified in a list 
        vocab: a dictionary where keys are words in vocabulary and value is an index
    Output:
        best_probs: matrix of dimension (num_tags, len(corpus)) of floats
        best_paths: matrix of dimension (num_tags, len(corpus)) of integers
    '''
    num_tags = len(tag_counts)
    best_probs = np.zeros((num_tags, len(corpus)))
    best_paths = np.zeros((num_tags, len(corpus)), dtype=int)
    s_idx = states.index("--s--")
    
    ### START CODE HERE ###
    for i in range(num_tags):
        if A[s_idx, i] != 0:
            best_probs[i, 0] = math.log(A[s_idx, i]) + math.log(B[i, vocab[corpus[0]]])
        else:
            best_probs[i, 0] = float('-inf')
    ### END CODE HERE ### 
    
    return best_probs, best_paths

In [15]:
best_probs, best_paths = initialize(states, tag_counts, A, B, prep, vocab)

In [16]:
print(best_probs[0,0]) # -22.60.....
print(best_paths[2,3])

-22.60982633354825
0


**Expected Output:** 
* -22.60982633354825
* 0.0


## Part 3.2 Viterbi Forward

In this part of the assignment, you will implement the viterbi forward segment. In other words, you will populate your `best_probs` and `best_paths` matrices.

The Viterbi forward algorithm will walk forward through the corpus and for each word compute a probability for each possible tag. Unlike our previous algorithm this will include the path up to that word,tag combination. 
An example makes this more clear. Note, in example, only a subset of states are shown due to space limitiations. In the diagram below, the first word is already initialized. The algorithm will compute a probability for each of the potential tags in the second and future words. For example, to compute the probability that the tag of 'tracks' is verb, 3rd person singular present (VBZ), highlighed in orange below, it will examine each of the paths from the tags of 'Loss' and choose the most likely.  An example of the calculation for **one** of those paths is the path from ("Loss", NN). The log of the probability of the path to and including 'Loss" being a noun (NN) is -14.32. We add to that the log of the probability of NN transitioning to VBS from the A matrix, circled in the diagram, log(4.37e-02). To that we add the log of the probability that the tag VBS would have an 'emission' 'tracks', log(4.61e-4) for a total of -25.13. This will turn out to the the most likely, but all other paths from 'Loss' are examined as well. The most likely path is stored in the best_path table - shown in orange. 
The formula's to compute probability and path for corpus[i], current tag j and previous tag k is:

`` prob = best_probs[k, i-1] + log(A[k, j]) + log(B[j, vocab[corpus[i]]``

`` path = k ``


**Instructions:** Implement the viterbi forward algorithm and store the best_path and best_prob for every possible tag for each word in the matrices `best_probs` and `best_tags` using the pseudo code below.

`for all corpus (1 in diagram)  
    for each tag type (2 in diagram)   
        for each input probability from previous entry (3in diagram)   
            compute the probability using formula above   
            retain the highest probabilty computed   
            set best_probs to that values  
            set best_paths to the index of the previous entry that produced the highest probability `

Utilize [math.log](https://docs.python.org/3/library/math.html) to compute the natural logarithm.

<img src = "Forward4.PNG"/>

In [17]:
# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: viterbi_forward
def viterbi_forward(A, B, test_corpus, best_probs, best_paths, vocab):
    '''
    Input: 
        A, B: The transiton and emission matrices respectively
        test_corpus: a list containing a preprocessed corpus
        best_probs: an initilized matrix of dimension (num_tags, len(corpus))
        best_paths: an initilized matrix of dimension (num_tags, len(corpus))
        vocab: a dictionary where keys are words in vocabulary and value is an index 
    Output: 
        best_probs: a completed matrix of dimension (num_tags, len(corpus))
        best_paths: a completed matrix of dimension (num_tags, len(corpus))
    '''
    num_tags = best_probs.shape[0]
    
    for i in range(1, len(test_corpus)): # for every word in the corpus
        if i % 5000 == 0:
            print("Words processed: {:>8}".format(i))
            
        ### START CODE HERE ###
        for j in range(num_tags):
            best_prob = float('-inf')
            best_path = -1
            for k in range(num_tags):
                prob = best_probs[k, i-1] + math.log(A[k, j]) + math.log(B[j, vocab[test_corpus[i]]])
                path = k
                if prob > best_prob:
                    best_prob = prob
                    best_path = path
            best_probs[j, i] = best_prob
            best_paths[j, i] = best_path
        ### END CODE HERE ###
        
    return best_probs, best_paths

In [18]:
# this will take a few minutes to run => processes ~ 30,000 words
best_probs, best_paths = viterbi_forward(A, B, prep, best_probs, best_paths, vocab)

Words processed:     5000
Words processed:    10000
Words processed:    15000
Words processed:    20000
Words processed:    25000
Words processed:    30000


In [19]:
# Testing this function 
print(best_probs[0,1]) 
print(best_paths[0,4]) 

-24.78215632717346
20


**Expected Output:**
* -24.78215632717346
* 20


## Part 3.3 Viterbi backward

Now you will implement the Viterbi backward algorithm which allows you to get the predictions from the `best_paths` and the `best_probs` matrices you have already implemented. 

You have filled in the `best_paths` and the `best_probs` matrices in the forward path. The example below,  shows how to proceed.  You select the the most likely entry for the last word in the corpus, 'upward' in the `best_probs` table. It is `RB`, an adverb, at offset 28. Store this in the last entry of `pred`. Select offset 28 of the last entry of `best_paths`, 40. This points back to 'VBZ' (verb, 3rd person singular present) for the word 'tracks'. Following links backward to the start, each word can be assigned its most likely POS. These are stored in the correpsonding etnry of `pred`. 


<img src = "Backwards5.PNG"/>

**Instructions:** Implement the `viterbi_backward` algorithm that returns a list of predictions.  
The indexing can be a bit confusing. Note in the small example above m = 3, so the last best_probs/paths entry is column 2 or m-1.  
_In Step 1:_       
Loop through all the rows/tags in the last entry of best_probs and find the row/tag with the maximum value.
Convert the index to a tag using `states`.  
In our small example above,  
`z[2] = 28` and
`pred[2] = states(z[2])` , states(z[2]) is 'RB' in this case.

_In Step 2_:  
Starting at the last entry of best_paths, use the index from step 1 as the start and then follow the pointers back to the start. Record the index's at each step and convert the index to a tag using `states`. 

In our small example above, in Step 2, the first iteration, you would read best_paths[:,2] and fill in z[1]  
`z[1] = best_paths[z[2],2]`  
The small test following the routine prints the last few words of the corpus and their states to aid in debug.


In [20]:
# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: viterbi_backward
def viterbi_backward(best_probs, best_paths, corpus, states):
    '''
    This function returns the best path.
    
    '''
    m = best_paths.shape[1] 
    z = [None] * m
    num_tags = best_probs.shape[0]
    argmax = best_probs[0, m - 1]
    pred = [None] * m
    
    ### START CODE HERE ###
    z[m-1] = np.argmax(best_probs[:, m-1])
    pred[m-1] = states[z[m-1]]
    for j in range(m-2, -1, -1):
        z[j] = best_paths[z[j+1], j+1]
        pred[j] = states[z[j]]
    ### END CODE HERE ###
        
    return pred

In [21]:
pred = viterbi_backward(best_probs, best_paths, prep, states)
m=len(pred)
print('The prediction for pred[-7:m-1] is: \n', prep[-7:m-1], "\n", pred[-7:m-1], "\n")
print('The prediction for pred[0:8] is: \n', pred[0:7], "\n", prep[0:7])

The prediction for pred[-7:m-1] is: 
 ['see', 'them', 'here', 'with', 'us', '.'] 
 ['VB', 'PRP', 'RB', 'IN', 'PRP', '.'] 

The prediction for pred[0:8] is: 
 ['DT', 'NN', 'POS', 'NN', 'MD', 'VB', 'VBN'] 
 ['The', 'economy', "'s", 'temperature', 'will', 'be', 'taken']


**Expected Output:**   
 <span style="font-family:Courier">
The prediction for prep[-7:m-1] is:  
 ['see', 'them', 'here', 'with', 'us', '.']  
 ['VB', 'PRP', 'RB', 'IN', 'PRP', '.']   
The prediction for pred[0:8] is:    
 ['DT', 'NN', 'POS', 'NN', 'MD', 'VB', 'VBN']   
 ['The', 'economy', "'s", 'temperature', 'will', 'be', 'taken'] 
</span>

Now you just have to compare the predicted labels to the true labels and you are done! You can then see the accuracy on the corpus. 

# Part 4: Predicting on a data set

In this part of the assignment you compute the accuracy of your prediction with the true `y` value. Your pred is a list of predictions corresponding to the predicted words of your `test_corpus`. 



In [22]:
print('The third word is:', prep[3])
print('Your prediction is:', pred[3])
print('Your corresponding y is: ', y[3])

The third word is: temperature
Your prediction is: NN
Your corresponding y is:  temperature	NN



You will now implement a function to compute the accuracy of your predictions.

**Instructions:** Implement a function that computes the accuracy of your predictions. to split y into the word and its tag you can use `y.split()`. 

In [23]:
# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: compute_accuracy
def compute_accuracy(pred, y):
    '''
    Input: 
        pred: a list of the predicted parts-of-speech 
        y: a list of lines where each word is separated by a '\t' (i.e. word \t tag)
    Output: 
        
    '''
    num_correct = 0
    total = 0
    for prediction, y in zip(pred, y):
        if not y.split():
            continue
            
        ### START CODE HERE ###
        word, tag = y.strip().split('\t')
        if prediction == tag:
            num_correct += 1
        total += 1
        ### END CODE HERE ###
        
    return num_correct/total

In [24]:
compute_accuracy(pred, y)

0.953063647155511

**Expected Output:** 0.953063647155511

Congratulations you were able to classify the parts-of-speech with 95% accuracy. 

### Take aways and overview

In this assignment you learnt about parts-of-speech tagging. You have computed POS by going forward in a corpus. Some implementations use bidirectional tagging, meaning knowing the previous word and the next word would tell you more about the POS instead of just knowing the previous word. But of course, if you can implement this unidirectional approach, it lays the foundation to other POS taggers used in industry, (which are just a little bit more complicated than what you have just implemented). This assignment is critical to understanding other POS tagging methods. With this implementation you managed to get 95% accuracy. Congratulations and see you next week where we introduce **Ngram Language Models** which are also extremely useful for speech recognition and other things you will later see in the specialization. 

References: 

- ["Speech and Language Processing", Dan Jurafsky and James H. Martin](https://web.stanford.edu/~jurafsky/slp3/)
- We would like to thank Melanie Tosik for her help and inspiration

# Optional:Create and test your own mini-corpus

In [25]:
#modify sentence to test a small corpus
sentence = "Loss tracks upward".split()

srvals = {"--s--"} # start with start (set)
for s in states:
    for w in sentence:
        if emission_counts[(s,w)] > 0:
            print(s,w)
            srvals.add(s)
print(srvals)
rvals = sorted(list(srvals))
rcvals = ["({}){}".format(states.index(i),i) for i in rvals]
print("Interresting states: ",rcvals)

JJ upward
NN Loss
NNS tracks
RB upward
VBZ tracks
{'--s--', 'VBZ', 'NN', 'NNS', 'RB', 'JJ'}
Interresting states:  ['(6)--s--', '(15)JJ', '(20)NN', '(23)NNS', '(28)RB', '(40)VBZ']


In [26]:
#Display interresting Parts of B
cols = [list(vocab).index(a) for a in sentence]
rows = [states.index(a) for a in rvals]
B_sub = pd.DataFrame(B[np.ix_(rows,cols)], index=rcvals, columns = sentence )
pd.options.display.float_format = '{:,.2e}'.format
print(B_sub)
pd.reset_option("display.float_format")

             Loss   tracks   upward
(6)--s-- 2.51e-08 2.51e-08 2.51e-08
(15)JJ   1.63e-08 1.63e-08 1.47e-04
(20)NN   1.50e-05 7.52e-09 7.52e-09
(23)NNS  1.67e-08 1.34e-04 1.67e-08
(28)RB   3.23e-08 3.23e-08 3.87e-04
(40)VBZ  4.61e-08 4.61e-04 4.61e-08


In [27]:
#display interresting parts of A
rows = [states.index(a) for a in rvals]
A_sub = pd.DataFrame(A[np.ix_(rows,rows)], index=rcvals, columns =rvals )
pd.options.display.float_format = '{:,.2e}'.format
print(A_sub)
pd.reset_option("display.float_format")

            --s--       JJ       NN      NNS       RB      VBZ
(6)--s-- 2.51e-08 4.20e-02 4.01e-02 4.19e-02 5.68e-02 1.41e-03
(15)JJ   6.54e-05 7.47e-02 4.50e-01 2.32e-01 3.64e-03 1.37e-03
(20)NN   4.96e-04 8.79e-03 1.22e-01 7.78e-02 1.83e-02 4.37e-02
(23)NNS  9.86e-04 1.68e-02 2.08e-02 1.05e-02 3.14e-02 8.12e-03
(28)RB   4.52e-04 1.03e-01 1.16e-02 4.33e-03 7.31e-02 3.95e-02
(40)VBZ  4.62e-05 7.52e-02 3.47e-02 1.63e-02 1.35e-01 9.23e-04


In [28]:
#initialize, display results
mini_best_probs, mini_best_paths = initialize(states, tag_counts, A, B,sentence,vocab)

rows = [states.index(a) for a in rvals]
bpr_sub = pd.DataFrame(mini_best_probs[np.ix_(rows,range(len(sentence)))], index=rcvals, columns =sentence )
bpa_sub = pd.DataFrame(mini_best_paths[np.ix_(rows,range(len(sentence)))], index=rcvals, columns =sentence )
pd.options.display.float_format = '{:,.2f}'.format
print(bpr_sub)
print(bpa_sub)
print(sentence)
pd.reset_option("display.float_format")

           Loss  tracks  upward
(6)--s-- -35.00    0.00    0.00
(15)JJ   -21.10    0.00    0.00
(20)NN   -14.32    0.00    0.00
(23)NNS  -21.08    0.00    0.00
(28)RB   -20.12    0.00    0.00
(40)VBZ  -23.46    0.00    0.00
          Loss  tracks  upward
(6)--s--     0       0       0
(15)JJ       0       0       0
(20)NN       0       0       0
(23)NNS      0       0       0
(28)RB       0       0       0
(40)VBZ      0       0       0
['Loss', 'tracks', 'upward']


In [29]:
# Run forward, backward, display results
mini_best_probs, mini_best_paths = viterbi_forward(A, B, sentence, mini_best_probs, mini_best_paths,vocab)
mini_pred = viterbi_backward(mini_best_probs, mini_best_paths, sentence, states)
rows = [states.index(a) for a in rvals]
bpr_sub = pd.DataFrame(mini_best_probs[np.ix_(rows,range(len(sentence)))], index=rcvals, columns =sentence )
bpa_sub = pd.DataFrame(mini_best_paths[np.ix_(rows,range(len(sentence)))], index=rcvals, columns =sentence )
pd.options.display.float_format = '{:,.2f}'.format
print(bpr_sub)
print(bpa_sub)
print(sentence)
print(mini_pred)
pd.reset_option("display.float_format")

           Loss  tracks  upward
(6)--s-- -35.00  -38.03  -50.22
(15)JJ   -21.10  -36.98  -36.55
(20)NN   -14.32  -35.13  -47.20
(23)NNS  -21.08  -25.79  -47.15
(28)RB   -20.12  -35.57  -34.99
(40)VBZ  -23.46  -25.13  -47.50
          Loss  tracks  upward
(6)--s--     0      32      23
(15)JJ       0      20      40
(20)NN       0      20      40
(23)NNS      0      20      40
(28)RB       0      20      40
(40)VBZ      0      20      23
['Loss', 'tracks', 'upward']
['NN', 'VBZ', 'RB']
