# Exercise 8: Part-of-Speech tagging using Markov Models

We will be doing Part-of-Speech (POS) tagging. This is a common problem in natural language processing that tries to assign grammatical tags to each word in a sentence. 
Often this task is posed as a sequence labeling task that can be modeled as a Markov chain.

Let's consider an example:
We have a sentence $Y_{0:T} = $ `Time flies like an arrow.`  and want to predict the corresponding POS tags that we model as a latent sequence $X_{0:T} =$ `NOUN VERB CONJ DET NOUN`. 
This example already reveals the challenge of this task which requires us to take context information into account, since the word `flies` could also be classified as `NOUN` in another context.


In the following exercise, we will apply the algorithm that we introduced in lecture 15 (slide 36) to this problem. 

### 0. Download and prepare the data

In [1]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Importing libraries
import nltk
import numpy as np
import pandas as pd
import random
from sklearn.model_selection import train_test_split
from functools import partial

 
# download the treebank corpus from nltk
nltk.download('treebank')
 
# download the universal tagset from nltk
nltk.download('universal_tagset')
 
# reading the Treebank tagged sentences
nltk_data = list(nltk.corpus.treebank.tagged_sents(tagset='universal'))

# split data into training and validation set in the ratio 80:20
train_set,test_set = train_test_split(nltk_data,train_size=0.80,test_size=0.20,random_state = 101)

[nltk_data] Downloading package treebank to /home/jovyan/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


In [3]:
# create list of train and test tagged words (ignore sentence boundaries)
train_data = [ tup for sent in train_set for tup in sent ]
test_data = [ tup for sent in test_set for tup in sent ]
print(len(train_data))
print(len(test_data))

80310
20366


In [4]:
# check some of the tagged words.
train_data[:23]

[('Drink', 'NOUN'),
 ('Carrier', 'NOUN'),
 ('Competes', 'VERB'),
 ('With', 'ADP'),
 ('Cartons', 'NOUN'),
 ('At', 'ADP'),
 ('last', 'ADJ'),
 ('count', 'NOUN'),
 (',', '.'),
 ('Candela', 'NOUN'),
 ('had', 'VERB'),
 ('sold', 'VERB'),
 ('$', '.'),
 ('4', 'NUM'),
 ('million', 'NUM'),
 ('*U*', 'X'),
 ('of', 'ADP'),
 ('its', 'PRON'),
 ('medical', 'ADJ'),
 ('devices', 'NOUN'),
 ('in', 'ADP'),
 ('Japan', 'NOUN'),
 ('.', '.')]

In [5]:
train_data[1]

('Carrier', 'NOUN')

In [6]:
train_data[1][1]

'NOUN'

In [7]:
# use set datatype to check how many unique tags are present in training data
tags = {tag for word,tag in train_data}
tag_vocab = {tag: i for i, tag in enumerate(tags)}
 
# check total words in vocabulary
words = {word for word,tag in train_data}
word_vocab = {word: i for i, word in enumerate(words)}

In [8]:
# Get tag for noun
tag_vocab[train_data[1][1]]
# Attention! vocab tags fo not stay the same after restarting notebook kernel
# This is not problem and simply reults in a differnt odreder of the matrices/data frames.
# Values do not change!

1

In [9]:
print(f'tags: {tags}')
print(f'tag_vocab: {tag_vocab}')

#print(f'words: {words}')
#print(f'word_vocab: {word_vocab}')

tags: {'CONJ', 'NOUN', 'ADP', 'ADV', '.', 'PRON', 'DET', 'PRT', 'VERB', 'ADJ', 'X', 'NUM'}
tag_vocab: {'CONJ': 0, 'NOUN': 1, 'ADP': 2, 'ADV': 3, '.': 4, 'PRON': 5, 'DET': 6, 'PRT': 7, 'VERB': 8, 'ADJ': 9, 'X': 10, 'NUM': 11}


## 1. Precomputing probabilities ("Training")  
#### [1] (a) Compute the transition probability
First we need to compute the transition probability, i.e. the probability that one POS tag follows on another: $p(x_t|x_{t-1})$

For this purpose, we can just pre-compute the entire table of conditional probabilities based on the counts from the training set. 

In [10]:
def compute_transition(train_data, tag_vocab):
    # Initialize a matrix to store transition probabilities
    p_transition = np.zeros((len(tag_vocab), len(tag_vocab)))
    
    # Count tag transitions in the training data
    for i in range(1, len(train_data)):  # start countin at index 1 so we can refer to the tag a position 0 
        
        # get current and previous tag
        current_tag = tag_vocab[train_data[i][1]]                    
        previous_tag = tag_vocab[train_data[i - 1][1]]
        
        # increase transition count from previous_tag to current_tag
        p_transition[previous_tag, current_tag] += 1
    
    # get row-wise sum of transition counts
    row_sums = p_transition.sum(axis=1, keepdims=True)
    
    # normalize counts to get probabilities
    p_transition = p_transition / row_sums
    
    return p_transition

In [11]:
# Declare dimension for better readability
print(f'Row:    Previous tag\nColumn: Current tag')

# Get transition matrix
p_transition = compute_transition(train_data, tag_vocab)
tags_df = pd.DataFrame(p_transition, columns = list(tags), index=list(tags))
display(tags_df)

Row:    Previous tag
Column: Current tag


Unnamed: 0,CONJ,NOUN,ADP,ADV,.,PRON,DET,PRT,VERB,ADJ,X,NUM
CONJ,0.000549,0.349067,0.055982,0.05708,0.035126,0.060373,0.123491,0.004391,0.150384,0.113611,0.00933,0.040615
NOUN,0.042454,0.262344,0.176827,0.016895,0.240094,0.004659,0.013106,0.043935,0.149134,0.012584,0.028825,0.009144
ADP,0.001012,0.323589,0.016958,0.014553,0.038724,0.069603,0.320931,0.001266,0.008479,0.107062,0.034548,0.063275
ADV,0.006982,0.032196,0.119472,0.081458,0.139255,0.012025,0.071373,0.01474,0.339022,0.130721,0.022886,0.029868
.,0.060086,0.218562,0.092918,0.052575,0.092382,0.068777,0.17221,0.00279,0.0897,0.046137,0.025644,0.078219
PRON,0.005011,0.212756,0.022323,0.036902,0.041913,0.006834,0.009567,0.014123,0.484738,0.070615,0.088383,0.006834
DET,0.000431,0.635906,0.009918,0.012074,0.017393,0.003306,0.006037,0.000287,0.040247,0.206411,0.045134,0.022855
PRT,0.002348,0.250489,0.019569,0.009393,0.04501,0.017613,0.10137,0.001174,0.401174,0.082975,0.012133,0.056751
VERB,0.005433,0.110589,0.092357,0.083886,0.034807,0.035543,0.13361,0.030663,0.167956,0.06639,0.21593,0.022836
ADJ,0.016893,0.696893,0.080583,0.005243,0.066019,0.000194,0.005243,0.011456,0.011456,0.063301,0.020971,0.021748


#### [1] (b) Compute the emission probability
Additionally, we need to compute the emission probability, i.e. the probability that a POS (Part of Speech) tag is associated with a specific word: $p(y_t|x_t)$.

For this purpose, we can just pre-compute the entire table of conditional probabilities based on the counts from the training set.

In [12]:
# check total words in vocabulary
words = {word for word,tag in train_data}
word_vocab = {word: i for i, word in enumerate(words)}

# Take a look
# print(f'words: {words}')
# print(f'word_vocab: {word_vocab}')

In [13]:
def compute_emission(train_data, tag_vocab, word_vocab):
    # Initialize a matrix to store emission probabilities
    p_emission = np.zeros((len(word_vocab), len(tag_vocab)))
    
    # Count word emissions for each POS tag in the training data
    for i in range(len(train_data)):
        # get current word and tag
        current_word, current_tag = train_data[i]
        
        # convert to indices using the vocabularies
        word_index = word_vocab[current_word]
        tag_index = tag_vocab[current_tag]
        
        # increase emission count for current_word and current_tag
        p_emission[word_index, tag_index] += 1
    
    # get column-wise sum of emission counts
    row_sums = p_emission.sum(axis=1, keepdims=True)
    
    # normalize counts to get probabilities
    p_emission = p_emission / row_sums
    
    return p_emission

In [14]:
p_emission = compute_emission(train_data, tag_vocab, word_vocab)
words_df = pd.DataFrame(p_emission, columns = list(tags), index=list(words))
display(words_df)

Unnamed: 0,CONJ,NOUN,ADP,ADV,.,PRON,DET,PRT,VERB,ADJ,X,NUM
Sacramento-based,0.0,0.0,0.0,0.00000,0.0,0.0,0.00000,0.0,0.0,1.0,0.0,0.0
Orchestra,0.0,1.0,0.0,0.00000,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0
Dai-Ichi,0.0,1.0,0.0,0.00000,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0
concept,0.0,1.0,0.0,0.00000,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0
bundles,0.0,1.0,0.0,0.00000,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
any,0.0,0.0,0.0,0.01087,0.0,0.0,0.98913,0.0,0.0,0.0,0.0,0.0
CAT,0.0,1.0,0.0,0.00000,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0
300-a-share,0.0,0.0,0.0,0.00000,0.0,0.0,0.00000,0.0,0.0,1.0,0.0,0.0
Reserve,0.0,1.0,0.0,0.00000,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0


In [15]:
# Look at words that with multiple possible tags, to ensure that the function works "fine"

# filter dataframe for rows without a 1 (in this case words that can have different tags)
filtered_words_df = words_df[~words_df.eq(1).any(axis=1)]

# words_df.eq(1): create boolean DataFrame of shape(words_df) where ones are marked as True others as False
# .any(axis=1): for each row, check if any element is True
# ~: bitwise NOT operator
filtered_words_df

Unnamed: 0,CONJ,NOUN,ADP,ADV,.,PRON,DET,PRT,VERB,ADJ,X,NUM
most,0.0,0.000000,0.0,0.509091,0.0,0.0,0.000000,0.0,0.000000,0.490909,0.0,0.0
spending,0.0,0.950000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.050000,0.000000,0.0,0.0
marks,0.0,0.888889,0.0,0.000000,0.0,0.0,0.000000,0.0,0.111111,0.000000,0.0,0.0
black,0.0,0.333333,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.666667,0.0,0.0
buffet,0.0,0.500000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.500000,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
The,0.0,0.007181,0.0,0.000000,0.0,0.0,0.992819,0.0,0.000000,0.000000,0.0,0.0
giant,0.0,0.200000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.800000,0.0,0.0
slide,0.0,0.800000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.200000,0.000000,0.0,0.0
diversified,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.333333,0.666667,0.0,0.0


In [16]:
# Example interpretation: fine
# fine can be an adverb, e.g. "I am fine"
# it could also be used as a noun, e.g. "I had to pay a fine for paking in the wrong spot"

display(words_df.loc['fine'])

CONJ    0.000000
NOUN    0.857143
ADP     0.000000
ADV     0.142857
.       0.000000
PRON    0.000000
DET     0.000000
PRT     0.000000
VERB    0.000000
ADJ     0.000000
X       0.000000
NUM     0.000000
Name: fine, dtype: float64

#### [1] (c) Compute the unigram probabilities
Finally, we need to compute the unigram probability of observing a given word: $p(y)$.
Analogously for a given POS-tag: $p(x)$

Again, we compute this probability on the counts from the training set.

In [17]:
def compute_unigrams(train_data, word_vocab, tag_vocab):
    p_y = np.zeros(len(word_vocab))
    p_x = np.zeros(len(tag_vocab))
    
    # count words and POS tags in the training data
    for word, tag in train_data:
        if word in word_vocab:
            p_y[word_vocab[word]] += 1
        if tag in tag_vocab:
            p_x[tag_vocab[tag]] += 1
    
    # normalize counts to get probabilities
    p_y = p_y / p_y.sum()
    p_x = p_x / p_x.sum()
    
    return p_y, p_x

In [18]:
p_y, p_x = compute_unigrams(train_data, word_vocab, tag_vocab)

In [19]:
p_x

array([0.02268709, 0.28596688, 0.09839372, 0.03210061, 0.11606276,
       0.02733159, 0.08662682, 0.03181422, 0.135226  , 0.06412651,
       0.06478645, 0.03487735])

In [20]:
p_y.sum()

1.0

In [21]:
p_x.sum()

1.0

In [22]:
p_y, p_x = compute_unigrams(train_data, word_vocab, tag_vocab)
words_df = pd.DataFrame(p_y, index=list(words))
display(words_df)
tag_df = pd.DataFrame(p_x, index=list(tags))
display(tag_df)

Unnamed: 0,0
Sacramento-based,0.000012
Orchestra,0.000012
Dai-Ichi,0.000012
concept,0.000062
bundles,0.000012
...,...
any,0.001146
CAT,0.000050
300-a-share,0.000025
Reserve,0.000075


Unnamed: 0,0
CONJ,0.022687
NOUN,0.285967
ADP,0.098394
ADV,0.032101
.,0.116063
PRON,0.027332
DET,0.086627
PRT,0.031814
VERB,0.135226
ADJ,0.064127


In [23]:
# we get 0.000012 surprisingly often in our as a word share. Mistake ???

#### Add an unknown token:
Now we computed these probabilities from the training data. Unfortunately evaluating on unseen data can lead to problems if we encounter words that we did not see before. There are many ways to deal with this. One option is to replace unseen words with an `<UNK>` token. We set the emission probability to be uniform, s.t. the algorithm has to rely on the transition probability to figure out the POS-tag.

In [24]:
eps= 1e-12
unk_id = len(word_vocab)
word_vocab["<UNK>"] = unk_id
p_y -= eps
p_y = np.concatenate([p_y,np.array([p_y.shape[0] * eps])], axis=0)
p_emission = np.concatenate([p_emission,np.ones_like(p_emission[:1])/len(p_emission[0])], axis=0)
def turn_unk(w):
    if w not in word_vocab:
        return "<UNK>"
    else:
        return w
test_data = [(turn_unk(w),t) for w,t in test_data]

## [1] 2. Numerical Stability
We have now computed all the probabilities that we need to apply the algorithm from the lecture.
However, if we just apply this algorithm naively, we will run into problems of numerical stability. 
To illustrate this, let's compute the probability of a random sequence of tags $p(x_0,x_1,...,x_N)$:

In [38]:
N = 100 
idx = ... # TODO
p_seq = ... # TODO
print(p_seq)

0.13522599925289502


As you can see, although we are multiplying non-zero probabilities, we end up with a probability of zero very quickly as we increase $N$. 

This problem can be avoided by performing computations in log-space. This means that we take the logarithm of all probability values before peforming any computations with them. 
This will turn all products into summations and divisions into subtractions. 
Let's try this trick on the task above and compute the value of $\log p(x_0,x_1,...,x_N)$

In [None]:
log_p_seq = ... # TODO
print(log_p_seq)

Now if we convert this back to $p(x_0,...,x_N)$, by taking the exponential, we will still end up with zero. So what does this help us? 

Here we can make use of the fact that the logarithm is a monotone increasing function, which means in particular that 
$$
\mathrm{argmax}_x p(x) = \mathrm{argmax}_x \log p(x)
$$ 
This is useful in cases like this one where the exact probability is not of interest, but we just want to know the position of the maximum value of a particular function. 

Finally, we need to deal with the situation where we want to take the logarithm of a sum of probabilities. 
In that case, we cannot just "pull the log through". 
Instead we need to use the log-sum-exp trick:
$$
\log \sum_{i=1}^{M} p(x_i) = b + \log \sum_{i=1}^M \exp ((\log p(x_i)) - b)   
$$
with $b = \max_i \log p(x_i)$. 
Please read this short explanation to understand why/how this works: [http://wittawat.com/posts/log-sum_exp_underflow.html](http://wittawat.com/posts/log-sum_exp_underflow.html)

In the following implementation, we will exclusively work with log-probabilities and use the log-sum-exp trick wherever we encounter a sum.
To prepare for this, we now convert all probabilities into log-probabilities and add a small `eps` term to all values to avoid $\log(0)$.

In [None]:
eps= 1e-12
log_p_y = np.log(p_y + eps)
log_p_x = np.log(p_x + eps)
log_p_transition = np.log(p_transition + eps)
log_p_emission = np.log(p_emission + eps)

## 3. The algorithm
Now that we have implemented the emission and transition probabilities, we can use them to compute the predictions $p(x_t|Y)$ for all $t=0,\ldots,n$ by applying the algorithm from the lecture.
Note that we $x$ is a discrete variable (we only have a limited vocabulary $\mathcal{X}$), which allows us to evaluate the integral expressions by carrying out a sum.

#### Please read all subtasks carefully before starting your implementation! 

#### [2] (a) Predict: 
Please compute the prediction step here. Note that you need to apply the log-sum-exp trick here and use numpy operations to efficiently compute the probabilities for all possible $x_t$ at the current timestep $t$. 

$$ p(x_t|Y_{0:t-1}) = \sum_{x_{t-1} \in \mathcal{X}} p(x_t|x_{t-1}) p(x_{t-1}|Y_{0:t-1}) $$

In [None]:
def predict(t, log_p_update, log_p_transition):
    """
    Args:
        t (int): Current timestep.
        log_p_update (np.array): `log p(x_t|y_{0:t})` array of shape [T+1,len(tag_vocab)].
        log_p_transition (np.array): `log p(x_t|x_{t-1})` array of shape [len(tag_vocab),len(tag_vocab)].
    
    Returns: 
        np.array: `log p(x_t|y_{0:t-1})` for timestep t. Array of shape [len(tag_vocab)]
    """
    # TODO
    return ...

#### [2] (b) Update: 
Please compute the update step here. Note that you need to perform the operations in log-space here and use numpy operations to efficiently compute the probabilities for all possible $x_t$ at the current timestep $t$. 

$$ p(x_t|Y_{1:t}) = p(y_t|x_t) \frac{p(x_t|Y_{0:t-1})}{p(y_t)} $$

In [None]:
def update(Y, t, log_p_predict, log_p_y, log_p_emission):
    """
    Args:
        Y (np.array): Observed sequence of words in array of shape [T].
        t (int): Current timestep.
        log_p_predict (np.array): `log p(x_t|y_{0:t-1})` array of shape [T+1,len(tag_vocab)].
        log_p_y (np.array): `log p(y_t)` array of shape [len(word_vocab)].
        log_p_emission (np.array): `log p(y_t|x_t)` array of shape [len(word_vocab),len(tag_vocab)].
    
    Returns: 
        np.array: `log p(x_t|y_{0:t})` for timestep t. Array of shape [len(tag_vocab)]
    """
    # TODO
    return ...

#### [2] (c) Smoothing: 
Please compute the smoothing step here. Note that you need to apply the log-sum-exp trick here and use numpy operations to efficiently compute the probabilities for all possible $x_t$ at the current timestep $t$. 

$$ p(x_t|Y) = p(x_t|Y_{0:t}) \sum_{x_{t+1} \in \mathcal{X}} p(x_{t+1}|x_t) \frac{p(x_{t+1}|Y)}{ p(x_{t+1}| Y_{0:t}) } $$

In [None]:
def smoothing(t, log_p_predict, log_p_update, log_p_marginal, log_p_transition): 
    """
    Args:
        t (int): Current timestep.
        log_p_predict (np.array): `log p(x_t|y_{0:t-1})` array of shape [T+1,len(tag_vocab)].
        log_p_update (np.array): `log p(x_t|y_{0:t})` array of shape [T+1,len(tag_vocab)].
        log_p_marginal (np.array): `log p(x_t|Y)` array of shape [T+1,len(tag_vocab)].
        log_p_transition (np.array): `log p(x_t|x_{t-1})` array of shape [len(tag_vocab),len(tag_vocab)].
    
    Returns: 
        np.array: `log p(x_t|Y)` for timestep t. Array of shape [len(tag_vocab)] 
    """
    # TODO
    return ...

#### Plugging it together
The functions above are applied in the following function that executes our algorithm to compute the marginals.
Please do not change anything in this function or the signature of the functions above. 
Take a look at their docstrings to understand what you need to compute.

In [None]:
def compute_marginals(Y_, log_p_transition, log_p_emission, log_p_y, log_p_x, tag_vocab):
    T = len(Y_)
    Y = np.zeros(T+1, dtype=int)
    Y[1:] = Y_ # since we are counting time from 1
    log_p_predict = np.zeros((T+1,len(tag_vocab)))
    log_p_update = np.zeros((T+1,len(tag_vocab)))
    log_p_update[0] = log_p_x  
    for t in range(1,T+1):
        log_p_predict[t] = predict(t, log_p_update, log_p_transition)
        log_p_update[t] = update(Y, t, log_p_predict, log_p_y, log_p_emission)
        
    log_p_marginal = np.zeros((T+1,len(tag_vocab)))
    for t in range(T-1,0,-1):
        log_p_marginal[t] = smoothing(t, log_p_predict, log_p_update, log_p_marginal, log_p_transition) 
    return log_p_marginal[1:]

## 3. Evaluate 
Now we can evaluate the algorithm we implemented on the test data.

Hint: you should get an accuracy > 90% if your implementation is correct.

In [None]:
def to_tokens(X, vocab):
    inv_map = {v: k for k, v in vocab.items()}
    return [inv_map[x] for x in X]

In [None]:
Y_test = np.array([word_vocab[w] for w,t in test_data])
X_test = np.array([tag_vocab[t] for w,t in test_data])
p_X_Y = compute_marginals(Y_test, log_p_transition, log_p_emission, log_p_y, log_p_x, tag_vocab)
X_pred = np.argmax(p_X_Y, axis=1)
acc = np.sum(X_pred == X_test) / X_pred.shape[0] * 100
print(to_tokens(X_test[:10], tag_vocab))
print(to_tokens(X_pred[:10], tag_vocab))
print(to_tokens(Y_test[:10], word_vocab))
print(f"Accuracy: {acc}%")