# Week 5 - NLP and Deep Learning

---

# Lecture 9: Sequence Prediction with HMMs

In this exercise you will implement the Viterbi algorithm for decoding in sequence tagging. More concretely, we are going to build a POS tagger for English web data.

## 1. Emissions and transition probabilities

In this part of the exercise you are going to prepare the emission and transition probabilities to use in the viterbi algorithm. We are going to focus on the task of Parts-Of-Speech (POS) tagging. You can use the following datareader for the following assignments:

In [2]:
def read_conll_file(path):
    """
    read in conll file
    
    :param path: path to read from
    :returns: list with sequences of words and labels for each sentence
    """
    data = []
    current_words = []
    current_tags = []

    for line in open(path, encoding='utf-8'):
        line = line.strip()

        if line:
            if line[0] == '#':
                continue # skip comments
            tok = line.split('\t')

            current_words.append(tok[0])
            current_tags.append(tok[1])
        else:
            if current_words:  # skip empty lines
                data.append((current_words, current_tags))
            current_words = []
            current_tags = []

    # check for last one
    if current_tags != []:
        data.append((current_words, current_tags))
    return data

Because we are going to use the the POS labels as indices in our Viterbi matrix, we need to know all labels beforehand, and they need to have a static order. We also need a special beginning and end label:

In [3]:
train_data = read_conll_file('pos-data/en_ewt-train.conll')
dev_data = read_conll_file('pos-data/en_ewt-dev.conll')

SMOOTH = 0.1
BEG = '<S>'
END = '</S>'
UNK = '<UNK>'

label_set = set([pos_label for sentence in train_data for pos_label in sentence[1]])
label_set.add(BEG)
label_set.add(END) #? Why aren't we adding <UNK>?
# put labels in a list, so that they are guaranteed to have the same order
label_list = list(sorted(label_set))

print('Length train data: ' + str(len(train_data)))
print('Length dev data: ' + str(len(dev_data)))

# the data is a list of pairs, containing 1: a list of words 2: a list of POS labels
print('Random datapoint:')
print(train_data[70][0])
print(train_data[70][1])

print('All labels:')
print(label_list)

Length train data: 12543
Length dev data: 2000
Random datapoint:
['It', 'is', 'a', 'time', 'to', 'learn', 'what', 'happened', 'and', 'how', 'it', 'may', 'affect', 'the', 'future', '.']
['PRON', 'AUX', 'DET', 'NOUN', 'PART', 'VERB', 'PRON', 'VERB', 'CCONJ', 'ADV', 'PRON', 'AUX', 'VERB', 'DET', 'NOUN', 'PUNCT']
All labels:
['</S>', '<S>', 'ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X']


* a) calculate the transition probabilities based on the training data, use a special label for the beginning of a sentence (`<S>`) and the end of a sentence (`</S>`). use laplace smoothing with a value of 0.1 to avoid probabilities of 0.0.

**Hint**: The transition probability $P(t_i|t_{i-1})$ is the probability that given a tag, $t_{i-1}$, that it will be followed by a tag $t_i$. 
$$P(t_i|t_{i-1}) = {C(t_{i-1},t_{i}) \over C(t_{i-1})}$$
With smoothing:
$$P(t_i|t_{i-1}) = {C(t_{i-1},t_{i}) + \gamma \over C(t_{i-1}) + (|t|) * \gamma} $$

Where $(|t|-1) * \gamma$ is used because we want to add probability mass to all labels.

**Hint2**: Every sentence in the data looks like `['DET', 'NOUN', 'VERB']` without any `<S>` or `</S>` tags. So the beginning and end of every data sample needs to be handled differently when counting occurences of transitions, alternatively you can add these tokens to each data sample so they look like `['<S>, 'DET', 'NOUN', 'VERB', </S>]`

In [4]:
from collections import Counter
# Make a dictionary of dictionaries, so that we can easily query for a certain probability.
# We add the smoothing value as a starting point, as it has to be added to each count.
# Note that a list of lists with POS indices is more efficient (but a bit more cumbersome to implement).
transition_counts = {label: {label: SMOOTH for label in label_list} for label in label_list}
# The count that a NOUN follows an ADJ (empty now)
print(transition_counts['ADJ']['NOUN'])

# First obtain the raw counts
pairs = []
for sentence in train_data:
    sentence = [[BEG] + sentence[0] + [END], [BEG] + sentence[1] + [END]]
    labels = sentence[1]
    pairs += [(labels[i], labels[i+1]) for i in range(len(labels)-1)]

for label in label_list:
    this_tuples = filter(lambda x: x[0] == label, pairs)
    counts = Counter(this_tuples)
    for (l_from, l_to), count in counts.items():
        transition_counts[l_from][l_to] += count

print(transition_counts)

print(transition_counts['ADJ']['NOUN']) # should be 6803.1

# Now fill the transition matrix, note that the outgoing probability of each label should sum to 1.0

transition_probs = {label: {label: 0.0 for label in label_list} for label in label_list}

for label in label_list:
    total_count = sum([count for count in transition_counts[label].values()])
    transition_probs[label] = { oth_label: oth_count / total_count for oth_label, oth_count in transition_counts[label].items()}

print(sum(transition_probs['ADJ'].values())) # sums to 1, we are happy.

0.1
{'</S>': {'</S>': 0.1, '<S>': 0.1, 'ADJ': 0.1, 'ADP': 0.1, 'ADV': 0.1, 'AUX': 0.1, 'CCONJ': 0.1, 'DET': 0.1, 'INTJ': 0.1, 'NOUN': 0.1, 'NUM': 0.1, 'PART': 0.1, 'PRON': 0.1, 'PROPN': 0.1, 'PUNCT': 0.1, 'SCONJ': 0.1, 'SYM': 0.1, 'VERB': 0.1, 'X': 0.1}, '<S>': {'</S>': 0.1, '<S>': 0.1, 'ADJ': 512.1, 'ADP': 548.1, 'ADV': 958.1, 'AUX': 362.1, 'CCONJ': 291.1, 'DET': 1260.1, 'INTJ': 405.1, 'NOUN': 783.1, 'NUM': 495.1, 'PART': 59.1, 'PRON': 3535.1, 'PROPN': 1442.1, 'PUNCT': 443.1, 'SCONJ': 446.1, 'SYM': 91.1, 'VERB': 759.1, 'X': 154.1}, 'ADJ': {'</S>': 50.1, '<S>': 0.1, 'ADJ': 714.1, 'ADP': 1006.1, 'ADV': 181.1, 'AUX': 45.1, 'CCONJ': 562.1, 'DET': 66.1, 'INTJ': 4.1, 'NOUN': 6803.1, 'NUM': 109.1, 'PART': 414.1, 'PRON': 171.1, 'PROPN': 872.1, 'PUNCT': 1702.1, 'SCONJ': 300.1, 'SYM': 20.1, 'VERB': 117.1, 'X': 16.1}, 'ADP': {'</S>': 1.1, '<S>': 0.1, 'ADJ': 1267.1, 'ADP': 540.1, 'ADV': 284.1, 'AUX': 15.1, 'CCONJ': 104.1, 'DET': 6377.1, 'INTJ': 5.1, 'NOUN': 2903.1, 'NUM': 717.1, 'PART': 15.1, 'PR

* b) calculate the emission probabilities based on the training data. Make sure that every POS tag can be assigned to an `<UNK>` token, use laplace smoothing with a value of 0.01 to avoid probabilities of 0.0.

**Hint**: The emission probability $P(w_i|t_{i})$ is the probability that given a tag, $t_i$, that it will be associated with a given word $w_i$. The formula below shows counts $C$ needed to calculate the probability.

$$P(w_i|t_{i}) = {C(t_{i},w_{i}) \over C(t_{i})}$$

In [5]:
word_set = {UNK}
for sent in train_data:
    for word in sent[0]:
        word_set.add(word)
word_list = list(sorted(word_set))

# Fill emission counts
pairs = []
for sentence in train_data:
    words = sentence[0]
    labels = sentence[1]
    pairs += zip(labels, words)

emission_counts = {label: {word: SMOOTH for word in word_list} for label in label_list}

for label in label_list[3:]:
    this_tuples = filter(lambda x: x[0] == label, pairs)
    counts = Counter(this_tuples)

    for (label, word), count in counts.items():
        emission_counts[label][word] += count

print(emission_counts['ADP']['at'])

# Convert to probabilities
emission_probs = {label: {word: SMOOTH for word in word_list} for label in label_list}

for label in label_list:
    total_count = sum([count for count in emission_counts[label].values()])
    emission_probs[label] = { word: count / total_count for word, count in emission_counts[label].items()}

print(emission_probs['ADP']['at'])
print(sum(emission_probs['ADP'].values())) # still sum to one

735.1
0.03719746383228232
0.9999999999997137


You can check whether your solutions are correct by estimating the probabilities on the data and check whether the probabilities match:

In [6]:
print(transition_probs['ADJ']['NOUN']) # 0.5171926196793345
print(transition_probs['NOUN']['ADJ']) # 0.01123434129302644
print(transition_probs[BEG]['ADJ']) # 0.04082136964025221
print(transition_probs['ADJ'][END]) # 0.003808756338424345
print(emission_probs['NOUN']['calling'])   # 2.9909103515414666e-05
print(emission_probs['VERB']['calling'])  # 0.0005740785225418476
print(emission_probs['VERB'][UNK])  # 4.071478883275515e-06
# perfet

0.5171926196793345
0.01123434129302644
0.04082136964025221
0.003808756338424345
2.9909103515414666e-05
0.0005740785225418476
4.071478883275515e-06


## 2. Viterbi algorithm

In the image below we see an example of the calculation of the first 2 positions in a Viterbi decoding:
<img width=500px src="pics/viterbi.jpg">

* a) Implement Viterbi decoding, use the transition and emission probabilities previously estimated (note that we also provide pre-calculated probabilities in `probs_en.pickle`). You can use the example code shown below as a starting point if you like.

**Hint**: The implementation can become simpler if you think about the problem as a 2d matrix that needs to be filled (each position in the list is a node in the viterbi decoding, $v_1(7)$, $v_1(6)$, ...). You can first initialize the matrix with 0.0's, and then fill it from left to right.

**Hint2**: You need to combine three probabilities for each possible history, and take the max of all possible histories. You do not need to use negative log probabilities for now.

In [21]:
from operator import itemgetter
# You can also load the pre-calculate probabilities:
# import pickle
# transition_probs, emission_probs = pickle.load(open('probs_en.pickle', 'rb'))
import numpy as np
def viterbi(sentence):
    """
    sentence: list of strings
    """
    columns = len(sentence)
    # You don't need the special tokens in the viterbi decoding so we remove them
    labels_list_exc_special = label_list.copy()
    labels_list_exc_special.remove(BEG)
    labels_list_exc_special.remove(END)

    rows = len(labels_list_exc_special)
    
    # Create the full matrix for scores as well as the backtracking.
    # e.g. scores[0][3] should get the probability of the best path of 
    # the first label and the 4th word in the sentence
    # Backtrack contains the index of the best tag for the previous word
    # e.g. backtrack[0][3] should get the index of the best tag for the 3rd word 
    # when backtracking from the first label and 4th word
    scores = np.array([[0.0 for _ in range(columns)] for _ in range(rows)])
    backtrack = np.array([[0 for _ in range(columns)] for _ in range(rows)])
    
    # Handle the first token separately, as it only has 2 probabilities (emission, transition)
    word_position = 0
    for pos_tag_idx, pos_tag in enumerate(labels_list_exc_special):
        # The probability of the first word given the POS tag:
        word = sentence[word_position]
        if word not in emission_probs[pos_tag]:
            word = UNK
        em_prob = emission_probs[pos_tag][word] 
        
        # The probability of the POS tag given that the previous "tag" is <S>
        transition_prob = transition_probs[BEG][pos_tag]
        
        # Save the total probability:
        scores[pos_tag_idx][word_position] = em_prob * transition_prob
        
        # Backtracking for the first token is uneccessary so we ignore it

    # Now handle the rest of the sequence
    for word_position in range(1, columns):
        word = sentence[word_position]
        if word not in emission_probs[pos_tag]:
            word = UNK
        for pos_tag_idx, pos_tag in enumerate(labels_list_exc_special):
            # Get emission probability, remember to handle unknown words
            em_prob = emission_probs[pos_tag][word]
            
            # For each possible history path (i.e. label): get the total score
            # We need to get the transition probability and the history probability
            # Hint: the history probability is the score of the previous word position in scores matrix
            candidate_scores = [0.0] * len(labels_list_exc_special)
            for hist_pos_tag_idx, hist_pos_tag in enumerate(labels_list_exc_special):
                score_previous = scores[hist_pos_tag_idx][word_position-1]
                transition_to_current_tag = transition_probs[hist_pos_tag][pos_tag]
                candidate_scores[hist_pos_tag_idx] = score_previous * transition_to_current_tag

            # Now extract the best score from candidate_scores and its previous path and save these
            # Hint: backtrack should contain the index of the best tag:
            # backtrack[tag_idx][word_position] = previous_best_tag_idx
            last_word_best_tag_idx, last_word_best_score = max(enumerate(candidate_scores), key=itemgetter(1))
            backtrack[pos_tag_idx][word_position] = last_word_best_tag_idx
            scores[pos_tag_idx][word_position] = last_word_best_score * em_prob


    # Extract the best score from the last labels to the special end label
    # Hint: here you only have history and transition (no emission)
    candidate_scores = [0.0] * len(labels_list_exc_special)
    for pos_tag_idx, pos_tag in enumerate(labels_list_exc_special):
        transition_prob = transition_probs[pos_tag][END]
        previous_tag_score = scores[pos_tag_idx][-1]
        candidate_scores[pos_tag_idx] = transition_prob * previous_tag_score
    last_word_best_tag_idx, last_word_best_score = max(enumerate(candidate_scores), key=itemgetter(1))
    end_score = candidate_scores[last_word_best_tag_idx]

    # Extract the path from the best last label using the backtrack matrix
    # Hint: the path contains the index of the best tag for each word
    tags_idx = [last_word_best_tag_idx]
    for word_idx in range(columns-1, 0, -1):
        nexttag = backtrack[tags_idx[0]][word_idx]
        tags_idx.insert(0, nexttag)
    
    return [labels_list_exc_special[i] for i in tags_idx]

    # Reverse the path and convert the indexes to labels
viterbi(['this', 'is', 'a', 'very', 'good', 'chocolate', '.'])

['PRON', 'AUX', 'DET', 'ADV', 'ADJ', 'NOUN', 'PUNCT']

* b) Ensure that the best path is saved during the decoding, so that you can extract the labels. What is the accuracy of your implementation of the Viterbi algorithm on the development data (`pos-data/en_ewt-dev.conll`)?

**Hint**: If implemented correctly, it should score at least an accuracy of 50%. If you score lower, we suggest you try printing the probabilities at each step (word) for the first sentence of the development data.

In [26]:
corr, tot = 0, 0
for words, labels in dev_data:
    corr += sum(np.array(viterbi(words)) == np.array(labels))
    tot += len(labels)

acc = corr / tot
print(f"{acc*100:.2f}% accuracy")
# 🔥🔥🔥

84.14% accuracy


* c) **Bonus**: try to improve your predictions by inspecting common errors, tuning some of the decisions (e.g. smoothing, weighing the three probabilities) you made, or improving the handling of unknown tokens.


In [None]:
"""
Don't have time to implement, but here are some ideas.
1. I saw in thesting the above cell that AP, which is PROPN, got marked as NOUN.  
   All-caps words are probably organizations, so try to look at how many allcaps  
   words are PROPNs and incorporate that.  
2. ok i lied i only have 1 idea, i would have to play around with weights  
   (hyperparameters optimization?) to get anything else.  
"""

# Lecture 10: BERT

## 3. Subword tokenization

BERT models are trained to predict tokens that were masked with a special `[mask]` token. In this assignment you will inspect what it has learned, and whether it has certain preferences (i.e. probing). 

a) Load the multilingual Bert tokenizer:

In [29]:
from transformers import AutoTokenizer

tokzr = AutoTokenizer.from_pretrained('bert-base-multilingual-cased', use_fast=False)

Multilingual BERT was trained on the 100 most frequent languages of Wikipedia. They used smoothing, to correct inbalances in the data. However, their smoothing is relatively conservative, so high-resource languages have a higher impact on the model, and it is unclear how they sampled for training the tokenizer. Compare the tokenizations for two different language types you know; preferably one higher-resource and one lower-resource. If you only know 1 language, or only high-resource languages, try to use a different variety of the language (for example for English, use social media abbreviations or typos, e.g.: c u tmrw). Can you observe any differences in the results? does it match your intuition of separating mostly meaning-carrying subwords?

You can use Figure 1 of https://arxiv.org/pdf/1911.02116.pdf or https://en.wikipedia.org/wiki/List_of_Wikipedias to see how large languages are on Wikipedia.

In [34]:
print(tokzr.tokenize('this is an example input'))
print(tokzr.tokenize('これは入力例です'))
# 入力 means "input" and should have been 1 token
print(tokzr.tokenize('Dette er et eksempel input'))
# I'm surprised every word has a token, is danish really common enough for that?
print(tokzr.tokenize('Это пример ввода'))
# Makes sense

['this', 'is', 'an', 'example', 'input']
['これは', '入', '力', '例', 'で', '##す']
['Dette', 'er', 'et', 'eksempel', 'input']
['Это', 'пример', 'вв', '##ода']


b) Test whether the `bert-base-cased` model can solve the analogy task that we discussed in the word2vec lecture ([slides](https://github.itu.dk/robv/intro-nlp2023/blob/main/slides/07-vector_semantics.pdf), [assignment](https://github.itu.dk/robv/intro-nlp2023/blob/main/assignments/week4/week4.ipynb)), we can do this by masking the target word we are looking for, and let the model predict which words fit best. We can then use a prompt to discover what the language model would guess. For example, we can use the prompt "man is to king as woman is to [MASK]". Try at least two syntactic analogies, and two semantic analogies.
You can use the following code:

(Note that you need 4gb of RAM for this assignment, otherwise you can use the HPC.)

In [35]:
from transformers import AutoModelForMaskedLM,AutoTokenizer
import torch

def getTopN(inputSent, model, tokzr, topn=1):
    maskId = tokzr.convert_tokens_to_ids(tokzr.mask_token)
    tokenIds = tokzr(inputSent).input_ids
    if maskId not in tokenIds:
        return 'please include ' + tokzr.mask_token + ' in your input'
    maskIndex = tokenIds.index(maskId)
    logits = model(torch.tensor([tokenIds])).logits
    return tokzr.convert_ids_to_tokens(torch.topk(logits, topn, dim=2).indices[0][maskIndex])

model = AutoModelForMaskedLM.from_pretrained('bert-base-cased')
tokzr = AutoTokenizer.from_pretrained('bert-base-cased')

getTopN('This is a [MASK] test.', model, tokzr, 5)


Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['positive', 'simple', 'negative', 'diagnostic', 'successful']

In [44]:
print('Man is to king as woman is to', getTopN('Man is to king as woman is to [MASK].', model, tokzr, 5)) # uncredible
print('Man is to doctor as woman is to', getTopN('Man is to doctor as woman is to [MASK].', model, tokzr, 5)) # why are they both man
print('Denmark is to Copenhagen as England is to', getTopN('Denmark is to Copenhagen as England is to [MASK].', model, tokzr, 5)) # yep
print('Computer is to mouse as animal is to', getTopN('Computer is to mouse as animal is to [MASK].', model, tokzr, 5)) # MAN FISH

Man is to king as woman is to ['man', 'king', 'queen', 'slave', 'rule']
Man is to doctor as woman is to ['man', 'nurse', 'cook', 'doctor', 'woman']
Denmark is to Copenhagen as England is to ['London', 'Paris', 'Copenhagen', 'Liverpool', 'Edinburgh']
Computer is to mouse as animal is to ['man', 'fish', 'mouse', 'cat', 'human']


c) Test how robust the language model is, does it have an effect on the results of the word predictions if you include punctuations at the end of the sentence?, what about starting with a capital? and do typos have a large impact?

In [47]:
# I already tested the punctuation above, it did real shit if I didn't include a period at the end (it kept predicting puncts)
print('Man is to king as woman is to', getTopN('Man is to king as woman is to [MASK].', model, tokzr, 5))
print('man is to king as woman is to', getTopN('man is to king as woman is to [MASK].', model, tokzr, 5)) # def a difference, though it's small.

print('Man iz to kng as eoman is to', getTopN('Man iz to kng as eoman is to [MASK].', model, tokzr, 5)) # it sure is different.

Man is to king as woman is to ['man', 'king', 'queen', 'slave', 'rule']
man is to king as woman is to ['king', 'queen', 'man', 'slave', 'rule']
Man iz to kng as eoman is to ['man', 'do', 'be', 'die', 'say']


d) Think of some prompts that test whether the model has any gender biases, you can test this for example by using common gendered names or pronouns, swapping them and then check whether the predicted word changed.

In [57]:
# I feel like i already did that but sure
print('The best gender is', getTopN('The best gender is [MASK] by far.', model, tokzr, 5), 'by far.') # wow female, slay
print('Fire\'s are put out by fire-', getTopN('Fire\'s are put out by fire-[MASK].', model, tokzr, 5)) # no firemen
print(getTopN('[MASK] tend to earn less.', model, tokzr, 5), 'tend to earn less.') # Women is in there
print(getTopN('[MASK] should earn less.', model, tokzr, 5), 'should earn less.') # He and She is there.
print(getTopN('[MASK] name is Bob.', model, tokzr, 5), 'name is Bob.') # Her is #2? crazy
print(getTopN('[MASK] name is Amanda.', model, tokzr, 5), 'name is Amanda.') # His #2

The best gender is ['unknown', 'chosen', 'known', 'found', 'female'] by far.
Fire's are put out by fire- ['fighters', 'engines', 'fighting', 'ants', 'trucks']
['They', 'People', 'Women', 'they', 'We'] tend to earn less.
['You', 'I', 'He', 'She', 'We'] should earn less.
['His', 'Her', 'My', 'his', 'The'] name is Bob.
['Her', 'His', 'My', 'her', 'The'] name is Amanda.


# 4. Finetune a BERT model

We have provided code for training a BERT based classifier, which can be found in `week5/bert/bert-topic.py`. The implementation uses huggingface's transformers library (https://github.com/huggingface/transformers), and simply adds a linear layer to convert the output of the CLS token from the last layer of the masked language model to a label. 

a) Inspect the code; what should the shape of the output_scores be at the end of the forward pass?, What does this output represent?

b) Train the model on your own machine or on the HPC without a GPU (Note that this code needs ~8gb ram), how long does it take?

c) Now change the number of maximum training sentences (MAX_TRAIN_SENTS) to 500 and the batch size (BATCH_SIZE) to 32. Note that it will now take very long to train without a GPU. Train the model on the HPC, make sure you reserve a GPU to speed up the training. For more information, see http://hpc.itu.dk/scheduling/templates/gpu/ (only available on ITU network/VPN). Note that the code detects automatically whether a GPU is available. Also note that the transformers library is already installed, and can be loaded with:

```
module load PyTorch/1.7.1-foss-2020b
module load Transformers/4.2.1-foss-2020a-Python-3.8.2
``` 

(which you also have to put in the job script).

In [None]:
# you'll have to take the trust-me-bro guarantee that I actually did that and didn't just hand in the assignment that's already 5 days late