University of Zagreb\
Faculty of Electrical Engineering and Computing

## Text Analysis and Retrieval 2023/2024
https://www.fer.unizg.hr/en/course/taar

------------------------------

# LAB 3: Neural NLP

*Version: 1.2*

© 2024 Josip Jukić, Jan Šnajder

Submission deadline: **April 21, 2024, 23:59 CET** 

------------------------------

### Instructions

Welcome, visitor! This lab assignment is structured into three segments. Your primary objective is to complete the missing code sections, marked by the "YOUR CODE HERE" placeholder, and then evaluate the cells.

For each part of the assignment, a series of tests are available for you to run. These tests are designed to guide you by showing the expected output format. Additionally, after you submit your assignment, further tests will be conducted. Please note that variations in library versions might cause slight differences in your results. However, there's no need for concern, as your submitted work will be evaluated in a controlled environment.


### Submission rules
By submitting the exercise, you confirm the following points:
1. You did not receive help from another when solving the exercise;
2. You attributed parts of the code that were taken from the Internet by referencing them in comments;
3. You did not use parts of the code from the Internet that are specific to the laboratory exercise;
4. You have not used AI assistants for coding such as GitHub Copilot (including generative AI tools such as ChatGPT).

**Violation of any of the above rules is considered a misdemeanor and results in academic sanctions.**

## Tasks

### 1. Machine Translation

Machine Translation (MT) involves the automatic conversion of text from one natural language to another, ensuring the original meaning is retained while generating coherent output in the target language. As one of the earliest areas of artificial intelligence research, machine translation has seen remarkable enhancements in quality, particularly with the adoption of large-scale empirical methods.

In this lab assignment, we'll employ pre-trained sequence-to-sequence models for our machine translation tasks. We'll also explore two strategies for text generation: the greedy decoder and beam search.

#### (a)

Assuming that we possess a pre-trained model, we still need to figure out how exactly we will generate tokens. One of the most straightforward approaches is to retrieve the most probable token in each step.
Implement `greedy_decoder` for language generation. The greedy method retrieves the index of the most probable token for each timestep in a sequence.

In [23]:
import numpy as np


def greedy_decoder(array):
    """
    Retrieves a 1D numpy array containing the index of the most
    probable token for each row.

    Arguments:
        array: 2D np.array of token probabilities, where each row correspond to
        a certain timestamp. For example, 10x5 array represents a sequence of 10
        words over a vocabulary of 5 words.
    """
    # YOUR CODE HERE
    return np.argmax(array, axis=1)

In [24]:
ex1a1 = np.array(
    [
        [0.1, 0.2, 0.3, 0.4, 0.5],
        [0.5, 0.4, 0.3, 0.2, 0.1],
        [0.1, 0.2, 0.3, 0.4, 0.5],
        [0.5, 0.4, 0.3, 0.2, 0.1],
        [0.1, 0.2, 0.3, 0.5, 0.4],
        [0.2, 0.1, 0.3, 0.4, 0.5],
        [0.3, 0.4, 0.5, 0.2, 0.1],
        [0.5, 0.4, 0.3, 0.2, 0.1],
        [0.1, 0.5, 0.3, 0.4, 0.2],
        [0.5, 0.4, 0.3, 0.2, 0.1],
    ]
)

sol1a1 = np.array([4, 0, 4, 0, 3, 4, 2, 0, 1, 0])

assert (greedy_decoder(ex1a1) == sol1a1).all()

#### (b)

Greedy approach has the benefit that it is very fast, but the quality of the final output sequences may be far from optimal. Instead of greedily choosing the most likely next step as the sequence is constructed, the beam search expands all possible next steps and keeps the `k` most likely, where `k` is a parameter and controls the number of beams or parallel searches through the sequence of probabilities.

The local beam search algorithm keeps track of `k` states rather than just one. It begins with `k` randomly generated states. At each step, all the successors of all `k` states are generated. If any one is a goal, the algorithm halts. Otherwise, it selects the k best successors from the complete list and repeats. This is illustrated in the figure below.

Implement `beam_search_decoder`.

![Beam search](img/beamsearch.jpg)

In [25]:
import numpy as np

def beam_search_decoder(data, k):
    """
    Retrieves top k sequences according to the beam search algorithm. The sequences
    must be sorted in ascending order according to the sequence score.
    If the sequence scores are the same, then use `str(numpy_array)` as the secondary sort key to resolve tie breaks,
    where `numpy_array` is the corresponding probability sequence stored in a numpy array.

    Scores for sequences are computed as the sum of negative logarithms. For example, the
    scores for a sequence with probabilities [0.1, 0.3, 0.2] would be -log(0.1) - log(0.3) - log(0.2).

    The result is returned as a sorted list of lists, where each list represents
    one instance of the beam search output. For example, the list [[0, 1], [1, 2], [2, 3]] is
    the result for k=3, where the first list [0, 1] (sequence of words from vocabulary at
    position 0 and 1) is the output with the lowest score, [1, 2]
    has the second lowest, and [2, 3] has the third lowest score.


    Arguments:
        array: 2D np.array of token probabilities, where each row corresponds to
               a certain timestamp. For example, 10x5 array represents a sequence of 10
               words over a vocabulary of 5 words.
        k: number of beams
    """

    # YOUR CODE HERE
    seqs = [[[], 1.0]]
    for d in data:
        all = []
        for i in range(len(seqs)):
            seq = seqs[i][0]
            result = seqs[i][1]
            for j in range(len(d)):
                one = [seq+[j],result*-np.log(d[j])]
                all.append(one)
        seqs = sorted(all, key=lambda one: (one[1], str(np.array(one[0]))))[:k]
    return list(map(lambda x: x[0], seqs))

In [26]:
ex1b1 = np.array(
    [
        [0.1, 0.2, 0.3, 0.4, 0.5],
        [0.5, 0.4, 0.3, 0.2, 0.1],
        [0.1, 0.2, 0.3, 0.4, 0.5],
        [0.5, 0.4, 0.3, 0.2, 0.1],
        [0.1, 0.2, 0.3, 0.5, 0.4],
        [0.2, 0.1, 0.3, 0.4, 0.5],
        [0.3, 0.4, 0.5, 0.2, 0.1],
        [0.5, 0.4, 0.3, 0.2, 0.1],
        [0.1, 0.5, 0.3, 0.4, 0.2],
        [0.5, 0.4, 0.3, 0.2, 0.1],
    ]
)

sol1b1 = np.array(
    [
        [4, 0, 4, 0, 3, 4, 2, 0, 1, 0],
        [3, 0, 4, 0, 3, 4, 2, 0, 1, 0],
        [4, 0, 3, 0, 3, 4, 2, 0, 1, 0],
    ]
)

assert (beam_search_decoder(ex1b1, 3) == sol1b1).all()

#### (c)

Finally, let's employ a pre-trained sequence-to-sequence Transformer from the [`hugggingface`](https://huggingface.co/) library. Specifically, we are going to use `mBART-large-50`, which can be employed in the [zero-shot](https://en.wikipedia.org/wiki/Zero-shot_learning) setup.

In [27]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM


# Load the model and its corresponding tokenizer.
model = AutoModelForSeq2SeqLM.from_pretrained(
    "facebook/mbart-large-50-many-to-many-mmt"
)
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

Take a look at the example below to see how to use the pre-trained model in the zero-shot setup. We are translating from Finnish (fi_FI) to English (en_XX).

In [28]:
fi_text = "Aamu kuluu aatellessa, päivä päätä käännellessä."
tokenizer.src_lang = "fi_FI"
encoded_ar = tokenizer(fi_text, return_tensors="pt")
generated_tokens = model.generate(
    **encoded_ar, max_length=50, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

'The morning goes by in bed, and the day ends in reverse.'

We can control the decoding algorithm by setting the `num_beams` parameter. If it is set to `None`, tokens will be decoded greedily. Additionally, we can control the sequence max length with `max_length` and early stopping with `early_stopping`. When set to True, `early_stopping` indicates that generation is finished when all beam hypotheses reached the EOS (end-of-sequence) token.

Implement the `translate` method which generalizes translation using `mbart` to any source and target language supported by the model. If in doubt, refer to the previous example. Be sure to use all of the arguments, most of which are going to be forwarded to the model's `generate` method. `batch_decode` wraps the decoded tokens into a list, so don't forget to extract the first element from the list to get the actual string.

In [29]:
def translate(
    model,
    tokenizer,
    text,
    src_lang,
    tgt_lang="en_XX",
    num_beams=None,
    max_length=50,
    early_stopping=False,
):
    """
    Translates `text` from `src_lang` to `tgt_lang` and returns it as a string.
    """
    # YOUR CODE HERE
    tokenizer.src_lang = src_lang
    encoded_ar = tokenizer(text, return_tensors="pt")
    generated_tokens = model.generate(
    **encoded_ar, max_length=max_length, forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang],
    num_beams=num_beams, early_stopping=early_stopping)
    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]


Translate Finnish (fi_FI) and Portugese (pt_PT) texts to English (see the cell below) using the `translate` method. Is greedy decoding doing OK or is it better to use beam search? If beam search is better, which `k` seems to perform well?

In [30]:
fi_text_1 = "Ahneus vie linnut taivaalta ja kalat merestä."
fi_text_2 = "Aika on rahaa, sanoi työtön kun kellonsa myi."

print(translate(model=model, tokenizer=tokenizer, text=fi_text_1, src_lang="fi_FI", num_beams=1))
print(translate(model=model, tokenizer=tokenizer, text=fi_text_1, src_lang="fi_FI", num_beams=8))

#print(translate(model=model, tokenizer=tokenizer, text=fi_text_2, src_lang="fi_FI"))

pt_text_1 = "Mais vale um pássaro na mão do que dois voando."
pt_text_2 = "Diz-me com quem andas e eu te direi quem és."

#print(translate(model=model, tokenizer=tokenizer, text=pt_text_1, src_lang="pt_PT"))
#print(translate(model=model, tokenizer=tokenizer, text=pt_text_2, src_lang="pt_PT"))



Anger takes birds out of heaven and fish out of the sea.
Greed takes birds out of heaven and fish out of the sea.


In [31]:
ex1c1 = "Anfangen ist leicht, Beharren ist Kunst"
assert translate(model, tokenizer, ex1c1, "de_DE") == "Starting is easy, barking is art"

