# Home Assignment 3 (30pts)

Submit your solution via Ilias until 23.59h on Tuesday, November 18th. Late submissions are **not possible**.

Submit your solutions in teams of 5 students. Unless explicitly agreed otherwise in advance, submissions from teams with more or less members will NOT be graded (i.e., count as failed).

**Make sure that all team members are part of the submitting group on Ilias.**

You may use the code from the exercises and basic functionalities that are explained in the official documentation of Python packages without citing, __all other sources must be cited__. In case of plagiarism (copying solutions from other teams or from the internet) ALL team members may be expelled from the course without warning.

#### General guidelines:
* Make sure that your code is executable, any task for which the code does not directly run on our machine will be graded with 0 points.
* If you use packages that are not available on the default or conda-forge channel, list them below. Also add a link to installation instructions. 
* Ensure that the notebook does not rely on the current notebook or system state!
  * Use `Kernel --> Restart & Run All` to see if you are using any definitions, variables etc. that 
    are not in scope anymore.
  * Do not rename any of the datasets you use, and load it from the same directory that your ipynb-notebook is located in, i.e., your working directory.
* Make sure you clean up your code before submission, e.g., properly align your code, and delete every line of code that you do not need anymore, even if you may have experimented with it. Minimize usage of global variables. Avoid reusing variable names multiple times!
* Ensure your code/notebook terminates in reasonable time.
* Feel free to use comments in the code. While we do not require them to get full marks, they may help us in case your code has minor errors.
* For questions that require a textual answer, please do not write the answer as a comment in a code cell, but in a Markdown cell below the code. Always remember to provide sufficient justification for all answers.
* You may create as many additional cells as you want, just make sure that the solutions to the individual tasks can be found near the corresponding assignment.
* If you have any general question regarding the understanding of some task, do not hesitate to post in the student forum in Ilias, so we can clear up such questions for all students in the course.

Additional packages (if any):
 - Example: `powerlaw`, https://github.com/jeffalstott/powerlaw

In [None]:
from typing import List, Union, Dict, Set, Tuple
from numpy.typing import NDArray
import nltk
import numpy as np

### Task 1: POS tagging (6 points)

In this task, we want to explore sentences with similar part of speech (POS) tag structure. For this, we need a corpus of text with tags. We will generate such a corpus by using NLTKâ€™s currently recommended POS tagger to tag a given list of tokens (https://www.nltk.org/api/nltk.tag.html).

In [None]:
# NLTK's off-the-shelf POS tagger
nltk.download('averaged_perceptron_tagger_eng')
from nltk import pos_tag

__a)__ Given a corpus of text ``corpus`` as a sequence of tokens, we want to collect all words that are tagged with a certain POS tag. Implement a function ``collect_words_for_tag`` that first tags the given corpus using NLTK's off-the-shelf tagger imported in the cell above. Then, for each POS tag, collect all words that were tagged with it. You should return a dictionary that maps each POS tag that was observed to the set of words that were assigned this tag in the given corpus. __(2 pts)__

In [None]:
from nltk.corpus.reader.util import StreamBackedCorpusView 

def collect_words_for_tag(corpus: Union[List[str], StreamBackedCorpusView]) -> Dict[str, Set[str]]:
    '''
    :param corpus: sequence of tokens that represents the text corpus
    :return: dict that maps each tag to a set of tokens that were assigned this tag in the corpus
    '''
    # your code here
    return 

__b)__ Implement a function ``generate_sentences`` that gets a sentence and a POS dictionary (assume the POS dictionary was generated by your function in __a)__) as input and generates ``n`` sequences of words with the same tag structure. The words in your generated sequence should be randomly taken from the set of words associated with the current tag. 

Additionally, the user should have the option to achieve sentences of ``better_quality``. Thus, if ``better_quality=True``, make sure that the tag structure of the output sentences actually matches the tag structure of the input sentence, as the tags may change depending on the context. 

You can assume that the training corpus is large enough to include all possible POS tags. __(2 pts)__

_Hint: consider the_ ``random`` _module_

In [None]:
def generate_rand(sentence: List[str], pos_dict: Dict[str, Set[str]], n: int, better_quality: bool=False) -> List[List[str]]:
    '''
    :param sentence: input sentence that sets the tag pattern
    :param pos_dict: maps each tag to a list of associated words
    :param n: number of sentences that should be generated
    :return: List of sentences with the same tag structure as the input sentence
    '''
    # your code here
    return

__c)__ Using the input sentence ``This test is very difficult``, test your implementation to generate 10 sentences based on  

* "Emma" by Jane Austen

* The "King James Bible"

Store your POS dictionary in ``emma_tags``and ``bible_tags``, respectively. Your generated sentences should be stored in ``emma_sent`` and ``bible_sent``. __(2 pts)__

In [None]:
sent = ["This", "test", "is", "very", "difficult"]

In [None]:
# your code here

### Task 2: The Viterbi algorithm (11 points)
Implement the Viterbi algorithm as introduced in the lecture and the exercise. The input of your function is a sentence that should be tagged, a dictionary with state transition probabilites and a dictionary with word emission probabilities. You may assume that the _transition probabilities_ are complete, i.e. the dictionary includes every combination of states. In contrast, we assume that all combinations of words and POS tags that are not in the dictionary of _emission probabilities_ have an emission probability of 0.

The function should return a list of POS tags, s.t. that each tag corresponds to a word of the input sentence. Moreover, return the probability of the sequence of POS tags that you found. 

You can test your function on the given example that was discussed in the Pen&Paper exercise. For the sentence ``the fans watch the race`` and the provided probabilities, your function should return the POS tag sequence ``['DT', 'N', 'V', 'DT', 'N']`` and a probability of ``9.720000000000002e-06``.

Additionally, implement beam search in the viterbi algorithm. The beam size is defined by the parameter `beam`. For example for `beam=2` we only keep the best 2 scores per column in each step and discard the rest. You may use the example from the lecture to test your implementation.

In [3]:
# test sentence
sentence = ["the", "fans", "watch", "the", "race"]

# state transition probabilities (complete)
state_trans_prob = {('<s>','DT'):0.8,('<s>','N'):0.2,('<s>','V'):0.0,
                    ('DT','DT'):0.0,('DT','N'):0.9,('DT','V'):0.1,
                    ('N','DT'):0.0,('N','N'):0.5,('N','V'):0.5,
                    ('V','DT'):0.5,('V','N'):0.5,('V','V'):0.0}

# word emission probabilities (not complete, all combinations that are not present have probability 0)
word_emission_prob = {('the','DT'):0.2, ('fans','N'):0.1,('fans','V'):0.2,('watch','N'):0.3,
                      ('watch','V'):0.15,('race','N'):0.1,('race','V'):0.3}

In [None]:
def Viterbi(sentence: List[str], trans_prob: Dict[Tuple[str,str], float], emiss_prob: Dict[Tuple[str,str], float], beam: int=0) -> (List[str], float):
    '''
    :param sentence: sentence that we want to tag
    :param trans_prob: dict with state transition probabilities
    :param emiss_prob: dict with word emission probabilities
    :param beam: beam size for beam search. If 0, don't apply beam search
    :returns: 
        - list with POS tags for each input word
        - float that indicates the probability of the tag sequence
    '''
    # your code here
    return 

### Task 1: Term Frequency - Inverse Document Frequency (13 pts)

In this task we want to use the term frequency - inverse document frequency (tf-idf) weighting method to compare documents with each other and to queries. 

In case you need to tokenize any sentences in the following tasks, please use a tokenizer from NLTK and not the ``string.split`` function.

__a)__ To test your implementation throughout this task, you are given an example from the exercise. Start by implementing a function ``process_docs`` that takes the provided dictionary of documents and returns the following data structures. __(4 pts)__

- ``word2index``: a dictionary that maps each word that appears in any document to a unique integer identifier starting at 0 
- ``doc2index``: a dictionary that maps each document name (here given as the dictionary keys) to a unique integer identifier starting at 0
- ``index2doc``: a dictionary that maps each document identifier to the corresponding document name (reverse to ``doc2index``)
- ``doc_word_vectors``: a dictionary that maps each document name to a list of word ids that indicate which words appeared in the document in their order of appearance. Words that appear multiple times must also be included multiple times.

In [None]:
# example from exercise 8
d1 = "cold beer beach"
d2 = "ice cream beer beer"
d3 = "beach cold ice cream"
d4 = "cold beer frozen yogurt frozen beer"
d5 = "frozen ice ice beer ice cream"
d6 = "yogurt ice cream ice cream"

docs = {"d1": d1, "d2": d2, "d3": d3, "d4": d4, "d5": d5, "d6": d6}

In [None]:
def process_docs(docs: Dict[str, str]) -> (Dict[str, int], Dict[str, int], Dict[int, str], Dict[str, List[int]]):
    """
    :params docs: dict that maps each document name to the document content
    :returns:
        - word2index: dict that maps each word to a unique id
        - doc2index: dict that maps each document name to a unique id
        - index2doc: dict that maps ids to their associated document name
        - doc_word_vectors: dict that maps each document name to a list of word ids that appear in it
    """
    # your code here
    return      

In [None]:
# The output for the provided example could look like this:

# word2index:
# {'cold': 0, 'beer': 1, 'beach': 2, 'ice': 3, 'cream': 4, 'frozen': 5, 'yogurt': 6}

# doc2index:
# {'d1': 0, 'd2': 1, 'd3': 2, 'd4': 3, 'd5': 4, 'd6': 5}

# index2doc
# {0: 'd1', 1: 'd2', 2: 'd3', 3: 'd4', 4: 'd5', 5: 'd6'}

# doc_word_vectors:
# {'d1': [0, 1, 2],
#  'd2': [3, 4, 1, 1],
#  'd3': [2, 0, 3, 4],
#  'd4': [0, 1, 5, 6, 5, 1],
#  'd5': [5, 3, 3, 1, 3, 4],
#  'd6': [6, 3, 4, 3, 4]}

__b)__ Set up a term-document matrix where each column corresponds to a document and each row corresponds to a word that was observed in any of the documents. The row/column indices should correspond to the word/document ids that are set in the input dicts ``word2index`` and ``doc2index``. Count how often each word appears in each document and fill the term document matrix. __(3 pts)__

_Example: The word "beer" with the word id_ ``1`` _appears two times in the document "d4" that has the document id_ ``3``. _Therefore the the entry at position_ ``[1, 3]`` _in the term-document matrix is_ ``2``.

In [None]:
def term_document_matrix(doc_word_v: Dict[str, List[int]], doc2index: Dict[str, int], word2index: Dict[str, int]) -> NDArray[NDArray[float]]:
    """
    :param doc_word_v: dict that maps each document to the list of word ids that appear in it
    :param doc2index: dict that maps each document name to a unique id
    :param word2index: dict that maps each word to a unique id
    :return: term-document matrix (each word is a row, each document is a column) that indicates the count of each word in each document 
    """
    # your code here
    return 

__c)__ Implement the function ``to_tf_idf_matrix`` that takes a term-document matrix and returns the corresponding term frequency (tf) matrix. If the parameter ``idf`` is set to ``True``, the tf-matrix should further be transformed to a tf-idf matrix (i.e. every entry corresponds to the tf-idf value of the associated word and document). Your implementation should leave the input term-document matrix unchanged. __(3 pts)__

Use the following formulas:

\begin{equation}
  tf_{t,d} =
    \begin{cases}
      1+log_{10}\text{count}(t,d) & \text{if count}(t, d) > 0\\
      0 & \text{otherwise}
    \end{cases}       
\end{equation}  

\begin{equation}
  idf_t = log_{10}(\frac{N}{df_t})
\end{equation}

\begin{equation}
  tf\text{-}idf_{t,d} = tf_{t,d} \cdot idf_t
\end{equation}

In [None]:
def to_tf_idf_matrix(td_matrix: NDArray[NDArray[float]], idf: bool=True) -> NDArray[NDArray[float]]:
    """
    :param td_matrix: term-document matrix 
    :param idf: computes the tf-idf matrix if True, otherwise computes only the tf matrix
    :return: matrix with tf(-idf) values for each word-document pair 
    """
    # your code here
    return 

__d)__ We now want to test the implementation on our running example. First, print the tf-idf for each word of the query ``ice beer`` with respect to each document. Second, find the two most similar documents from ``d1, d2, d3`` according to cosine similarity and print all similarity values.  __(3 pts)__

In [None]:
# your code here