# Introduction to Natural Language Processing: Assignment 5

In this exercise, we'll practice the implementation of the Viterbi Algorithm for the Hidden Markov Model.

- You can use built-in Python packages, spaCy, scikit-learn, Numpy and Pandas.
- Please comment your code
- Submissions are due Tuesdays at 23:59 **only** on eCampus: **Assignmnets >> Student Submissions >> Assignment 5 (Deadline: 16.01.2024, at 23:59)**

- Name the file aproppriately: "Assignment_5_\<Your_Name\>.ipynb" and submit only the Jupyter Notebook file.
- Please use relative path, your code should work on my computer if the Jupyter Notebook and the file are both in the same directory.

Example: file_name = emission_probabilities.csv, **DON'T use:** /Users/ComputerName/Username/Documents/.../emission_probabilities.csv

**Notes:**

1. For this exercise, please follow the lecture and exercise slides. Additionally, you can refer to the YouTube-links mentioned in the slides, e.g., Lecture 8 >> Page 30 or this one: [LINK](https://www.youtube.com/watch?v=IqXdjdOgXPM)
2. Reference for Viterbi algorithm in Task 2: [LINK](https://web.stanford.edu/~jurafsky/slp3/A.pdf)

### Task 1 (6 points)


Using the files `transisition_probabilities.csv` and `emission_probabilities.csv`, compute the probability of POS tags for the following word sequence:

| I | want | to | race |
| --|------|----|------|

Import the given files and:

**a)** Define a function (all_possible_tags) to retrieve **ALL possible tag sequences** for the given word sequence.

**Input:**  Sentence ("I want to race") and the files above.

**Output:** All possible tag sequences. >> `all_possible_tags_dict = {"1": [VB, TO, NN, PPSS], "2": [NN, TO, VB, PPSS], "3": ...}`)

**b)** Define a function (all_possible_tags_probability) to select the most probable tag sequence.

**Input:** All possible tag sequences retrieved in (a) and the files above

**Output:** The probability for each possible tag sequence as well as the most probable tag sequence with its probability. >> `all_possible_tags_probability_dict = {"1": 0.00123, "2": 0.00234, "3": ...}` and `most_probable_tag_seq_dict = {tag_sequence: [VB, TO, NN, PPSS], probability: 0.00123}`

**Notes:**
1. The indices of the output in (b) should represent the same indices of the output in (a).
2. The tag sequences and the probabilities, shown above in (a) and (b), are only some random examples.


In [3]:
# read transition_probabilities.csv and emission_probabilities.csv

import pandas as pd

transisition_probabilities = pd.read_csv("transition_probabilities.csv", sep=",", index_col=0)
emission_probabilities = pd.read_csv("emission_probabilities.csv", sep=",", index_col=0)

print(transisition_probabilities.head())
print(emission_probabilities.head())

          VB       TO       NN     PPSS
<s>   0.0190  0.00430  0.04100  0.06700
VB    0.0038  0.03500  0.04700  0.00700
TO    0.8300  0.00000  0.00047  0.00000
NN    0.0040  0.01600  0.08700  0.00450
PPSS  0.2300  0.00079  0.00120  0.00014
         I      want   to      race
VB    0.00  0.009300  0.00  0.00012
TO    0.00  0.000000  0.99  0.00000
NN    0.00  0.000054  0.00  0.00057
PPSS  0.37  0.000000  0.00  0.00000


In [38]:
def all_possible_tags_rec(sentence, emission_probabilities):

    all_possible_tags_list = []
    #Here comes your code

    # get the current word
    curr_word = sentence[0]

    # get the emission probabilities of the current word
    curr_word_emission_probabilities = emission_probabilities[curr_word]

    # iterate over all tags
    for tag in curr_word_emission_probabilities.index:

        # check if the current tag has a non-zero emission probability for the current word
        if emission_probabilities.loc[tag, curr_word] > 0:

            # base case, if the sentence has only one word left
            if len(sentence) == 1:
                all_possible_tags_list.append([tag])
                continue

            # recursive case, if the sentence has more than one word left
            next_seq = all_possible_tags_rec(sentence[1:], emission_probabilities)

            # add current tag to the beginning of each sequence
            for i in range(len(next_seq)):
                next_seq[i].insert(0, tag)

            # add the sequences to the list
            all_possible_tags_list += next_seq

    return all_possible_tags_list


def all_possible_tags(sentence, emission_probabilities):

    all_possible_tags_dict = {}

    # get all possible tag sequences as a list
    all_possible_tags_list = all_possible_tags_rec(sentence, emission_probabilities)

    # convert the list to a dictionary
    for i in range(len(all_possible_tags_list)):
        all_possible_tags_dict[i] = all_possible_tags_list[i]

    return all_possible_tags_dict


def all_possible_tags_probability(sentence, all_possible_tags_dict, transisition_probabilities, emission_probabilities):

    all_possible_tags_probability_dict = {}
    most_probable_tag_seq_dict = {}
    #Here comes your code

    for i in all_possible_tags_dict.keys():

        # get the current tag sequence
        seq = all_possible_tags_dict[i]

        # add <s> to the end of the sequence
        seq.append('<s>')

        # initialize probability of the sequence
        prob_seq = 1

        # calculate the probability of the sequence
        for j in range(len(seq) - 1):

            # get the current and next tag
            curr_tag = seq[j]
            next_tag = seq[j + 1]

            # calculate the probability of the next tag
            prob_seq *= transisition_probabilities.loc[next_tag, ' ' + curr_tag]

        # add the probability of the sequence to the dictionary
        all_possible_tags_probability_dict[i] = prob_seq

    # get the most probable tag sequence
    most_probable_tag_seq_dict["tag_sequence"] = all_possible_tags_dict[max(all_possible_tags_probability_dict, key=all_possible_tags_probability_dict.get)]
    most_probable_tag_seq_dict["probability"] = all_possible_tags_probability_dict[max(all_possible_tags_probability_dict, key=all_possible_tags_probability_dict.get)]

    return all_possible_tags_probability_dict, most_probable_tag_seq_dict


sentence = [' I', " want", " to ", ' race']

all_possible_tags_dict = all_possible_tags(sentence, emission_probabilities)
print(all_possible_tags_dict)

all_possible_tags_probability_dict, most_probable_tag_seq_dict = all_possible_tags_probability(sentence, all_possible_tags_dict, transisition_probabilities, emission_probabilities)
print(all_possible_tags_probability_dict, most_probable_tag_seq_dict)

{0: ['PPSS', 'VB', 'TO', 'VB'], 1: ['PPSS', 'VB', 'TO', 'NN'], 2: ['PPSS', 'NN', 'TO', 'VB'], 3: ['PPSS', 'NN', 'TO', 'NN']}
{0: 3.86365e-06, 1: 3.8113600000000002e-06, 2: 1.4064749999999997e-09, 3: 1.38744e-09} {'tag_sequence': ['PPSS', 'VB', 'TO', 'VB', '<s>'], 'probability': 3.86365e-06}


### Task 2 (7 points)

Implement the following algorithm from scratch, using the files given for Task 1:

 ![VITERBI.png](\VITERBI.png)


You should define a function (viterbi) that:

**a)** Computes the initialization step.

**b)** Computes the recursion step.

**c)** Computes the termination step.

**Input:** Sentence ("I want to race"), transition and emission files.

**Output:** Most probable tag sequence and its probability. >> `most_probable_tag_seq_dict = {tag_sequence: [VB, TO, NN, PPSS], probability: 0.00123}`


In [None]:
def viterbi(sentence, transisition_probabilities, emission_probabilities):

    most_probable_tag_seq_dict = {}
    #Here comes your code

    return(most_probable_tag_seq_dict)

## Task 3 (7 points)

Given a hidden markov model with the following probabilities:

**Transition Probabilities:**


 |            | Sunny (S) | Cloudy (C) | Rainy (R)       |
-------------|------------|-----------|---------------
|Sunny (S)    | 0.7       | 0.2        | 0.1   |
|Cloudy (C)   | 0.3       | 0.5        | 0.2   |
|Rainy (R)    | 0.1       | 0.4        | 0.5   |

**Emission Probabilities:**

 |            | Dry (D)    | Wet (W) |
-------------|------------|------------
|Sunny (S)    | 0.9        | 0.1    |
|Cloudy (C)   | 0.6        | 0.4     |
|Rainy (R)    | 0.2        | 0.8     |

**Initial Probability Distribution:**


Sunny (S): 0.4

Cloudy (C): 0.3

Rainy (R): 0.3

represented in following graph: the colored, straight arrows represent values of Transition Matrix, the dotted arrows represent the Emission Matrix while the arrows from the start node to the hidden states (Sunny, Cloudy, Rainy) represent the initial probabilities.

 ![HMM.png](\HMM.png)


**1)** Compute the probabilities of observing the sequence Dry, Wet, Wet, Dry given this hidden state sequence Sunny, Cloudy, Rainy, Sunny.

**2)**  How would the hidden Markov model (HMM) be affected if the emission probabilities for the observation symbols 'Dry' and 'Wet' in the Sunny state were both 0.5? Discuss the potential implications on the model's behavior and the interpretation of the observation sequence in part 1.

**Note:**  You can type your solution directly in the following Markdown or add a handwritten image to it. Don't forget to submit the image if you choose that option.

Here comes your solution
