# HW 3-Part-of-Speech Tagging with HMMs + Decoding Techniques (Greedy and Viterbi)

- Detravious Jamari Brinkley
- CSCI-544: Applied Natural Language Processing
- python version: 3

---

1. Part-of-Speech (POS) Tagging [a type of sequence labelling task where of a given word, assign the part of speech]
2. HMMs (Hidden Markov Model) [a generative-based model that's used for POS Tagging]
    1. Generative-based [provides the probabilities for all possible combinations of values of variables in the set using the joint distribution]
    2. With POS Tagging: Given a sequence of observations (sentences), the task is to infer the most likely sequence of hidden states (POS Tags) that could have generated the observed data.
3. **Decoding Techniques:**
    1. Greedy [find the optimal (OPT) solution at each step]
    2. Viterbi [make use of dynammic programming to find the OPT solution with backtracking while searching the entire search space]
4. **Notes of the data and given files:**
    - Dataset: Wall Street Journal section of the Penn Treebank
    - Folder named `data` with the following files:
        1. `train`, sentences *with* human-annotated POS Tags
        2. `dev`, sentences *with* human-annotated POS Tags
        3. `test`, sentences *without* POS Tags, thus predict the POS Tags
    - Format: Blank like at the end of each sentence. Each line contains 3 items separated by the `\t`, the tab symbol. These three items are
        1. Index of the word in the sentence
        2. Word type
        3. POS Tag



In [1]:
import sys
import json

import numpy as np
import pandas as pd

from tqdm import tqdm
from collections import defaultdict

# Load and Update Data
- [x] Find a way to separate sentences when loading the df.

In [2]:
def load_data(file_path: str, file_name: str, is_test_file: bool, config_index: bool = True):
    
    if config_index == True:
        if is_test_file != True:
            file =  file_path + file_name
            open_df = pd.read_table(file, sep = "\t", names=['Index', 'Word', 'POS Tag'], skip_blank_lines=False)
        else:
            file =  file_path + file_name
            open_df = pd.read_table(file, sep = "\t", names=['Index', 'Word'], skip_blank_lines=False)
        
    return open_df

In [3]:
def update_df_rows_with_dummy(df: pd.DataFrame, new_columns_name: list) -> pd.DataFrame:  
    """Update the rows of the dataframe if blank space, fill with dummy"""  

    dummy_row = pd.DataFrame([['0.0', ' ', 'dummy']], columns=df.columns)
    df = pd.concat([dummy_row, df], ignore_index=True)
    df.columns = new_columns_name
    df.fillna("dummy", inplace=True)
   
    return df

In [4]:
train_df = load_data('data/', 'train', False)
dev_df = load_data('data/', 'dev', False)
test_df = load_data('data/', 'test', True)

In [5]:
train_dev_columns_name = ['Index', 'Word', 'POS Tag']

updated_train_df = update_df_rows_with_dummy(train_df, train_dev_columns_name)
updated_dev_df = update_df_rows_with_dummy(dev_df, train_dev_columns_name)

In [6]:
all_pos_tags = updated_train_df['POS Tag'].unique()
all_pos_tags

array(['dummy', 'NNP', ',', 'CD', 'NNS', 'JJ', 'MD', 'VB', 'DT', 'NN',
       'IN', '.', 'VBZ', 'VBG', 'CC', 'VBD', 'VBN', 'RB', 'TO', 'PRP',
       'RBR', 'WDT', 'VBP', 'RP', 'PRP$', 'JJS', 'POS', '``', 'EX', "''",
       'WP', ':', 'JJR', 'WRB', '$', 'NNPS', 'WP$', '-LRB-', '-RRB-',
       'PDT', 'RBS', 'FW', 'UH', 'SYM', 'LS', '#'], dtype=object)

# Outline of Tasks

1. Vocabulary Creation
2. Model Learning
3. Greedy Decoding with HMM
4. Viterbi Decoding with HMM


# 1. Vocabulary Creation

- **Problem:** Creating vocabulary to handle unkown words.
    - **Solution:** Replace rare words wtih whose occurrences are less than a threshold (ie: 3) with a special token `< unk >`

---

1. [x] Create a vocabulary using the training data in the file train
2. [x] Output the vocabulary into a txt file named `vocab.txt`
    - [x] See PDF on how to properly format vocabulary file
3. [x] Questions
    1. [x] What is the selected threshold for unknown words replacement? 3
    2. [x] What is the total size of your vocabulary? 13751
    3. [x] What is the total occurrences of the special token `< unk >` after replacement? 29443

In [7]:
true_false_series = updated_train_df['Word'].value_counts()
vocab_df = pd.DataFrame(true_false_series)
vocab_df.reset_index(inplace = True)
vocab_df

Unnamed: 0,Word,count
0,",",46476
1,the,39533
2,dummy,38234
3,.,37452
4,of,22104
...,...,...
43188,Birthday,1
43189,Happy,1
43190,Bertie,1
43191,crouched,1


In [8]:
def create_vocab_threshold_df(df: pd.DataFrame, word_col_name: str, count_col_name: str, threhold: int, special_token: str, save_df: bool, save_path_with_name: str):
    """For every word in df, replace with special_token if below threshold
    
    """
    true_false_series = df[count_col_name] > 3
    
    updated_vocab_df = df.loc[true_false_series == True]
    updated_false_vocab_df = df.loc[true_false_series == False]
    updated_false_vocab_df[word_col_name] = special_token
    
    N_updated_false_vocab_df = len(updated_false_vocab_df)
    
    new_row = pd.DataFrame([[special_token, N_updated_false_vocab_df]], columns=updated_vocab_df.columns)
    final_df = pd.concat([new_row, updated_vocab_df], ignore_index=True)
    N_vocab = range(0, len(updated_vocab_df)+1)
    
    final_df["index"] = N_vocab
    
    final_df = final_df.reindex(columns=[word_col_name, "index", count_col_name])
    if save_df == True:
        final_df.to_csv(save_path_with_name, header=None, index=None, sep='\t')
    
    return final_df

In [9]:
word_col_name = "Word"
count_col_name = "count"
special_token = "< unk >"
save_df = False
save_file_path_and_name = "submit/vocab.txt"
updated_vocab_df = create_vocab_threshold_df(vocab_df, word_col_name, count_col_name, 3, special_token, save_df, save_file_path_and_name)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  updated_false_vocab_df[word_col_name] = special_token


In [10]:
updated_vocab_df

Unnamed: 0,Word,index,count
0,< unk >,0,29443
1,",",1,46476
2,the,2,39533
3,dummy,3,38234
4,.,4,37452
...,...,...,...
13746,trafficking,13746,4
13747,7.62,13747,4
13748,gut,13748,4
13749,17.3,13749,4


# 2. Model Learning

- Learn an HMM from the training data
- **HMM Parameters:**
  <div style="text-align: center;">

    $
    \text{Transition Probability (} t \text{)}: \quad t(s' \mid s) = \frac{\text{count}(s \rightarrow s')}{\text{count}(s)}
    $

    $
    \text{Emission Probability (} e \text{)}: \quad e(x \mid s) = \frac{\text{count}(s \rightarrow x)}{\text{count}(s)}
    $

  </div>

---

1. [x] Learn a model using the training data in the file train
2. [x] Output the learned model into a model file in json format, named `hmm.json`. The model file should contains two dictionaries for the emission and transition parameters, respectively.
    1. [x] 1st dictionary: Named transition, contains items with pairs of (s, s′) as key and t(s′|s) as value. 
    2. [x] 2nd dictionary: Named emission, contains items with pairs of (s, x) as key and e(x|s) as value.
3. Question
    1. [x] How many transition and emission parameters in your HMM? transition = 1416. emission = 50287


In [11]:
# updated_train_df.head(20)

In [12]:
def get_counts(df, word_col_name, pos_tag_col_name, prev_pos_tag_col_name):
    """Count the transition and emission states, respectively"""
    transition_states = defaultdict(int)
    emission_state_word = defaultdict(int)
    N_state = defaultdict(int)
    
    df[prev_pos_tag_col_name] = df[pos_tag_col_name].shift(1) # previous state for trnasition probabilities

    # iterate through vocabulary
    for index, row in tqdm(df.iterrows(), total=df.shape[0]):

        emission_state_word[(row[pos_tag_col_name], row[word_col_name])] += 1
        # transition count + 1
        if pd.notnull(row[prev_pos_tag_col_name]):  # Check if it's not NaN
            transition_states[(row[prev_pos_tag_col_name], row[pos_tag_col_name])] += 1

        # increment tag when I see it
        N_state[(row[pos_tag_col_name])] += 1

    return transition_states, emission_state_word, N_state

In [13]:
pos_tag_col_name = "POS Tag"
prev_pos_tag_col_name = 'Previous_POS Tag'
transitions, emissions, N_states = get_counts(updated_train_df, word_col_name, pos_tag_col_name, prev_pos_tag_col_name)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 950313/950313 [00:42<00:00, 22285.44it/s]


In [14]:
def calculate_prob(transitions: dict, emissions: dict, N_states: dict, prob_type: str):   
    """Calculate the transistion and emissions probabilities, respectively"""

    if prob_type == "t":
        t_or_e = transitions
    elif prob_type == "e":
        t_or_e = emissions

    store_probs = {}
    for key, value in t_or_e.items():
        
        curr_state = key[0]       
        store_probs[key] = value / N_states[curr_state]
        
    return store_probs

In [15]:
t_probs = calculate_prob(transitions, emissions, N_states, 't')
e_probs = calculate_prob(transitions, emissions, N_states, 'e')

In [16]:
list(t_probs.items())[:7]

[(('dummy', 'NNP'), 0.19789104610393007),
 (('NNP', 'NNP'), 0.3782645420509543),
 (('NNP', ','), 0.13846908958086018),
 ((',', 'CD'), 0.021234939759036144),
 (('CD', 'NNS'), 0.15775891730703062),
 (('NNS', 'JJ'), 0.017196978862406887),
 (('JJ', ','), 0.029129343105320303)]

In [17]:
list(e_probs.items())[:7]

[(('dummy', ' '), 2.6165681092678842e-05),
 (('NNP', 'Pierre'), 6.84868961738654e-05),
 (('NNP', 'Vinken'), 2.2828965391288468e-05),
 ((',', ','), 0.9999139414802065),
 (('CD', '61'), 0.0007168253240050465),
 (('NNS', 'years'), 0.019530237301024905),
 (('JJ', 'old'), 0.003613599348534202)]

In [18]:
save_path_with_name = "submit/hmm.json"

combine_t_and_e_probs = {}
combine_t_and_e_probs["transitions"] = t_probs
combine_t_and_e_probs["emissions"] = e_probs

t_e_probs_df = pd.DataFrame(combine_t_and_e_probs)
# t_e_probs_df.to_json(save_path_with_name)

# json_object = json.dumps(combine_t_and_e_probs)
# with open(save_path_with_name, 'w') as json_file:
#     json_file.write(json_object)

# 3. Greedy Decoding with HMM

1. [x] Implement the greedy decoding algorithm
2. [x] Evaluate it on the development data
3. [x] Predicting the POS Tags of the sentences in the test data
4. [x] Output the predictions in a file named `greedy.out`, in the same format of training data
5. [x] Evaluate the results of the model on `eval.py` in the terminal with `python eval.py − p {predicted file} − g {gold-standard file}`
6. [x] Question
    1. [x] What is the accuracy on the dev data? 80.99% which is not great. Need more training data to improve accuracy. Also need to learn how to write correct and efficient code.

In [19]:
# updated_dev_df.head(40)

In [206]:
def greedy_decoding(dev_df: pd.DataFrame, t_probs: dict, e_probs: dict, N_pos_tags: np.array):
    """Implement greedy decoding on the development file (words only) using the transition probability and emission probability. 
    Furthermore, don't use POS Tag of development file, thus only use POS Tag from training data.

    Parameters
    ----------
    df: `pd.DataFrame`
        Dev file

    t_probs: `py dict`
        Tranision probabilities for POS Tag given previous POS Tag

    e_probs: `py dict`
        Emission probabilities for Word given POS Tags

    N_pos_tags: `np.array`
        All POS Tags found in the training file
    
    Return
    ------
    
    """

    previous_pos_tag = "dummy"
    all_words_with_pos_tag = []
    
    
    for index, row in tqdm(dev_df.iterrows(), total=dev_df.shape[0]):
        # print("index", index, "with word", row['Word'])
        if row['POS Tag'] != "dummy":

            # Store per current word with all possible tags. Empty when at new word
            store_scores = []
            
            for N_pos_tags_idx in range(len(N_pos_tags)):
                current_pos_tag = N_pos_tags[N_pos_tags_idx]
                # print("- Current POS Tag: ", current_pos_tag)
    
                """Transition
                Pr(t_find_pos_tag | t_given_pos_tag)
                """
                t_find_pos_tag = current_pos_tag
                t_given_pos_tag = previous_pos_tag
                # print(f"--- t({t_find_pos_tag} | {t_given_pos_tag})")
                
                """Emission
                Pr(e_word | e_given_pos_tag)
                """
                e_word = row['Word']
                e_given_pos_tag = current_pos_tag
                # print(f"--- e({e_given_pos_tag} | {e_word})") # order this way to match e_probs dictionary
                
                """Transition * Emission"""
                t_key = (t_find_pos_tag, t_given_pos_tag)
                e_key = (e_given_pos_tag, e_word)
                # print(t_key in t_probs, e_key in e_probs)
    
                # IF-ELSE bc not all pairs will be found. If pair is found, use score, otherwise (pair isn't found) set score to 0.0.
                if t_key in t_probs and e_key in e_probs:
                    t = t_probs[t_key]
                    e = e_probs[e_key]
                    score = t * e
                    # print(f"---  t({t_find_pos_tag} | {t_given_pos_tag}) * e({e_word} | {e_given_pos_tag}) = {score}")
                    
                else:
                    t = 0.000001
                    e = 0.000001
                    score = t * e
                    # print(f"--- t({t_find_pos_tag} | {t_given_pos_tag}) * e({e_word} | {e_given_pos_tag}) = {score}")
                            
                store_scores.append(score)
            # print(f"Scores", store_scores)
        
                # print("Word is: ", row['Word'], "with POS Tag of", current_pos_tag)
            max_score_idx = np.argmax(np.array(store_scores)) # use argmax to get the index of max score
            
            current_pos_tag = N_pos_tags[max_score_idx] # use the index of the max score to find which POS Tag to update to
            # print("Word is: ", row['Word'], "with POS Tag of", current_pos_tag)
            # all_words_with_pos_tag[row['Word']] = current_pos_tag
            all_words_with_pos_tag.append([row['Word'], current_pos_tag])
            
            previous_pos_tag = current_pos_tag
            # print(all_words_with_pos_tag)
    
            # print("Updated POS Tag", current_pos_tag)
            # print()
        else:
            empty = ""
            all_words_with_pos_tag.append([empty, empty])
        
    
    return all_words_with_pos_tag

In [207]:
# updated_dev_df[updated_dev_df["POS Tag"] == "dummy"]

In [208]:
# list(e_probs.keys())

In [219]:
gd_output = greedy_decoding(updated_dev_df, t_probs, e_probs, all_pos_tags)

100%|███████████████████| 137295/137295 [00:20<00:00, 6651.36it/s]


In [220]:
gd_output = gd_output[1:]
# gd_output

[['The', 'DT'],
 ['Arizona', 'NNP'],
 ['Corporations', 'NNS'],
 ['Commission', 'FW'],
 ['authorized', 'VBD'],
 ['an', 'DT'],
 ['11.5', 'CD'],
 ['%', 'NN'],
 ['rate', 'NN'],
 ['increase', 'VB'],
 ['at', 'IN'],
 ['Tucson', 'NNP'],
 ['Electric', 'NNP'],
 ['Power', 'NNP'],
 ['Co.', 'NNP'],
 [',', ','],
 ['substantially', 'RB'],
 ['lower', 'RBR'],
 ['than', 'IN'],
 ['recommended', 'VBN'],
 ['last', 'VB'],
 ['month', 'NN'],
 ['by', 'IN'],
 ['a', 'SYM'],
 ['commission', 'NN'],
 ['hearing', 'VBG'],
 ['officer', 'NN'],
 ['and', 'CC'],
 ['barely', 'RB'],
 ['half', 'NN'],
 ['the', 'DT'],
 ['rise', 'VB'],
 ['sought', 'VBD'],
 ['by', 'IN'],
 ['the', 'DT'],
 ['utility', 'NN'],
 ['.', '.'],
 ['', ''],
 ['The', 'DT'],
 ['ruling', 'VBG'],
 ['follows', 'VBZ'],
 ['a', 'SYM'],
 ['host', 'NN'],
 ['of', 'IN'],
 ['problems', 'NNS'],
 ['at', 'IN'],
 ['Tucson', 'NNP'],
 ['Electric', 'NNP'],
 [',', ','],
 ['including', 'VBG'],
 ['major', 'JJ'],
 ['write-downs', 'NNS'],
 [',', ','],
 ['a', 'LS'],
 ['60', 'dummy'

In [221]:
# gd_output = gd_output[1:]

with open('greedy.out', 'w') as op:
    # # # # # # # 
    index = 1
    for idx, word in enumerate(gd_output):
        if word[0] == "":
            index = 1
            op.write("\n")
        else:
            op.write(f'{index}\t{word[0]}\t{word[1]}')
            op.write("\n")
            index += 1

In [36]:
# gd_output_df = pd.DataFrame(list(gd_output.items()), columns=['Words', 'POS Tags'])
# gd_output_df = gd_output_df.drop(0)
# # gd_output_df

# save_greedy_as = "greedy.out"
# gd_output_df.to_csv(save_greedy_as, header=None, sep='\t')

# 4. Viterbi Decoding with HMM

1. [ ] Implement the viterbi decoding algorithm
2. [ ] Evaluate it on the development data
3. [ ] Predict the POS Tags of the sentences in the test data
4. [ ] Output the predictions in a file named `viterbi.out`, in the same format of training data
5. [ ] Question
    1. [x] What is the accuracy on the dev data? 22.41% which is not great. Need more training data to improve accuracy. Also need to learn how to write correct and efficient code.

In [79]:
def viterbi_decoding(dev_df: pd.DataFrame, t_probs: dict, e_probs: dict, N_pos_tags: np.array):
    """Implement greedy decoding on the development file (words only) using the transition probability and emission probability. 
    
    
    ??? Furthermore, don't use POS Tag of development file, thus only use POS Tag from training data.

    Parameters
    ----------        
    dev_df: `pd.DataFrame`
        Dev file

    t_probs: `py dict`
        Tranision probabilities for POS Tag given previous POS Tag

    e_probs: `py dict`
        Emission probabilities for Word given POS Tags

    N_pos_tags: `np.array`
        All POS Tags found in the training file
    
    Return
    ------
    
    """

    """
    Base Cases:
    ----------
    v_pi: `py dictionary`
        Dictionary to store the length of each sentence (thus will have to reset to 0 every new sentence) and 
        all possible POS Tags (which will remain the same for each sentence). 
        
    v_pi[len_of_each_sentence, all_possible_tags] = t(all_pos_tags_in_dev_file | s) * e(words | all_pos_tags_in_dev_file)
    
    v_pi[0, all_pos_tags_in_dev_file] = t(t_find_pos_tag | t_given_pos_tag) * e(e_word | e_given_pos_tag)
    
    v_pi[0, dummy] = t(dummy | dummy) * e(The | dummy) = s1
    v_pi[0, DT] = t(DT | _) * e(The | DT) = s2
    v_pi[0, NNP] = t(NNP | _) * e(The | NNP) = s3

    v_pi_key: `py tuple`
        Tuple to store all possible combinations of keys for v_pi. 

    """

    previous_pos_tag = "dummy"
    v_pi = {}
    all_words_with_pos_tag = {}

    # start_algo_idx = 0

    """Base cases
    [ ] See all (words tags) to be 0
    [ ] Fix on the first word
    [ ] Consider using a 2D list of [word x tag]
    """
    for index, row in tqdm(dev_df.iterrows(), total=dev_df.shape[0]):
        # print("index", index, "with word", row['Word'], "and POS tag from dev is '", row['POS Tag'], "'2nd index", row["Index"])
        if row["Index"] == "dummy":
            break
        
        current_pos_tag = row['POS Tag']
        v_pi_key = (0, current_pos_tag)

        
        """Transition"""
        t_find_pos_tag = current_pos_tag
        t_given_pos_tag = previous_pos_tag
        # print(f"--- t({t_find_pos_tag} | {t_given_pos_tag})")

        """Emission"""
        e_word = row['Word']
        e_given_pos_tag = current_pos_tag
        # print(f"--- e({e_given_pos_tag} | {e_word})") # order this way to match e_probs dictionary

        """Transition * Emission"""
        t_key = (t_find_pos_tag, t_given_pos_tag)
        e_key = (e_given_pos_tag, e_word)
        # print(t_key in t_probs, e_key in e_probs)
        
        # IF-ELSE bc not all pairs will be found. If pair is found, use score, otherwise (pair isn't found) set score to 0.0.
        if t_key in t_probs and e_key in e_probs:
            if v_pi_key not in v_pi:
                t = t_probs[t_key]
                e = e_probs[e_key]
                score = t * e
                # print(f"--- FOUND 1x: pi{v_pi_key} = t({t_find_pos_tag} | {t_given_pos_tag}) * e({e_word} | {e_given_pos_tag}) = {score}")
                v_pi[v_pi_key] = score
            else: # if key in v_pi dict found again, add scores to strenghten key
                # print(f"--- FOUND AGAIN: pi{v_pi_key} = t({t_find_pos_tag} | {t_given_pos_tag}) * e({e_word} | {e_given_pos_tag}) = {score}")
                v_pi[v_pi_key] += score
            
        else:
            t = 0.001
            e = 0.001
            score = t * e
            # print(f"--- NOT FOUND: pi{v_pi_key} = t({t_find_pos_tag} | {t_given_pos_tag}) * e({e_word} | {e_given_pos_tag}) = {score}")
            v_pi[v_pi_key] = score
                        
        # print()
        all_words_with_pos_tag[row['Word']] = current_pos_tag
    # print("Fillings of 0th index for v_pi --- ", v_pi)

    # Fill v_pi with all possible options
    for i, row in dev_df.iterrows():
        for pos_tag in N_pos_tags:
            v_pi_key = (i, pos_tag)
            v_pi[v_pi_key] = 0.0
        
    # print("Fillings of remaining pairs for v_pi --- ", v_pi)

               
    """Algo"""
    s_1 = np.argmax(np.array(list(v_pi.values())))
    updated_s1 = np.array(list(v_pi.keys()))[s_1]
    max_previous_pos_tag = updated_s1[1]

    # print()
    # print(f"Base case: {v_pi}")
    # print(f"s_1 (from the base case) index is {s_1} and the key at this index is {max_previous_pos_tag}")
    # print()

    track_pi_idx = 1 # TODO: Figure out when to reset. I think go until end of/ len of previous sentence
    
    for index, row in tqdm(dev_df.iterrows(), total=dev_df.shape[0]):
        print("index", index, "with word", row['Word'], "and POS tag from dev", row['POS Tag'], row["Index"])
        
        idx_j = track_pi_idx - 1
        # print("j - 1 = ", idx_j)
        previous_v_pi_key = (idx_j, max_previous_pos_tag)
        
        if index > 38:
            if row["Index"] == "dummy":
                break
            # pass
            if row["Index"] >= 1.0:
                current_pos_tag = row['POS Tag']
                # v_pi_key = (1, current_pos_tag)
                
                v_pi_key = (track_pi_idx, current_pos_tag)
                
                # print(f"Key of v_i is {v_pi_key} bc we  j is {track_pi_idx} and j - 1 is {idx_j}")
                

                """DP Algo
                pi[number_of_words_in_sentence * all_possible_tags] = t(all_pos_tags_in_dev_file | s) * e(words | all_pos_tags_in_dev_file)
                
                pi[1, all_pos_tags_in_dev_file] = t(t_find_pos_tag | t_given_pos_tag) * e(e_word | e_given_pos_tag)

                pi[1, DT] = max(pi[1, s1] * t(DT | s1) * e(The | DT) = s1
                (1, 'DT') = max(pi[1, s1] * t(DT | CD) * e(The | DT) 

                
                pi[2, NNP] = max(pi[2, s2] * t(NNP | s2) * e(Arizona | NNP) = s2
                pi[3, NNP] = max(pi[3, s3] * t(NNP | s3) * e(Corps... | NNP) = s3
                pi[4, NNP] = max(pi[4, s4] * t(NNP | s4) * e(Commission | NNP) = s4
            
                """
        
                """Transition"""
                t_find_pos_tag = current_pos_tag
                t_given_pos_tag = max_previous_pos_tag
                # print(f"--- t({t_find_pos_tag} | {t_given_pos_tag})")
        
                """Emission"""
                e_word = row['Word']
                e_given_pos_tag = current_pos_tag
                # print(f"--- e({e_given_pos_tag} | {e_word})") # order this way to match e_probs dictionary
        
                """Transition * Emission"""
                t_key = (t_find_pos_tag, t_given_pos_tag)
                e_key = (e_given_pos_tag, e_word)
                # print(t_key in t_probs, e_key in e_probs)
                # print("---previous_v_pi_key:", previous_v_pi_key, "with max_previous_pos_tag as", max_previous_pos_tag)
                
                # IF-ELSE bc not all pairs will be found. If pair is found, use score, otherwise (pair isn't found) set score to 0.0.
                if t_key in t_probs and e_key in e_probs and previous_v_pi_key in v_pi:
                    t = t_probs[t_key]
                    e = e_probs[e_key]
                    score = v_pi[previous_v_pi_key] * t * e
                    # v_pi[v_pi_key] = v_pi[previous_v_pi_key] * t * e
                    print(f"--- pi{previous_v_pi_key} is FOUND in v_pi. Now, let's update v_pi at pi{v_pi_key}")
                    print(f"--- pi{v_pi_key} = pi({previous_v_pi_key}) * t({t_find_pos_tag} | {t_given_pos_tag}) * e({e_word} | {e_given_pos_tag}) = {score}")
                    # previous_v_pi_key = v_pi_key
             
                else:
                    t = 0.001
                    e = 0.001
                    # v_pi[v_pi_key] = 0.0
                    score = v_pi[previous_v_pi_key] * t * e
                    print(f"--- pi{previous_v_pi_key} is NOT FOUND in v_pi. Now, let's update v_pi at pi{v_pi_key}")
                    print(f"--- pi{v_pi_key} = pi({previous_v_pi_key}) * t({t_find_pos_tag} | {t_given_pos_tag}) * e({e_word} | {e_given_pos_tag}) = {score}")
                    # previous_v_pi_key = v_pi_key
                                
                track_pi_idx += 1
                v_pi[previous_v_pi_key] = score
                
                
        
                # print("UPDATE")
                s_i = np.argmax(np.array(list(v_pi.values())))
                updated_s_i = np.array(list(v_pi.keys()))[s_i]
                # print(f"max value is at index: {s_i} in {v_pi} has key of {updated_s_i}")
                max_previous_pos_tag = updated_s_i[1]
                all_words_with_pos_tag[row['Word']] = current_pos_tag
                print("max_previous_pos_tag is", max_previous_pos_tag)
                print()
            # print(f"is {previous_v_pi_key} in v_pi")
    
    # print(v_pi)

    return all_words_with_pos_tag
    # return v_pi

In [80]:
# updated_dev_df.head(40)

In [81]:
vd_output = viterbi_decoding(updated_dev_df[:50], t_probs, e_probs, all_pos_tags)

 76%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                     | 38/50 [00:00<00:00, 13040.71it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 1686.98it/s]

index 0 with word   and POS tag from dev dummy 0.0
index 1 with word The and POS tag from dev DT 1.0
index 2 with word Arizona and POS tag from dev NNP 2.0
index 3 with word Corporations and POS tag from dev NNP 3.0
index 4 with word Commission and POS tag from dev NNP 4.0
index 5 with word authorized and POS tag from dev VBD 5.0
index 6 with word an and POS tag from dev DT 6.0
index 7 with word 11.5 and POS tag from dev CD 7.0
index 8 with word % and POS tag from dev NN 8.0
index 9 with word rate and POS tag from dev NN 9.0
index 10 with word increase and POS tag from dev NN 10.0
index 11 with word at and POS tag from dev IN 11.0
index 12 with word Tucson and POS tag from dev NNP 12.0
index 13 with word Electric and POS tag from dev NNP 13.0
index 14 with word Power and POS tag from dev NNP 14.0
index 15 with word Co. and POS tag from dev NNP 15.0
index 16 with word , and POS tag from dev , 16.0
index 17 with word substantially and POS tag from dev RB 17.0
index 18 with word lower and




In [82]:
# vd_output

In [47]:
# vd_output_df = pd.DataFrame(list(vd_output.items()), columns=['Words', 'POS Tags'])
# vd_output_df = vd_output_df.drop(0)
# # vd_output_df

# save_as = "viterbi.out"
# vd_output_df.to_csv(save_as, header=None, sep='\t')