# HW 3-Part-of-Speech Tagging with HMMs + Decoding Techniques (Greedy and Viterbi)

- Detravious Jamari Brinkley
- CSCI-544: Applied Natural Language Processing
- python version: 3.11.4

---

1. Part-of-Speech (POS) Tagging [a type of sequence labelling task where of a given word, assign the part of speech]
2. HMMs (Hidden Markov Model) [a generative-based model that's used for POS Tagging]
    1. Generative-based [provides the probabilities for all possible combinations of values of variables in the set using the joint distribution]
    2. With POS Tagging: Given a sequence of observations (sentences), the task is to infer the most likely sequence of hidden states (POS Tags) that could have generated the observed data.
3. **Decoding Techniques:**
    1. Greedy [find the optimal (OPT) solution at each step]
    2. Viterbi [make use of dynammic programming to find the OPT solution with backtracking while searching the entire search space]
4. **Notes of the data and given files:**
    - Dataset: Wall Street Journal section of the Penn Treebank
    - Folder named `data` with the following files:
        1. `train`, sentences *with* human-annotated POS Tags
        2. `dev`, sentences *with* human-annotated POS Tags
        3. `test`, sentences *without* POS Tags, thus predict the POS Tags
    - Format: Blank like at the end of each sentence. Each line contains 3 items separated by the `\t`, the tab symbol. These three items are
        1. Index of the word in the sentence
        2. Word type
        3. POS Tag



In [5]:
import sys

import numpy as np
import pandas as pd

from tqdm import tqdm
from collections import defaultdict

# Load and Update Data
- [x] Find a way to separate sentences when loading the df.

In [6]:
def load_data(file_path: str, file_name: str, config_index: bool = True):
    
    if config_index == True:
        file =  file_path + file_name
        open_df = pd.read_table(file, sep = "\t", names=['Index', 'Word', 'POS'], skip_blank_lines=False)
        # open_df = open_df.set_index('1')
        
    return open_df

In [7]:
def update_df_columns(df: pd.DataFrame, new_columns_name: list, about: str) -> pd.DataFrame:  
    """Update the columns of the dataframe if first column is data needed"""  

    original_index = df.index.copy()
    N_columns = len(df.columns.to_list())

    if N_columns == 2:
        pass
    #     print(about, "has 2 column")
    #     word = df.columns.to_list()[0]
    #     new_row = pd.DataFrame([['0.0', word]], columns=df.columns)
    #     df = pd.concat([new_row, df], ignore_index=True)
    #     df.columns = new_columns_name
    #     df.fillna("dummy", inplace=True)

    elif N_columns == 3:
        print(about, "has 3 columns")
        dummy_row = pd.DataFrame([['0.0', ' ', 'dummy']], columns=df.columns)
        word = df.columns.to_list()[1]
        pos_tag = df.columns.to_list()[2]
        # new_row = pd.DataFrame([['0.0', word, pos_tag]], columns=df.columns)
        df = pd.concat([dummy_row, df], ignore_index=True)
        df.columns = new_columns_name
        df.fillna("dummy", inplace=True)
        
    else:
        print(" --- Invalid number of columns ---")
        sys.exit()

    print("Update complete\n") 
    # new_index = original_index.append(pd.Index(range(len(df) - len(original_index))))
    # df = df.set_index(new_index)
   
    return df

In [8]:
train_df = load_data('data/', 'train')
dev_df = load_data('data/', 'dev')
test_df = load_data('data/', 'test')

two_columns_name = ['Index', 'Word', 'POS Tag']
one_columns_name = ['Index', 'Word']

updated_train_df = update_df_columns(train_df, two_columns_name, "Train data")
updated_dev_df = update_df_columns(dev_df, two_columns_name, "Dev data")
# updated_test_df = update_df_columns(test_df, one_columns_name, "Test data")

Train data has 3 columns
Update complete

Dev data has 3 columns
Update complete



In [9]:
updated_train_df.head(33)
# updated_train_df.tail(5)
# updated_train_df

# updated_train_df[updated_train_df['POS Tag'] == "dummy"].shape

Unnamed: 0,Index,Word,POS Tag
0,0.0,,dummy
1,1.0,Pierre,NNP
2,2.0,Vinken,NNP
3,3.0,",",","
4,4.0,61,CD
5,5.0,years,NNS
6,6.0,old,JJ
7,7.0,",",","
8,8.0,will,MD
9,9.0,join,VB


# Outline of Tasks

1. Vocabulary Creation
2. Model Learning
3. Greedy Decoding with HMM
4. Viterbi Decoding with HMM


# 1. Vocabulary Creation

- **Problem:** Creating vocabulary to handle unkown words.
    - **Solution:** Replace rare words wtih whose occurrences are less than a threshold (ie: 3) with a special token `< unk >`

---

1. [ ] Create a vocabulary using the training data in the file train
2. [ ] Output the vocabulary into a txt file named `vocab.txt`
    - [ ] See PDF on how to properly format vocabulary file
3. [ ] Questions
    1. [ ] What is the selected threshold for unknown words replacement?
    2. [ ] What is the total size of your vocabulary?
    3. [ ] What is the total occurrences of the special token `< unk >`after replacement?

In [10]:
# siddhant
# shivam

In [11]:
true_false_series = updated_train_df['Word'].value_counts()
print(true_false_series)

Word
,           46476
the         39533
dummy       38234
.           37452
of          22104
            ...  
Birthday        1
Happy           1
Bertie          1
crouched        1
Huricane        1
Name: count, Length: 43193, dtype: int64


In [12]:
vocab_df = pd.DataFrame(true_false_series)
vocab_df.reset_index(inplace = True)

In [13]:
true_false_series = vocab_df['count'] > 3

updated_vocab_df = vocab_df.loc[true_false_series == True]
updated_false_vocab_df = vocab_df.loc[true_false_series == False]
updated_false_vocab_df['Word'] = ' <unk> '
print()
N_updated_false_vocab_df = len(updated_false_vocab_df)
N_updated_false_vocab_df
new_row = pd.DataFrame([['<unk>', N_updated_false_vocab_df]], columns=updated_vocab_df.columns)
new_row
df = pd.concat([new_row, updated_vocab_df], ignore_index=True)
N_vocab = range(0, len(updated_vocab_df)+1)

df['index'] = N_vocab

df = df.reindex(columns=['Word', 'index', 'count'])
df
# df.to_csv('vocab.txt', header=None, index=None, sep='\t')




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  updated_false_vocab_df['Word'] = ' <unk> '


Unnamed: 0,Word,index,count
0,<unk>,0,29443
1,",",1,46476
2,the,2,39533
3,dummy,3,38234
4,.,4,37452
...,...,...,...
13746,trafficking,13746,4
13747,7.62,13747,4
13748,gut,13748,4
13749,17.3,13749,4


In [14]:
df

Unnamed: 0,Word,index,count
0,<unk>,0,29443
1,",",1,46476
2,the,2,39533
3,dummy,3,38234
4,.,4,37452
...,...,...,...
13746,trafficking,13746,4
13747,7.62,13747,4
13748,gut,13748,4
13749,17.3,13749,4


In [15]:
# df[df[word]

# 2. Model Learning

- Learn an HMM from the training data
- **HMM Parameters:**
  <div style="text-align: center;">

    $
    \text{Transition Probability (} t \text{)}: \quad t(s' \mid s) = \frac{\text{count}(s \rightarrow s')}{\text{count}(s)}
    $

    $
    \text{Emission Probability (} e \text{)}: \quad e(x \mid s) = \frac{\text{count}(s \rightarrow x)}{\text{count}(s)}
    $

  </div>

---

1. [x] Learn a model using the training data in the file train
2. [ ] Output the learned model into a model file in json format, named `hmm.json`. The model file should contains two dictionaries for the emission and transition parameters, respectively.
    1. [ ] 1st dictionary: Named transition, contains items with pairs of (s, s′) as key and t(s′|s) as value. 
    2. [ ] 2nd dictionary: Named emission, contains items with pairs of (s, x) as key and e(x|s) as value.
3. Question
    1. [ ] How many transition and emission parameters in your HMM?


In [16]:
updated_train_df.head(20)

Unnamed: 0,Index,Word,POS Tag
0,0.0,,dummy
1,1.0,Pierre,NNP
2,2.0,Vinken,NNP
3,3.0,",",","
4,4.0,61,CD
5,5.0,years,NNS
6,6.0,old,JJ
7,7.0,",",","
8,8.0,will,MD
9,9.0,join,VB


In [17]:
def hmm(df):
    transition_states = defaultdict(int)
    emission_state_word = defaultdict(int)
    N_state = defaultdict(int)
    

    df['Previous_POS Tag'] = df['POS Tag'].shift(1) # previous state for trnasition probabilities

    # iterate through vocabulary
    for index, row in tqdm(df.iterrows(), total=df.shape[0]):

        emission_state_word[(row["POS Tag"], row["Word"])] += 1
        # transition count + 1
        if pd.notnull(row['Previous_POS Tag']):  # Check if it's not NaN
            transition_states[(row["Previous_POS Tag"], row['POS Tag'])] += 1

        # state_count + 1; 
        N_state[(row["POS Tag"])] += 1

    print(emission_state_word)
    print(transition_states)
    print(N_state)

    return emission_state_word, transition_states, N_state

In [18]:
emissions, transitions, N_states = hmm(updated_train_df)

100%|██████████| 950313/950313 [00:47<00:00, 20208.64it/s]

defaultdict(<class 'int'>, {('dummy', 'NNP'): 7563, ('NNP', 'NNP'): 33139, ('NNP', ','): 12131, (',', 'CD'): 987, ('CD', 'NNS'): 5502, ('NNS', 'JJ'): 995, ('JJ', ','): 1717, (',', 'MD'): 490, ('MD', 'VB'): 7541, ('VB', 'DT'): 5661, ('DT', 'NN'): 37299, ('NN', 'IN'): 31554, ('IN', 'DT'): 31088, ('DT', 'JJ'): 17200, ('JJ', 'NN'): 26472, ('NN', 'NNP'): 1214, ('NNP', 'CD'): 1680, ('CD', '.'): 2530, ('.', 'dummy'): 35255, ('NNP', 'VBZ'): 3434, ('VBZ', 'NN'): 751, ('IN', 'NNP'): 14091, (',', 'DT'): 6211, ('DT', 'NNP'): 8757, ('NNP', 'VBG'): 155, ('VBG', 'NN'): 1819, ('NN', '.'): 13890, ('JJ', 'CC'): 1003, ('CC', 'JJ'): 2520, (',', 'VBD'): 2396, ('VBD', 'VBN'): 2696, ('VBN', 'DT'): 1287, ('JJ', 'JJ'): 4362, ('dummy', 'DT'): 8374, ('IN', 'NN'): 10362, ('NN', 'RB'): 2317, ('RB', 'VBN'): 2389, ('VBN', 'TO'): 2081, ('TO', 'VB'): 12398, ('VB', 'NNP'): 814, ('NNP', 'NN'): 5044, ('NN', 'NNS'): 10034, ('NNS', 'VBZ'): 493, ('VBZ', 'VBN'): 3093, ('NNS', 'IN'): 13569, ('IN', 'NNS'): 5672, ('NNS', 'VBN')




In [27]:
def calculate_t_prob(transitions, N_states):    
    # Calculate probabilities
    transition_probs = {} # dictionary definition
    for key,value in transitions.items(): # iterate through dicitionary 

        curr_state = key[0]
        # print('current state: ', curr_state, "\nKey: ", key)
        # print("Value of dictionary at the index: ", value,'\nNumber of times this state has been the current state: ', N_states[curr_state])

        # count(given s, find s') / given s
        transition_probs[key] = value / N_states[curr_state]

        # count(given s, find w) / given s
        
        # how many times you've seen the (s => s') = v / how many times you've seen the current state , s  
        # break 
    # print(transition_probs)

    return transition_probs

# Calculate emission probabilities

In [26]:
t_probs = calculate_t_prob(transitions, N_states)
t_probs

{('dummy', 'NNP'): 0.19789104610393007, ('NNP', 'NNP'): 0.3782645420509543, ('NNP', ','): 0.13846908958086018, (',', 'CD'): 0.021234939759036144, ('CD', 'NNS'): 0.15775891730703062, ('NNS', 'JJ'): 0.017196978862406887, ('JJ', ','): 0.029129343105320303, (',', 'MD'): 0.010542168674698794, ('MD', 'VB'): 0.7990886934407121, ('VB', 'DT'): 0.22209580603397544, ('DT', 'NN'): 0.4734877816566169, ('NN', 'IN'): 0.24741637524111218, ('IN', 'DT'): 0.32807784039342325, ('DT', 'JJ'): 0.21834338305299905, ('JJ', 'NN'): 0.4491042345276873, ('NN', 'NNP'): 0.009519030219392476, ('NNP', 'CD'): 0.019176330928682313, ('CD', '.'): 0.0725427227893107, ('.', 'dummy'): 0.9306285141092311, ('NNP', 'VBZ'): 0.0391973335768423, ('VBZ', 'NN'): 0.035792584119721665, ('IN', 'NNP'): 0.14870512252263662, (',', 'DT'): 0.1336273666092943, ('DT', 'NNP'): 0.11116470961599492, ('NNP', 'VBG'): 0.0017692448178248561, ('VBG', 'NN'): 0.12677725118483413, ('NN', '.'): 0.10891213323505888, ('JJ', 'CC'): 0.01701615092290988, ('CC

{('dummy', 'NNP'): 0.19789104610393007,
 ('NNP', 'NNP'): 0.3782645420509543,
 ('NNP', ','): 0.13846908958086018,
 (',', 'CD'): 0.021234939759036144,
 ('CD', 'NNS'): 0.15775891730703062,
 ('NNS', 'JJ'): 0.017196978862406887,
 ('JJ', ','): 0.029129343105320303,
 (',', 'MD'): 0.010542168674698794,
 ('MD', 'VB'): 0.7990886934407121,
 ('VB', 'DT'): 0.22209580603397544,
 ('DT', 'NN'): 0.4734877816566169,
 ('NN', 'IN'): 0.24741637524111218,
 ('IN', 'DT'): 0.32807784039342325,
 ('DT', 'JJ'): 0.21834338305299905,
 ('JJ', 'NN'): 0.4491042345276873,
 ('NN', 'NNP'): 0.009519030219392476,
 ('NNP', 'CD'): 0.019176330928682313,
 ('CD', '.'): 0.0725427227893107,
 ('.', 'dummy'): 0.9306285141092311,
 ('NNP', 'VBZ'): 0.0391973335768423,
 ('VBZ', 'NN'): 0.035792584119721665,
 ('IN', 'NNP'): 0.14870512252263662,
 (',', 'DT'): 0.1336273666092943,
 ('DT', 'NNP'): 0.11116470961599492,
 ('NNP', 'VBG'): 0.0017692448178248561,
 ('VBG', 'NN'): 0.12677725118483413,
 ('NN', '.'): 0.10891213323505888,
 ('JJ', 'CC')

In [28]:
emissions.items()



In [29]:
def calculate_e_prob(emissions, N_states):
    # Calculate probabilities
    emissions_probs = {} # dictionary definition
    for key, value in emissions.items(): # iterate through dicitionary 
        curr_state = key[0]
        # count(given s, find w) / given s
        emissions_probs[key] = value / N_states[curr_state]

    return emissions_probs

# Calculate emission probabilities

In [30]:
e_probs = calculate_e_prob(emissions, N_states)
e_probs

{('dummy', ' '): 2.6165681092678842e-05,
 ('NNP', 'Pierre'): 6.84868961738654e-05,
 ('NNP', 'Vinken'): 2.2828965391288468e-05,
 (',', ','): 0.9999139414802065,
 ('CD', '61'): 0.0007168253240050465,
 ('NNS', 'years'): 0.019530237301024905,
 ('JJ', 'old'): 0.003613599348534202,
 ('MD', 'will'): 0.3138709335593939,
 ('VB', 'join'): 0.0015693044058221193,
 ('DT', 'the'): 0.5016439225642653,
 ('NN', 'board'): 0.0023287907538381922,
 ('IN', 'as'): 0.0353954283543342,
 ('DT', 'a'): 0.2341478895588702,
 ('JJ', 'nonexecutive'): 0.00010179153094462541,
 ('NN', 'director'): 0.002422883309548826,
 ('NNP', 'Nov.'): 0.0026709889507807506,
 ('CD', '29'): 0.0021218029590549374,
 ('.', '.'): 0.9886228651373967,
 ('dummy', 'dummy'): 0.9999738343189073,
 ('NNP', 'Mr.'): 0.044014245274404167,
 ('VBZ', 'is'): 0.3208940997045086,
 ('NN', 'chairman'): 0.0033638088666551663,
 ('IN', 'of'): 0.23322569070685326,
 ('NNP', 'Elsevier'): 1.1414482695644234e-05,
 ('NNP', 'N.V.'): 0.00014838827504337504,
 ('NNP', 'Du

# 3. Greedy Decoding with HMM

1. [ ] Implement the greedy decoding algorithm
2. [ ] Evaluate it on the development data
3. [ ] Predicting the POS Tags of the sentences in the test data
4. [ ] Output the predictions in a file named `greedy.out`, in the same format of training data
5. [ ] Evaluate the results of the model on `eval.py` in the terminal with `python eval.py − p {predicted file} − g {gold-standard file}`
6. [ ] Question
    1. [ ] What is the accuracy on the dev data? 

In [21]:
import numpy as np
def calc_greedy(t, e):
    g_values = np.multiply(t, e)
    return np.max(g_values)

In [22]:
calc_greedy(t_values, e_values)

NameError: name 't_values' is not defined

# 4. Viterbi Decoding with HMM

1. [ ] Implement the viterbi decoding algorithm
2. [ ] Evaluate it on the development data
3. [ ] Predict the POS Tags of the sentences in the test data
4. [ ] Output the predictions in a file named `viterbi.out`, in the same format of training data
5. [ ] Question
    1. [ ] What is the accuracy on the dev data?