# HW 3-Part-of-Speech Tagging with HMMs + Decoding Techniques (Greedy and Viterbi)

- Detravious Jamari Brinkley
- CSCI-544: Applied Natural Language Processing
- python version: 3.11.4

---

1. Part-of-Speech (POS) Tagging [a type of sequence labelling task where of a given word, assign the part of speech]
2. HMMs (Hidden Markov Model) [a generative-based model that's used for POS Tagging]
    1. Generative-based [provides the probabilities for all possible combinations of values of variables in the set using the joint distribution]
    2. With POS Tagging: Given a sequence of observations (sentences), the task is to infer the most likely sequence of hidden states (POS Tags) that could have generated the observed data.
3. **Decoding Techniques:**
    1. Greedy [find the optimal (OPT) solution at each step]
    2. Viterbi [make use of dynammic programming to find the OPT solution with backtracking while searching the entire search space]
4. **Notes of the data and given files:**
    - Dataset: Wall Street Journal section of the Penn Treebank
    - Folder named `data` with the following files:
        1. `train`, sentences *with* human-annotated POS Tags
        2. `dev`, sentences *with* human-annotated POS Tags
        3. `test`, sentences *without* POS Tags, thus predict the POS Tags
    - Format: Blank like at the end of each sentence. Each line contains 3 items separated by the `\t`, the tab symbol. These three items are
        1. Index of the word in the sentence
        2. Word type
        3. POS Tag



In [1]:
# imports
import pandas as pd
# import 

# Load and Update Data

In [2]:
def load_data(file_path: str, file_name: str, config_index: bool = True):
    
    if config_index == True:
        file =  file_path + file_name
        open_df = pd.read_table(file)
        open_df = open_df.set_index('1')
        
    return open_df

In [3]:
def update_df_columns(df: pd.DataFrame, new_columns_name: list, about: str) -> pd.DataFrame:  
    """Update the columns of the dataframe if first column is data needed"""  

    N_columns = len(df.columns.to_list())

    if N_columns == 1:
        print(about, "has 1 column")
        word = df.columns.to_list()[0]
        new_row = pd.DataFrame([[word]], columns=df.columns)
        df = pd.concat([new_row, df], ignore_index=True)
        df.columns = new_columns_name

    elif N_columns == 2:
        print(about, "has 2 columns")
        word = df.columns.to_list()[0]
        pos_tag = df.columns.to_list()[1]
        new_row = pd.DataFrame([[word, pos_tag]], columns=df.columns)
        df = pd.concat([new_row, df], ignore_index=True)
        df.columns = new_columns_name
        
    else:
        print(" --- Invalid number of columns ---")

    print("Update complete\n")    
    return df

In [4]:
train_df = load_data('data/', 'train')
dev_df = load_data('data/', 'dev')
test_df = load_data('data/', 'test')

two_columns_name = ['Word', 'Pos Tag']
one_columns_name = ['Word']

updated_train_df = update_df_columns(train_df, two_columns_name, "Train data")
updated_dev_df = update_df_columns(dev_df, two_columns_name, "Dev data")
updated_test_df = update_df_columns(test_df, one_columns_name, "Test data")

Train data has 2 columns
Update complete

Dev data has 2 columns
Update complete

Test data has 1 column
Update complete



In [5]:
updated_train_df.head(18)

Unnamed: 0,Word,Pos Tag
0,Pierre,NNP
1,Vinken,NNP
2,",",","
3,61,CD
4,years,NNS
5,old,JJ
6,",",","
7,will,MD
8,join,VB
9,the,DT


# Outline of Tasks

1. Vocabulary Creation
2. Model Learning
3. Greedy Decoding with HMM
4. Viterbi Decoding with HMM


# 1. Vocabulary Creation

- **Problem:** Creating vocabulary to handle unkown words.
    - **Solution:** Replace rare words wtih whose occurrences are less than a threshold (ie: 3) with a special token `< unk >`

---

1. [ ] Create a vocabulary using the training data in the file train
2. [ ] Output the vocabulary into a txt file named `vocab.txt`
    - [ ] See PDF on how to properly format vocabulary file
3. [ ] Questions
    1. [ ] What is the selected threshold for unknown words replacement?
    2. [ ] What is the total size of your vocabulary?
    3. [ ] What is the total occurrences of the special token `< unk >`after replacement?

In [6]:
# siddhant
# shivam

In [7]:
true_false_series = updated_train_df['Word'].value_counts()
print(true_false_series)

Word
,           46476
the         39533
.           37452
of          22104
to          21305
            ...  
Birthday        1
Happy           1
Bertie          1
crouched        1
Huricane        1
Name: count, Length: 43192, dtype: int64


In [8]:
vocab_df = pd.DataFrame(true_false_series)
vocab_df.reset_index(inplace = True)

In [9]:
true_false_series = vocab_df['count'] > 3

updated_vocab_df = vocab_df.loc[true_false_series == True]
updated_false_vocab_df = vocab_df.loc[true_false_series == False]
updated_false_vocab_df['Word'] = ' <unk> '
print()
N_updated_false_vocab_df = len(updated_false_vocab_df)
N_updated_false_vocab_df
new_row = pd.DataFrame([['<unk>', N_updated_false_vocab_df]], columns=updated_vocab_df.columns)
new_row
df = pd.concat([new_row, updated_vocab_df], ignore_index=True)
N_vocab = range(0, len(updated_vocab_df)+1)

df['index'] = N_vocab

df = df.reindex(columns=['Word', 'index', 'count'])
df
# df.to_csv('vocab.txt', header=None, index=None, sep='\t')




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  updated_false_vocab_df['Word'] = ' <unk> '


Unnamed: 0,Word,index,count
0,<unk>,0,29443
1,",",1,46476
2,the,2,39533
3,.,3,37452
4,of,4,22104
...,...,...,...
13745,9.3,13745,4
13746,starters,13746,4
13747,prescribe,13747,4
13748,scammers,13748,4


In [10]:
df

Unnamed: 0,Word,index,count
0,<unk>,0,29443
1,",",1,46476
2,the,2,39533
3,.,3,37452
4,of,4,22104
...,...,...,...
13745,9.3,13745,4
13746,starters,13746,4
13747,prescribe,13747,4
13748,scammers,13748,4


# 2. Model Learning

- Learn an HMM from the training data
- **HMM Parameters:**
  <div style="text-align: center;">

    $
    \text{Transition Probability (} t \text{)}: \quad t(s' \mid s) = \frac{\text{count}(s \rightarrow s')}{\text{count}(s)}
    $

    $
    \text{Emission Probability (} e \text{)}: \quad e(x \mid s) = \frac{\text{count}(s \rightarrow x)}{\text{count}(s)}
    $

  </div>

---

1. [x] Learn a model using the training data in the file train
2. [ ] Output the learned model into a model file in json format, named `hmm.json`. The model file should contains two dictionaries for the emission and transition parameters, respectively.
    1. [ ] 1st dictionary: Named transition, contains items with pairs of (s, s′) as key and t(s′|s) as value. 
    2. [ ] 2nd dictionary: Named emission, contains items with pairs of (s, x) as key and e(x|s) as value.
3. Question
    1. [ ] How many transition and emission parameters in your HMM?


In [11]:
states_series = updated_train_df['Pos Tag']
states_series

0         NNP
1         NNP
2           ,
3          CD
4         NNS
         ... 
912090     TO
912091    NNP
912092    NNP
912093     RB
912094      .
Name: Pos Tag, Length: 912095, dtype: object

In [51]:
N_nnp = states_series.value_counts().to_dict()['NNP']
N_nnp

87608

In [12]:
states = states_series.unique()
states

array(['NNP', ',', 'CD', 'NNS', 'JJ', 'MD', 'VB', 'DT', 'NN', 'IN', '.',
       'VBZ', 'VBG', 'CC', 'VBD', 'VBN', 'RB', 'TO', 'PRP', 'RBR', 'WDT',
       'VBP', 'RP', 'PRP$', 'JJS', 'POS', '``', 'EX', "''", 'WP', ':',
       'JJR', 'WRB', '$', 'NNPS', 'WP$', '-LRB-', '-RRB-', 'PDT', 'RBS',
       'FW', 'UH', 'SYM', 'LS', '#'], dtype=object)

In [13]:
def create_pairings(states):
    states_list = []
    
    for idx in range(len(states)):
        state = states[idx]
        # print(idx, state)
        
        for jdx in range(len(states)):
            states_dict = {}
            state_2 = states[jdx]
            # print("---", jdx, state_2)
            states_dict[state] = state_2
            # print("append", states_dict)
            states_list.append(states_dict)
            # print()
    
    return states_list

In [14]:
pairing_of_states = create_pairings(states[:2])
pairing_of_states

[{'NNP': 'NNP'}, {'NNP': ','}, {',': 'NNP'}, {',': ','}]

In [15]:
def create_shifts(states_series: pd.Series, current_state: int, next_state: int):
    """Splits a given create_shifts into multiple input rows where each input row has a s' and s
    
    Parameters:
    
    Return:
    """
    
    df = pd.DataFrame(states_series)
    # print(df)
    cols = list()
    
    lag_col_names = []
    count_lag = 0
    # input sequence (t-n, ... t-1)
    for prior_observation in range(current_state, 0, -1):
        # print("prior_observation: ", prior_observation)
        cols.append(df.shift(prior_observation))
        new_col_name = "given_state"
        # print(new_col_name)
        lag_col_names.append(new_col_name)
        
    
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, next_state):
        cols.append(df.shift(-i))
        new_col_name = "find_state"
        # print(new_col_name)
        lag_col_names.append(new_col_name)
        
        # put it all together
        uts_sml_df = pd.concat(cols, axis=1) 
        uts_sml_df.columns=[lag_col_names]
        # drop rows with NaN values
        uts_sml_df.dropna(inplace=True)
        
    return uts_sml_df

In [16]:
shfited_states_df = create_shifts(states_series, 1, 1)
shfited_states_df

Unnamed: 0,given_state,find_state
1,NNP,NNP
2,NNP,","
3,",",CD
4,CD,NNS
5,NNS,JJ
...,...,...
912090,PRP,TO
912091,TO,NNP
912092,NNP,NNP
912093,NNP,RB


In [23]:
given_filter = (shfited_states_df['given_state'] == 'NNP')
given_filter

Unnamed: 0,given_state
1,True
2,True
3,False
4,False
5,False
...,...
912090,False
912091,False
912092,True
912093,True


In [24]:
condition_on_given = shfited_states_df[given_filter]
condition_on_given

Unnamed: 0,given_state,find_state
1,NNP,
2,NNP,
3,,
4,,
5,,
...,...,...
912090,,
912091,,
912092,NNP,
912093,NNP,


In [26]:
condition_on_given.dropna(how='all', inplace=True)

In [31]:
given_df = condition_on_given['given_state']
given_df

Unnamed: 0,given_state
1,NNP
2,NNP
16,NNP
19,NNP
20,NNP
...,...
912079,NNP
912084,NNP
912085,NNP
912092,NNP


In [43]:
given_filter = (shfited_states_df['given_state'] == 'NNP')
condition_on_given = shfited_states_df[given_filter]
condition_on_given.dropna(how='all', inplace=True)
given_df = condition_on_given['given_state']
given_df

Unnamed: 0,given_state
1,NNP
2,NNP
16,NNP
19,NNP
20,NNP
...,...
912079,NNP
912084,NNP
912085,NNP
912092,NNP


In [32]:
find_filter = (shfited_states_df['find_state'] == 'NNP')
condition_on_find = shfited_states_df[find_filter]
condition_on_find.dropna(how='all', inplace=True)
find_df = condition_on_find['find_state']
find_df

Unnamed: 0,find_state
1,NNP
15,NNP
18,NNP
19,NNP
23,NNP
...,...
912078,NNP
912083,NNP
912084,NNP
912091,NNP


In [44]:
# Create a new DataFrame where both given_df and find_df overlap with index
merged_df = pd.merge(given_df, find_df, left_index=True, right_index=True)

# Display the new DataFrame
merged_df

Unnamed: 0,given_state,find_state
1,NNP,NNP
19,NNP,NNP
24,NNP,NNP
32,NNP,NNP
42,NNP,NNP
...,...,...
912053,NNP,NNP
912054,NNP,NNP
912078,NNP,NNP
912084,NNP,NNP


In [52]:
N_nnp_nnp = len(merged_df)
N_nnp_nnp

33192

In [54]:
s_prime = N_nnp_nnp
s = N_nnp
transition_probs = s_prime / s
transition_probs

0.3788695096338234

# 3. Greedy Decoding with HMM

1. [ ] Implement the greedy decoding algorithm
2. [ ] Evaluate it on the development data
3. [ ] Predicting the POS Tags of the sentences in the test data
4. [ ] Output the predictions in a file named `greedy.out`, in the same format of training data
5. [ ] Evaluate the results of the model on `eval.py` in the terminal with `python eval.py − p {predicted file} − g {gold-standard file}`
6. [ ] Question
    1. [ ] What is the accuracy on the dev data? 

# 4. Viterbi Decoding with HMM

1. [ ] Implement the viterbi decoding algorithm
2. [ ] Evaluate it on the development data
3. [ ] Predict the POS Tags of the sentences in the test data
4. [ ] Output the predictions in a file named `viterbi.out`, in the same format of training data
5. [ ] Question
    1. [ ] What is the accuracy on the dev data?