# HW 3-Part-of-Speech Tagging with HMMs + Decoding Techniques (Greedy and Viterbi)

- Detravious Jamari Brinkley
- CSCI-544: Applied Natural Language Processing
- python version: 3.11.4

---

1. Part-of-Speech (POS) Tagging [a type of sequence labelling task where of a given word, assign the part of speech]
2. HMMs (Hidden Markov Model) [a generative-based model that's used for POS Tagging]
    1. Generative-based [provides the probabilities for all possible combinations of values of variables in the set using the joint distribution]
    2. With POS Tagging: Given a sequence of observations (sentences), the task is to infer the most likely sequence of hidden states (POS Tags) that could have generated the observed data.
3. **Decoding Techniques:**
    1. Greedy [find the optimal (OPT) solution at each step]
    2. Viterbi [make use of dynammic programming to find the OPT solution with backtracking while searching the entire search space]
4. **Notes of the data and given files:**
    - Dataset: Wall Street Journal section of the Penn Treebank
    - Folder named `data` with the following files:
        1. `train`, sentences *with* human-annotated POS Tags
        2. `dev`, sentences *with* human-annotated POS Tags
        3. `test`, sentences *without* POS Tags, thus predict the POS Tags
    - Format: Blank like at the end of each sentence. Each line contains 3 items separated by the `\t`, the tab symbol. These three items are
        1. Index of the word in the sentence
        2. Word type
        3. POS Tag



In [1]:
import pandas as pd

# Load and Update Data
- [x] Find a way to separate sentences when loading the df.

In [2]:
def load_data(file_path: str, file_name: str, config_index: bool = True):
    
    if config_index == True:
        file =  file_path + file_name
        open_df = pd.read_table(file, skip_blank_lines=False)
        open_df = open_df.set_index('1')
        
    return open_df

In [3]:
def update_df_columns(df: pd.DataFrame, new_columns_name: list, about: str) -> pd.DataFrame:  
    """Update the columns of the dataframe if first column is data needed"""  

    N_columns = len(df.columns.to_list())

    if N_columns == 1:
        print(about, "has 1 column")
        word = df.columns.to_list()[0]
        new_row = pd.DataFrame([[word]], columns=df.columns)
        df = pd.concat([new_row, df], ignore_index=True)
        df.columns = new_columns_name
        df.fillna("dummy", inplace=True)

    elif N_columns == 2:
        print(about, "has 2 columns")
        dummy_row = pd.DataFrame([[' ', 'dummy']], columns=df.columns)
        word = df.columns.to_list()[0]
        pos_tag = df.columns.to_list()[1]
        new_row = pd.DataFrame([[word, pos_tag]], columns=df.columns)
        df = pd.concat([dummy_row, new_row, df], ignore_index=True)
        df.columns = new_columns_name
        df.fillna("dummy", inplace=True)
        
    else:
        print(" --- Invalid number of columns ---")

    print("Update complete\n")    
    return df

In [4]:
train_df = load_data('data/', 'train')
dev_df = load_data('data/', 'dev')
test_df = load_data('data/', 'test')

two_columns_name = ['Word', 'POS Tag']
one_columns_name = ['Word']

updated_train_df = update_df_columns(train_df, two_columns_name, "Train data")
updated_dev_df = update_df_columns(dev_df, two_columns_name, "Dev data")
updated_test_df = update_df_columns(test_df, one_columns_name, "Test data")

Train data has 2 columns
Update complete

Dev data has 2 columns
Update complete

Test data has 1 column
Update complete



In [5]:
# updated_train_df.head(33)
# updated_train_df.tail(5)
# updated_train_df

# updated_train_df[updated_train_df['POS Tag'] == "dummy"].shape

# Outline of Tasks

1. Vocabulary Creation
2. Model Learning
3. Greedy Decoding with HMM
4. Viterbi Decoding with HMM


# 1. Vocabulary Creation

- **Problem:** Creating vocabulary to handle unkown words.
    - **Solution:** Replace rare words wtih whose occurrences are less than a threshold (ie: 3) with a special token `< unk >`

---

1. [ ] Create a vocabulary using the training data in the file train
2. [ ] Output the vocabulary into a txt file named `vocab.txt`
    - [ ] See PDF on how to properly format vocabulary file
3. [ ] Questions
    1. [ ] What is the selected threshold for unknown words replacement?
    2. [ ] What is the total size of your vocabulary?
    3. [ ] What is the total occurrences of the special token `< unk >`after replacement?

In [6]:
# siddhant
# shivam

In [7]:
true_false_series = updated_train_df['Word'].value_counts()
print(true_false_series)

Word
,           46476
the         39533
dummy       38234
.           37452
of          22104
            ...  
Birthday        1
Happy           1
Bertie          1
crouched        1
Huricane        1
Name: count, Length: 43193, dtype: int64


In [8]:
vocab_df = pd.DataFrame(true_false_series)
vocab_df.reset_index(inplace = True)

In [9]:
true_false_series = vocab_df['count'] > 3

updated_vocab_df = vocab_df.loc[true_false_series == True]
updated_false_vocab_df = vocab_df.loc[true_false_series == False]
updated_false_vocab_df['Word'] = ' <unk> '
print()
N_updated_false_vocab_df = len(updated_false_vocab_df)
N_updated_false_vocab_df
new_row = pd.DataFrame([['<unk>', N_updated_false_vocab_df]], columns=updated_vocab_df.columns)
new_row
df = pd.concat([new_row, updated_vocab_df], ignore_index=True)
N_vocab = range(0, len(updated_vocab_df)+1)

df['index'] = N_vocab

df = df.reindex(columns=['Word', 'index', 'count'])
df
# df.to_csv('vocab.txt', header=None, index=None, sep='\t')




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  updated_false_vocab_df['Word'] = ' <unk> '


Unnamed: 0,Word,index,count
0,<unk>,0,29443
1,",",1,46476
2,the,2,39533
3,dummy,3,38234
4,.,4,37452
...,...,...,...
13746,trafficking,13746,4
13747,7.62,13747,4
13748,gut,13748,4
13749,17.3,13749,4


In [10]:
df

Unnamed: 0,Word,index,count
0,<unk>,0,29443
1,",",1,46476
2,the,2,39533
3,dummy,3,38234
4,.,4,37452
...,...,...,...
13746,trafficking,13746,4
13747,7.62,13747,4
13748,gut,13748,4
13749,17.3,13749,4


In [11]:
# df[df[word]

# 2. Model Learning

- Learn an HMM from the training data
- **HMM Parameters:**
  <div style="text-align: center;">

    $
    \text{Transition Probability (} t \text{)}: \quad t(s' \mid s) = \frac{\text{count}(s \rightarrow s')}{\text{count}(s)}
    $

    $
    \text{Emission Probability (} e \text{)}: \quad e(x \mid s) = \frac{\text{count}(s \rightarrow x)}{\text{count}(s)}
    $

  </div>

---

1. [x] Learn a model using the training data in the file train
2. [ ] Output the learned model into a model file in json format, named `hmm.json`. The model file should contains two dictionaries for the emission and transition parameters, respectively.
    1. [ ] 1st dictionary: Named transition, contains items with pairs of (s, s′) as key and t(s′|s) as value. 
    2. [ ] 2nd dictionary: Named emission, contains items with pairs of (s, x) as key and e(x|s) as value.
3. Question
    1. [ ] How many transition and emission parameters in your HMM?


In [12]:
updated_train_df

Unnamed: 0,Word,POS Tag
0,,dummy
1,Pierre,NNP
2,Vinken,NNP
3,",",","
4,61,CD
...,...,...
950308,to,TO
950309,San,NNP
950310,Francisco,NNP
950311,instead,RB


In [13]:
def create_shifts(states_series: pd.Series, current_state: int, next_state: int):
    """Splits a given create_shifts into multiple input rows where each input row has a s' and s
    
    Parameters:
    
    Return:
    """
    
    df = pd.DataFrame(states_series)
    # print(df)
    cols = list()
    
    lag_col_names = []
    count_lag = 0
    # input sequence (t-n, ... t-1)
    for prior_observation in range(current_state, 0, -1):
        # print("prior_observation: ", prior_observation)
        cols.append(df.shift(prior_observation))
        new_col_name = "given_state"
        # print(new_col_name)
        lag_col_names.append(new_col_name)
        
    
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, next_state):
        cols.append(df.shift(-i))
        new_col_name = "find_state"
        # print(new_col_name)
        lag_col_names.append(new_col_name)
        
        # put it all together
        uts_sml_df = pd.concat(cols, axis=1) 
        uts_sml_df.columns=[lag_col_names]
        # drop rows with NaN values
        uts_sml_df.dropna(inplace=True)
        
    return uts_sml_df

In [14]:
states_series = updated_train_df['POS Tag']
states_df = create_shifts(states_series, 1, 1)
states_df

Unnamed: 0,given_state,find_state
1,dummy,NNP
2,NNP,NNP
3,NNP,","
4,",",CD
5,CD,NNS
...,...,...
950308,PRP,TO
950309,TO,NNP
950310,NNP,NNP
950311,NNP,RB


In [15]:
# states_df = create_shifts(updated_train_df, ', 1, 1)
# states_df

In [16]:
def condition_on_s(df, pos_tag):
    # print("Given", pos_tag)
    given_filter = (df['given_state'] == pos_tag)
    condition_on_given = df[given_filter]
    condition_on_given.dropna(how='all', inplace=True)
    given_df = condition_on_given['given_state']

    return given_df

In [17]:
def condition_on_s_prime(df, pos_tag):
    # print("Find", pos)
    find_filter = (df['find_state'] == pos_tag)
    condition_on_find = df[find_filter]
    condition_on_find.dropna(how='all', inplace=True)
    find_df = condition_on_find['find_state']
    
    return find_df

In [18]:
s_df = condition_on_s(states_df, 'NNP')
s_df

Unnamed: 0,given_state
2,NNP
3,NNP
17,NNP
21,NNP
22,NNP
...,...
950297,NNP
950302,NNP
950303,NNP
950310,NNP


In [19]:
s_prime_df = condition_on_s_prime(states_df, 'NNP')
s_prime_df

Unnamed: 0,find_state
1,NNP
2,NNP
16,NNP
20,NNP
21,NNP
...,...
950296,NNP
950301,NNP
950302,NNP
950309,NNP


- [ ] Add condition to check all elements in specific column should be that POS Tag and NOT another

In [20]:
def get_common_elements(s_df, s_prime_df):
    """Merge s and s' across same index"""
    df = pd.merge(s_df, s_prime_df, left_index=True, right_index=True)
    return df

In [21]:
merged_df = get_common_elements(s_df, s_prime_df)
merged_df

Unnamed: 0,given_state,find_state
2,NNP,NNP
21,NNP,NNP
26,NNP,NNP
35,NNP,NNP
45,NNP,NNP
...,...,...
950270,NNP,NNP
950271,NNP,NNP
950296,NNP,NNP
950302,NNP,NNP


In [22]:
def calc_transition_probability(word_pos_tag_df, col_name, merged_df, pos_tag, s_prime_pos_tag):
    """Calculate

    Parameters
    ----------
    word_pos_tag_df: `pd.DataFrame`
        The updated DF of either train (N x 2: Word and POS Tag), dev (N x 2: Word and POS Tag), or test (N x 1: POS Tag).

    col_name: `str`
        Column name of word_pos_tag_df to get. Should only be POS Tag. If Word column, then throw error.
        
    merged_df: `pd.DataFrame`
        The subset DF that reps the all matchings of s (given) and s' (find).

    pos_tag: `str`
        The POS Tag of interests to find the total

    Return
    ------
    t_prob: `float` 
        The probability of t

    """
    
    N_pos_tag_each_series = word_pos_tag_df[col_name].value_counts()
    # print("Each POS Tag has N values: \n", N_pos_tag_each_series)

    s = N_pos_tag_each_series.to_dict()[pos_tag]
    print(f"s (given) = {pos_tag} with length of: {s}")
    
    # s = states_df.value_counts().to_dict()[pos_tag]
    # print(f"Length of given: {s}")

    s_prime = len(merged_df)
    print(f"s' (find) = {s_prime_pos_tag} with length of: {s_prime}")

    t = s_prime / s

    return t

In [23]:
calc_transition_probability(updated_train_df, 'POS Tag', merged_df, 'NNP', 'NNP')

s (given) = NNP with length of: 87608
s' (find) = NNP with length of: 33139


0.3782645420509543

In [24]:
def calc_transition_probability_per_sentence(word_pos_tag_df, df):
    
    # Iterate over the rows of df
    for index, row in df.iterrows():
        # Get values from column 1 and column 2
        s_pos_tag = row['given_state']
        s_prime_pos_tag = row['find_state']
        
        # print(f"Row {index}: Given s = {given_s}, Find s' = {find_s_prime}")
        print(f"Row {index}: Pr({s_prime_pos_tag} | {s_pos_tag})")
        # print()

        s_df = condition_on_s(df, s_pos_tag)
        s_prime_df = condition_on_s_prime(df, s_prime_pos_tag)
        merged_df = get_common_elements(s_df, s_prime_df)
        t_prob = calc_transition_probability(word_pos_tag_df, 'POS Tag', merged_df, s_pos_tag, s_prime_pos_tag)
        
        print(f"Row {index}: Pr(s' = {s_prime_pos_tag} | s = {s_pos_tag}) --- t = {t_prob} ")
        print()

In [42]:
calc_transition_probability_per_sentence(updated_train_df, states_df)

Row 1: Pr(NNP | dummy)
s (given) = dummy with length of: 38218
s' (find) = NNP with length of: 7563
Row 1: Pr(s' = NNP | s = dummy) --- t = 0.19789104610393007 

Row 2: Pr(NNP | NNP)
s (given) = NNP with length of: 87608
s' (find) = NNP with length of: 33139
Row 2: Pr(s' = NNP | s = NNP) --- t = 0.3782645420509543 

Row 3: Pr(, | NNP)
s (given) = NNP with length of: 87608
s' (find) = , with length of: 12131
Row 3: Pr(s' = , | s = NNP) --- t = 0.13846908958086018 

Row 4: Pr(CD | ,)
s (given) = , with length of: 46480
s' (find) = CD with length of: 987
Row 4: Pr(s' = CD | s = ,) --- t = 0.021234939759036144 

Row 5: Pr(NNS | CD)
s (given) = CD with length of: 34876
s' (find) = NNS with length of: 5502
Row 5: Pr(s' = NNS | s = CD) --- t = 0.15775891730703062 

Row 6: Pr(JJ | NNS)
s (given) = NNS with length of: 57859
s' (find) = JJ with length of: 995
Row 6: Pr(s' = JJ | s = NNS) --- t = 0.017196978862406887 

Row 7: Pr(, | JJ)
s (given) = JJ with length of: 58944
s' (find) = , with leng

KeyboardInterrupt: 

In [27]:
updated_train_df

Unnamed: 0,Word,POS Tag
0,,dummy
1,Pierre,NNP
2,Vinken,NNP
3,",",","
4,61,CD
...,...,...
950308,to,TO
950309,San,NNP
950310,Francisco,NNP
950311,instead,RB


In [37]:
emissions_df = updated_train_df.loc[0:, ['POS Tag', 'Word']]
emissions_df.rename(columns={'POS Tag': 'given_state', 'Word': 'find_state'}, inplace=True)

In [38]:
emissions_df.head(21)

Unnamed: 0,given_state,find_state
0,dummy,
1,NNP,Pierre
2,NNP,Vinken
3,",",","
4,CD,61
5,NNS,years
6,JJ,old
7,",",","
8,MD,will
9,VB,join


In [41]:
import warnings
warnings.filterwarnings('ignore')
calc_transition_probability_per_sentence(updated_train_df, emissions_df)

Row 0: Pr(  | dummy)
s (given) = dummy with length of: 38218
s' (find) =   with length of: 1
Row 0: Pr(s' =   | s = dummy) --- t = 2.6165681092678842e-05 

Row 1: Pr(Pierre | NNP)
s (given) = NNP with length of: 87608
s' (find) = Pierre with length of: 6
Row 1: Pr(s' = Pierre | s = NNP) --- t = 6.84868961738654e-05 

Row 2: Pr(Vinken | NNP)
s (given) = NNP with length of: 87608
s' (find) = Vinken with length of: 2
Row 2: Pr(s' = Vinken | s = NNP) --- t = 2.2828965391288468e-05 

Row 3: Pr(, | ,)
s (given) = , with length of: 46480
s' (find) = , with length of: 46476
Row 3: Pr(s' = , | s = ,) --- t = 0.9999139414802065 

Row 4: Pr(61 | CD)
s (given) = CD with length of: 34876
s' (find) = 61 with length of: 25
Row 4: Pr(s' = 61 | s = CD) --- t = 0.0007168253240050465 

Row 5: Pr(years | NNS)
s (given) = NNS with length of: 57859
s' (find) = years with length of: 1130
Row 5: Pr(s' = years | s = NNS) --- t = 0.019530237301024905 

Row 6: Pr(old | JJ)
s (given) = JJ with length of: 58944
s'

KeyboardInterrupt: 

# 3. Greedy Decoding with HMM

1. [ ] Implement the greedy decoding algorithm
2. [ ] Evaluate it on the development data
3. [ ] Predicting the POS Tags of the sentences in the test data
4. [ ] Output the predictions in a file named `greedy.out`, in the same format of training data
5. [ ] Evaluate the results of the model on `eval.py` in the terminal with `python eval.py − p {predicted file} − g {gold-standard file}`
6. [ ] Question
    1. [ ] What is the accuracy on the dev data? 

# 4. Viterbi Decoding with HMM

1. [ ] Implement the viterbi decoding algorithm
2. [ ] Evaluate it on the development data
3. [ ] Predict the POS Tags of the sentences in the test data
4. [ ] Output the predictions in a file named `viterbi.out`, in the same format of training data
5. [ ] Question
    1. [ ] What is the accuracy on the dev data?