# HW 3-Part-of-Speech Tagging with HMMs + Decoding Techniques (Greedy and Viterbi)

- Detravious Jamari Brinkley
- CSCI-544: Applied Natural Language Processing
- python version: 3.11.4

---

1. Part-of-Speech (POS) Tagging [a type of sequence labelling task where of a given word, assign the part of speech]
2. HMMs (Hidden Markov Model) [a generative-based model that's used for POS Tagging]
    1. Generative-based [provides the probabilities for all possible combinations of values of variables in the set using the joint distribution]
    2. With POS Tagging: Given a sequence of observations (sentences), the task is to infer the most likely sequence of hidden states (POS Tags) that could have generated the observed data.
3. **Decoding Techniques:**
    1. Greedy [find the optimal (OPT) solution at each step]
    2. Viterbi [make use of dynammic programming to find the OPT solution with backtracking while searching the entire search space]
4. **Notes of the data and given files:**
    - Dataset: Wall Street Journal section of the Penn Treebank
    - Folder named `data` with the following files:
        1. `train`, sentences *with* human-annotated POS Tags
        2. `dev`, sentences *with* human-annotated POS Tags
        3. `test`, sentences *without* POS Tags, thus predict the POS Tags
    - Format: Blank like at the end of each sentence. Each line contains 3 items separated by the `\t`, the tab symbol. These three items are
        1. Index of the word in the sentence
        2. Word type
        3. POS Tag



In [1]:
# imports
import pandas as pd
# import 

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


# Load and Update Data

In [2]:
def load_data(file_path: str, file_name: str, config_index: bool = True):
    
    if config_index == True:
        file =  file_path + file_name
        open_df = pd.read_table(file)
        open_df = open_df.set_index('1')
        
    return open_df

In [3]:
def update_df_columns(df: pd.DataFrame, new_columns_name: list, about: str) -> pd.DataFrame:  
    """Update the columns of the dataframe if first column is data needed"""  

    N_columns = len(df.columns.to_list())

    if N_columns == 1:
        print(about, "has 1 column")
        word = df.columns.to_list()[0]
        new_row = pd.DataFrame([[word]], columns=df.columns)
        df = pd.concat([new_row, df], ignore_index=True)
        df.columns = new_columns_name

    elif N_columns == 2:
        print(about, "has 2 columns")
        word = df.columns.to_list()[0]
        pos_tag = df.columns.to_list()[1]
        new_row = pd.DataFrame([[word, pos_tag]], columns=df.columns)
        df = pd.concat([new_row, df], ignore_index=True)
        df.columns = new_columns_name
        
    else:
        print(" --- Invalid number of columns ---")

    print("Update complete\n")    
    return df

In [4]:
train_df = load_data('data/', 'train')
dev_df = load_data('data/', 'dev')
test_df = load_data('data/', 'test')

two_columns_name = ['Word', 'Pos Tag']
one_columns_name = ['Word']

updated_train_df = update_df_columns(train_df, two_columns_name, "Train data")
updated_dev_df = update_df_columns(dev_df, two_columns_name, "Dev data")
updated_test_df = update_df_columns(test_df, one_columns_name, "Test data")

Train data has 2 columns
Update complete

Dev data has 2 columns
Update complete

Test data has 1 column
Update complete



In [5]:
updated_train_df.head(18)

Unnamed: 0,Word,Pos Tag
0,Pierre,NNP
1,Vinken,NNP
2,",",","
3,61,CD
4,years,NNS
5,old,JJ
6,",",","
7,will,MD
8,join,VB
9,the,DT


# Outline of Tasks

1. Vocabulary Creation
2. Model Learning
3. Greedy Decoding with HMM
4. Viterbi Decoding with HMM


# 1. Vocabulary Creation

- **Problem:** Creating vocabulary to handle unkown words.
    - **Solution:** Replace rare words wtih whose occurrences are less than a threshold (ie: 3) with a special token `< unk >`

---

1. [ ] Create a vocabulary using the training data in the file train
2. [ ] Output the vocabulary into a txt file named `vocab.txt`
    - [ ] See PDF on how to properly format vocabulary file
3. [ ] Questions
    1. [ ] What is the selected threshold for unknown words replacement?
    2. [ ] What is the total size of your vocabulary?
    3. [ ] What is the total occurrences of the special token `< unk >`after replacement?

In [6]:
# siddhant
# shivam

In [7]:
true_false_series = updated_train_df['Word'].value_counts()
print(true_false_series)


Word
,           46476
the         39533
.           37452
of          22104
to          21305
            ...  
Birthday        1
Happy           1
Bertie          1
crouched        1
Huricane        1
Name: count, Length: 43192, dtype: int64


In [8]:
vocab_df = pd.DataFrame(true_false_series)
vocab_df.reset_index(inplace = True)

In [9]:
true_false_series = vocab_df['count'] > 3

In [10]:
updated_vocab_df = vocab_df.loc[true_false_series == True]
updated_vocab_df

Unnamed: 0,Word,count
0,",",46476
1,the,39533
2,.,37452
3,of,22104
4,to,21305
...,...,...
13744,9.3,4
13745,starters,4
13746,prescribe,4
13747,scammers,4


In [11]:
N_vocab = range(0, len(updated_vocab_df))
N_vocab

range(0, 13749)

In [12]:
updated_vocab_df['index'] = N_vocab
updated_vocab_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  updated_vocab_df['index'] = N_vocab


Unnamed: 0,Word,count,index
0,",",46476,0
1,the,39533,1
2,.,37452,2
3,of,22104,3
4,to,21305,4
...,...,...,...
13744,9.3,4,13744
13745,starters,4,13745
13746,prescribe,4,13746
13747,scammers,4,13747


In [13]:
updated_vocab_df = updated_vocab_df.reindex(columns=['Word', 'index', 'count'])
updated_vocab_df

Unnamed: 0,Word,index,count
0,",",0,46476
1,the,1,39533
2,.,2,37452
3,of,3,22104
4,to,4,21305
...,...,...,...
13744,9.3,13744,4
13745,starters,13745,4
13746,prescribe,13746,4
13747,scammers,13747,4


In [None]:
updated_vocab_df.to_csv('vocab.txt', header=None, index=None, sep='\t')


In [None]:
updated_vocab_df = updated_vocab_df[['index', 'count']]
updated_vocab_df

In [None]:
pd.DataFrame(data=updated_vocab_df, index=N_vocab)

In [None]:
def split_true_false(df: pd.DataFrame, col_name: str, threshold: int, replacement_token: str):
    """Split the data based on the selected threshold"""

    
    true_false_series = df[col_name].value_counts() > threshold
    # print(true_false_series)

    get_occurences = df[col_name].value_counts().to_dict()
    N_occurences = list(get_occurences.values())
    print("Occurences", N_occurences)

   

    true_false_dict = true_false_series.value_counts().to_dict()
    trues = list(true_false_dict.values())[1]
    print("Total True and Total False", trues)

    # update_N_occurences = N_occurences[trues:]
    # print("Occurences", update_N_occurences)

    

    true_false_df = pd.DataFrame(true_false_series)

    # Rename the index column to 'Word'
    true_false_df.index.name = 'Word'

    # Reset index to make the index a regular column
    true_false_df.reset_index(inplace=True)

    true_false_df.columns = ['Word', '#occurrences']

    # Merge idx_df with sample_df on 'Word'
    # merged_df = pd.merge(idx_df, df, on='Word', how='left')

    # # Rename the column to 'Value'
    # merged_df.columns = ['Word', 'T/F', 'Pos Tag']

    # Print the DataFrame
    # print(merged_df)
    

    # N_true = true_false_dict[True]
    # true_df = true_false_df[:N_true]
    # # print("true_df --- ", len(true_df))
    # false_df = true_false_df[N_true:]
    # print("false_df --- ", false_df)
    # false_df.loc[0:, 'Word'] = replacement_token
    # print("false_df --- ", false_df)
    true_false_token_df = true_false_df.copy()

    # true_false_token_df.loc[true_false_token_df['T/F'] == False, 'Word'] = replacement_token

    return true_false_df, true_false_token_df



In [None]:
true_false_train_df, tokened_true_false_train_df = split_true_false(updated_train_df, 'Word', 3, '< unk >')

In [None]:
true_false_train_df 

In [None]:
def get_conditioned_df(df: pd.DataFrame, col_name: str, condition):
    conditioned_df = df[col_name] == condition

    return df.loc[conditioned_df]


In [None]:
vocab_df = get_conditioned_df(true_false_train_df, 'T/F', True)
unk_df = get_conditioned_df(tokened_true_false_train_df, 'Word', '< unk >')

In [None]:
vocab_df['Word']

In [None]:
unk_df

# 2. Model Learning

- Learn an HMM from the training data
- **HMM Parameters:**
  <div style="text-align: center;">

    $
    \text{Transition Probability (} t \text{)}: \quad t(s' \mid s) = \frac{\text{count}(s \rightarrow s')}{\text{count}(s)}
    $

    $
    \text{Emission Probability (} e \text{)}: \quad e(x \mid s) = \frac{\text{count}(s \rightarrow x)}{\text{count}(s)}
    $

  </div>

---

1. [x] Learn a model using the training data in the file train
2. [ ] Output the learned model into a model file in json format, named `hmm.json`. The model file should contains two dictionaries for the emission and transition parameters, respectively.
    1. [ ] 1st dictionary: Named transition, contains items with pairs of (s, s′) as key and t(s′|s) as value. 
    2. [ ] 2nd dictionary: Named emission, contains items with pairs of (s, x) as key and e(x|s) as value.
3. Question
    1. [ ] How many transition and emission parameters in your HMM?


# 3. Greedy Decoding with HMM

1. [ ] Implement the greedy decoding algorithm
2. [ ] Evaluate it on the development data
3. [ ] Predicting the POS Tags of the sentences in the test data
4. [ ] Output the predictions in a file named `greedy.out`, in the same format of training data
5. [ ] Evaluate the results of the model on `eval.py` in the terminal with `python eval.py − p {predicted file} − g {gold-standard file}`
6. [ ] Question
    1. [ ] What is the accuracy on the dev data? 

# 4. Viterbi Decoding with HMM

1. [ ] Implement the viterbi decoding algorithm
2. [ ] Evaluate it on the development data
3. [ ] Predict the POS Tags of the sentences in the test data
4. [ ] Output the predictions in a file named `viterbi.out`, in the same format of training data
5. [ ] Question
    1. [ ] What is the accuracy on the dev data?