# Part of Speech Tagging with Hidden Markov Model

Dikerjakan oleh:
* Bryan Christopher - 219116780
* Christian Budhi Sabdana - 219116781
* Christian Trisno Sen Long Chen - 219116782

In [1]:
'''
magic command from IPython extension to reload modules before executing user code
https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html
'''

%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd
from io import StringIO
import csv
import sys
from tqdm.notebook import tqdm

In [3]:
'''
collecting corpus, split by double end
'''

collection = []

with open("./data/Indonesian_Manually_Tagged_Corpus.tsv", "r") as f:
    txt_file = f.read()

for sentence in txt_file.split('\n\n'):
    temp = pd.read_csv(StringIO(sentence), delimiter='\t', header=None, quoting=csv.QUOTE_NONE)
    collection.append(temp.to_numpy())

collection = np.array(collection, dtype=object)

## Splitting the data

In [4]:
TRAINING_PERCENTAGE = 70
VALIDATION_PERCENTAGE = 10
TESTING_PERCENTAGE = 20

In [5]:
'''
shuffling collection data, and divide the data into training set, validation set, and testing set
'''

np.random.shuffle(collection)

training_count = round(len(collection) * TRAINING_PERCENTAGE / 100)
validation_count = round((len(collection) - training_count) * VALIDATION_PERCENTAGE / (100 - TRAINING_PERCENTAGE))

training = collection[0 : training_count]
validation = collection[training_count : training_count + validation_count]
testing = collection[training_count + validation_count : ]

## Using POS Tagging Class
This section will train to get all of the occurence for the emmission and transition

In [6]:
from PosTagging import PosTagging
tagger = PosTagging(training)
_,_,_ = tagger.train()

[7021/7021]	|██████████████████████████████████████████████████|
finished...

### Get the Best Value for Lambda
We will use the validation data to get the best lambda. Note that this will use the basic viterby algorthm but will brute force to get the best lambda by iterating 0.0 to 1.0 with 0.01 as a step-up

In [7]:
lambdas = tagger.validate(validation, 0.01)
lambdas[-5:]

  0%|          | 0/100 [00:00<?, ?it/s]

[(0.9500000000000006, 0.918149878383074),
 (0.9600000000000006, 0.9186131809582642),
 (0.9700000000000006, 0.9187676151499942),
 (0.9800000000000006, 0.9193081348210493),
 (0.9900000000000007, 0.919578394656577)]

### Testing
After getting the best accuracy, we will take the lambda as a parameter for testing.
In this Testing we use 0.99 as the lambda and got 91.80% for the accuracy

In [8]:
best_lambdas = np.amax(lambdas)
best_lambdas

0.9900000000000007

In [9]:
test_accuracy_1 = tagger.test(testing, best_lambdas)
print("Test Accuracy: {:.2%}".format(test_accuracy_1))

Test Accuracy: 91.68%


### Get the Best Value for Lambda With Ternary Search
In this section we will try to use ternary search to find the best lambda rather than using brute force

In [10]:
ternary_lambdas = tagger.validate_with_ternary_search(validation)
ternary_lambdas

started...
[try 23 ]  left ptr 0.9974 | right ptr 0.9975 | delta 0.0000 | best 0.9198 | time 4.3438

(0.99743693209634, 0.9197714373962396)

### Testing
After getting the best accuracy, we will take the lambda as a parameter for testing.
In this Testing we use 0.917477 as the lambda and got 91.82% for the accuracy

In [11]:
test_accuracy_2 = tagger.test(testing, ternary_lambdas[0])
print("Test Accuracy: {:.2%}".format(test_accuracy_2))

Test Accuracy: 91.69%
