A toy pos tagger applied Hidden Markov Model.
- python3
- numpy
- sklearn
- data format example
token1/tag1 token2/tag2 token3/tag3 ...
see data/raw_data.txt
for more details
- split
data/raw_data.txt
intodata/train.txt
anddata/test.txt
with ratio4:1
import random
with open("data/raw_data.txt", "r", encoding="utf-8") as f:
data = f.readlines()
random.shuffle(data)
pivot = int(0.2 * len(data))
testset = data[:pivot]
trainset = data[pivot:]
with open("data/train.txt", "w", encoding="utf-8") as f:
f.write(trainset)
with open("data/test.txt", "w", encoding="utf-8") as f:
f.write(testset)
- evaluate hmm's initialzation matrix, transition matrix and emission matrix on trainset, which will be generated by
train.py
and cached asinitial_np.pkl
,transit_np.pkl
andemit_np.pkl
python train.py
It also caches lookup table as token2idx.json
and tag2idx.json
.
- evaluate
micro-f1
,precision
,recall
andaccuracy
on testset using hmm model learned by step 3.
$ python test.py
micro-f1 score: 0.7452485032055516
precision score: 0.7452485032055516
recall score: 0.7452485032055516
accuracy score: 0.7452485032055516
It also caches hypothesis as pred.txt