HMM pos tagger

A toy pos tagger applied Hidden Markov Model.

Requirements

python3
numpy
sklearn

Usage

data format example

token1/tag1 token2/tag2 token3/tag3 ...

see data/raw_data.txt for more details

split data/raw_data.txt into data/train.txt and data/test.txt with ratio 4:1

import random
with open("data/raw_data.txt", "r", encoding="utf-8") as f:
  data = f.readlines()
random.shuffle(data)
pivot = int(0.2 * len(data))
testset = data[:pivot]
trainset = data[pivot:]
with open("data/train.txt", "w", encoding="utf-8") as f:
  f.write(trainset)
with open("data/test.txt", "w", encoding="utf-8") as f:
  f.write(testset)

evaluate hmm's initialzation matrix, transition matrix and emission matrix on trainset, which will be generated by train.py and cached as initial_np.pkl, transit_np.pkl and emit_np.pkl

python train.py

It also caches lookup table as token2idx.json and tag2idx.json.

evaluate micro-f1, precision, recall and accuracy on testset using hmm model learned by step 3.

$ python test.py
micro-f1 score: 0.7452485032055516
precision score: 0.7452485032055516
recall score: 0.7452485032055516
accuracy score: 0.7452485032055516

It also caches hypothesis as pred.txt

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HMM pos tagger

Requirements

Usage

About

Releases

Packages

Languages

License

SkyAndCloud/HMM-pos-tagger

Folders and files

Latest commit

History

Repository files navigation

HMM pos tagger

Requirements

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages