<a href="https://colab.research.google.com/github/zcqin/DL-nlp2025/blob/main/HMM_for_POSTagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HMM for POS Tagging

Zengchang Qin

zengchang.qin@gmail.com



Part-of-Speech (POS) tagging is the process of assigning a part of speech (such as noun, verb, adjective) to each word in a sentence. It is a fundamental task in Natural Language Processing (NLP) that plays a crucial role in syntactic parsing, named entity recognition, and many other language-related applications.

A Hidden Markov Model (HMM) is a statistical model commonly used for POS tagging. It represents the problem as a Markov process with hidden states (POS tags) and observed emissions (words in a sentence). HMMs leverage two key probabilistic components:

Transition probabilities: The probability of moving from one state (POS tag) to another.

Emission probabilities: The probability of a particular state (POS tag) emitting an observation (word).

The most common method to find the optimal sequence of POS tags given a sentence is Viterbi Algorithm.

set up

In [1]:
import numpy as np
import nltk
from nltk.tag import hmm
from nltk.corpus import treebank
from collections import defaultdict

load the data

In [2]:
nltk.download('treebank')
nltk.download('universal_tagset')

# Load POS tagged sentences from Treebank corpus
tagged_sentences = treebank.tagged_sents(tagset='universal')

# Split into training and testing data
train_data = tagged_sentences[:3000]  # Train on 3000 sentences
test_data = tagged_sentences[3000:]   # Test on the remaining

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


Train the HMM Model

In [3]:
trainer = hmm.HiddenMarkovModelTrainer()
hmm_tagger = trainer.train(train_data)

evaluate the model on given data

In [4]:
accuracy = hmm_tagger.evaluate(test_data)
print(f"HMM Tagger Accuracy: {accuracy * 100:.2f}%")

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  accuracy = hmm_tagger.evaluate(test_data)
  O[i, k] = self._output_logprob(si, self._symbols[k])
  O[i, k] = self._output_logprob(si, self._symbols[k])


HMM Tagger Accuracy: 51.60%


POS Tagging on a Sample Sentence

In [5]:
sentence = "The quick brown fox jumps over the lazy dog".split()
predicted_tags = hmm_tagger.tag(sentence)
print(predicted_tags)

[('The', 'DET'), ('quick', 'ADJ'), ('brown', 'NOUN'), ('fox', 'NOUN'), ('jumps', 'NOUN'), ('over', 'NOUN'), ('the', 'NOUN'), ('lazy', 'NOUN'), ('dog', 'NOUN')]
