<center>

# Part-of-Speech (POS) Tagging in nlp

</center>

---

## 1. Introduction

**Part-of-Speech (POS) Tagging** is a fundamental task in Natural Language Processing (NLP) that involves assigning **a grammatical category** (such as noun, verb, adjective) to each word in a text.  

POS tagging is important because it:
- Helps understand **syntactic structure** of sentences  
- Improves **information extraction** and **named entity recognition**  
- Forms a basis for **parsing, machine translation, and sentiment analysis**  


## 2. Definition

Given a sentence: "I love natural language processing"


POS tagging assigns:
I → PRON (pronoun) <br>
love → VERB  <br>
natural → ADJ (adjective)  <br>
language→ NOUN  <br>
processing → NOUN


Formally, POS tagging is the task of mapping:

$$ \text{Sentence } S = [w_1, w_2, ..., w_n] \quad \longrightarrow \quad [t_1, t_2, ..., t_n] $$
where \( w_i \) is a word and \( t_i \) is its POS tag.

## 3. POS Tag Categories

POS tags depend on the tagging standard (Penn Treebank is common). Common categories:

| POS Tag | Meaning       | Example         |
|---------|---------------|----------------|
| NN      | Noun, singular| cat, language  |
| NNS     | Noun, plural  | cats, dogs     |
| VB      | Verb, base    | run, eat       |
| VBD     | Verb, past    | ran, ate       |
| VBG     | Verb, gerund  | running        |
| JJ      | Adjective     | beautiful      |
| RB      | Adverb        | quickly        |
| PRP     | Pronoun       | I, he, she     |
| DT      | Determiner    | the, a         |
| IN      | Preposition   | in, on         |


## 4. Approaches for POS Tagging

### 4.1 Rule-Based
- Uses **hand-crafted grammatical rules**  
- Example: if a word ends in “-ing”, tag as **VBG**  

### 4.2 Stochastic / Probabilistic
- Uses **Hidden Markov Models (HMM)** or **n-grams**  
- Tags assigned based on **probability of sequence**  
- Example: `P(tag|word)` or `P(tag_i|tag_{i-1})`

### 4.3 Machine Learning / Deep Learning
- Models trained on **tagged corpora**  
- Common algorithms:  
  - Maximum Entropy  
  - Conditional Random Fields (CRF)  
  - BiLSTM + CRF  
  - Transformers (BERT, RoBERTa)


In [1]:
import nltk
import spacy
import pandas as pd

from nltk import word_tokenize, pos_tag

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
# Load spaCy model
nlp = spacy.load("en_core_web_sm")

In [5]:
doc = nlp("Natural Language Processing helps computers understand human language")

In [6]:
doc.text

'Natural Language Processing helps computers understand human language'

In [7]:
doc[-1]

language

In [8]:
doc[2].pos_

'NOUN'

In [9]:
doc[2].tag_

'NN'

In [10]:
spacy.explain('VB')

'verb, base form'

In [11]:
# Tokenize into words
tokens = word_tokenize(str(doc))  # convert Doc to string
print(tokens)

['Natural', 'Language', 'Processing', 'helps', 'computers', 'understand', 'human', 'language']


In [12]:
nltk_pos_tags = pos_tag(tokens)
nltk_pos_tags

[('Natural', 'JJ'),
 ('Language', 'NNP'),
 ('Processing', 'NNP'),
 ('helps', 'VBZ'),
 ('computers', 'NNS'),
 ('understand', 'VBP'),
 ('human', 'JJ'),
 ('language', 'NN')]

Common POS Tags:
- NN  → Noun
- VB  → Verb
- JJ  → Adjective
- RB  → Adverb
- PRP → Pronoun


In [13]:
doc = nlp(doc)

spacy_pos = [(token.text, token.pos_, token.tag_) for token in doc]
spacy_pos


[('Natural', 'PROPN', 'NNP'),
 ('Language', 'PROPN', 'NNP'),
 ('Processing', 'NOUN', 'NN'),
 ('helps', 'VERB', 'VBZ'),
 ('computers', 'NOUN', 'NNS'),
 ('understand', 'VERB', 'VB'),
 ('human', 'ADJ', 'JJ'),
 ('language', 'NOUN', 'NN')]

In [14]:
df_pos = pd.DataFrame(spacy_pos, columns=["Word", "POS", "Detailed_Tag"])
df_pos

Unnamed: 0,Word,POS,Detailed_Tag
0,Natural,PROPN,NNP
1,Language,PROPN,NNP
2,Processing,NOUN,NN
3,helps,VERB,VBZ
4,computers,NOUN,NNS
5,understand,VERB,VB
6,human,ADJ,JJ
7,language,NOUN,NN


In [15]:
df_pos['POS'].value_counts()

POS
NOUN     3
PROPN    2
VERB     2
ADJ      1
Name: count, dtype: int64

In [16]:
pos_counts = df_pos['POS'].value_counts().to_dict()
pos_counts

{'NOUN': 3, 'PROPN': 2, 'VERB': 2, 'ADJ': 1}

In [17]:
for word in doc:
    print(word.text,"------>", word.pos_,word.tag_,spacy.explain(word.tag_))

Natural ------> PROPN NNP noun, proper singular
Language ------> PROPN NNP noun, proper singular
Processing ------> NOUN NN noun, singular or mass
helps ------> VERB VBZ verb, 3rd person singular present
computers ------> NOUN NNS noun, plural
understand ------> VERB VB verb, base form
human ------> ADJ JJ adjective (English), other noun-modifier (Chinese)
language ------> NOUN NN noun, singular or mass


In [18]:
doc2 = nlp(u"I left the room")
for word in doc2:
    print(word.text,"------>", word.pos_,word.tag_,spacy.explain(word.tag_))

I ------> PRON PRP pronoun, personal
left ------> VERB VBD verb, past tense
the ------> DET DT determiner
room ------> NOUN NN noun, singular or mass


In [19]:
doc3 = nlp(u"to the left of the room")
for word in doc3:
    print(word.text,"------>", word.pos_,word.tag_,spacy.explain(word.tag_))

to ------> ADP IN conjunction, subordinating or preposition
the ------> DET DT determiner
left ------> NOUN NN noun, singular or mass
of ------> ADP IN conjunction, subordinating or preposition
the ------> DET DT determiner
room ------> NOUN NN noun, singular or mass


In [20]:
doc4 = nlp(u"I read books on history")
for word in doc4:
    print(word.text,"------>", word.pos_,word.tag_,spacy.explain(word.tag_))

I ------> PRON PRP pronoun, personal
read ------> VERB VBP verb, non-3rd person singular present
books ------> NOUN NNS noun, plural
on ------> ADP IN conjunction, subordinating or preposition
history ------> NOUN NN noun, singular or mass


In [21]:
doc5 = nlp(u"I have read a book on history")
for word in doc5:
    print(word.text,"------>", word.pos_,word.tag_,spacy.explain(word.tag_))

I ------> PRON PRP pronoun, personal
have ------> AUX VBP verb, non-3rd person singular present
read ------> VERB VBN verb, past participle
a ------> DET DT determiner
book ------> NOUN NN noun, singular or mass
on ------> ADP IN conjunction, subordinating or preposition
history ------> NOUN NN noun, singular or mass


In [22]:
doc6 = nlp(u"The quick brown fox jumped over the lazy dog")

In [23]:
from spacy import displacy

In [24]:
displacy.render(doc6,style='dep',jupyter=True)

In [25]:
options={
    'distance':80,
    'compact':True,
    'color':'#fff',
    'bg':'#00a65a'
}

In [26]:
displacy.render(doc6,style='dep',jupyter=True,options=options)

## 5. Applications of POS Tagging

- Named Entity Recognition (NER): Identify entities like persons, organizations.
- Information Retrieval: Better search ranking using syntactic understanding.
- Machine Translation: Helps in producing grammatically correct sentences.
- Text-to-Speech (TTS): Correct pronunciation depends on POS (e.g., “read” present vs past tense).
- Sentiment Analysis: Helps identify adjectives and adverbs for opinion mining.

## 6. Challenges in POS Tagging

- Ambiguity: Words can have multiple POS depending on context.
- Example: I saw a bat → bat = noun / animal or object
- Out-of-Vocabulary Words: Words not in training data.
- Morphologically rich languages: Inflections and compound words increase complexity.
- Domain-specific text: Scientific or technical texts may have unseen vocabulary.

## 7. Summary

- POS tagging assigns syntactic categories to words in text.
- Approaches include rule-based, probabilistic, and neural network-based methods.
- Crucial for syntax analysis, downstream NLP tasks, and semantic understanding.
- Libraries like NLTK, spaCy, and Stanford NLP provide robust POS taggers for research and projects.

<div style="text-align: right;">
    <b>Author:</b> Monower Hossen <br>
    <b>Date:</b> January 13, 2026
</div>
