# Basic NLP Tasks – Tokenization, POS Tagging, and Shallow Parsing

In this lab, we demonstrate some fundamental NLP tasks:

1. **Tokenization** – splitting text into words.
2. **Part-of-Speech (POS) Tagging** – labeling words with grammatical tags.
3. **Shallow Parsing / Chunking** – extracting noun phrases, verb phrases, and prepositional phrases.

We will use a sample sentence and perform these tasks step by step.


In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

from nltk.tokenize import word_tokenize

# Sample sentence
sentence = "The clever brown fox leaps over the sleepy dog."

# Tokenization
tokens = word_tokenize(sentence)
print("Word Tokens:", tokens)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker.zip.
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...


Word Tokens: ['The', 'clever', 'brown', 'fox', 'leaps', 'over', 'the', 'sleepy', 'dog', '.']


[nltk_data]   Unzipping corpora\words.zip.


## Part-of-Speech (POS) Tagging

We assign grammatical tags to each token:

- **DT** – Determiner  
- **JJ** – Adjective  
- **NN** – Noun  
- **VBZ** – Verb (3rd person singular present)  
- **IN** – Preposition


In [2]:
# Simple rule-based POS tagging
pos_tags = []

for token in tokens:
    if token.lower() in ['the', 'a', 'an']:
        pos_tags.append((token, 'DT'))
    elif token.lower() in ['in', 'on', 'over', 'under']:
        pos_tags.append((token, 'IN'))
    elif token.endswith('s') or token.endswith('es'):
        pos_tags.append((token, 'VBZ'))
    elif token.endswith('y') or token.lower() in ['clever', 'brown', 'sleepy']:
        pos_tags.append((token, 'JJ'))
    else:
        pos_tags.append((token, 'NN'))

print("POS Tags:", pos_tags)


POS Tags: [('The', 'DT'), ('clever', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('leaps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('sleepy', 'JJ'), ('dog', 'NN'), ('.', 'NN')]


## Shallow Parsing / Chunking

We will extract:

1. **Noun Phrases (NP)** – optional determiner + adjectives + noun  
2. **Verb Phrases (VP)** – verbs optionally followed by noun phrases  
3. **Prepositional Phrases (PP)** – preposition + noun phrases


In [3]:
def extract_noun_phrases(pos_tags):
    noun_phrases = []
    i = 0
    while i < len(pos_tags):
        phrase = []
        # Optional determiner
        if pos_tags[i][1] == 'DT':
            phrase.append(pos_tags[i][0])
            i += 1
        # Adjectives
        while i < len(pos_tags) and pos_tags[i][1] == 'JJ':
            phrase.append(pos_tags[i][0])
            i += 1
        # Noun
        if i < len(pos_tags) and pos_tags[i][1] == 'NN':
            phrase.append(pos_tags[i][0])
            i += 1
            noun_phrases.append(" ".join(phrase))
        else:
            i += 1
    return noun_phrases

noun_phrases = extract_noun_phrases(pos_tags)
print("Noun Phrases:", noun_phrases)


Noun Phrases: ['The clever brown fox', 'the sleepy dog', '.']


In [4]:
def extract_verb_phrases(pos_tags):
    verb_phrases = []
    i = 0
    while i < len(pos_tags):
        if pos_tags[i][1] in ['VB', 'VBZ']:
            vp = [pos_tags[i][0]]
            i += 1
            np = extract_noun_phrases(pos_tags[i:])
            if np:
                vp.extend(np[0].split())
                i += len(np[0].split())
            verb_phrases.append(" ".join(vp))
        else:
            i += 1
    return verb_phrases

verb_phrases = extract_verb_phrases(pos_tags)
print("Verb Phrases:", verb_phrases)


Verb Phrases: ['leaps the sleepy dog']


In [5]:
def extract_prepositional_phrases(pos_tags):
    prep_phrases = []
    i = 0
    while i < len(pos_tags):
        if pos_tags[i][1] == 'IN':
            pp = [pos_tags[i][0]]
            i += 1
            np = extract_noun_phrases(pos_tags[i:])
            if np:
                pp.extend(np[0].split())
                i += len(np[0].split())
            prep_phrases.append(" ".join(pp))
        else:
            i += 1
    return prep_phrases

prep_phrases = extract_prepositional_phrases(pos_tags)
print("Prepositional Phrases:", prep_phrases)


Prepositional Phrases: ['over the sleepy dog']
