# Week 11 – NLP with NLTK
Michael Kamp


## 1. Tokenization

In [1]:
import nltk
from nltk.tokenize import word_tokenize

text = "After a long week of studying, Michael took a break with his VR boxing game and felt energized again."
tokens = word_tokenize(text)
print(tokens)


['After', 'a', 'long', 'week', 'of', 'studying', ',', 'Michael', 'took', 'a', 'break', 'with', 'his', 'VR', 'boxing', 'game', 'and', 'felt', 'energized', 'again', '.']


## 2. Stopwords

In [2]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w.lower() not in stop_words]
print(filtered)


['long', 'week', 'studying', ',', 'Michael', 'took', 'break', 'VR', 'boxing', 'game', 'felt', 'energized', '.']


## 3. Stemming

In [3]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stems = [stemmer.stem(w) for w in tokens]
print(stems)


['after', 'a', 'long', 'week', 'of', 'studi', ',', 'michael', 'took', 'a', 'break', 'with', 'hi', 'vr', 'box', 'game', 'and', 'felt', 'energ', 'again', '.']


## 4. Lemmatization

In [4]:
from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()
lemmas = [lemm.lemmatize(w) for w in tokens]
lemmas
print(lemmas)

['After', 'a', 'long', 'week', 'of', 'studying', ',', 'Michael', 'took', 'a', 'break', 'with', 'his', 'VR', 'boxing', 'game', 'and', 'felt', 'energized', 'again', '.']


## 4b. POS-Aware Lemmatization (Improved Accuracy)

In this enhanced version of lemmatization, each word is processed using its part-of-speech tag (POS).
This significantly improves accuracy — especially for verbs like “took,” “felt,” and “studying,” which would not be lemmatized correctly without POS information.

Using POS-aware lemmatization results in more meaningful base forms of words, which makes downstream NLP tasks such as text classification, topic modeling, and sentiment analysis more effective.

In [5]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Helper to convert NLTK POS tags → WordNet POS tags
def get_wordnet_pos(tag, word=""):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V') or word.lower() in ['felt', 'went', 'made']:
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # fallback

lemm = WordNetLemmatizer()

# Use the same sentence as earlier
text = "After a long week of studying, Michael took a break with his VR boxing game and felt energized again."

# Tokenize + POS tag
tokens = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)

# POS-aware lemmatization
lemmas_pos_aware = [
    lemm.lemmatize(word, get_wordnet_pos(tag, word)) 
    for word, tag in pos_tags
]

lemmas_pos_aware
pos_tags



[('After', 'IN'),
 ('a', 'DT'),
 ('long', 'JJ'),
 ('week', 'NN'),
 ('of', 'IN'),
 ('studying', 'VBG'),
 (',', ','),
 ('Michael', 'NNP'),
 ('took', 'VBD'),
 ('a', 'DT'),
 ('break', 'NN'),
 ('with', 'IN'),
 ('his', 'PRP$'),
 ('VR', 'NNP'),
 ('boxing', 'NN'),
 ('game', 'NN'),
 ('and', 'CC'),
 ('felt', 'VBD'),
 ('energized', 'VBN'),
 ('again', 'RB'),
 ('.', '.')]

Interpretation

POS-aware lemmatization correctly identifies words like:

studying → study (verb)

took → take (past tense verb)

felt → feel (past tense verb)

energized → energize (past participle)

Compared to regular lemmatization, which leaves most words unchanged,
POS-aware lemmatization produces cleaner, more informative text, which is crucial for accurate NLP analysis.

## 5. POS Tagging

In [6]:
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)

[('After', 'IN'), ('a', 'DT'), ('long', 'JJ'), ('week', 'NN'), ('of', 'IN'), ('studying', 'VBG'), (',', ','), ('Michael', 'NNP'), ('took', 'VBD'), ('a', 'DT'), ('break', 'NN'), ('with', 'IN'), ('his', 'PRP$'), ('VR', 'NNP'), ('boxing', 'NN'), ('game', 'NN'), ('and', 'CC'), ('felt', 'VBD'), ('energized', 'VBN'), ('again', 'RB'), ('.', '.')]


## 6. Chunking

In [7]:
grammar = "NP: {<DT>?<JJ>*<NN>}"
parser = nltk.RegexpParser(grammar)
chunks = parser.parse(pos_tags)
print(chunks)


(S
  After/IN
  (NP a/DT long/JJ week/NN)
  of/IN
  studying/VBG
  ,/,
  Michael/NNP
  took/VBD
  (NP a/DT break/NN)
  with/IN
  his/PRP$
  VR/NNP
  (NP boxing/NN)
  (NP game/NN)
  and/CC
  felt/VBD
  energized/VBN
  again/RB
  ./.)


## 7. Named Entity Recognition

In [8]:
ner = nltk.ne_chunk(pos_tags)
print(ner)


(S
  After/IN
  a/DT
  long/JJ
  week/NN
  of/IN
  studying/VBG
  ,/,
  (PERSON Michael/NNP)
  took/VBD
  a/DT
  break/NN
  with/IN
  his/PRP$
  VR/NNP
  boxing/NN
  game/NN
  and/CC
  felt/VBD
  energized/VBN
  again/RB
  ./.)


## 8. Summary & Key Takeaways

This week’s NLP lab demonstrated the essential preprocessing steps used in text analytics and Natural Language Processing. These steps transform raw, unstructured text into structured forms that machine learning models can interpret.

**Key concepts reviewed:**

• **Tokenization** – Breaks a sentence into individual words or tokens.  
• **Stopword Removal** – Removes common but uninformative words.  
• **Stemming** – Reduces words to their root form.  
• **Lemmatization** – Reduces words to their meaningful base form.  
• **Part-of-Speech Tagging** – Labels each word with its grammatical role.  
• **Chunking** – Groups tokens into meaningful phrases.  
• **Named Entity Recognition (NER)** – Identifies people, places, dates, etc.
