# 🌐 Part 1: Environment Setup and Library Imports

**Learning Outcome (LO1):**  
You will learn to set up a Python environment, import essential libraries, and understand what each does.  

> 💡 *Tip:* We are using **UBC Syzygy** — a cloud Jupyter Notebook environment. No installation is needed, but we must import libraries before coding.

# 📦 Install required libraries

In [1]:
!pip install numpy nltk scikit-learn 



# 🧰 Import required libraries

In [2]:
# Import the Natural Language Toolkit
import nltk  
from nltk import pos_tag
from nltk.corpus import wordnet

# Import tools for tokenization and lemmatization
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Import the Counter class for word frequency counting
from collections import Counter  

# Import scikit-learn for Bag-of-Words (BoW)
from sklearn.feature_extraction.text import CountVectorizer  

# Download necessary NLTK resources (run once)
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('averaged_perceptron_tagger')

print("✅ Libraries imported successfully! 😄")

✅ Libraries imported successfully! 😄


[nltk_data] Downloading package punkt to /home/jupyter/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jupyter/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jupyter/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [21]:
def get_wordnet_pos(tag):
    """Map POS tag to first character lemmatize() accepts."""
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # default to noun

# 🧹 Part 2: Text Cleaning and Lemmatization

# 📝 Step 1: Load the sample text

### 📖 The Tell-Tale Heart by Edgar Allan Poe
#### Source: https://poemuseum.org/the-tell-tale-heart/

In [22]:
sample_text = """
TRUE! — nervous — very, very dreadfully nervous I had been and am; but why will you say that I am mad?
The disease had sharpened my senses — not destroyed — not dulled them. Above all was the sense of hearing acute.
I heard all things in the heaven and in the earth. I heard many things in hell.
How, then, am I mad? Hearken! and observe how healthily — how calmly I can tell you the whole story.
"""
print("✅ Text loaded successfully! Length:", len(sample_text), "characters")

✅ Text loaded successfully! Length: 398 characters


# 🧩 Step 2: Tokenize text into words

In [23]:
# HINT: use word_tokenize()
tokens = ___
tokens

In [43]:
# HINT: you want to POS tag tokens
tagged = pos_tag(___)

# 🧩 Step 3: Lemmatize each token

In [44]:
# HINT: what so you want the lammatize to do here?
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.___(word, get_wordnet_pos(tag)) for token in tokens]

AttributeError: 'WordNetLemmatizer' object has no attribute '___'

In [46]:
# Print title/header
print("🔤Original Token      →  🧠 Lemma")
print("-" * 30)  # Optional: a separator line

#HINT: Zip original tokens and their lemmas
for original, lemma in ___(tokens, lemmas):
    print(f"{original:15} → {lemma}")

🔤Original Token      →  🧠 Lemma
------------------------------


TypeError: 'numpy.ndarray' object is not callable

# 🔍 Can You Spot the Difference?

**Learning Outcome (LO2):**  
Perform tokenization using lemmatization to clean up text for NLP.

---

> 🧐 **Think:** What has changed from Original Token → Lemma  
> 🤔 **Think:** How can this possibly help natural language processing?

---

NLP is a subfield of artificial intelligence, specifically associated with machine learning, and plays an important role in enabling computers to understand, interpret, and communicate with human natural language (Kumar, 2025).

# ✅ TEST 1: Check if lemmatization was applied correctly

In [11]:
# Basic checks
conditions = [
    isinstance(tokens, list),                    # tokens should be a list
    len(tokens) > 20,                            # enough tokens should be found
    isinstance(lemmas, list),                    # lemmas should be a list
    all(isinstance(x, str) for x in lemmas),     # all lemmas should be strings
]

# Display friendly feedback
if all(conditions):
    print("😄 PASSED: Your text was tokenized and lemmatized correctly!")
    print(f"🔢 Total tokens found: {len(tokens)}")
    print(f"🧠 Sample lemmas: {lemmas[:10]}")
else:
    print("❌ TRY AGAIN: Something's off. Check if you filled in the blanks correctly for tokenization or lemmatization.")

😄 PASSED: Your text was tokenized and lemmatized correctly!
🔢 Total tokens found: 93
🧠 Sample lemmas: ['TRUE', '!', '—', 'nervous', '—', 'very', ',', 'very', 'dreadfully', 'nervous']


# 🧰 Part 3: Bag-of-Words (BoW) Model

**Learning Outcome (LO3):**  
Understand how to represent text numerically using the Bag-of-Words model.

> 🧩 **Your task:** Fill in blanks to transform text into a matrix of word counts.

# 🧩 Step 1: Create a CountVectorizer

In [12]:
vectorizer = CountVectorizer()

# 🧩 Step 2: Fit and transform the text data

In [13]:
# HINT: vectorizer.fit_transform([])
X = ___

# 🧩 Step 3: Get feature (word) names

In [16]:
words = vectorizer.get_feature_names_out()
words

array(['above', 'acute', 'all', 'am', 'and', 'been', 'but', 'calmly',
       'can', 'destroyed', 'disease', 'dreadfully', 'dulled', 'earth',
       'had', 'healthily', 'heard', 'hearing', 'hearken', 'heaven',
       'hell', 'how', 'in', 'mad', 'many', 'my', 'nervous', 'not',
       'observe', 'of', 'say', 'sense', 'senses', 'sharpened', 'story',
       'tell', 'that', 'the', 'them', 'then', 'things', 'true', 'very',
       'was', 'whole', 'why', 'will', 'you'], dtype=object)

In [17]:
# Display the results
print("🔤 Vocabulary:", words)
print("🧮 BoW Matrix:\n", X.toarray())

🔤 Vocabulary: ['above' 'acute' 'all' 'am' 'and' 'been' 'but' 'calmly' 'can' 'destroyed'
 'disease' 'dreadfully' 'dulled' 'earth' 'had' 'healthily' 'heard'
 'hearing' 'hearken' 'heaven' 'hell' 'how' 'in' 'mad' 'many' 'my'
 'nervous' 'not' 'observe' 'of' 'say' 'sense' 'senses' 'sharpened' 'story'
 'tell' 'that' 'the' 'them' 'then' 'things' 'true' 'very' 'was' 'whole'
 'why' 'will' 'you']
🧮 BoW Matrix:
 [[1 1 2 3 3 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 3 3 2 1 1 2 2 1 1 1 1 1 1 1 1
  1 5 1 1 2 1 2 1 1 1 1 2]]


# 🧪 TEST 2 — Check if Bag-of-Words model is correctly built

In [18]:
try:
    num_words = len(words)
    matrix_shape = X.shape

    if num_words > 10 and matrix_shape[1] == num_words:
        print("😄 PASSED: Bag-of-Words created successfully!")
        print(f"📊 Vocabulary size: {num_words}")
        print(f"🧮 BoW matrix shape: {matrix_shape}")
        print(f"🔤 Sample words: {words[:10]}")
    else:
        print("❌ TRY AGAIN: BoW model doesn't look right. Make sure to use vectorizer.fit_transform([sample_text]) and get_feature_names_out().")

except Exception as e:
    print("❌ ERROR: Something went wrong.")
    print("💡 HINT: Check if 'vectorizer', 'X', and 'words' are defined correctly.")

😄 PASSED: Bag-of-Words created successfully!
📊 Vocabulary size: 48
🧮 BoW matrix shape: (1, 48)
🔤 Sample words: ['above' 'acute' 'all' 'am' 'and' 'been' 'but' 'calmly' 'can' 'destroyed']


# 🧾 Step 4: Display word counts so you can SEE how BoW works!

In [19]:
# Convert BoW matrix to an array
bow_array = X.toarray()

# Pair each word with its corresponding count
word_counts = dict(zip(words, bow_array[0]))

# Print each word and how many times it appears in the text
print("📊 Word Count (Bag-of-Words Representation):")
    
# Sort words by frequency (descending)
for word, count in sorted(word_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"{word:<15} : {count}")

📊 Word Count (Bag-of-Words Representation):
the             : 5
am              : 3
and             : 3
how             : 3
in              : 3
all             : 2
had             : 2
heard           : 2
mad             : 2
nervous         : 2
not             : 2
things          : 2
very            : 2
you             : 2
above           : 1
acute           : 1
been            : 1
but             : 1
calmly          : 1
can             : 1
destroyed       : 1
disease         : 1
dreadfully      : 1
dulled          : 1
earth           : 1
healthily       : 1
hearing         : 1
hearken         : 1
heaven          : 1
hell            : 1
many            : 1
my              : 1
observe         : 1
of              : 1
say             : 1
sense           : 1
senses          : 1
sharpened       : 1
story           : 1
tell            : 1
that            : 1
them            : 1
then            : 1
true            : 1
was             : 1
whole           : 1
why             : 1
will            

In [20]:
# 🧪 TEST 3 — Check if word count dictionary works

try:
    if isinstance(word_counts, dict) and len(word_counts) > 5:
        top_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:5]
        print("😄 PASSED: Word counts generated successfully!")
        print("🏆 Top 5 most frequent words:")
        for word, count in top_words:
            print(f"{word:<15}: {count}")
    else:
        print("❌ TRY AGAIN: Your word_counts dictionary seems empty or not created correctly.")
except Exception:
    print("❌ ERROR: Check if you defined 'word_counts' after creating the BoW matrix.")

😄 PASSED: Word counts generated successfully!
🏆 Top 5 most frequent words:
the            : 5
am             : 3
and            : 3
how            : 3
in             : 3


# 💭 Part 4: Reflection & Real-World Applications

**Learning Outcome (LO5 & LO6):**  
Think critically about what you learned and its implications.

### 🧩 Discussion Prompts
1. Why is Bag-of-Words useful yet limited for real-world NLP tasks?
2. Can you name an industry where NLP can be transformative?
3. Reflect on your learning process — how did collaboration help?

> 💬 Post your reflections on **Padlet** or discuss in class.