<a href="https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing Demystified | Preprocessing
https://nlpdemystified.org<br>
https://github.com/futuremojo/nlp-demystified

### spaCy upgrade and package installation.

At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy.
<br><br>
**IMPORTANT**<br>
If you're running this in the cloud rather than using a local Jupyter server on your machine, then the notebook will **timeout** after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical packages.
<br><br>
Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:<br>
https://research.google.com/colaboratory/local-runtimes.html

In [1]:
# !pip install -U spacy==3.* 

In [1]:
!python -m spacy info

[1m

spaCy version    3.8.5                         
Location         /home/amiche/anaconda3/envs/nlp/lib/python3.12/site-packages/spacy
Platform         Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version   3.12.9                        
Pipelines                                      



In [2]:
 import spacy 

After importing spaCy, the next thing we need to do is load a suitable statistical model for our project. spaCy offers a variety of models for different languages. These models help with tokenization, part-of-speech tagging, named entity recognition, and more.

Here, we're loading the **en_core_web_sm** model which is the smallest English model spaCy offers and is a good starting point for NLP tasks.<br>
https://spacy.io/models/en#en_core_web_sm

Since we upgraded spaCy, we'll need to download the statistical model as well.

In [3]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
nlp = spacy.load('en_core_web_sm')

**en_core_web_sm** is trained on OntoNotes 5 which is an annotated corpus comprising news, blogs, transcripts, etc. Put simply, this means a bunch of documents were labelled with information such as how each sentence should be parsed, whether a particular word is a noun or adjective or other part-of-speech, whether a word is a special entity like a person or a real-world organization, and other language-related labels. A statistical model was then generated from these labelled documents.<br>
https://catalog.ldc.upenn.edu/LDC2013T19
<br><br>
You can learn more about the available spaCy models at these links:<br>
https://spacy.io/models<br>
https://spacy.io/usage/models

After loading the model, the _nlp_ variable now references a **Language** class instance which contains language-specific rules for various tasks (e.g. tokenization) and a processing pipeline.<br>
https://spacy.io/api/language

In [5]:
type(nlp) 

spacy.lang.en.English

# Tokenization

Course module for this demo:
https://www.nlpdemystified.org/course/tokenization


### Tokenization with spaCy

We pass whatever text we want to process to _nlp_, which returns a **Doc** container object containing the tokenized text and a number of annotations for each token. These annotations are discussed in follow-up videos. You can learn more about the **Doc** object here:<br>
https://spacy.io/api/doc

In [6]:
# Sample sentence.
s = "He didn't want to pay $20 for this book."
doc = nlp(s)

In [8]:
doc.text

"He didn't want to pay $20 for this book."

We can iterate over this **Doc** object and view the tokens.

In [9]:
print([t.text for t in doc])

['He', 'did', "n't", 'want', 'to', 'pay', '$', '20', 'for', 'this', 'book', '.']


Note how
- "didn't" is separated into "did"  and "n't".
- the currency symbol and amount are separated.
- the period at the end of the sentence is its own token.

The **Doc** object can be indexed and sliced like a regular list. The **Doc** object contains **Token** and **Span** objects, which offer different views into the text.

In [10]:
# We can view an individual token by indexing into the Doc object.
print(doc[0])

He


In [11]:
# A Doc object is a container of other objects, namely Token and Span objects.
print(type(doc[0]))

<class 'spacy.tokens.token.Token'>


In [12]:
# Slicing a Doc object returns a Span object.
print(doc[0:3])
print(type(doc[0:3]))

He didn't
<class 'spacy.tokens.span.Span'>


In [13]:
# Access a token's index in a sentence.
print([(t.text, t.i) for t in doc])

[('He', 0), ('did', 1), ("n't", 2), ('want', 3), ('to', 4), ('pay', 5), ('$', 6), ('20', 7), ('for', 8), ('this', 9), ('book', 10), ('.', 11)]


Spacy's tokenization is _non-destructive_, which means the original input can be reconstructed from the tokens.

In [14]:
# You can view the original input like so:
print(doc.text)

He didn't want to pay $20 for this book.


You can learn more about the **Token** and **Span** objects here:<br>
https://spacy.io/api/token<br>
https://spacy.io/api/span


In [16]:
doc = nlp("He didn't want to pay $20 for this book.")
for token in doc:
    print(f"{token.text:>10} | {token.pos_:<10} | {token.dep_:<10} | head: {token.head.text}")

        He | PRON       | nsubj      | head: want
       did | AUX        | aux        | head: want
       n't | PART       | neg        | head: want
      want | VERB       | ROOT       | head: want
        to | PART       | aux        | head: pay
       pay | VERB       | xcomp      | head: want
         $ | SYM        | nmod       | head: 20
        20 | NUM        | dobj       | head: pay
       for | ADP        | prep       | head: pay
      this | DET        | det        | head: book
      book | NOUN       | pobj       | head: for
         . | PUNCT      | punct      | head: want


We can also tokenize multiple sentences and access each sentence individually using the **Doc** object's _sents_ property.

In [22]:
s = """Either the well was very deep, or she fell very slowly, for she 
had plenty of time as she went down to look about her and to wonder what 
was going to happen next. First, she tried to look down and make out what 
she was coming to, but it was too dark to see anything; then she looked at 
the sides of the well, and noticed that they were filled with cupboards and 
book-shelves; here and there she saw maps and pictures hung upon pegs."""

doc = nlp(s)

# Look at individual sentences (there should be two 'Span' objects).
sentences = [sent for sent in doc.sents]
print(f"Sentences : {sentences}")
print(f"\nNumber of sentence : {len(sentences)}")

Sentences : [Either the well was very deep, or she fell very slowly, for she 
had plenty of time as she went down to look about her and to wonder what 
was going to happen next., First, she tried to look down and make out what 
she was coming to, but it was too dark to see anything; then she looked at 
the sides of the well, and noticed that they were filled with cupboards and 
book-shelves; here and there she saw maps and pictures hung upon pegs.]

Number of sentence : 2


### Tokenization Exercises

In [None]:
#
# EXERCISE:
# 1) Tokenize the following text
# 2) Iterate through the tokens to check whether there's a currency symbol.
# 3) If there is, and the currency label is followed by a number, print
#    both the symbol and the number.
# 
# Look through https://spacy.io/api/token#attributes on how to check whether
# a token is a currency symbol or a number.
#
# Expected output: "$20".
s = "He didn't want to pay $20 for this book."
doc = nlp(s)

In [34]:
s = "He didn't want to pay $20 for this book."
doc = nlp(s)

print(f"tokens : {[(token.text, token.i) for token in doc]}")
print(doc[6:8])

tokens : [('He', 0), ('did', 1), ("n't", 2), ('want', 3), ('to', 4), ('pay', 5), ('$', 6), ('20', 7), ('for', 8), ('this', 9), ('book', 10), ('.', 11)]
$20


In [None]:
#
# EXERCISE: Learn how the spaCy tokenizer works and how to customize it:
# https://spacy.io/usage/linguistic-features#tokenization
#

In [36]:
text = "Apple is looking at buying U.K. startup for $1 billion"

# simple split
split = [elt for elt in text.split(" ")]
print(f"Simple plit : {split}")
# tokenization
doc = nlp(text)
tokens = [token.text for token in doc]
print(f"\nTokenization : {tokens}")

Simple plit : ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$1', 'billion']

Tokenization : ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion']


In [41]:
# Minimal Python Tokenize (spaCy-style logic)
def custom_tokenizer(text, special_cases, prefix_search, suffix_search, infix_finditer, token_match, url_match):
    tokens = []

    # Step 1: Split text by whitespace to get initial word candidates
    for substring in text.split():
        while substring:
            # Step 2: Check for full special case (e.g., "U.K.", "don't")
            if substring in special_cases:
                tokens.extend(special_cases[substring])
                break

            # Step 3: Repeatedly handle prefixes and suffixes
            while prefix_search(substring) or suffix_search(substring):
                # Step 3a: Check token match (e.g., emoji, number, email, etc.)
                if token_match(substring):
                    tokens.append(substring)
                    substring = ""
                    break

                # Step 3b: Re-check for special case
                if substring in special_cases:
                    tokens.extend(special_cases[substring])
                    substring = ""
                    break

                # Step 3c: Consume prefix
                prefix = prefix_search(substring)
                if prefix:
                    split = prefix.end()
                    tokens.append(substring[:split])
                    substring = substring[split:]
                    continue

                # Step 3d: Consume suffix
                suffix = suffix_search(substring)
                if suffix:
                    split = suffix.start()
                    tokens.append(substring[split:])
                    substring = substring[:split]
                    continue

            # Step 4: Token match again
            if token_match(substring):
                tokens.append(substring)
                substring = ""

            # Step 5: URL match
            elif url_match(substring):
                tokens.append(substring)
                substring = ""

            # Step 6: Special case (again)
            elif substring in special_cases:
                tokens.extend(special_cases[substring])
                substring = ""

            # Step 7: Handle infixes (e.g., hyphen, apostrophe inside word)
            elif list(infix_finditer(substring)):
                infixes = list(infix_finditer(substring))
                offset = 0
                for match in infixes:
                    if offset != match.start():
                        tokens.append(substring[offset:match.start()])
                    tokens.append(substring[match.start():match.end()])
                    offset = match.end()
                if offset < len(substring):
                    tokens.append(substring[offset:])
                substring = ""

            # Step 8: No special handling matched
            else:
                tokens.append(substring)
                substring = ""

    return tokens

def print_tokenization_results(results: dict):
    for i, (sentence, tokens) in enumerate(results.items(), 1):
        print(f"\n🟡 Example {i}:")
        print(f"🔹 Input: {sentence}")
        print(f"🔹 Tokens: {tokens}")

In [42]:
import re

# ----------- Rule Definitions -----------
PREFIXES = [r"\$", r"\("]
SUFFIXES = [r"\.", r"\!", r"\?"]
INFIXES = [r"-", r"'"]

prefix_re = re.compile(r"^(" + "|".join(PREFIXES) + ")")
suffix_re = re.compile(r"(" + "|".join(SUFFIXES) + r")$")
infix_re = re.compile("|".join(INFIXES))

prefix_search = lambda text: prefix_re.match(text)
suffix_search = lambda text: suffix_re.search(text)
infix_finditer = lambda text: infix_re.finditer(text)
token_match = lambda text: None  # Simulated matcher
url_match = lambda text: text.startswith("http")

SPECIAL_CASES = {
    "U.K.": ["U.K."],
    "don't": ["do", "n't"],
    "can't": ["ca", "n't"],
    "e-mail": ["e", "-", "mail"],
    "O'Neill": ["O", "'", "Neill"]
}

# ----------- Test Examples -----------
examples = [
    "Apple is looking at buying U.K. startup for $1 billion.",
    "She said, 'I can't do this anymore!'",
    "Please send an e-mail to support@example.com.",
    "O'Neill is visiting the U.S. next week.",
    "He paid $100 for that (surprisingly expensive) book.",
    "Visit http://example.com for more info.",
    "I don't think this will work.",
    "The price is $2.50!",
    "Welcome to A.I.-powered platforms.",
    "Don't worry – we've got you covered."
]

# Tokenize each example
results = {text: custom_tokenizer(
    text,
    SPECIAL_CASES,
    prefix_search,
    suffix_search,
    infix_finditer,
    token_match,
    url_match
) for text in examples}

print_tokenization_results(results)


🟡 Example 1:
🔹 Input: Apple is looking at buying U.K. startup for $1 billion.
🔹 Tokens: ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', '.', 'billion']

🟡 Example 2:
🔹 Input: She said, 'I can't do this anymore!'
🔹 Tokens: ['She', 'said,', "'", 'I', 'ca', "n't", 'do', 'this', 'anymore!', "'"]

🟡 Example 3:
🔹 Input: Please send an e-mail to support@example.com.
🔹 Tokens: ['Please', 'send', 'an', 'e', '-', 'mail', 'to', '.', 'support@example.com']

🟡 Example 4:
🔹 Input: O'Neill is visiting the U.S. next week.
🔹 Tokens: ['O', "'", 'Neill', 'is', 'visiting', 'the', '.', 'U.S', 'next', '.', 'week']

🟡 Example 5:
🔹 Input: He paid $100 for that (surprisingly expensive) book.
🔹 Tokens: ['He', 'paid', '$', '100', 'for', 'that', '(', 'surprisingly', 'expensive)', '.', 'book']

🟡 Example 6:
🔹 Input: Visit http://example.com for more info.
🔹 Tokens: ['Visit', 'http://example.com', 'for', 'more', '.', 'info']

🟡 Example 7:
🔹 Input: I don't think this will work.
🔹 Toke

In [46]:
from spacy.symbols import ORTH
import spacy

# handle special case with spaCy
nlp = spacy.load("en_core_web_sm")
doc = nlp("gimme that")
print(f"Not handle : {[t.text for t in doc]}")

# Add special case rule
special_case = [{ORTH: "gim"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)
doc = nlp("gimme that again")
print(f"Special case : {[t.text for t in doc]}")

Not handle : ['gimme', 'that']
Special case : ['gim', 'me', 'that', 'again']


In [47]:
# Debug
nlp.tokenizer.explain(text)

[('TOKEN', 'Apple'),
 ('TOKEN', 'is'),
 ('TOKEN', 'looking'),
 ('TOKEN', 'at'),
 ('TOKEN', 'buying'),
 ('TOKEN', 'U.K.'),
 ('TOKEN', 'startup'),
 ('TOKEN', 'for'),
 ('PREFIX', '$'),
 ('TOKEN', '1'),
 ('TOKEN', 'billion')]

In [49]:
import re
import spacy
from spacy.tokenizer import Tokenizer

# Define special cases
special_cases = {":)": [{"ORTH": ":)"}]}

# Define custom regexes
prefix_re = re.compile(r'''^[\[\("']''')         # Match [, (, ", or '
suffix_re = re.compile(r'''[\]\)"']$''')         # Match ], ), ", or '
infix_re = re.compile(r'''[-~]''')               # Match - or ~ as infix
simple_url_re = re.compile(r'''^https?://''')    # Match basic http/https

# Custom tokenizer factory
def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab,
                     rules=special_cases,
                     prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     infix_finditer=infix_re.finditer,
                     url_match=simple_url_re.match)

# Load language model and assign custom tokenizer
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)

# Test
doc = nlp("hello-world. :)")
print([t.text for t in doc])


['hello', '-', 'world.', ':)']


In [51]:
# Example 1: Basic whitespace tokenizer 
from spacy.tokens import Doc

class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(" ")
        spaces = [True] * len(words)
        # Avoid zero-length tokens
        for i, word in enumerate(words):
            if word == "":
                words[i] = " "
                spaces[i] = False
        # Remove the final trailing space
        if words[-1] == " ":
            words = words[0:-1]
            spaces = spaces[0:-1]
        else:
           spaces[-1] = False

        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.blank("en")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("What's happened to me? he thought. It wasn't a dream.")
print([token.text for token in doc])

["What's", 'happened', 'to', 'me?', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.']


In [None]:
#
# EXERCISE: Read through spaCy-101 and if you're interested, check out their course
# on spaCy itself (link on the page).
# https://spacy.io/usage/spacy-101
#

In [53]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting joblib (from nltk)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (796 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m796.9/796.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading joblib-1.4.2-py3-none-any.whl (301 kB)
Installing collected packages: regex, joblib, nltk
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [nltk][32m2/3[0m [nltk]b]
[1A[2KSuccessfully installed joblib-1.4.2 nltk-3.9.1 rege

In [58]:
#
# EXERCISE: Look up how to tokenize the sentence below using NLTK. The imports 
# are done for you. Does the NLTK tokenizer handle "N.Y.C." correctly?
#
import nltk
from nltk.tokenize import TreebankWordTokenizer
s = "Let's go to N.Y.C. for the weekend."

tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(s)
print(tokens)

['Let', "'s", 'go', 'to', 'N.Y.C.', 'for', 'the', 'weekend', '.']


**NOTE**: Different tokenizers will give subtly different results based on the rules they use. Experiment with different tokenizers and use the one best suited for your project.

# Basic Preprocessing
## Case-Folding, Stop Word Removal, Stemming, and Lemmatization.

Course module for this demo:
https://www.nlpdemystified.org/course/basic-preprocessing

**NOTE: If the notebook timed out, you may need to re-upgrade spaCy and re-install the language model as follows:**


In [None]:
!pip install -U spacy==3.*
!python -m spacy download en_core_web_sm
!python -m spacy info

spaCy performs all these preprocessing steps (except stemming) behind the scenes for you. Inline with its non-destructive policy, the tokens aren't modified directly. Rather, each **Token** object has a number of attributes which can help you get views of your document with these pre-processing steps applied. The attributes a **Token** has can be found here:<br>
https://spacy.io/api/token#attributes
<br><br>
More information about spaCy's processing pipeline:<br>
https://spacy.io/usage/processing-pipelines

In [59]:
import spacy
nlp = spacy.load('en_core_web_sm')
s = "He told Dr. Lovato that he was done with the tests and would post the results shortly."
doc = nlp(s)

### Case-Folding

View your document with case-folding using the *lower_* attribute.

In [63]:
print([t.text for t in doc])

['He', 'told', 'Dr.', 'Lovato', 'that', 'he', 'was', 'done', 'with', 'the', 'tests', 'and', 'would', 'post', 'the', 'results', 'shortly', '.']


In [64]:
print([t.lower_ for t in doc])

['he', 'told', 'dr.', 'lovato', 'that', 'he', 'was', 'done', 'with', 'the', 'tests', 'and', 'would', 'post', 'the', 'results', 'shortly', '.']


You can also apply conditions when generating these views. For example, we can skip case-folding if a token is the start of a sentence.

In [65]:
print([t.lower_ if not t.is_sent_start else t for t in doc])

[He, 'told', 'dr.', 'lovato', 'that', 'he', 'was', 'done', 'with', 'the', 'tests', 'and', 'would', 'post', 'the', 'results', 'shortly', '.']


### Stop Word Removal

spaCy comes with a default stop word list. To view your document with stop words removed, you can use the *is_stop* attribute.

In [66]:
# spaCy's default stop word list.
print(nlp.Defaults.stop_words)
print(len(nlp.Defaults.stop_words))

{'something', 'give', 'and', 'were', 'therein', '’ve', 'see', 'how', 'him', 'his', 'until', 'always', 'before', 'such', 'make', 'say', 'anyhow', 'bottom', 'though', 'using', 'everywhere', 'everyone', 'due', 'back', 'than', "'ve", 'one', "'d", 'had', '‘ll', 'eleven', 'this', 'afterwards', 'did', 'towards', '’re', 'hers', 'what', 'never', 'much', 'am', 'sometime', 'well', 'used', 'she', 'when', 'without', 'on', 'quite', 'that', 'herself', 'n’t', 'thereafter', '’s', 'sixty', 'get', 'serious', 'hereafter', 'whereby', 'formerly', 'someone', 'somewhere', 'nor', 'these', 'none', 'yet', 'amongst', 'meanwhile', 'at', 'anyone', 'other', 'as', 'two', 'please', 'further', 'from', 'forty', 'it', 'the', 'many', 'several', 'you', '’m', 'behind', 'call', 'some', 'alone', 'noone', 'nobody', 'who', 'keep', 'whereafter', 'just', 'both', 'latter', 'elsewhere', 'however', 'nothing', 'sometimes', 'become', 'really', 'now', 'while', 'yourself', 'fifteen', 'hereby', 'over', 'move', 'might', 'amount', 'take', 

In [67]:
print([t for t in doc if not t.is_stop])

[told, Dr., Lovato, tests, post, results, shortly, .]


### Lemmatization

It's similar with lemmatization. You can view your document with lemmatization applied through the *lemma_* attribute.

In [68]:
[(t.text, t.lemma_) for t in doc]

[('He', 'he'),
 ('told', 'tell'),
 ('Dr.', 'Dr.'),
 ('Lovato', 'Lovato'),
 ('that', 'that'),
 ('he', 'he'),
 ('was', 'be'),
 ('done', 'do'),
 ('with', 'with'),
 ('the', 'the'),
 ('tests', 'test'),
 ('and', 'and'),
 ('would', 'would'),
 ('post', 'post'),
 ('the', 'the'),
 ('results', 'result'),
 ('shortly', 'shortly'),
 ('.', '.')]

### Basic Preprocessing Exercises

## 🌱 What Does a Stemmer Do?

A **stemmer** is a tool or algorithm that **reduces a word to its base or root form** — called the **stem** — by stripping off **prefixes** or **suffixes**.

However, the resulting **stem**:

* **May not be a valid word** in the language (unlike a lemma).
* Is **not necessarily the same** as a dictionary root.

### 🧪 Examples:

| Original Word | Stemmed Form |
| ------------- | ------------ |
| running       | run          |
| studies       | studi        |
| easily        | easili       |
| happiness     | happi        |

---

## 🧠 Why Is Stemming Necessary in NLP?

In many NLP tasks, we don’t need all the different **forms** of a word — we just care about its **core meaning**.

### ✅ Use cases:

1. **Search engines / Information retrieval:**

   * If someone searches for “**run**”, you also want to match “**running**”, “**ran**”, “**runs**”.
   * Stemming helps normalize all of these to a common base (e.g., “run”).

2. **Text classification / Topic modeling / Clustering:**

   * Reducing word variations improves signal-to-noise ratio.
   * Helps reduce the **dimensionality** of the vocabulary space.

3. **Faster training:**

   * In tasks like spam detection or sentiment analysis, you don’t care if it’s “played”, “playing”, or “plays” — they all get reduced to “play”.

---

## 🚨 Stemming ≠ Lemmatization

| Feature  | **Stemming**                        | **Lemmatization**                  |
| -------- | ----------------------------------- | ---------------------------------- |
| Result   | Rough root (not always a real word) | Dictionary root (lemma)            |
| Speed    | Fast (rule-based)                   | Slower (uses grammar & dictionary) |
| Accuracy | Lower                               | Higher                             |
| Example  | "studies" → "studi"                 | "studies" → "study"                |

So, stemming is a **quick-and-dirty** solution, often used when:

* You don’t need grammatical correctness
* You want speed
* You’re handling a **large corpus** or building a **prototype**

---

## 🧰 When Should You Use a Stemmer?

✅ Use it when:

* You’re doing document matching, text search, keyword analysis
* You want fast preprocessing without worrying about correctness
* You’re okay with some over- or under-stemming

❌ Avoid it when:

* You need linguistically accurate output (e.g., chatbots, grammar checkers)
* You're working with multilingual, morphologically rich languages


spaCy doesn't support stemming natively. But for completeness, we can stem using **NLTK**. Specifically, we can use the *Snowball stemmer* which is an improved version of the *Porter stemmer*.

In [76]:
#
# EXERCISE: Find out how to intialize the SnowballStemmer, then tokenize
# and stem the sentence below.
#
from nltk.stem.snowball import SnowballStemmer
s = 'He told Dr. Lovato that he was done with the tests and would post the results shortly.'

# Initialize the stemmer here.
print("Available languages for SnowballStemmer")
print(" ".join(SnowballStemmer.languages))

stemmer = SnowballStemmer("english")

# Tokenize, stem, and print the tokens.
from nltk.tokenize import TreebankWordTokenizer

tokenizer  = TreebankWordTokenizer()
tokens = tokenizer.tokenize(s)

stemmed_tokens  = [stemmer.stem(token) for token in tokens]

print(f"Tokens : {tokens}")
print(f"Stemmer : {stemmed_tokens}")

Available languages for SnowballStemmer
arabic danish dutch english finnish french german hungarian italian norwegian porter portuguese romanian russian spanish swedish
Tokens : ['He', 'told', 'Dr.', 'Lovato', 'that', 'he', 'was', 'done', 'with', 'the', 'tests', 'and', 'would', 'post', 'the', 'results', 'shortly', '.']
Stemmer : ['he', 'told', 'dr.', 'lovato', 'that', 'he', 'was', 'done', 'with', 'the', 'test', 'and', 'would', 'post', 'the', 'result', 'short', '.']


In [88]:
#
# EXERCISE: Find out how to add and remove your own stop words in spaCy. Add the 
# word 'told' as a stop word, test that it works, then remove it from 
# the stop word list.
#
s = "He told the committee everything he knew about the incident, even though it put his own job at risk."
nlp = spacy.load("en_core_web_sm")
doc = nlp(s)

# Before adding 'told' as stop word
print(f"Original Stop Words: {[token.text for token in doc if token.is_stop]}")

# Add 'told' as a stop word
nlp.vocab["told"].is_stop = True
doc = nlp(s)
print(f"After Adding 'told': {[token.text for token in doc if token.is_stop]}")

# Remove 'told' from stop words
nlp.vocab["told"].is_stop = False
doc = nlp(s)
print(f"After Removing 'told': {[token.text for token in doc if token.is_stop]}")


Original Stop Words: ['He', 'the', 'everything', 'he', 'about', 'the', 'even', 'though', 'it', 'put', 'his', 'own', 'at']
After Adding 'told': ['He', 'told', 'the', 'everything', 'he', 'about', 'the', 'even', 'though', 'it', 'put', 'his', 'own', 'at']
After Removing 'told': ['He', 'the', 'everything', 'he', 'about', 'the', 'even', 'though', 'it', 'put', 'his', 'own', 'at']


In [84]:
#
# EXERCISE: Read up on how to add your own custom attributes to Token objects
# and try adding one of your own.
# https://spacy.io/usage/processing-pipelines#custom-components-attributes
#

#Advanced Preprocessing

## Part-of-Speech Tagging, Named Entity Recognition, and Parsing.

Course module for this demo:
https://www.nlpdemystified.org/course/advanced-preprocessing

**NOTE: If the notebook timed out, you may need to re-upgrade spaCy and re-install the language model as follows:**


In [None]:
!pip install -U spacy==3.*
!python -m spacy download en_core_web_sm
!python -m spacy info

spaCy performs Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and parsing as part of its default pipeline in the *nlp* object.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
s = "John watched an old movie at the cinema."
doc = nlp(s)

### Part-of-Speech Tagging

POS tags can be accessed through the *pos_* attribute

In [None]:
[(t.text, t.pos_) for t in doc]

To get a description for a POS tag, we can use _spacy.explain_.

In [None]:
spacy.explain('PROPN')

The POS tags above are called *course-grained* tags. You can also access *fine-grained* tags through the *tag_* attribute. Fine-grained tags provide more detailed information about a token such as its tense and, if a word is a pronoun, what specific type of pronoun it is.

In [None]:
[(t.text, t.tag_) for t in doc]

So **NNP** refers specifically to a _singular pronoun_, and **VBD** is a verb in *past tense*.

In [None]:
print(spacy.explain('NNP'))
print(spacy.explain('VBD'))

### Named Entity Recognition

There are multiple ways to access named entities. One way is through the *ent_type_* attribute.


In [None]:
s = "Volkswagen is developing an electric sedan which could potentially come to America next fall."
doc = nlp(s)

[(t.text, t.ent_type_) for t in doc]

You can view spaCy's named entities annotations here:<br>
https://spacy.io/api/annotation#named-entities

or use _spacy.explain_.

In [None]:
spacy.explain('GPE')

You can also check if a token is an entity before printing it by checking whether the _ent_type_ (note the lack of trailing underscore) attribute is non-zero.

In [None]:
print([(t.text, t.ent_type_) for t in doc if t.ent_type != 0])

Another way is through the _ents_ property of the **Doc** object. Here, we iterate through _ents_ and print the entity itself and its label.

In [None]:
print([(ent.text, ent.label_) for ent in doc.ents])

Note how "next fall" is outputted above as a single span when you use _ents_.
<br><br>
You can also access the positions of entities:

In [None]:
print([(ent.text, ent.label_, ent.start_char, ent.end_char) for ent in doc.ents])

spaCy is bundled with visualizers for both parsing and named entities.<br>
https://spacy.io/usage/visualizers
<br><br>
Here, we visualize the entities in our sample sentence.

In [None]:
from spacy import displacy

# We need to set the 'jupyter' variable to True in order to output
# the visualization directly. Otherwise, you'll get raw HTML.
displacy.render(doc, style='ent', jupyter=True)

For domain-specific corpora, an NER tagger may need to be further fine-tuned. Here, we may want _The Martian_ tagged as a "FILM" (assuming that's our goal).

In [None]:
s = "Ridley Scott directed The Martian."
doc = nlp(s)
displacy.render(doc, style='ent', jupyter=True)

### Parsing

Let's first visualize a parse to make it easier to follow.

In [None]:
s = "She enrolled in the course at the university."
doc = nlp(s)

# Note the 'style' argument is assigned a 'dep' flag this time around.
displacy.render(doc, style='dep', jupyter=True)

The visualization above is for a dependency parse (spaCy doesn't come with a constituency parser). For each pair of depencencies, spaCy visualizes the child (pointed to), the head (pointed from), and their relationship (the label arc). You can view the dependency annotations here:<br>
https://spacy.io/api/annotation#dependency-parsing

You can also use *spacy.explain* to get information on a particular annotation.

In [None]:
spacy.explain('nsubj')

The dependency labels themselves can be accessed through the *dep_* attribute.

In [None]:
[(t.text, t.dep_) for t in doc]

Note how the word 'enrolled' is the _ROOT_.
<br><br>
But the labels above don't show how the words are related to each other (the arcs). To get a better idea, you can print the head of each dependency.

In [None]:
[(t.text, t.dep_, t.head.text) for t in doc]

### Using spaCy's Matcher to find patterns
spaCy comes with a host of pattern-matching functionality. Beyond regex, spaCy can match on a variety of attributes such as POS tags, entity labels, lemmas, dependencies, entire phrases, and a lot more. You can learn more here:<br>
https://spacy.io/usage/rule-based-matching<br>
https://explosion.ai/demos/matcher
<br><br>
Here, we try to search for patterns that may be useful for a hospitality bot.

In [None]:
# The general Matcher is one of multiple matcher objects
# included with spaCy.
from spacy.matcher import Matcher

# We initialize the Matcher with the spaCy vocab object, which contains
# words along with their labels and entities.
matcher = Matcher(nlp.vocab)

s = "I want to book a hotel room."
doc = nlp(s)

# Patterns are expressed as an ordered sequence. Here, we're looking
# to match occurrences starting with a 'book' string followed by
# a determiner (DET) POS tag, then a noun POS tag.
# The OP key marks the match as optional in some way.

# Here, the DET POS (marked with '?') will match 0 or 1 times, and
# the NOUN POS (marked with '+') will match 1 or more times.
# See this link for more information:
# https://spacy.io/usage/rule-based-matching#quantifiers
pattern = [
  {'TEXT': 'book'},
  {'POS': 'DET', 'OP': '?'},
  {'POS': 'NOUN', 'OP': '+'},
]

# We give our pattern a label and pass it to the matcher.
matcher.add('USER_INTENT', [pattern])

# Run the matcher over the doc.
matches = matcher(doc)

# For each match, the matcher returns a tuple specifying a match id, start, 
# and end of the match.
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

The code above demonstrates the Matcher but is brittle.
- What if "book" is capitalized?
- What if a user types "reserve" instead of "book"?
- How can we match on "hotel room" as a compound noun?
- What if a user types "book a flight and hotel room"?

Can you think of how you would handle these cases?
<br><br>
We could come up more rules to match different patterns, or perhaps just search for keywords based on POS and entities (e.g. a country) and present the user with a bunch of possible intentions and let them choose one, or have a bunch of different interpretation functions submit answers and select the most likely one based on what was historically accepted most often. We can also ask clarifying questions to narrow things down.
<br><br>
For example, for the last sentence, you could have a function scan through the **Doc** object's *noun_chunks* (phrases that have a noun as their head) and isolate keywords there along with potential conjunctions (e.g. "and").<br>
https://spacy.io/usage/linguistic-features#noun-chunks


In [None]:
doc = nlp("I want to book a flight and hotel room in Berlin.")
for noun_phrase in doc.noun_chunks:
  print("phrase: {}, root head: {}".format(noun_phrase, noun_phrase.root.head))

Using pure rules is a good place to start or prototype (especially if the domain is narrow with a tight set of use cases) but as our requirements get more sophisticated, we'll need to blend in other approaches such as classical models or perhaps deep learning (at the very least, maybe tune existing neural networks). spaCy's models can be updated with more examples to fine-tune predictions.<br>
https://spacy.io/usage/training<br>
<br>
We'll keep learning more approaches as the course progresses.

### Talkin' like Yoda
Languages like English are built around the _subject-verb-object_ pattern. But if you're familiar with Yoda from Star Wars, he famously speaks in an _object-subject-verb pattern_. Using the information in a dependency parse, we can turn basic English sentences into Yoda-speak.

In [None]:
def yodize(s: str):
  doc = nlp(s)
  for t in doc:
    if t.dep_ == "ROOT":

      # Assuming our sentence is of the form subject-verb-object, we take 
      # everything after the root (likely verb) and put it in front, and 
      # likewise take everything before the root, and put it after.
      seq = [doc[t.i + 1: -1].text, doc[0: t.i].text, t.text + '.']
      seq[0] = seq[0].capitalize()
      print(' '.join(seq))

In [None]:
yodize("I will fly to Texas.")

This is ok for simple sentences but starts getting weird with longer, more convoluted sentences. What are some ways you would improve this?

### Advanced Preprocessing Exercises

In [None]:
#
# EXERCISE: Learn how to extend spaCy's NER models. Specifically, how to add new
# entity names and entity types. 
#

In [None]:
#
# EXERCISE: using doc.ents, identify and print the dates in this sentence.
# Expected output: ['Feb 13th', 'Feb 24th']
#
s = "We'll be in Osaka on Feb 13th and leave on Feb 24th."
doc = nlp(s)



In [None]:
#
# EXERCISE: Read about spaCy's PhraseMatcher
# https://spacy.io/usage/rule-based-matching#phrasematcher
#
# Using the PhraseMatcher, find the start and end index of all occurrences 
# of 'Caesar Augustus' and 'Roman Empire' (case-insensitive).
#
# Expected output: [(0, 2), (15, 17)]
#
from spacy.matcher import PhraseMatcher
s = "Caesar Augustus was the founder of the Roman Principate (the first phase of the Roman Empire)."
doc = nlp(s)


# Additional Reading and Resources

Read through this page to learn more about spaCy's language processing pipeline including what's going on under the hood, how to create custom components, disable certain components (e.g. NER) when they're unneeded, optimization tips, and best practices:<br>
https://spacy.io/usage/processing-pipelines
<br><br>
Take the free and succinct spaCy course (available in multiple languages):<br>
https://course.spacy.io/
