## Workshop - Deconstructing Language for AI Agents

Objective: By the end of this 90-minute workshop, you will understand and have implemented the core components of a Natural Language Processing (NLP) pipeline. You will learn how raw text is systematically processed to become useful, structured data for training an AI agent like Google's BARD.

Duration: 90 minutes

### 1. Introduction: How Does an AI Learn Language?

When you ask an LLM, such as BARD, a question, it seems to just understand you. But that "understanding" is the result of a massive amount of pre-processing.

An AI doesn't read text like a human. It sees text as a sequence of numbers. To an AI, the words "run," "running," and "ran" are initially just different, unrelated data points. The goal of an NLP pipeline is to standardize and simplify text, transforming it into a structured format that a machine can learn patterns from.

Think of it as preparing ingredients for a recipe. You don't just throw a whole carrot into the pot. You wash it, peel it, and chop it. Our "ingredients" are words, and the "recipe" is the AI model we want to train. 

During this workshop, we'll learn the essential pre-processing steps.

### 2. The Raw Material: A Look at Our Text Data

For any NLP task, we need data. We'll be using the "20 Newsgroups" dataset, a classic collection of about 18,000 documents posted to 20 different Usenet newsgroups. It's perfect for us because it's real, messy text with a large vocabulary.

Let's start by loading the data and performing a quick Exploratory Data Analysis (EDA) to understand what we're working with.

Sample EDA: A First Glance at the Data
First, we need to install the necessary libraries. scikit-learn provides the dataset, nltk is our main NLP toolkit, and matplotlib helps us visualize our findings.

In [1]:
#
# Prerequisite installations for the workshop
# To set up the environment, run this script:
# !pip install scikit-learn nltk matplotlib
#
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Eespinosa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Eespinosa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Eespinosa\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Now, let's load the data and see what a raw document looks like.

In [2]:
# Load the dataset
from sklearn.datasets import fetch_20newsgroups
newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Let's inspect a single document
print("--- RAW DOCUMENT ---")
sample_doc = newsgroups_data.data[12]
print(sample_doc)

# Let's get some basic stats
all_text = ' '.join(newsgroups_data.data)
words = all_text.split()
unique_words = set(words)

print(f"\n--- INITIAL STATS ---")
print(f"Number of documents: {len(newsgroups_data.data)}")
print(f"Total words (approx.): {len(words)}")
print(f"Total unique words (vocab size): {len(unique_words)}")

--- RAW DOCUMENT ---
930418

Do what thou wilt shall be the whole of the Law. [Honestly.]
The word of Sin is Restriction. [Would I kid you?]


Does one man's words encompass the majestic vision of thousands
of individuals?  Quoting a man is not the same as quoting the
Order.  Taken out of context, words can be interpreted much
differently than had one applied them within the confines of
their original expression.

I think this is the case regarding Hymenaeus Beta, Frater Superior 
of the Order to which I belong.  When he included that bit
from Merlinus X' he did us all a service.  He showed us the extremes
to which Order members have been known to go in their fervor.
I have little knowledge regarding Reuss' background, but surely
he was an unusual man, and he was an important force in the Order 
for many years.

Yet as people change so do Orders change, and while we look back
so carefully at the dirty laundry of O.T.O. remember that this is
only the surface skim and that many perspecti

You'll notice our vocabulary is massive! This is the complexity we need to reduce.

### 3. The NLP Pipeline: From Chaos to Structure

We will now build the core components of our pipeline. Each step feeds its output to the next, progressively refining the text.

#### Part 1: Tokenization - Creating the Building Blocks

Concept: Tokenization is the process of breaking down a stream of text into individual units, or tokens. These tokens are the fundamental building blocks for any further analysis. Usually, tokens are words, but they can also be sentences or characters.

Implementation: A Simple Tokenizer
We can start with a basic algorithm that splits text by spaces and removes punctuation.

In [3]:
import re

def simple_tokenizer(text):
  """
  A simple implementation of a word tokenizer.
  1. Splits text by non-word characters.
  2. Returns a list of word tokens.
  """
  # Use regular expression to find all sequences of word characters
  tokens = re.findall(r'\b\w+\b', text)
  return tokens

# Let's tokenize our sample document
raw_tokens = simple_tokenizer(sample_doc)
print("--- TOKENIZATION ---")
print(raw_tokens)

--- TOKENIZATION ---
['930418', 'Do', 'what', 'thou', 'wilt', 'shall', 'be', 'the', 'whole', 'of', 'the', 'Law', 'Honestly', 'The', 'word', 'of', 'Sin', 'is', 'Restriction', 'Would', 'I', 'kid', 'you', 'Does', 'one', 'man', 's', 'words', 'encompass', 'the', 'majestic', 'vision', 'of', 'thousands', 'of', 'individuals', 'Quoting', 'a', 'man', 'is', 'not', 'the', 'same', 'as', 'quoting', 'the', 'Order', 'Taken', 'out', 'of', 'context', 'words', 'can', 'be', 'interpreted', 'much', 'differently', 'than', 'had', 'one', 'applied', 'them', 'within', 'the', 'confines', 'of', 'their', 'original', 'expression', 'I', 'think', 'this', 'is', 'the', 'case', 'regarding', 'Hymenaeus', 'Beta', 'Frater', 'Superior', 'of', 'the', 'Order', 'to', 'which', 'I', 'belong', 'When', 'he', 'included', 'that', 'bit', 'from', 'Merlinus', 'X', 'he', 'did', 'us', 'all', 'a', 'service', 'He', 'showed', 'us', 'the', 'extremes', 'to', 'which', 'Order', 'members', 'have', 'been', 'known', 'to', 'go', 'in', 'their', 'ferv

> Talking Point: "Tokenization is the first and most critical step; it defines the discrete units of meaning—our 'words'—that the machine will learn from."

#### Part 2: Normalization & Stop Words Removal - Cleaning the Noise

Concept: Normalization is about standardizing tokens to treat different forms of a word as the same. The most common step is converting all text to lowercase. Why? Because we don't want the model to think "Apple" (the company) and "apple" (the fruit) are different purely based on capitalization if the context doesn't require it.

Concept: Stop words are extremely common words like "the," "a," "is," "in." They add grammatical structure for humans but often add little semantic value for a machine. Removing them helps the model focus on the words that carry the most meaning.

Implementation: Lowercasing and Filtering

In [4]:
from nltk.corpus import stopwords

def normalize_and_remove_stops(tokens):
    """
    1. Converts all tokens to lowercase (Normalization).
    2. Removes common English stop words.
    """
    # 1. Normalization: Convert to lowercase
    normalized_tokens = [token.lower() for token in tokens]

    # 2. Stop Words Removal
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in normalized_tokens if token not in stop_words]
    return filtered_tokens

# Process our tokens
cleaned_tokens = normalize_and_remove_stops(raw_tokens)
print("--- NORMALIZATION & STOP WORDS REMOVAL ---")
print(cleaned_tokens)

--- NORMALIZATION & STOP WORDS REMOVAL ---
['930418', 'thou', 'wilt', 'shall', 'whole', 'law', 'honestly', 'word', 'sin', 'restriction', 'would', 'kid', 'one', 'man', 'words', 'encompass', 'majestic', 'vision', 'thousands', 'individuals', 'quoting', 'man', 'quoting', 'order', 'taken', 'context', 'words', 'interpreted', 'much', 'differently', 'one', 'applied', 'within', 'confines', 'original', 'expression', 'think', 'case', 'regarding', 'hymenaeus', 'beta', 'frater', 'superior', 'order', 'belong', 'included', 'bit', 'merlinus', 'x', 'us', 'service', 'showed', 'us', 'extremes', 'order', 'members', 'known', 'go', 'fervor', 'little', 'knowledge', 'regarding', 'reuss', 'background', 'surely', 'unusual', 'man', 'important', 'force', 'order', 'many', 'years', 'yet', 'people', 'change', 'orders', 'change', 'look', 'back', 'carefully', 'dirty', 'laundry', 'remember', 'surface', 'skim', 'many', 'perspectives', 'encompassed', 'extend', 'beyond', 'one', 'individual', 'hope', 'show', 'much', 'room'

> Talking Point (Normalization): "By converting every token to lowercase, we prevent the model from treating the same word differently, drastically reducing the vocabulary size it needs to handle."

> Talking Point (Stop Words): "Removing stop words is like filtering out the static; it lets the model focus on the high-signal words that define the document's topic."

#### Part 3: Stemming - Finding the Root

Concept: Stemming is a process of reducing a word to its root or stem. It does this by applying a crude set of rules to chop off common prefixes or suffixes (affixes). For example, "running," "runs," and "ran" might all be stemmed to the base form "run." This process, known as affix stripping, is heuristic and can sometimes result in non-dictionary words (e.g., "studies" becomes "studi"). However, it's fast and effective for collapsing vocabulary.

Implementation: The Porter Stemmer
We will use the famous Porter Stemming algorithm, available in NLTK.

In [5]:
from nltk.stem import PorterStemmer

def stem_tokens(tokens):
    """
    Applies the Porter Stemming algorithm to a list of tokens.
    This is a form of affix stripping.
    """
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens

# Apply stemming to our cleaned tokens
final_tokens = stem_tokens(cleaned_tokens)
print("--- STEMMING (AFFIX STRIPPING) ---")
print(final_tokens)

--- STEMMING (AFFIX STRIPPING) ---
['930418', 'thou', 'wilt', 'shall', 'whole', 'law', 'honestli', 'word', 'sin', 'restrict', 'would', 'kid', 'one', 'man', 'word', 'encompass', 'majest', 'vision', 'thousand', 'individu', 'quot', 'man', 'quot', 'order', 'taken', 'context', 'word', 'interpret', 'much', 'differ', 'one', 'appli', 'within', 'confin', 'origin', 'express', 'think', 'case', 'regard', 'hymenaeu', 'beta', 'frater', 'superior', 'order', 'belong', 'includ', 'bit', 'merlinu', 'x', 'us', 'servic', 'show', 'us', 'extrem', 'order', 'member', 'known', 'go', 'fervor', 'littl', 'knowledg', 'regard', 'reuss', 'background', 'sure', 'unusu', 'man', 'import', 'forc', 'order', 'mani', 'year', 'yet', 'peopl', 'chang', 'order', 'chang', 'look', 'back', 'care', 'dirti', 'laundri', 'rememb', 'surfac', 'skim', 'mani', 'perspect', 'encompass', 'extend', 'beyond', 'one', 'individu', 'hope', 'show', 'much', 'room', 'differ', 'opinion', 'within', 'order', 'perhap', 'test', 'limit', 'let', 'us', 'exami

> Talking Point (Stemming/Affix Stripping): "Stemming aggressively strips word endings to group related terms; it’s a powerful but blunt tool for consolidating our vocabulary down to its core concepts."

#### Part 4: The Final Result & The Big Picture

Let's put it all together and see the transformation.

In [6]:
# The full pipeline
def nlp_pipeline(text):
    tokens = simple_tokenizer(text)
    cleaned_tokens = normalize_and_remove_stops(tokens)
    final_tokens = stem_tokens(cleaned_tokens)
    return final_tokens

# Before
print("--- BEFORE PIPELINE ---")
print(sample_doc)

print("\n" + "="*50 + "\n")

# After
print("--- AFTER PIPELINE ---")
processed_doc = nlp_pipeline(sample_doc)
print(processed_doc)

# Let's re-evaluate our vocabulary size on the entire dataset
all_processed_tokens = nlp_pipeline(' '.join(newsgroups_data.data))
final_vocab_size = len(set(all_processed_tokens))

print(f"\n--- FINAL STATS ---")
print(f"Initial unique words: {len(unique_words)}")
print(f"Final unique words (vocab size): {final_vocab_size}")
print(f"Vocabulary Reduction: {((len(unique_words) - final_vocab_size) / len(unique_words)) * 100:.2f}%")

--- BEFORE PIPELINE ---
930418

Do what thou wilt shall be the whole of the Law. [Honestly.]
The word of Sin is Restriction. [Would I kid you?]


Does one man's words encompass the majestic vision of thousands
of individuals?  Quoting a man is not the same as quoting the
Order.  Taken out of context, words can be interpreted much
differently than had one applied them within the confines of
their original expression.

I think this is the case regarding Hymenaeus Beta, Frater Superior 
of the Order to which I belong.  When he included that bit
from Merlinus X' he did us all a service.  He showed us the extremes
to which Order members have been known to go in their fervor.
I have little knowledge regarding Reuss' background, but surely
he was an unusual man, and he was an important force in the Order 
for many years.

Yet as people change so do Orders change, and while we look back
so carefully at the dirty laundry of O.T.O. remember that this is
only the surface skim and that many perspe

### Part 5: Conclusion 

As you can see, we've taken unstructured, messy human language and converted it into a clean, standardized list of meaningful tokens. We've dramatically reduced the vocabulary size, making it much more efficient for an AI model to process.

This processed data is now ready to be converted into numerical vectors—a process called vectorization (e.g., TF-IDF or Word2Vec)—which is the final step before feeding it into a machine learning model to perform tasks like text classification, sentiment analysis, or even to train a foundational model like BARD.

### Part 6: Fun Activity: Ask BERT a Question!

So far, we've built an NLP pipeline from scratch. This is a fundamental skill, and it shows you the deliberate, step-by-step process of making text useful for machines.

Now, let's jump to the forefront of NLP and use a massive, pre-trained model: BERT (Bidirectional Encoder Representations from Transformers).

What is BERT?
Think of the pipeline we just built. Its goal was to simplify and standardize words. A model like BERT does something far more complex. During its "pre-training" on nearly the entire internet, it learned not just words, but the relationships between words.

Watch this video as an introduction to Transfoemrs to help you understand what BERT is, and where it came from:

[![Transformer models and BERT model: Overview](https://i.ytimg.com/vi/X5p-iS_6chc/maxresdefault.jpg)](http://www.youtube.com/watch?v=t45S_MwAcOw)


The key is its bidirectional nature. Unlike older models that read text left-to-right, BERT reads the entire sentence at once. This allows it to understand context. For example, it knows that "bank" in "river bank" is different from "bank" in "bank account." The pipeline we built would stem both to "bank" and lose that contextual meaning.

We can use the power of these pre-trained models for specific tasks without needing to train them ourselves. Let's try one of the most popular tasks: Extractive Question Answering.

The Goal: We will give the model a paragraph of text (the context) and a question. The model's job is to find and extract the exact span of text from the context that answers our question.

#### Implementation: The Hugging Face Pipeline 🚀

The transformers library from Hugging Face makes using models like BERT incredibly simple with a tool called pipeline. It handles all the complex tokenization and processing behind the scenes, so we can focus on the results.

In [7]:
#
# Prerequisite: Make sure you have transformers installed!
# Prerequisite: Install the necessary libraries
# The 'transformers' library requires either PyTorch or TensorFlow.
# We'll install torch here.
#
# !pip install transformers torch
#
from transformers import pipeline
import textwrap # Used for nice printing of the context

# 1. Create a Question-Answering pipeline
# This will download a default BERT-based model fine-tuned for this task.
# No authentication is needed!
qa_pipeline = pipeline("question-answering")

# 2. Define the context
# This is the body of text the model will read to find the answer.
context = """
BERT, which stands for Bidirectional Encoder Representations from Transformers,
is a large language model developed by Google. It was a major breakthrough in
the field of Natural Language Processing because it was the first model to look
at the full context of a word by examining the words that come before and after it.
This bidirectional capability allows it to grasp complex nuances in language.
The model was pre-trained on a massive amount of text data from sources like
Wikipedia and the BooksCorpus.
"""

# 3. Ask a question!
# The answer to this question must be present in the context above.
question = "What does BERT stand for?"

# 4. Get the answer
result = qa_pipeline(question=question, context=context)

# Let's print the results nicely
print("--- BERT Question-Answering ---")
print(f"❓ Question: {question}")
print(f"✅ Answer: '{result['answer']}'")
print(f"Confidence Score: {result['score']:.4f}")

# --- Now, it's your turn! ---
#
# Try asking different questions. For example:
# - "Who developed BERT?"
# - "What capability allows BERT to grasp nuances?"
# - "What was the model pre-trained on?"
#
# You can even change the context entirely to a paragraph about a
# topic you find interesting and ask questions about it!

print("\n--- Try it yourself! ---")
# Example of another question
my_question = "Who developed BERT?"
my_result = qa_pipeline(question=my_question, context=context)

print(f"❓ My Question: {my_question}")
print(f"✅ Answer: '{my_result['answer']}' with score {my_result['score']:.4f}")

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Device set to use cpu


--- BERT Question-Answering ---
❓ Question: What does BERT stand for?
✅ Answer: 'Bidirectional Encoder Representations from Transformers'
Confidence Score: 0.7050

--- Try it yourself! ---
❓ My Question: Who developed BERT?
✅ Answer: 'Google' with score 0.9778


⚠️ A Quick Word on Model Limitations
It's important to understand that the BERT model we're using is not a general-purpose chatbot. It has been fine-tuned for a very specific task: extractive question answering.

This means its only job is to find and extract the exact span of text from the context that best answers the question. It does not understand abstract concepts, perform reasoning, or generate new sentences.

Watch what happens when we ask it about a joke.

In [8]:
# Example of another question
my_question = "Can BERT tell jokes?"
my_result = qa_pipeline(question=my_question, context=context)

print(f"❓ My Question: {my_question}")
print(f"✅ Answer: '{my_result['answer']}' with score {my_result['score']:.4f}")

❓ My Question: Can BERT tell jokes?
✅ Answer: 'Bidirectional Encoder Representations from Transformers' with score 0.0830


In [9]:
# The context contains a simple joke
joke_context = "I told my computer I needed a break, and now it won’t stop sending me Kit-Kat ads."

# The question requires understanding humor, not just extracting text
joke_question = "Why is this joke funny?"

# Let's see how the extractive model handles it
joke_result = qa_pipeline(question=joke_question, context=joke_context)

print(f"❓ Question: {joke_question}")
print(f"✅ Answer: '{joke_result['answer']}'")
print(f"Confidence Score: {joke_result['score']:.4f}")

❓ Question: Why is this joke funny?
✅ Answer: 'I needed a break'
Confidence Score: 0.0999


As you can see, the model fails. It extracts a piece of the sentence but completely misses the pun and the humorous concept. The low confidence score also shows it is struggling.

This model will generally fail at questions that require:

Reasoning or Synthesis: It can't combine multiple pieces of information.

"Yes/No" Answers: Unless the words "yes" or "no" are explicitly in the text.

Abstract Concepts: Like humor, sarcasm, or summarizing the main idea.

This is a key difference between an extractive model like this one and a generative AI like BARD, which is designed to understand these nuances and generate new, explanatory text.