# Class 1 & 2: NLP and Search
## Learning Notebook Part 1 - Foundation: Preprocessing & Bag of Words

**Welcome!** This notebook focuses on **preprocessing and tokenization** - the foundation of all NLP work!

**This notebook will:**
- üìù Teach text preprocessing with **interactive TODOs** for class participation
- üîß Show regex patterns and tokenization techniques step-by-step  
- üí° Explain concepts with examples (complete implementations are in Exercise Notebook)

**For hands-on practice:**
- Complete implementations and exercises are in the **Exercise Notebook**

Let's start by understanding why text preprocessing matters!
- Convert text into numbers that computers can understand

**By the end**: You'll understand text preprocessing, Bag of Words (word counts), and how to convert text to numbers. You'll learn that BOW is **syntactic** (word-based, no meaning) - true semantic search (understanding meaning) comes in Class 3 with dense embeddings!

**Important**: **Semantic = meaning**. In this class, we learn **syntactic** models (BOW) that work with word counts but don't understand meaning. Semantic models (embeddings) that understand meaning are in Class 3!

---

## üìö Useful Resources & Tools

Before we dive in, here are essential resources you'll find helpful throughout this notebook:

> **üí° Learning Philosophy**: For the sake of **deeper understanding**, we'll implement many things from scratch in this course. This helps you truly grasp how the algorithms work, what the challenges are, and why certain design decisions matter. However, keep in mind that there are excellent tools and libraries available to help you in real-world projects once you understand the fundamentals. We'll use both approaches - building from scratch to learn, and leveraging tools to be productive!

### üîß Essential Libraries & Documentation

**Python Standard Library:**
- **`re` (Regular Expressions)**: [Python `re` documentation](https://docs.python.org/3/library/re.html)
  - Built-in regex support - no installation needed!
  - Essential for text cleaning and pattern matching

**NLP Libraries:**
- **NLTK (Natural Language Toolkit)**: [NLTK Documentation](https://www.nltk.org/)
  - Tokenization, stop words, stemming, lemmatization
  - `pip install nltk`
  
- **spaCy**: [spaCy Documentation](https://spacy.io/)
  - Industrial-strength NLP with built-in normalization
  - `pip install spacy` + `python -m spacy download en_core_web_sm`
  
- **scikit-learn**: [sklearn.feature_extraction.text](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text)
  - `CountVectorizer`, `TfidfVectorizer` for text vectorization
  - Already included in most data science environments

**Text Processing Utilities:**
- **Unidecode**: Unicode normalization - `pip install unidecode`
- **contractions**: Expand contractions - `pip install contractions`
- **TextBlob**: Simple NLP API - `pip install textblob`

### üß™ Regex Testing & Learning Tools

**Regex101** - Interactive Regex Tester: [https://regex101.com/](https://regex101.com/)
- ‚≠ê **Highly Recommended!** Test your regex patterns in real-time
- See matches highlighted, understand each part of your pattern
- Supports Python regex flavor
- Great for debugging complex patterns

**Other Regex Resources:**
- **Regex Cheat Sheet**: [Quick Reference](https://www.rexegg.com/regex-quickstart.html)
- **Python Regex HOWTO**: [Official Python Guide](https://docs.python.org/3/howto/regex.html)
- **Regex Crossword**: [Learn by playing!](https://regexcrossword.com/)

### üìñ Quick Reference: Regex Cheat Sheet

```
CHARACTER CLASSES
.          Any character except newline
\w         Word character (letter, digit, underscore)
\W         Non-word character
\d         Digit (0-9)
\D         Non-digit
\s         Whitespace (space, tab, newline)
\S         Non-whitespace
[abc]      Any of a, b, or c
[^abc]     Not a, b, or c
[a-z]      Character range

ANCHORS
^          Start of string
$          End of string
\b         Word boundary

QUANTIFIERS
*          0 or more (greedy)
+          1 or more (greedy)
?          0 or 1 (optional)
{n}        Exactly n times
{n,}       n or more times
{n,m}      Between n and m times
*?         Non-greedy (lazy) match

GROUPS & CAPTURES
(abc)      Capture group
(?:abc)    Non-capturing group
(?P<name>abc)  Named group
\1         Backreference to group 1

LOOKAROUNDS
(?=abc)    Positive lookahead
(?!abc)    Negative lookahead
(?<=abc)   Positive lookbehind
(?<!abc)   Negative lookbehind

ESCAPE SEQUENCES
\.         Literal period
\\         Literal backslash
\n         Newline
\t         Tab
```

### üí° Pro Tips

1. **Start with Regex101**: Test patterns before coding
2. **Use raw strings**: `r"pattern"` in Python to avoid escaping issues
3. **Test incrementally**: Build complex patterns step by step
4. **Read the docs**: Python's `re` module has great examples
5. **Practice**: Regex is a skill that improves with use!

**Remember**: You don't need to memorize everything - bookmark these resources and refer back as needed!

---

### üéì From Scratch ‚Üí Tools: The Learning Path

**In this course, you'll:**
1. ‚úÖ **Build from scratch** - Implement tokenization, TF-IDF, similarity calculations manually
2. ‚úÖ **Understand the why** - Learn the challenges, edge cases, and design decisions
3. ‚úÖ **Then use tools** - Apply libraries like `sklearn`, `spaCy`, `NLTK` with full understanding

**Why this approach?**
- **Deeper understanding**: You'll know what's happening under the hood
- **Better debugging**: When things go wrong, you'll know where to look
- **Informed choices**: You'll choose the right tool for the right job
- **Customization**: You'll be able to modify and extend tools when needed

**In real projects**: Use the tools! But your foundation from building from scratch will make you a better practitioner. üöÄ


---

## Today's Goal: Building a Movie Search System

**The Problem**: You're building a movie recommendation system. Users want to:
- üîç **Search** for movies by description (e.g., "space adventure", "mind-bending thriller")
- üìä **Discover** similar movies automatically
- üí° **Understand** meaning, not just match exact words

**The Challenge**:
- "Space adventure" should find "cosmic journey" and "galactic exploration" (requires understanding meaning - synonyms!)
- "Mind-bending" should find "psychological thriller" and "complex narrative" (requires understanding meaning!)
- We want to understand **meaning**, not just keywords!

**What we'll learn TODAY (Syntactic approaches)**:
1. Start with simple keyword search and multiple keyword search (see their limitations)
2. Learn simple tokenization (splitting text into words)
3. Learn text preprocessing (cleaning, tokenization, regex)
4. Learn n-grams (feature extraction - capturing word order)
5. Convert text to numbers (Bag of Words - word counts/vectorization)
6. **Next in Part 2**: TF-IDF, similarity-based search, and clustering

**Pipeline order** (very important!):
```
Keyword Search ‚Üí Tokenization ‚Üí Preprocessing ‚Üí N-grams ‚Üí Vectorization (BoW - word counts) ‚Üí Applications (Search, Clustering)
```

**What's coming NEXT CLASS (Semantic approaches)**:
- Embeddings for true semantic search (understanding meaning, synonyms)
- "Space" and "cosmic" will be similar because they share meaning!

**Key Point**: In Part 1, we learn **syntactic** models (BOW - word counts). In Part 2, we'll learn TF-IDF. Both work with word patterns but don't understand meaning. **Semantic = meaning**. True semantic search comes in Class 3!

**This is top-down learning**: We'll see the problem first, then learn the tools to solve it!

---

## What is Natural Language Processing (NLP)?

**NLP** = Teaching computers to understand, interpret, and generate human language

### Real-World Applications (Why This Matters!)

| Application | Example | Why It's Important |
|------------|---------|-------------------|
| **Search Engines** | Google, Bing | Finding relevant results for your queries |
| **Virtual Assistants** | Siri, Alexa, ChatGPT | Understanding what you're asking |
| **Spam Detection** | Email filters | Automatically identifying unwanted emails |
| **Sentiment Analysis** | Review analysis | Understanding if reviews are positive/negative |
| **Translation** | Google Translate | Converting between languages |
| **Text Summarization** | News digests | Condensing long articles into key points |

**Today's focus**: Search and clustering - foundational NLP tasks you'll use everywhere!

---

## Machine Learning in NLP: Two Approaches

### Unsupervised Learning (Today's Focus!)
- ‚ùå **No labels needed** - we don't tell the model the "right" answer
- ‚úÖ **Clustering**: Automatically finding groups (e.g., similar movies)
- ‚úÖ **Search**: Finding similar documents without examples
- üéØ **Goal**: Discover patterns in the data

### Supervised Learning (You Know This from Chapter 0!)
- ‚úÖ **Needs labeled data** - we provide correct answers
- üìä **Text Classification**: Spam/not spam, genre classification
- üí≠ **Sentiment Analysis**: Positive/negative/neutral labels
- üéØ **Goal**: Learn to predict labels from examples

**Key Insight**: Same preprocessing, same vectors - just different goals!
- Unsupervised: Find patterns (no labels)
- Supervised: Predict labels (with examples)

We'll connect these at the end!


## Setup


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')

# For better output display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


### Load the Data


In [2]:
# Load movie descriptions
# If running in Google Colab and data file doesn't exist, download it from GitHub
import os

if not os.path.exists('data/movies.csv'):
    print("Data file not found. Downloading from GitHub...")
    os.makedirs('data', exist_ok=True)
    import urllib.request
    url = 'https://raw.githubusercontent.com/samsung-ai-course/8th-9th-edition/main/Chapter%202%20-%20Natural%20Language%20Processing/Class%201%20%26%202%20-%20NLP%20and%20Search/data/movies.csv'
    urllib.request.urlretrieve(url, 'data/movies.csv')
    print("‚úì Data file downloaded successfully!")

df = pd.read_csv('data/movies.csv')
print(f"Loaded {len(df)} movies")
df.head()

Data file not found. Downloading from GitHub...


('data/movies.csv', <http.client.HTTPMessage at 0x7daf04db58e0>)

‚úì Data file downloaded successfully!
Loaded 10000 movies


Unnamed: 0,movie_id,title,description,genre,rating
0,1,Edge of Code,A compelling romance film about a young advent...,Romance,7.1
1,2,Storm of Secret,This captivating romance movie follows a quest...,Romance,6.3
2,3,Under Warrior Redux,"In this captivating war story, a secret organi...",War,7.3
3,4,Quest of Secret,A compelling fantasy film about a determined d...,Fantasy,8.3
4,5,Key of Game,A exploration adventure film about a master th...,Adventure,6.2


## Why Text is Hard

Computers work with numbers, but text is made of words. This creates several challenges:

1. **Unstructured**: Text doesn't have a fixed format
2. **Synonyms**: "space" vs "cosmic" vs "galactic" - different words, similar meaning
3. **Context**: "bank" could mean financial institution or river edge
4. **Variations**: "sci-fi", "science fiction", "Science Fiction" - same concept
5. **Word order matters**: "dog bites man" vs "man bites dog" - completely different!

Let's look at our movie descriptions:


In [3]:
# Let's examine a movie description
print("Movie Description Example:")
print("=" * 60)
print(f"Title: {df.loc[0, 'title']}")
print(f"Description: {df.loc[0, 'description']}")
print("=" * 60)


Movie Description Example:
Title: Edge of Code
Description: A compelling romance film about a young adventurer. an epic adventure that spans continents and generations. a touching love story that will warm your heart.


## Two Approaches to Search

### 1. Keyword Search (Simple but Limited)
- Looks for exact word matches
- Can use simple substring matching or Term Frequency (TF) approaches
- Fast and simple
- Fails with synonyms ("space movie" won't find "cosmic adventure")
- **Syntactic only** - works with words, not meaning

### 2. True Semantic Search (Class 3 - Embeddings!)
- **Semantic = meaning-based** - understands synonyms and related concepts
- "Space" and "cosmic" are close in meaning (semantic similarity)
- Requires embeddings (dense vectors that capture meaning)
- This is what we'll learn in Class 3!

**Key distinction:**
- **Syntactic models** (BOW): Work with word presence/counts, no understanding of meaning
- **Semantic models** (embeddings): Understand meaning - synonyms and related concepts are similar

Let's start with different keyword search approaches:


### Approach 1: Simple Substring Matching (Vanilla Keyword Search)

The most basic approach - just check if the query word appears in the text.

**Concept**: Simple substring matching checks if query words appear anywhere in the document text. It's fast but limited - it only finds exact word matches and doesn't rank results.

**Implementation**: You'll implement this in **Exercise 4** in the Exercise Notebook!


In [4]:
# Simple keyword search - Let's implement this together!
def simple_keyword_search(df, query, column='description'):
    """
    Simple keyword search: finds documents containing the query words (exact match)
    """
    query_lower = query.lower()
    results = []

    for idx, row in df.iterrows():
        text = str(row[column]).lower()

        if query_lower in text:
            results.append({
                'movie_id': row['movie_id'],
                'title': row['title'],
                'match': True
            })

    return pd.DataFrame(results)

# Let's test it together!
query = "space"
results = simple_keyword_search(df, query)
print(f"Found {len(results)} results for '{query}':")
print(results.head())


Found 564 results for 'space':
   movie_id            title  match
0        25     Legacy Falls   True
1        41   Fear of Knight   True
2        47      Light: Soul   True
3        65  Mage of Warrior   True
4        78   Warrior of War   True


### Approach 2: Multiple Keyword Search

Instead of searching for a single word, we can search for multiple words at once. The query "space adventure" should find documents containing both "space" AND "adventure".

**Concept**: Multiple keyword search finds documents that contain ALL query words. This is more precise than single-word search and allows for more specific queries.

**Implementation**: You'll implement this in **Exercise 4** in the Exercise Notebook!


In [5]:
# Multiple keyword search - Complete solution
def multiple_keyword_search(df, query, column='description'):
    """
    Multiple keyword search: finds documents containing ALL query words (AND logic)
    """
    query_lower = query.lower()
    query_words = query_lower.split()  # Split query into individual words

    results = []

    for idx, row in df.iterrows():
        text = str(row[column]).lower()

        # Check if ALL query words appear in the text
        all_words_found = all(word in text for word in query_words)

        if all_words_found:
            results.append({
                'movie_id': row['movie_id'],
                'title': row['title'],
                'match': True
            })

    return pd.DataFrame(results)

# Let's test it!
query_multi = "space adventure"
results_multi = multiple_keyword_search(df, query_multi)
print(f"Found {len(results_multi)} results for '{query_multi}' (containing ALL words):")
print(results_multi.head())


Found 173 results for 'space adventure' (containing ALL words):
   movie_id            title  match
0        41   Fear of Knight   True
1        65  Mage of Warrior   True
2       103   Fire of Battle   True
3       148    Square of War   True
4       284        War: Soul   True


## Simple Tokenization

Before we move to more advanced preprocessing, let's understand the basics of tokenization - splitting text into individual words.

**Tokenization** is the process of breaking text into smaller units (tokens), which are usually words.


In [6]:
# Simple tokenization examples
text = "Natural Language Processing is amazing! It's used everywhere."
print("Original:", text)

# Split on whitespace (simple but loses punctuation)
tokens_simple = text.split()
print("\nSimple split:", tokens_simple)

# Regex tokenization (find all word characters)
tokens_regex = re.findall(r'\w+', text.lower())
print("Regex tokenization:", tokens_regex)

print("\nüí° Key Insight: Tokenization is the first step in converting text to numbers!")
print("   We'll use this when we create Bag of Words vectors.")


Original: Natural Language Processing is amazing! It's used everywhere.

Simple split: ['Natural', 'Language', 'Processing', 'is', 'amazing!', "It's", 'used', 'everywhere.']
Regex tokenization: ['natural', 'language', 'processing', 'is', 'amazing', 'it', 's', 'used', 'everywhere']

üí° Key Insight: Tokenization is the first step in converting text to numbers!
   We'll use this when we create Bag of Words vectors.


**Key Insights about Keyword Search**:

**Advantages**:
- ‚úÖ Fast and simple
- ‚úÖ Finds exact word matches
- ‚úÖ Multiple keyword search allows more specific queries

**Limitations**:
- ‚ùå "space" won't match "cosmic" or "galactic" (synonyms) - **syntactic only, no meaning**
- ‚ùå "adventure" won't match "journey" or "quest" - different words = zero similarity
- ‚ùå Doesn't understand meaning - it's a **syntactic model** (word-based, not meaning-based)
- ‚ùå No ranking - all matches are equal

To do better, we need to:
1. **Preprocess** the text properly (tokenization, normalization)
2. **Convert** text to numerical vectors (Bag of Words - word counts)
3. **Measure similarity** between vectors (we'll learn this in Part 2!)

This moves us to **vectorization** - converting text to numbers. In Part 2, we'll learn TF-IDF for better search and clustering!

---

## Text Preprocessing Pipeline

Before we can work with text, we need to clean and prepare it. The pipeline has three main stages:

1. **Pre-processing** (before tokenization): Clean the raw text
2. **Tokenization**: Split text into individual words/tokens
3. **Post-processing** (after tokenization): Further refine the tokens

### Why Preprocessing Matters

**Garbage In = Garbage Out**: If we don't clean our text properly, our models will learn from noise, not signal!


### Stage 1: Pre-processing (Before Tokenization)

**Goal**: Clean the raw text before splitting it into words

Common tasks:
- Remove HTML tags
- Remove special characters
- Normalize URLs, emails, phone numbers (using **Regular Expressions/Regex**)
- Handle case (convert to lowercase)
- Remove extra whitespace

Let's see an example with **Regular Expressions (Regex)**:


In [7]:
# Text preprocessing with regex - Complete solution
sample_text = "Contact us at info@example.com or call (555) 123-4567. Visit https://example.com for more info!!!"
print("Original:", sample_text)

# Remove URLs using regex
# Pattern: r'https?://\S+' matches http:// or https:// followed by non-whitespace
text_no_urls = re.sub(r'https?://\S+', '', sample_text)
print("\nAfter removing URLs:", text_no_urls)

# Remove email addresses
# Pattern: r'\S+@\S+' matches non-whitespace characters before and after @
text_no_emails = re.sub(r'\S+@\S+', '', text_no_urls)
print("After removing emails:", text_no_emails)

# Remove phone numbers
# Pattern handles: (555) 123-4567 or 555-123-4567
text_no_phones = re.sub(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', '', text_no_emails)
print("After removing phones:", text_no_phones)

# Remove punctuation but keep letters, numbers, spaces
# Pattern: [^\w\s] means "not word characters or whitespace"
text_clean = re.sub(r'[^\w\s]', ' ', text_no_phones)
print("After removing punctuation:", text_clean)

# Normalize whitespace and lowercase
# Pattern: r'\s+' matches one or more whitespace characters
text_final = re.sub(r'\s+', ' ', text_clean).strip().lower()
print("Final (lowercase, normalized):", text_final)


Original: Contact us at info@example.com or call (555) 123-4567. Visit https://example.com for more info!!!

After removing URLs: Contact us at info@example.com or call (555) 123-4567. Visit  for more info!!!
After removing emails: Contact us at  or call (555) 123-4567. Visit  for more info!!!
After removing phones: Contact us at  or call . Visit  for more info!!!
After removing punctuation: Contact us at  or call   Visit  for more info   
Final (lowercase, normalized): contact us at or call visit for more info


### Stage 2: Tokenization

**Goal**: Split text into individual words (tokens)

Tokens can be:
- Words: "machine", "learning"
- Punctuation: ".", ","
- Numbers: "2024"
- Subwords: "un-" + "happiness" (advanced)

Simple approach: split on whitespace
Better approach: use regex to find word boundaries


In [8]:
# Simple tokenization examples
text = "Natural Language Processing is amazing! It's used everywhere."
print("Original:", text)

# Split on whitespace (simple but loses punctuation)
tokens_simple = text.split()
print("\nSimple split:", tokens_simple)

# Regex tokenization (find all word characters)
tokens_regex = re.findall(r'\w+', text.lower())
print("Regex tokenization:", tokens_regex)


Original: Natural Language Processing is amazing! It's used everywhere.

Simple split: ['Natural', 'Language', 'Processing', 'is', 'amazing!', "It's", 'used', 'everywhere.']
Regex tokenization: ['natural', 'language', 'processing', 'is', 'amazing', 'it', 's', 'used', 'everywhere']


### Stage 3: Post-processing (After Tokenization)

**Goal**: Further refine tokens by removing noise

Common tasks:
- Remove **stop words**: "the", "a", "an", "and", "or" (common but not informative)
- Remove very short tokens: "I", "a" (often noise)
- Stemming/Lemmatization: "running" ‚Üí "run" (we'll skip this for now)


In [9]:
# Complete preprocessing pipeline - Complete solution
STOP_WORDS = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
              'of', 'with', 'by', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
              'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could',
              'should', 'may', 'might', 'must', 'can', 'this', 'that', 'these', 'those',
              'i', 'you', 'he', 'she', 'it', 'we', 'they', 'what', 'which', 'who'}

def preprocess_text(text):
    """
    Complete preprocessing pipeline: clean, tokenize, filter
    """
    # Step 1: Pre-processing - lowercase and normalize
    text = str(text).lower()
    text = re.sub(r'\s+', ' ', text).strip()

    # Step 2 - Tokenization: Use regex to find all words
    # Pattern: \w+ matches word characters
    tokens = re.findall(r'\w+', text)

    # Step 3 - Post-processing: Remove stop words and very short tokens (length < 3)
    tokens_clean = [token for token in tokens
                    if token not in STOP_WORDS and len(token) > 2]

    return tokens_clean

# Let's test it!
sample = df.loc[0, 'description']
print("Original:", sample)
print("\nPreprocessed tokens:", preprocess_text(sample))


Original: A compelling romance film about a young adventurer. an epic adventure that spans continents and generations. a touching love story that will warm your heart.

Preprocessed tokens: ['compelling', 'romance', 'film', 'about', 'young', 'adventurer', 'epic', 'adventure', 'spans', 'continents', 'generations', 'touching', 'love', 'story', 'warm', 'your', 'heart']


---

## N-grams: Capturing Word Order (Feature Extraction)

**After preprocessing and tokenization**, we can create **n-grams** - sequences of n consecutive tokens. N-grams help capture some word order information that simple Bag of Words loses!

### What are N-grams?

**N-grams** = Sequences of N consecutive tokens (words or characters)

**Types:**
- **Unigrams (1-grams)**: Single words - "I", "love", "Python"
- **Bigrams (2-grams)**: Pairs of consecutive words - "I love", "love Python"
- **Trigrams (3-grams)**: Triples of consecutive words - "I love Python"

### Why N-grams Matter

**Problem with unigrams (single words)**: "Dog bites man" and "Man bites dog" have the same unigrams!

**Solution with bigrams**:
- "Dog bites man" ‚Üí ["dog bites", "bites man"]
- "Man bites dog" ‚Üí ["man bites", "bites dog"]
- Different bigrams ‚Üí different meaning captured!

### Example:

```
Text: "I love Python programming"

Unigrams (1-grams):
  ["i", "love", "python", "programming"]

Bigrams (2-grams):
  ["i love", "love python", "python programming"]

Trigrams (3-grams):
  ["i love python", "love python programming"]
```

**Trade-off**:
- ‚úÖ Better capture of word order and context
- ‚ùå Larger vocabulary (more features)
- ‚ùå Can be sparse (many n-grams appear only once)

**Common Usage**: Combining unigrams + bigrams gives a good balance!


In [10]:
# N-grams Implementation - Complete solution

def create_ngrams(tokens, n=2):
    """
    Create n-grams from a list of tokens.

    Args:
        tokens: List of tokens (words)
        n: Size of n-gram (1=unigrams, 2=bigrams, 3=trigrams)

    Returns:
        list: List of n-grams
    """
    if n == 1:
        return tokens
    else:
        ngrams = []
        for i in range(len(tokens) - n + 1):
            # Create n-gram by joining n consecutive tokens
            ngram = ' '.join(tokens[i:i+n])
            ngrams.append(ngram)
        return ngrams

# Example 1: Creating n-grams from text
text = "I love Python programming"
tokens = preprocess_text(text)  # Preprocess first (tokenize, remove stop words, etc.)
print(f"Original text: {text}")
print(f"After preprocessing: {tokens}")
print("\n" + "=" * 60)

# Create different n-grams
unigrams = create_ngrams(tokens, n=1)
bigrams = create_ngrams(tokens, n=2)
trigrams = create_ngrams(tokens, n=3)

print("Unigrams (1-grams) - single words:")
print(f"  {unigrams}")

print("\nBigrams (2-grams) - pairs of consecutive words:")
print(f"  {bigrams}")

print("\nTrigrams (3-grams) - triplets of consecutive words:")
print(f"  {trigrams}")

# Example 2: Demonstrating why n-grams matter (word order)
print("\n" + "=" * 60)
print("Why N-grams Matter - Word Order Example:")
print("=" * 60)

text1 = "The dog bites the man"
text2 = "The man bites the dog"

tokens1 = preprocess_text(text1)
tokens2 = preprocess_text(text2)

bigrams1 = create_ngrams(tokens1, n=2)
bigrams2 = create_ngrams(tokens2, n=2)

print(f"\nText 1: '{text1}'")
print(f"Tokens: {tokens1}")
print(f"Bigrams: {bigrams1}")

print(f"\nText 2: '{text2}'")
print(f"Tokens: {tokens2}")
print(f"Bigrams: {bigrams2}")

print("\n‚Üí Different bigrams = different meaning captured!")
print("‚Üí N-grams help preserve some word order information")

# Example 3: Using n-grams with movie descriptions
print("\n" + "=" * 60)
print("N-grams with Real Data:")
print("=" * 60)

sample_movie = df.loc[0, 'description']
tokens_movie = preprocess_text(sample_movie)
print(f"Movie description: {sample_movie[:100]}...")
print(f"\nTokens (first 10): {tokens_movie[:10]}")

movie_unigrams = create_ngrams(tokens_movie, n=1)
movie_bigrams = create_ngrams(tokens_movie, n=2)

print(f"\nUnigrams (first 10): {movie_unigrams[:10]}")
print(f"Bigrams (first 10): {movie_bigrams[:10]}")
print(f"\nTotal unigrams: {len(movie_unigrams)}")
print(f"Total bigrams: {len(movie_bigrams)}")

print("\nüí° Note: N-grams are created AFTER tokenization and BEFORE vectorization.")
print("   They're used as features in TF-IDF (you'll see this in scikit-learn options).")


Original text: I love Python programming
After preprocessing: ['love', 'python', 'programming']

Unigrams (1-grams) - single words:
  ['love', 'python', 'programming']

Bigrams (2-grams) - pairs of consecutive words:
  ['love python', 'python programming']

Trigrams (3-grams) - triplets of consecutive words:
  ['love python programming']

Why N-grams Matter - Word Order Example:

Text 1: 'The dog bites the man'
Tokens: ['dog', 'bites', 'man']
Bigrams: ['dog bites', 'bites man']

Text 2: 'The man bites the dog'
Tokens: ['man', 'bites', 'dog']
Bigrams: ['man bites', 'bites dog']

‚Üí Different bigrams = different meaning captured!
‚Üí N-grams help preserve some word order information

N-grams with Real Data:
Movie description: A compelling romance film about a young adventurer. an epic adventure that spans continents and gene...

Tokens (first 10): ['compelling', 'romance', 'film', 'about', 'young', 'adventurer', 'epic', 'adventure', 'spans', 'continents']

Unigrams (first 10): ['compell

### N-grams in Production

In practice, you'll use libraries like scikit-learn's `CountVectorizer` or `TfidfVectorizer` which can automatically create n-grams:


In [11]:
# Example (we'll see this in Part 2):
from sklearn.feature_extraction.text import TfidfVectorizer

# Create unigrams and bigrams
vectorizer = TfidfVectorizer(ngram_range=(1, 2))  # Unigrams + bigrams
print("TfidfVectorizer with ngram_range=(1, 2) created!")
print("This will create both unigrams and bigrams when you fit it to text.")


TfidfVectorizer with ngram_range=(1, 2) created!
This will create both unigrams and bigrams when you fit it to text.


**Common choices:**
- `ngram_range=(1, 1)`: Only unigrams (standard BoW)
- `ngram_range=(1, 2)`: Unigrams + bigrams (good balance)
- `ngram_range=(2, 2)`: Only bigrams
- `ngram_range=(1, 3)`: Unigrams + bigrams + trigrams (more features, can be sparse)

**Key Insight**: N-grams are created **after tokenization** and **before vectorization** (BoW). They help capture word order, but still don't capture semantic meaning - that requires embeddings (Class 3)!


---

## From Text to Numbers: Bag of Words (BoW) - Vectorization

**Now that we have preprocessed tokens (and optionally n-grams)**, we need to convert them to numbers that computers can work with. This is called **vectorization**.

**Pipeline reminder:**
1. ‚úÖ Keyword search (simple and multiple)
2. ‚úÖ Simple tokenization
3. ‚úÖ Preprocessing (clean, normalize)
4. ‚úÖ Post-processing (filter, remove stop words)
5. ‚úÖ Feature extraction (n-grams - optional, we just covered this!)
6. **‚Üê We are here**: Vectorization (Bag of Words - word counts)
7. Applications (Similarity Search, Clustering - Part 2)

### The Bag of Words Model (BoW)

**Idea**: Represent each document as a vector of word counts, ignoring word order.

The process:
1. Create vocabulary (list of all unique words from the corpus)
2. Count how many times each word appears in a document
3. Represent document as vector of counts (not frequencies, just counts!)

**Terminology reminder:**
- **Corpus**: Collection of all documents we're working with
- **Vocabulary**: All unique words/tokens in the corpus
- **Token**: Individual word after tokenization

Example:
- Document 1: "I love Python"
- Document 2: "Python is great"
- **Corpus**: The collection of both documents
- **Vocabulary**: ["i", "love", "python", "is", "great"] (all unique words from the corpus)

**Bag of Words vectors**:
- Doc 1: [1, 1, 1, 0, 0]  (one "i", one "love", one "python")
- Doc 2: [0, 0, 1, 1, 1]  (one "python", one "is", one "great")

**Key insight**: We've lost word order! "Dog bites man" = "Man bites dog" in BoW. But for many tasks, this is okay!

**Remember**: BoW is just counting how many times each word appears - simple word counts, not frequencies!


In [14]:
from collections import Counter
import pandas as pd
import re

def preprocess_text(text: str):
    # lowercase + ficar s√≥ com letras/n√∫meros + tokenizar por espa√ßos
    text = text.lower()
    text = re.sub(r"[^\w\s]", "", text)   # remove pontua√ß√£o
    tokens = text.split()
    return tokens

# Bag of Words
docs = [
    "I love Python programming",
    "Python is a programming language",
    "I love machine learning"
]

# Step 1: Build vocabulary from the corpus (all unique words)
all_words = set()
for doc in docs:
    tokens = preprocess_text(doc)
    all_words.update(tokens)

vocab = sorted(list(all_words))
print(f"Corpus: {len(docs)} documents")
print("Vocabulary:", vocab)
print(f"Vocabulary size: {len(vocab)}")

# Step 2: Create BoW vectors for each document
bow_vectors = []
for doc in docs:
    tokens = preprocess_text(doc)     # 1) preprocess
    word_counts = Counter(tokens)     # 2) count

    # 3) build vector aligned with vocab
    vector = [word_counts.get(word, 0) for word in vocab]

    bow_vectors.append(vector)
    print(f"\n'{doc}' -> {vector}")

print("\nBag of Words Matrix:")
print(pd.DataFrame(bow_vectors, columns=vocab))


Corpus: 3 documents
Vocabulary: ['a', 'i', 'is', 'language', 'learning', 'love', 'machine', 'programming', 'python']
Vocabulary size: 9

'I love Python programming' -> [0, 1, 0, 0, 0, 1, 0, 1, 1]

'Python is a programming language' -> [1, 0, 1, 1, 0, 0, 0, 1, 1]

'I love machine learning' -> [0, 1, 0, 0, 1, 1, 1, 0, 0]

Bag of Words Matrix:
   a  i  is  language  learning  love  machine  programming  python
0  0  1   0         0         0     1        0            1       1
1  1  0   1         1         0     0        0            1       1
2  0  1   0         0         1     1        1            0       0


## Sparse vs Dense Vectors

This is a crucial concept in NLP!

### Sparse Vectors (like BoW)
- **Most values are zero** (e.g., [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, ...])
- Vocabulary size can be huge (10,000-100,000+ words)
- Each document only uses a small fraction of the vocabulary
- **Memory efficient** when stored in sparse format (only store non-zero values)
- Example: Bag of Words, TF-IDF

### Dense Vectors (like Embeddings - we'll see this next class!)
- **Most/all values are non-zero** (e.g., [0.23, -0.15, 0.87, ..., 0.42])
- Fixed, smaller dimension (typically 100-768 dimensions)
- Each dimension has meaning (learned representation)
- **Captures relationships** between words
- Example: Word embeddings, sentence embeddings

Let's visualize this:


## Comparing: With vs Without Preprocessing

**Key Question**: Does preprocessing make a difference? Let's find out!


In [15]:
# Comparison: BoW with preprocessing vs without preprocessing
# Our corpus: collection of sample documents
docs_sample = [
    "I LOVE Python! It's amazing!!!",
    "Python is a programming language.",
    "i love machine learning!"
]

print("=" * 70)
print("WITHOUT Preprocessing (raw text):")
print("=" * 70)
print(f"Corpus: {len(docs_sample)} documents\n")

# BoW without preprocessing - just split on whitespace
# Build vocabulary from corpus
all_words_no_preprocess = set()
for doc in docs_sample:
    words = doc.lower().split()  # Simple split, no cleaning
    all_words_no_preprocess.update(words)

vocab_no_preprocess = sorted(list(all_words_no_preprocess))
print(f"Vocabulary: {vocab_no_preprocess}")
print(f"Vocabulary size: {len(vocab_no_preprocess)}")
print(f"\nNotice: 'i', 'python!', 'it's', 'amazing!!!', 'language.' are separate words!")
print(f"Punctuation and case variations create different words!")

bow_no_preprocess = []
for doc in docs_sample:
    words = doc.lower().split()
    word_counts = Counter(words)
    vector = [word_counts.get(word, 0) for word in vocab_no_preprocess]
    bow_no_preprocess.append(vector)

print("\nBoW Matrix (no preprocessing):")
print(pd.DataFrame(bow_no_preprocess, columns=vocab_no_preprocess,
                   index=[f"Doc {i+1}" for i in range(len(docs_sample))]))

print("\n" + "=" * 70)
print("WITH Preprocessing (cleaned and tokenized):")
print("=" * 70)
print(f"Corpus: {len(docs_sample)} documents\n")

# BoW with preprocessing
# Build vocabulary from corpus (after preprocessing)
all_words_preprocess = set()
for doc in docs_sample:
    tokens = preprocess_text(doc)
    all_words_preprocess.update(tokens)

vocab_preprocess = sorted(list(all_words_preprocess))
print(f"Vocabulary: {vocab_preprocess}")
print(f"Vocabulary size: {len(vocab_preprocess)}")
print(f"\nNotice: Clean words only! Punctuation removed, stop words filtered!")

bow_preprocess = []
for doc in docs_sample:
    tokens = preprocess_text(doc)
    word_counts = Counter(tokens)
    vector = [word_counts.get(word, 0) for word in vocab_preprocess]
    bow_preprocess.append(vector)

print("\nBoW Matrix (with preprocessing):")
print(pd.DataFrame(bow_preprocess, columns=vocab_preprocess,
                   index=[f"Doc {i+1}" for i in range(len(docs_sample))]))

print("\n" + "=" * 70)
print("KEY INSIGHT:")
print("=" * 70)
print("Without preprocessing: More vocabulary words, many variations of same word")
print("With preprocessing: Cleaner vocabulary, focuses on meaningful words")
print("‚Üí Preprocessing reduces noise and makes patterns clearer!")


WITHOUT Preprocessing (raw text):
Corpus: 3 documents

Vocabulary: ['a', 'amazing!!!', 'i', 'is', "it's", 'language.', 'learning!', 'love', 'machine', 'programming', 'python', 'python!']
Vocabulary size: 12

Notice: 'i', 'python!', 'it's', 'amazing!!!', 'language.' are separate words!
Punctuation and case variations create different words!

BoW Matrix (no preprocessing):
       a  amazing!!!  i  is  it's  language.  learning!  love  machine  \
Doc 1  0           1  1   0     1          0          0     1        0   
Doc 2  1           0  0   1     0          1          0     0        0   
Doc 3  0           0  1   0     0          0          1     1        1   

       programming  python  python!  
Doc 1            0       0        1  
Doc 2            1       1        0  
Doc 3            0       0        0  

WITH Preprocessing (cleaned and tokenized):
Corpus: 3 documents

Vocabulary: ['a', 'amazing', 'i', 'is', 'its', 'language', 'learning', 'love', 'machine', 'programming', 'pytho

## The Problem with Syntactic Representations (BoW Limitations)

**Critical Understanding**: Bag of Words is a **syntactic representation** (TF-IDF in Part 2 is also syntactic) - they only capture word presence/counts, NOT meaning!

**Key Terminology:**
- **Syntactic**: Based on word structure/counts (BOW) or frequencies (TF-IDF) - no understanding of meaning
- **Semantic**: Based on meaning - understands synonyms and related concepts (embeddings - Class 3)

**Remember**: Semantic = meaning. Syntactic models like BOW work with word counts only - they don't understand meaning!

Let's see the issues:


In [16]:
# Demonstrating BoW/Syntactic Representation Limitations

print("=" * 70)
print("Problem 1: Word Order is Lost")
print("=" * 70)

# Corpus: two documents with different word order
docs_order = [
    "The dog bites the man",
    "The man bites the dog"
]

# Create BoW vectors for these two sentences
# What do you notice about the vectors?
# Build vocabulary from corpus
all_words_order = set()
for doc in docs_order:
    tokens = preprocess_text(doc)
    all_words_order.update(tokens)

vocab_order = sorted(list(all_words_order))
print(f"Vocabulary: {vocab_order}")

bow_order = []
for doc in docs_order:
    tokens = preprocess_text(doc)
    word_counts = Counter(tokens)
    vector = [word_counts.get(word, 0) for word in vocab_order]
    bow_order.append(vector)
    print(f"\n'{doc}'")
    print(f"‚Üí {vector}")

print("\n" + "=" * 50)
print("Result: Both sentences have IDENTICAL BoW vectors!")
print("But they have COMPLETELY DIFFERENT meanings!")
print("=" * 50)

print("\n" + "=" * 70)
print("Problem 2: Synonyms are Treated as Completely Different")
print("=" * 70)

# Corpus: documents with synonyms
docs_synonyms = [
    "I love machine learning",
    "I adore artificial intelligence"
]

# Create BoW vectors for these
# Build vocabulary from corpus
all_words_syn = set()
for doc in docs_synonyms:
    tokens = preprocess_text(doc)
    all_words_syn.update(tokens)

vocab_syn = sorted(list(all_words_syn))
print(f"Vocabulary: {vocab_syn}")

bow_syn = []
for doc in docs_synonyms:
    tokens = preprocess_text(doc)
    word_counts = Counter(tokens)
    vector = [word_counts.get(word, 0) for word in vocab_syn]
    bow_syn.append(vector)
    print(f"\n'{doc}'")
    print(f"‚Üí {vector}")

print("\n" + "=" * 50)
print("Result: 'love' vs 'adore' and 'machine learning' vs 'AI' are completely different!")
print("But they have SIMILAR meanings!")
print("‚Üí BoW similarity = 0 even though meanings are related")
print("=" * 50)

print("\n" + "=" * 70)
print("Problem 3: Context is Lost")
print("=" * 70)

# Corpus: documents with same word but different context
docs_context = [
    "The bank is closed",      # Financial bank
    "The river bank is muddy"  # River edge
]

# Build vocabulary from corpus
all_words_ctx = set()
for doc in docs_context:
    tokens = preprocess_text(doc)
    all_words_ctx.update(tokens)

vocab_ctx = sorted(list(all_words_ctx))
print(f"Vocabulary: {vocab_ctx}")

bow_ctx = []
for doc in docs_context:
    tokens = preprocess_text(doc)
    word_counts = Counter(tokens)
    vector = [word_counts.get(word, 0) for word in vocab_ctx]
    bow_ctx.append(vector)
    print(f"\n'{doc}'")
    print(f"‚Üí {vector}")

print("\n" + "=" * 50)
print("Result: 'bank' appears in both, but means completely different things!")
print("‚Üí BoW can't distinguish context/meaning")
print("=" * 50)

print("\n" + "=" * 70)
print("SUMMARY: Syntactic Representation (BoW/TF-IDF) Limitations")
print("=" * 70)
print("‚ùå Loses word order")
print("‚ùå Can't handle synonyms (different words = 0 similarity)")
print("‚ùå Can't understand context")
print("‚ùå Only captures word counts, NOT meaning")
print("\nüìå CRITICAL: Syntactic ‚â† Semantic")
print("   - Syntactic (BOW/TF-IDF): Word-based, no meaning")
print("   - Semantic (Embeddings - Class 3): Meaning-based, understands synonyms")
print("\n‚úÖ But syntactic models are simple, interpretable, and work for many tasks!")
print("\nüí° Next class: We'll see embeddings (semantic representations) that solve these!")
print("   - Semantic = meaning - embeddings understand that 'space' and 'cosmic' are similar!")


Problem 1: Word Order is Lost
Vocabulary: ['bites', 'dog', 'man', 'the']

'The dog bites the man'
‚Üí [1, 1, 1, 2]

'The man bites the dog'
‚Üí [1, 1, 1, 2]

Result: Both sentences have IDENTICAL BoW vectors!
But they have COMPLETELY DIFFERENT meanings!

Problem 2: Synonyms are Treated as Completely Different
Vocabulary: ['adore', 'artificial', 'i', 'intelligence', 'learning', 'love', 'machine']

'I love machine learning'
‚Üí [0, 0, 1, 0, 1, 1, 1]

'I adore artificial intelligence'
‚Üí [1, 1, 1, 1, 0, 0, 0]

Result: 'love' vs 'adore' and 'machine learning' vs 'AI' are completely different!
But they have SIMILAR meanings!
‚Üí BoW similarity = 0 even though meanings are related

Problem 3: Context is Lost
Vocabulary: ['bank', 'closed', 'is', 'muddy', 'river', 'the']

'The bank is closed'
‚Üí [1, 1, 1, 0, 0, 1]

'The river bank is muddy'
‚Üí [1, 0, 1, 1, 1, 1]

Result: 'bank' appears in both, but means completely different things!
‚Üí BoW can't distinguish context/meaning

SUMMARY: Syntac

---

## End of Part 1: Foundation Complete! üéâ

**Great job!** You've learned the fundamentals:
- ‚úÖ **Keyword search** (simple and multiple keyword search)
- ‚úÖ **Simple tokenization** (splitting text into words)
- ‚úÖ **Text preprocessing** (cleaning, tokenization, regex)
- ‚úÖ **Converting text to numbers** (Bag of Words - word counts/vectorization)
- ‚úÖ **Understanding sparse vectors**
- ‚úÖ **Recognizing BoW limitations** (syntactic, not semantic)

**The Complete NLP Pipeline You Now Understand:**

```
Raw Text
  ‚Üì
1. Keyword Search (simple and multiple)
  ‚Üì
2. Simple Tokenization (split into words)
  ‚Üì
3. Preprocessing (clean, normalize)
  ‚Üì
4. Post-processing (filter stop words)
  ‚Üì
5. Vectorization (Bag of Words - word counts) ‚Üê You learned this!
  ‚Üì
6. TF-IDF (weighted vectors) ‚Üê Part 2
  ‚Üì
7. Similarity Search & Clustering ‚Üê Part 2
```

**Key Point**: **Bag of Words (BoW)** is just counting how many times each word appears in a document - simple word counts, not frequencies!

**Now it's time to practice!** üèãÔ∏è

Complete the exercises in the **Exercise Notebook** to reinforce what you've learned:
- **Exercise 1**: Text cleaning with regex (URLs, emails, phone numbers)
- **Exercise 2**: Tokenization with stop word removal
- **Exercise 3**: Bag of Words implementation from scratch
- **Exercise 4**: Keyword search implementation (simple and multiple)
- **Exercise 5**: Similarity-based search with TF-IDF (still syntactic, not semantic!) - Part 2
- **Exercise 6**: Document clustering with K-Means - Part 2
- **Exercise 7**: Comparing preprocessing approaches
- **Exercise 8**: Stemming and Lemmatization (advanced preprocessing)

**Next in Part 2**: We'll learn TF-IDF (improving on BoW with weighted frequencies), similarity search, and clustering!

Take a break, complete the exercises, then continue with **Learning Notebook Part 2** üëâ