# 🧠 NLP Foundations Workshop: From Preprocessing to tf-idf


**Duration**: 90 minutes  
**Team Size**: 3 students  
**Objective**: Build an NLP pipeline from scratch to implement and test six foundational concepts in Natural Language Processing in preparation for Vector Space Models and Cosine Similarity.


## Step 1: Presenting the Six Core NLP Concepts

### 🔹 Term-Document Incidence Matrix

The **Term-Document Incidence Matrix** is a binary matrix that shows whether a term $t$ appears in a document $d$.

- Rows represent terms in the vocabulary  
- Columns represent documents in the corpus  
- Each entry $w_{t,d}$ is defined as:

$$
w_{t,d} =
\begin{cases}
1 & \text{if } t \in d \\
0 & \text{otherwise}
\end{cases}
$$

This is a **binary representation** — it only records the **presence or absence** of a term, not how many times it appears.

---

#### ✅ Why Use It?

- It’s the **simplest form** of representing document contents using structured data.
- Useful for:
  - Boolean search and keyword filters
  - Document classification based on keyword sets
  - Building foundational **retrieval systems**
- Helps in detecting whether **all query terms exist** in a document (e.g., phrase queries or "AND" operations)

---

#### 📘 Example

Suppose we have 3 documents:

- **Doc1**: "machine learning is fun"  
- **Doc2**: "deep learning is powerful"  
- **Doc3**: "machine learning and deep models"

The vocabulary extracted from all three is:

**Vocabulary** = {machine, learning, is, fun, deep, powerful, and, models}

The Term-Document Incidence Matrix would look like:

| Term       | Doc1 | Doc2 | Doc3 |
|------------|------|------|------|
| machine    | 1    | 0    | 1    |
| learning   | 1    | 1    | 1    |
| is         | 1    | 1    | 0    |
| fun        | 1    | 0    | 0    |
| deep       | 0    | 1    | 1    |
| powerful   | 0    | 1    | 0    |
| and        | 0    | 0    | 1    |
| models     | 0    | 0    | 1    |

For example:
- $w_{\text{machine}, \text{Doc1}} = 1$ → "machine" is in Doc1
- $w_{\text{powerful}, \text{Doc1}} = 0$ → "powerful" is not in Doc1

This matrix is particularly helpful when implementing **Boolean retrieval systems** and **phrase matching**.


In [26]:
# 📘 Example: Term-Document Incidence Matrix

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample corpus from the Markdown example
docs = [
    "machine learning is fun",          # Doc1
    "deep learning is powerful",        # Doc2
    "machine learning and deep models"  # Doc3
]

# Use binary=True to indicate presence/absence (1 or 0)
vectorizer = CountVectorizer(binary=True)

# Fit and transform the corpus
X = vectorizer.fit_transform(docs)

# Create a labeled DataFrame
incidence_matrix = pd.DataFrame(X.toarray(),
                                index=["Doc1", "Doc2", "Doc3"],
                                columns=vectorizer.get_feature_names_out())

# Display the incidence matrix
print("🔎 Term-Document Incidence Matrix:")
display(incidence_matrix)


🔎 Term-Document Incidence Matrix:


Unnamed: 0,and,deep,fun,is,learning,machine,models,powerful
Doc1,0,0,1,1,1,1,0,0
Doc2,0,1,0,1,1,0,0,1
Doc3,1,1,0,0,1,1,1,0


🗣️ **Instructor Talking Point**: This code demonstrates how the presence or absence of a term in a document is encoded as a binary matrix — foundational for Boolean retrieval. Explain this with respect to a future AI agent (chatbot) builds context.
<br/>
<br/>
🧠 **Student Talking Point**: Add a phrase query (e.g., 'machine learning') and explain your reasoning as to how you would check if both terms occur in a single document using this matrix.

### 🔹 Term Frequency (TF)

**Term Frequency (TF)** measures how frequently a term $t$ appears in a document $d$.

$$
tf_{t,d} = f_{t,d}
$$

Where $f_{t,d}$ is the raw count of term $t$ in document $d$.

---

#### ✅ Why Use It?

- TF reflects the importance of a word **within a specific document**.
- A higher TF means the term is likely central to the topic of that document.
- It's used as the **first step** in vectorizing text for machine learning models like classification, clustering, or information retrieval.

TF is most effective when combined with **IDF** (Inverse Document Frequency) to balance against very common terms across the corpus.

---

#### 📘 Example

Let’s say we have this document:

> **Doc1**: `"machine learning is fun and machine learning is useful"`

Calculate raw term counts:

| Term     | Raw TF $(f_{t,d})$ |
|----------|--------------------|
| machine  | 2                  |
| learning | 2                  |
| is       | 2                  |
| fun      | 1                  |
| and      | 1                  |
| useful   | 1                  |

If normalized (total of 9 words):

- $tf(\text{"machine"}, \text{Doc1}) = \frac{2}{9} \approx 0.22$
- $tf(\text{"learning"}, \text{Doc1}) = \frac{2}{9} \approx 0.22$

This simple frequency can then be used as input into models such as **TF-IDF**, which adjusts these values based on how rare the words are across multiple documents.


In [27]:
# 📘 Example: Term Frequency (TF)

import pandas as pd
from collections import Counter

# Sample document
doc1 = "machine learning is fun and machine learning is useful"

# Tokenize the document (simple lowercase + split)
tokens = doc1.lower().split()

# Count term frequencies
tf_raw = Counter(tokens)

# Total number of words
total_terms = len(tokens)

# Compute normalized TF
tf_normalized = {term: count / total_terms for term, count in tf_raw.items()}

# Display results
print("🔢 Raw Term Frequencies:")
display(pd.DataFrame(tf_raw.items(), columns=["Term", "Raw TF"]))

print("\n📏 Normalized Term Frequencies:")
display(pd.DataFrame(tf_normalized.items(), columns=["Term", "TF (Normalized)"]))


🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,machine,2
1,learning,2
2,is,2
3,fun,1
4,and,1
5,useful,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,machine,0.222222
1,learning,0.222222
2,is,0.222222
3,fun,0.111111
4,and,0.111111
5,useful,0.111111


🗣️ **Instructor Talking Point**: "Here we count how often each term appears in a single document and normalize it. This is the simplest way to represent word importance within a document. Explain this with respect to a future AI agent (chatbot) builds  builds context.
<br/>
<br/>
🧠 **Student Talking Point**: "Use this TF output to compare with another document. Which terms are likely to be most important in Doc1 based on their normalized TF? Explain your reasoning.

### 🔹 Log Frequency Weight

To reduce the impact of very frequent terms, **log frequency weighting** is applied.

$$
w_{t,d} =
\begin{cases}
1 + \log_{10}(f_{t,d}) & \text{if } f_{t,d} > 0 \\
0 & \text{if } f_{t,d} = 0
\end{cases}
$$

This transformation reduces the skew caused by terms that appear many times in a document. Instead of allowing their raw frequency to dominate, we scale their contribution **logarithmically**.

---

#### ✅ Why Use It?

- Frequent terms are not always the most **important** terms.
- Log scaling ensures that:
  - Words with a raw count of 1 are preserved ($1 + \\log_{10}(1) = 1$),
  - But words with very high counts (e.g., 1000) don’t dominate the document vector.

This helps **normalize the influence** of repetitive terms and improve the **numerical stability** of document representations in models.

---

#### 📘 Example

Let’s say we have a document with the following raw term counts:

| Term     | Raw TF $f_{t,d}$ | Log Frequency Weight $w_{t,d}$ |
|----------|------------------|-------------------------------|
| machine  | 1                | $1 + \\log_{10}(1) = 1$        |
| learning | 3                | $1 + \\log_{10}(3) \approx 1.477$ |
| data     | 10               | $1 + \\log_{10}(10) = 2$       |

So even though "data" appears 10 times, its log-weighted value is **just 2**, making it more comparable to less frequent but potentially more meaningful terms like "learning".

This makes log frequency weighting especially useful when preparing inputs for models like **TF-IDF** or **document clustering**.


In [28]:
# 📘 Example: Log Frequency Weighting

import pandas as pd
import numpy as np
from collections import Counter

# Sample document with varying term frequencies
doc = "machine learning data data data learning learning learning machine data data data data"

# Tokenize and count raw term frequencies
tokens = doc.lower().split()
raw_tf = Counter(tokens)

# Compute log frequency weights
log_weighted_tf = {
    term: 1 + np.log10(freq) if freq > 0 else 0
    for term, freq in raw_tf.items()
}

# Build and display the result as a DataFrame
df = pd.DataFrame({
    "Term": raw_tf.keys(),
    "Raw TF (f_{t,d})": raw_tf.values(),
    "Log Weight (w_{t,d})": log_weighted_tf.values()
})

print("📊 Log Frequency Weighting:")
display(df)


📊 Log Frequency Weighting:


Unnamed: 0,Term,"Raw TF (f_{t,d})","Log Weight (w_{t,d})"
0,machine,2,1.30103
1,learning,4,1.60206
2,data,7,1.845098


🗣️ **Instructor Talking Point**: Note how 'data' has a high frequency, but its impact is smoothed by log weighting, making it comparable to 'learning'. Explain this with respect to how a future AI agent (chatbot) builds builds context.
<br/>
<br/>
🧠 **Student Talking Point**: Try adjusting the number of times a word appears and observe how the log scale compresses large values.

### 🔹 Document Frequency (DF)

**Document Frequency** is the number of documents in which a term $t$ appears:

$$
df_t = |\{ d \in D : t \in d \}|
$$

Where:
- $df_t$ is the document frequency of term $t$
- $D$ is the set of all documents in the corpus
- $t \in d$ means the term $t$ appears in document $d$

---

#### ✅ Why Use It?

- It helps you understand **how common or rare** a word is across the entire document set.
- Words with **high DF** (e.g., “the”, “and”) occur in many documents and are often **less informative**.
- Words with **low DF** are more likely to be **specific and meaningful** for distinguishing between documents.
- DF is a key ingredient in calculating **Inverse Document Frequency (IDF)**.

---

#### 📘 Example

Suppose you have the following three documents:

- **Doc1**: "machine learning is fun"  
- **Doc2**: "deep learning is powerful"  
- **Doc3**: "machine learning and deep models"

Now, let’s compute the Document Frequency:

| Term     | Document Frequency ($df_t$) |
|----------|-----------------------------|
| machine  | 2 (Doc1, Doc3)              |
| learning | 3 (Doc1, Doc2, Doc3)        |
| deep     | 2 (Doc2, Doc3)              |
| models   | 1 (Doc3)                    |

The term **"learning"** appears in all three documents → **high DF**, which means it’s **less useful for distinguishing** between them.

The term **"models"** appears in only one document → **low DF**, meaning it could be a **useful keyword** for that specific document.


In [29]:
# 📘 Example: Document Frequency (DF)

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents from Curriculum Learning (4)
docs = [
    "machine learning is fun",          # Doc1
    "deep learning is powerful",        # Doc2
    "machine learning and deep models"  # Doc3
]

# Use CountVectorizer to extract term-document matrix (raw counts)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)

# Get feature names and document-term matrix as array
terms = vectorizer.get_feature_names_out()
X_array = X.toarray()

# Calculate document frequency for each term
df_counts = (X_array > 0).sum(axis=0)

# Format as a DataFrame
df_table = pd.DataFrame({
    "Term": terms,
    "Document Frequency (df_t)": df_counts
}).sort_values("Document Frequency (df_t)", ascending=False)

print("📊 Document Frequency (DF) Table:")
display(df_table)


📊 Document Frequency (DF) Table:


Unnamed: 0,Term,Document Frequency (df_t)
4,learning,3
1,deep,2
5,machine,2
3,is,2
2,fun,1
0,and,1
6,models,1
7,powerful,1


🗣️ **Instructor Talking Point**: Notice how common terms like 'learning' appear in all documents, while more specific terms like 'fun' or 'models' appear in only one.
<br/>
<br/>
🧠 **Student Talking Point**: Choose a term and explain how its document frequency could affect downstream TF-IDF weighting.

### 🔹 Inverse Document Frequency (IDF)

**Inverse Document Frequency (IDF)** measures how rare or informative a term is across the entire corpus:

$$
idf_t = \log_{10} \left( \frac{N}{df_t} \right)
$$

Where:
- $N$ is the total number of documents in the corpus  
- $df_t$ is the number of documents that contain the term $t$

---

#### ✅ Why Use It?

- IDF is used to **downweight common terms** and **upweight rare ones**.
- Words like “the”, “and”, or “data” appear frequently and are less helpful in distinguishing documents.
- Terms that appear in **fewer documents** are often **more informative** and **discriminative**.
- IDF is a core component of **TF-IDF**, a widely used technique in search engines, document classification, and clustering.

---

#### 📘 Example

Let’s say we have **5 documents** total, and the following document frequencies:

| Term     | $df_t$ | $idf_t = \log_{10}(N / df_t)$ |
|----------|--------|-------------------------------|
| machine  | 3      | $\log_{10}(5 / 3) \approx 0.22$ |
| entropy  | 1      | $\log_{10}(5 / 1) = 0.70$       |
| the      | 5      | $\log_{10}(5 / 5) = 0.00$       |

- The term **"entropy"** appears in only one document, so its IDF is **high** → it’s a **rare and informative term**.
- The term **"the"** ap


In [30]:
# 📘 Example: Inverse Document Frequency (IDF)

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents (5 total)
docs = [
    "machine learning is powerful",
    "deep learning is advanced",
    "entropy measures randomness",
    "machine learning and AI are evolving",
    "the science of machine learning"
]

# Total number of documents
N = len(docs)

# Use CountVectorizer to get document-term matrix
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
terms = vectorizer.get_feature_names_out()
X_array = X.toarray()

# Compute document frequency for each term
df_counts = (X_array > 0).sum(axis=0)

# Compute IDF using log base 10
idf_values = np.log10(N / df_counts)

# Build a DataFrame for display
idf_table = pd.DataFrame({
    "Term": terms,
    "Document Frequency (df_t)": df_counts,
    "IDF (log10(N / df_t))": idf_values
}).sort_values("IDF (log10(N / df_t))", ascending=False)

print("📊 Inverse Document Frequency (IDF) Table:")
display(idf_table)


📊 Inverse Document Frequency (IDF) Table:


Unnamed: 0,Term,Document Frequency (df_t),IDF (log10(N / df_t))
0,advanced,1,0.69897
1,ai,1,0.69897
2,and,1,0.69897
3,are,1,0.69897
4,deep,1,0.69897
5,entropy,1,0.69897
6,evolving,1,0.69897
11,of,1,0.69897
14,science,1,0.69897
10,measures,1,0.69897


🗣️ **Instructor Talking Point**: IDF adjusts for the fact that some words are common across all documents — this is critical in improving document relevance in search systems.
<br/>
<br/>
🧠 **Student Talking Point**: Choose a low-IDF and high-IDF term from this output and explain why they behave differently.

### 🔹 TF-IDF Weighting

**TF-IDF (Term Frequency–Inverse Document Frequency)** scores each term $t$ in document $d$ based on how frequent and how rare it is:

$$
w_{t,d} = \left(1 + \log_{10}(f_{t,d})\right) \times \log_{10} \left( \frac{N}{df_t} \right)
$$

Where:
- $f_{t,d}$ is the raw count of term $t$ in document $d$
- $df_t$ is the number of documents that contain term $t$
- $N$ is the total number of documents in the corpus

---

#### ✅ Why Use It?

- TF-IDF balances **term importance within a document** (TF) against **term commonality across all documents** (IDF).
- It **boosts rare, relevant words** while **suppressing frequent, generic words**.
- TF-IDF is foundational in:
  - Information Retrieval (search engines)
  - Document similarity
  - Feature engineering for classification or clustering

---

#### 📘 Example

Suppose we have:

- $f_{\text{machine}, \text{Doc1}} = 3$
- $df_{\text{machine}} = 2$
- $N = 5$ total documents

Then:

- TF part: $1 + \log_{10}(3) \approx 1 + 0.477 = 1.477$
- IDF part: $\log_{10}(5 / 2) \approx 0.398$
- TF-IDF weight:

$$
w_{\text{machine}, \text{Doc1}} = 1.477 \times 0.398 \approx 0.588
$$

This means "machine" is **important within Doc1**, but since it's found in other documents too, the overall weight is **moderated**.

TF-IDF creates a **sparse, weighted vector representation** of documents, ready for:
- Cosine similarity
- Clustering
- Search ranking
- Input into classical machine learning models


In [31]:
# 📘 Example: TF-IDF Weighting (Manual Computation)

import pandas as pd
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus of 5 documents
docs = [
    "machine learning is powerful",
    "deep learning is advanced",
    "entropy measures randomness",
    "machine learning and AI are evolving",
    "the science of machine learning"
]

# Total number of documents
N = len(docs)

# Vectorize (raw term frequencies)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
terms = vectorizer.get_feature_names_out()
X_array = X.toarray()

# Compute Document Frequencies
df = (X_array > 0).sum(axis=0)
idf = np.log10(N / df)

# Manual TF-IDF: apply (1 + log10(tf)) * idf
tf_log = 1 + np.where(X_array > 0, np.log10(X_array), 0)
tfidf = tf_log * idf

# Create a DataFrame for visual inspection
tfidf_df = pd.DataFrame(tfidf, columns=terms, index=[f"Doc{i+1}" for i in range(N)])

print("📊 TF-IDF Weighted Matrix (Manual Computation):")
display(tfidf_df.round(3))


📊 TF-IDF Weighted Matrix (Manual Computation):


  tf_log = 1 + np.where(X_array > 0, np.log10(X_array), 0)


Unnamed: 0,advanced,ai,and,are,deep,entropy,evolving,is,learning,machine,measures,of,powerful,randomness,science,the
Doc1,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699
Doc2,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699
Doc3,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699
Doc4,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699
Doc5,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699


🗣️ **Instructor Talking Point**: We combined TF and IDF manually — useful for seeing how each part of the formula shapes the final result.
<br/>
<br/>
🗣️ **Instructor Talking Point**: Document Frequency (DF) counts how many documents contain a specific term, showing how common it is across the corpus.
Inverse Document Frequency (IDF) does the opposite—it measures how rare or informative a term is by applying a logarithmic scale to the inverse of DF.
So, DF increases with term frequency across documents, while IDF decreases, giving higher weight to rare terms.
Together, they balance relevance: DF tells us "how many use this term," while IDF tells us "how useful is this term for distinguishing documents."
IDF is critical for reducing noise from overly common words.
<br/>
<br/>
🧠 **Student Talking Point**: "Pick one row (a document) and explain which term seems most important and why, based on the TF-IDF weights.

## Step 2: Document Collection

In [32]:

from nltk.corpus import gutenberg
from nltk.corpus import reuters
from nltk.stem import PorterStemmer
import os
import nltk
DATA_DIR = os.path.join(os.path.dirname('data'), 'data', 'nltk_data')

# 2. Make sure it exists
os.makedirs(DATA_DIR, exist_ok=True)

# 3. Tell NLTK to look in there
nltk.data.path.append(DATA_DIR)

# 4. Download the Gutenberg corpus into that folder
nltk.download('gutenberg', download_dir=DATA_DIR)
nltk.download('reuters', download_dir=DATA_DIR)



# List of file IDs (there are 18 built-in Gutenberg books)
nltk.download('gutenberg')   
file_ids = gutenberg.fileids()
documents = [gutenberg.raw(file_id) for file_id in file_ids]

# If you want 20+ documents, you can duplicate or supplement from other corpora



# Add 10 more from Reuters
reuters_ids = reuters.fileids()[:10]
documents.extend([reuters.raw(doc_id) for doc_id in reuters_ids])

print(f"Loaded {len(documents)} documents.")


Loaded 28 documents.


[nltk_data] Downloading package gutenberg to data\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package reuters to data\nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\parth\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


## Step 3: Implement a Tokenizer

In [33]:


import re

# Tokenizer function
def tokenize(text):
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens

# Load and tokenize all .txt files from the 'data/' folder
def load_and_tokenize_all(folder_path):
    tokenized_documents = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(folder_path, filename)
            with open(file_path, 'r', encoding='utf-8', errors= "ignore") as file:
                text = file.read()
                tokens = tokenize(text)
                tokenized_documents.append(tokens)
                print(f"Tokenized: {filename} — {len(tokens)} tokens")
    return tokenized_documents

# Run the function on your data folder
folder_path = 'data/nltk_data/corpora'
all_tokenized_docs = load_and_tokenize_all(folder_path)

# Preview the first 30 tokens from the first document
print("\nPreview from first document:")
print(all_tokenized_docs[0][:30])



Tokenized: austen-emma.txt — 161983 tokens
Tokenized: austen-persuasion.txt — 84167 tokens
Tokenized: austen-sense.txt — 120787 tokens
Tokenized: bible-kjv.txt — 854046 tokens
Tokenized: blake-poems.txt — 6936 tokens
Tokenized: bryant-stories.txt — 46703 tokens
Tokenized: burgess-busterbrown.txt — 16363 tokens
Tokenized: carroll-alice.txt — 27336 tokens
Tokenized: cats.txt — 36329 tokens
Tokenized: chesterton-ball.txt — 82867 tokens
Tokenized: chesterton-brown.txt — 73288 tokens
Tokenized: chesterton-thursday.txt — 58729 tokens
Tokenized: edgeworth-parents.txt — 170796 tokens
Tokenized: melville-moby_dick.txt — 218621 tokens
Tokenized: milton-paradise.txt — 80497 tokens
Tokenized: shakespeare-caesar.txt — 20873 tokens
Tokenized: shakespeare-hamlet.txt — 30271 tokens
Tokenized: shakespeare-macbeth.txt — 18351 tokens
Tokenized: whitman-leaves.txt — 126605 tokens

Preview from first document:
['emma', 'by', 'jane', 'austen', '1816', 'volume', 'i', 'chapter', 'i', 'emma', 'woodhouse', 'han

## Step 4: Text Normalization Pipeline

In [34]:

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer


from nltk.corpus import stopwords
nltk.download('stopwords', quiet=True)

STOPWORDS = set(stopwords.words('english'))
STEMMER = PorterStemmer()


def normalize(tokens):
    normalized = []
    for t in tokens:
        if t in STOPWORDS:
            continue
        stemmed = STEMMER.stem(t)
        normalized.append(stemmed)
    return normalized

# Load, tokenize, normalize all .txt files
def load_and_process_all(folder_path):
    all_docs = []
    for root, _, files in os.walk(folder_path):
        for fname in files:
            if not fname.endswith('.txt'):
                continue
            path = os.path.join(root, fname)
            with open(path, 'r', encoding='utf-8', errors='ignore') as f:
                text = f.read()
            tokens = tokenize(text)
            tokens = normalize(tokens)
            all_docs.append(tokens)
            print(f"Processed: {fname} — {len(tokens)} tokens (stopwords removed & stemmed)")
    return all_docs

# Run it
folder_path = 'data/nltk_data/corpora'
all_tokenized_docs = load_and_process_all(folder_path)

# Preview the first 30 normalized tokens from the first document
print("\nPreview from first document:")
print(all_tokenized_docs[0][:30])


Processed: austen-emma.txt — 73532 tokens (stopwords removed & stemmed)
Processed: austen-persuasion.txt — 38383 tokens (stopwords removed & stemmed)
Processed: austen-sense.txt — 54040 tokens (stopwords removed & stemmed)
Processed: bible-kjv.txt — 437149 tokens (stopwords removed & stemmed)
Processed: blake-poems.txt — 3807 tokens (stopwords removed & stemmed)
Processed: bryant-stories.txt — 21810 tokens (stopwords removed & stemmed)
Processed: burgess-busterbrown.txt — 7618 tokens (stopwords removed & stemmed)
Processed: carroll-alice.txt — 12243 tokens (stopwords removed & stemmed)
Processed: cats.txt — 36329 tokens (stopwords removed & stemmed)
Processed: chesterton-ball.txt — 39900 tokens (stopwords removed & stemmed)
Processed: chesterton-brown.txt — 35350 tokens (stopwords removed & stemmed)
Processed: chesterton-thursday.txt — 28333 tokens (stopwords removed & stemmed)
Processed: edgeworth-parents.txt — 78207 tokens (stopwords removed & stemmed)
Processed: melville-moby_dick.t

## Step 5: Build and Test the Pipeline


Using the six concepts and the preprocessing pipeline above, implement a full pipeline that:
- Preprocesses text
- Applies vectorization
- Computes all six concept metrics
- Tests with one phrase query per concept


## Term Document Index Matrix

In [None]:


manual_doc = load_and_tokenize_all('data/')
combined_docs = all_tokenized_docs + [manual_doc]
combined_texts = [' '.join(doc) for doc in combined_docs]

vectorizer_binary = CountVectorizer(binary=True)
X_incidence = vectorizer_binary.fit_transform(combined_texts)
incidence_df = pd.DataFrame(X_incidence.toarray(), columns=vectorizer_binary.get_feature_names_out())

print(incidence_df)

    00  000  00021053  00081429  00482129  01  02  10  100  1000  ...  \
0    0    1         0         0         0   0   0   1    0     0  ...   
1    0    0         0         0         0   0   0   1    0     0  ...   
2    0    0         0         0         0   0   0   1    0     0  ...   
3    0    0         0         0         0   0   0   1    1     0  ...   
4    0    0         0         0         0   0   0   0    0     0  ...   
5    0    0         0         0         0   0   0   0    0     0  ...   
6    0    0         0         0         0   0   0   0    0     0  ...   
7    0    0         0         0         0   0   0   0    0     0  ...   
8    0    0         0         0         0   0   0   1    1     1  ...   
9    1    1         0         0         0   1   1   1    1     1  ...   
10   0    0         0         0         0   0   0   0    0     0  ...   
11   0    0         0         0         0   0   0   0    0     0  ...   
12   0    0         0         0         0   0   0  

## Term Frequency

In [44]:
vectorizer_tf = CountVectorizer(binary=False)
X_tf = vectorizer_tf.fit_transform(combined_texts)
tf_df = pd.DataFrame(X_tf.toarray(), columns=vectorizer_tf.get_feature_names_out())

print(tf_df)

    00  000  00021053  00081429  00482129  01  02    10  100  1000  ...  \
0    0    2         0         0         0   0   0     2    0     0  ...   
1    0    0         0         0         0   0   0     1    0     0  ...   
2    0    0         0         0         0   0   0     1    0     0  ...   
3    0    0         0         0         0   0   0  2117    6     0  ...   
4    0    0         0         0         0   0   0     0    0     0  ...   
5    0    0         0         0         0   0   0     0    0     0  ...   
6    0    0         0         0         0   0   0     0    0     0  ...   
7    0    0         0         0         0   0   0     0    0     0  ...   
8    0    0         0         0         0   0   0     1    1     1  ...   
9    1    1         0         0         0   1   2     1    2     1  ...   
10   0    0         0         0         0   0   0     0    0     0  ...   
11   0    0         0         0         0   0   0     0    0     0  ...   
12   0    0         0    

## Log Frequency

In [45]:
log_weighted_tf = 1 + np.where(X_tf.toarray() > 0, np.log10(X_tf.toarray()), 0)
log_weighted_df = pd.DataFrame(log_weighted_tf, columns=vectorizer_tf.get_feature_names_out())
print(log_weighted_df)

     00      000  00021053  00081429  00482129   01       02        10  \
0   1.0  1.30103       1.0       1.0       1.0  1.0  1.00000  1.301030   
1   1.0  1.00000       1.0       1.0       1.0  1.0  1.00000  1.000000   
2   1.0  1.00000       1.0       1.0       1.0  1.0  1.00000  1.000000   
3   1.0  1.00000       1.0       1.0       1.0  1.0  1.00000  4.325721   
4   1.0  1.00000       1.0       1.0       1.0  1.0  1.00000  1.000000   
5   1.0  1.00000       1.0       1.0       1.0  1.0  1.00000  1.000000   
6   1.0  1.00000       1.0       1.0       1.0  1.0  1.00000  1.000000   
7   1.0  1.00000       1.0       1.0       1.0  1.0  1.00000  1.000000   
8   1.0  1.00000       1.0       1.0       1.0  1.0  1.00000  1.000000   
9   1.0  1.00000       1.0       1.0       1.0  1.0  1.30103  1.000000   
10  1.0  1.00000       1.0       1.0       1.0  1.0  1.00000  1.000000   
11  1.0  1.00000       1.0       1.0       1.0  1.0  1.00000  1.000000   
12  1.0  1.00000       1.0       1.0  

  log_weighted_tf = 1 + np.where(X_tf.toarray() > 0, np.log10(X_tf.toarray()), 0)


## Document Frequency

In [47]:
df_counts = (X_tf.toarray() > 0).sum(axis=0)
df_table = pd.DataFrame({'Term': vectorizer_tf.get_feature_names_out(), 'Document Frequency': df_counts})
print(df_table)

              Term  Document Frequency
0               00                   1
1              000                   3
2         00021053                   1
3         00081429                   1
4         00482129                   1
...            ...                 ...
36700          zur                   1
36701       zuriel                   1
36702  zurishaddai                   1
36703       zuyder                   1
36704        zuzim                   1

[36705 rows x 2 columns]


## Inverse Document Frequency

In [None]:
N = len(combined_docs)
idf_values = np.log10(N / df_counts)
idf_table = pd.DataFrame({'Term': vectorizer_tf.get_feature_names_out(), 'Document Frequency': df_counts, 'IDF': idf_values})
print(idf_table)

              Term  Document Frequency       IDF
0               00                   1  1.301030
1              000                   3  0.823909
2         00021053                   1  1.301030
3         00081429                   1  1.301030
4         00482129                   1  1.301030
...            ...                 ...       ...
36700          zur                   1  1.301030
36701       zuriel                   1  1.301030
36702  zurishaddai                   1  1.301030
36703       zuyder                   1  1.301030
36704        zuzim                   1  1.301030

[36705 rows x 3 columns]


## TF-IDF Manual Computation

In [53]:
tfidf_manual = log_weighted_tf * idf_values
tfidf_manual_df = pd.DataFrame(tfidf_manual, columns=vectorizer_tf.get_feature_names_out())
print(tfidf_manual_df)

         00       000  00021053  00081429  00482129       01        02  \
0   1.30103  1.071930   1.30103   1.30103   1.30103  1.30103  1.301030   
1   1.30103  0.823909   1.30103   1.30103   1.30103  1.30103  1.301030   
2   1.30103  0.823909   1.30103   1.30103   1.30103  1.30103  1.301030   
3   1.30103  0.823909   1.30103   1.30103   1.30103  1.30103  1.301030   
4   1.30103  0.823909   1.30103   1.30103   1.30103  1.30103  1.301030   
5   1.30103  0.823909   1.30103   1.30103   1.30103  1.30103  1.301030   
6   1.30103  0.823909   1.30103   1.30103   1.30103  1.30103  1.301030   
7   1.30103  0.823909   1.30103   1.30103   1.30103  1.30103  1.301030   
8   1.30103  0.823909   1.30103   1.30103   1.30103  1.30103  1.301030   
9   1.30103  0.823909   1.30103   1.30103   1.30103  1.30103  1.692679   
10  1.30103  0.823909   1.30103   1.30103   1.30103  1.30103  1.301030   
11  1.30103  0.823909   1.30103   1.30103   1.30103  1.30103  1.301030   
12  1.30103  0.823909   1.30103   1.30

## Term-Document Incidence Matrix: Talking Points

- Using a Term-Document Incidence Matrix allows quick checks for whether specific keywords exist in a document, which is helpful in tasks like keyword-based filtering or search.

- This matrix can be used to identify documents that contain all terms from a query phrase, although it does not provide information about the order or exact position of the terms.

---

## Term Frequency (TF): Talking Points

- When two words appear equally often in a document, they receive the same TF score, even if one of them is a more generic or less meaningful term.

- Normalizing term frequency ensures that document length does not unfairly influence which terms are considered most important when comparing different documents.

---

## Log Frequency Weight: Talking Points

- Applying log frequency weighting reduces the influence of very frequently occurring words, preventing them from dominating the analysis in document search or ranking systems.

- Adjusting raw counts with log frequency makes term importance values more stable and comparable, even when some words appear much more often than others.

---

## Document Frequency (DF): Talking Points

- Document Frequency helps identify common filler or stopwords that appear in many documents, making it easier to filter out less informative terms automatically.

- DF provides a broader view than just checking individual documents, as it measures how widely a term is distributed across the entire corpus.

---

## Inverse Document Frequency (IDF): Talking Points

- High IDF values highlight terms that are rare and potentially more useful for distinguishing between documents, while low IDF values indicate common terms.

- By reviewing IDF values, it becomes clear which terms will have a greater impact when calculating TF-IDF scores, helping to focus on more meaningful keywords.

---

## TF-IDF Weighting: Talking Points

- TF-IDF emphasizes words that are both frequent within a document and rare across the entire corpus, making it a valuable method for document ranking and relevance scoring.

- Comparing TF-IDF scores across documents reveals which terms contribute most to a document’s uniqueness and importance relative to a specific query or task.


## Step 6: The Workshop


One team member must push the final notebook to GitHub and send the `.git` URL to the instructor before the end of class.




## 🧠 Learning Objectives
- Implement the foundations of **Vector Space Proximity** algorithms using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into peer notebooks.
- Practice **collaborative debugging**, professional peer feedback, and improve code quality.

## 🧩 Workshop Structure (90 Minutes)
1. **Instructor Use Case Introduction** *(15 min)* – Set up teams of 3 people. Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Team Jupyter Notebook Development** *(45 min)* – NLP Pipeline and six IR basics techniques implementation + Markdown documentation (work as teams)
3. **Push to GitHub** *(15 min)* – Teams commit and push initial notebooks. **Make sure to include your names so it is easy to identify the team that developed the code**.
4. **Instructor Review** - The instructor will go around, take notes, and provide coaching as needed, during the **Peer Review Round**
5. **Email Delivery** *(15 min)* – Each team send the instructor an email **with the *.git link** to the GitHub repo **(one email/team)**. Subject on the email is: PROG8245 - IR Basics & Vector Space Proximity Foundations Workshop, Team #_____.


## 💻 Submission Checklist
- ✅ `IRBasics_VectorSpaceProximity.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline, Inverted Index and the six concepts.
  - Markdown explanations for each major step
  - **Labeled talking point(s)** (1-2 per concept)
- ✅ `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- ✅ GitHub Repo:
  - Public repo named `IRBasics-VectorSpaceProximity-workshop`
  - This is a group effort, so **choose one member of the team** to publish the repo
  - At least **one commit containing one meaningful talking point**

## 🔚 Conclusion


This workshop prepares you for our next session on **Vector Space Proximity** and **Cosine Similarity**.
