# 🛠️ Active Learning Workshop: Implementing an Inverted Matrix (Jupyter + GitHub Edition)
## 🔍 Workshop Theme
*Readable, correct, and collaboratively reviewed code—just like in the real world.*


Welcome to the 90-minute workshop! In this hands-on session, your team will build an **Inverted Index** pipeline, the foundation of many intelligent systems that need fast and relevant access to text data — such as AI agents.

### 👥 Team Guidelines
- Work in teams of 3.
- Submit one completed Jupyter Notebook per team.
- The final notebook must contain **Markdown explanations** and **Python code**.
- Push your notebook to GitHub and share the `.git` link before class ends.

---
## 🔧 Workshop Tasks Overview

1. **Document Collection**
2. **Tokenizer Implementation**
3. **Normalization Pipeline (Stemming, Stop Words, etc.)**
4. **Build and Query the Inverted Index**

> Each step includes a sample **talking point**. Your team must add your own custom **Markdown + code cells** with a **second talking point**, and test your Inverted Index with **2 phrase queries**.




## 🧠 Learning Objectives
- Implement an **Inverted Matrix** using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into peer notebooks.
- Practice **collaborative debugging**, professional peer feedback, and improve code quality.

## 🧩 Workshop Structure (90 Minutes)
1. **Instructor Use Case Introduction** *(15 min)* – Set up teams of 3 people. Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Team Jupyter Notebook Development** *(45 min)* – Manual IR and Inverted Matrix coding + Markdown documentation (work as teams)
3. **Push to GitHub** *(15 min)* – Teams commit and push initial notebooks. **Make sure to include your names so it is easy to identify the team that developed the Min-Max code**.
4. **Instructor Review** - The instructor will go around, take notes, and provide coaching as needed, during the **Peer Review Round**
5. **Email Delivery** *(15 min)* – Each team send the instructor an email **with the *.git link** to the GitHub repo **(one email/team)**. Subject on the email is: PROG8245 - Inverted Matrix  Workshop, Team #_____.


## 💻 Submission Checklist
- ✅ `IR_InvertedMatrix_Workshop.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline, and Inverted Index.
  - Markdown explanations for each major step
  - **Labeled talking point(s)** and 2 phrase query tests
- ✅ `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- ✅ GitHub Repo:
  - Public repo named `IR-invertedmatrix-workshop`
  - This is a group effort, so **choose one member of the team** to publish the repo
  - At least **one commit containing one meaningful talking point**

## 📄 Step 1: Document Collection


### 🗣 Instructor Talking Point:
> We begin by gathering a text corpus. To build a robust index, your vocabulary should include **over 2000 unique words**. You can use scraped articles, academic papers, or open datasets.

### 🔧 Your Task:
- Collect at least 20+ text documents.
- Ensure the vocabulary exceeds 2000 unique words.
- Load the documents into a list for processing.


In [1]:
import os
import re
from sklearn.feature_extraction.text import CountVectorizer

# 📁 Path to the folder where the extracted files are stored
folder_path = 'docs/'  # Replace with the path where you extract the ZIP

# 🔄 Step 1: Load all .txt documents from folder
def load_documents(folder_path):
    documents = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                documents.append(file.read())
    return documents

# 🧹 Step 2: Clean text
def clean_text(text):
    text = re.sub(r'\W+', ' ', text)  # Remove punctuation/special characters
    return text.lower()

# 📏 Step 3: Get vocabulary size
def get_vocabulary_size(docs):
    vectorizer = CountVectorizer()
    vectorizer.fit(docs)
    return len(vectorizer.vocabulary_)

# 🚀 Run the pipeline
documents = load_documents("./data/")
cleaned_docs = [clean_text(doc) for doc in documents]
vocab_size = get_vocabulary_size(cleaned_docs)

# 📊 Output
print(f"✅ Documents loaded: {len(documents)}")
print(f"✅ Unique vocabulary size: {vocab_size}")

# 💬 Optional check
if len(documents) >= 20 and vocab_size >= 2000:
    print("🎯 Requirement satisfied: You can move to the next step!")
else:
    print("⚠️ Requirement NOT met — consider using longer/more diverse documents.")


✅ Documents loaded: 20
✅ Unique vocabulary size: 6866
🎯 Requirement satisfied: You can move to the next step!


## ✂️ Step 2: Tokenizer


### 🗣 Instructor Talking Point:
> The tokenizer breaks raw text into a stream of words (tokens). This is the foundation for every later step in IR and NLP.

### 🔧 Your Task:
- Implement a basic tokenizer that splits text into lowercase words.
- Handle punctuation removal and basic non-alphanumeric filtering.


In [2]:
import os
import re

def basic_tokenizer(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    tokens = text.split()
    return tokens

# Load documents from 'docs/' folder
def load_documents(folder_path='docs/'):
    documents = []
    for filename in sorted(os.listdir(folder_path)):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                documents.append(file.read())
    return documents

# Example usage: load all docs and tokenize each
docs = load_documents('./Data/')

for i, doc in enumerate(docs, 1):
    tokens = basic_tokenizer(doc)
    print(f"Document {i} tokens: {tokens[:20]}")  # print first 20 tokens of each doc


Document 1 tokens: ['usually', 'many', 'skin', 'finish', 'attorney', 'early', 'save', 'boy', 'in', 'store', 'thousand', 'pick', 'clear', 'today', 'face', 'far', 'system', 'star', 'stop', 'summer']
Document 2 tokens: ['billion', 'trip', 'stand', 'stage', 'world', 'question', 'people', 'kid', 'price', 'determine', 'eight', 'join', 'whatever', 'friend', 'already', 'yet', 'fall', 'recent', 'it', 'account']
Document 3 tokens: ['director', 'century', 'weight', 'statement', 'give', 'various', 'hot', 'similar', 'same', 'act', 'out', 'these', 'land', 'glass', 'three', 'world', 'either', 'mind', 'far', 'nice']
Document 4 tokens: ['anyone', 'letter', 'particular', 'like', 'wind', 'whole', 'laugh', 'trip', 'room', 'keep', 'claim', 'ball', 'require', 'worker', 'standard', 'foreign', 'democratic', 'collection', 'skill', 'close']
Document 5 tokens: ['best', 'there', 'prevent', 'option', 'among', 'candidate', 'raise', 'shake', 'without', 'customer', 'dog', 'religious', 'congress', 'per', 'dream', 'stu

## 🔁 Step 3: Normalization Pipeline (Stemming, Stop Word Removal, etc.)


### 🗣 Instructor Talking Point:
> Now we normalize tokens: convert to lowercase, remove stop words, apply stemming or affix stripping. This reduces redundancy and enhances search accuracy.

### 🔧 Your Task:
- Use `nltk` to remove stopwords and apply stemming.


In [3]:
import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download stopwords once
nltk.download('stopwords')

def normalize_tokens(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    tokens = text.split()
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    return stemmed_tokens

def load_documents(folder_path='./Data/'):
    documents = []
    for filename in sorted(os.listdir(folder_path)):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                documents.append(file.read())
    return documents

# Load and normalize all documents
docs = load_documents('./Data/')

for i, doc in enumerate(docs, 1):
    normalized_tokens = normalize_tokens(doc)
    print(f"Document {i} normalized tokens: {normalized_tokens[:20]}")  # first 20 tokens


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Document 1 normalized tokens: ['usual', 'mani', 'skin', 'finish', 'attorney', 'earli', 'save', 'boy', 'store', 'thousand', 'pick', 'clear', 'today', 'face', 'far', 'system', 'star', 'stop', 'summer', 'film']
Document 2 normalized tokens: ['billion', 'trip', 'stand', 'stage', 'world', 'question', 'peopl', 'kid', 'price', 'determin', 'eight', 'join', 'whatev', 'friend', 'alreadi', 'yet', 'fall', 'recent', 'account', 'mother']
Document 3 normalized tokens: ['director', 'centuri', 'weight', 'statement', 'give', 'variou', 'hot', 'similar', 'act', 'land', 'glass', 'three', 'world', 'either', 'mind', 'far', 'nice', 'manag', 'continu', 'surfac']
Document 4 normalized tokens: ['anyon', 'letter', 'particular', 'like', 'wind', 'whole', 'laugh', 'trip', 'room', 'keep', 'claim', 'ball', 'requir', 'worker', 'standard', 'foreign', 'democrat', 'collect', 'skill', 'close']
Document 5 normalized tokens: ['best', 'prevent', 'option', 'among', 'candid', 'rais', 'shake', 'without', 'custom', 'dog', 'religi

## 🔍 Step 4: Inverted Index


    ### 🗣 Instructor Talking Point:
    > We now map each normalized token to the list of document IDs in which it appears. This is the core structure that allows fast Boolean and phrase queries.

    ### 🔧 Your Task:
    - Build the inverted index using a dictionary.
    - Add code to support phrase queries using positional indexing.


In [4]:
import os
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk

nltk.download('stopwords')

def normalize_tokens(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    tokens = text.split()
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    filtered_stemmed = [stemmer.stem(t) for t in tokens if t not in stop_words]
    return filtered_stemmed

def build_inverted_index(docs):
    inverted_index = {}
    for doc_id, text in enumerate(docs):
        tokens = normalize_tokens(text)
        seen_in_doc = set()
        for pos, token in enumerate(tokens):
            # Add positional info for phrase queries
            if token not in inverted_index:
                inverted_index[token] = {}
            if doc_id not in inverted_index[token]:
                inverted_index[token][doc_id] = []
            inverted_index[token][doc_id].append(pos)
            seen_in_doc.add(token)
    return inverted_index

def phrase_in_doc(inverted_index, phrase_tokens, doc_id):
    positions_lists = []
    for token in phrase_tokens:
        if token not in inverted_index or doc_id not in inverted_index[token]:
            return False
        positions_lists.append(inverted_index[token][doc_id])

    # Check positions for sequential occurrence of phrase tokens
    first_positions = positions_lists[0]
    for start_pos in first_positions:
        if all((start_pos + offset) in positions_lists[offset] for offset in range(1, len(positions_lists))):
            return True
    return False

def phrase_search(inverted_index, phrase, docs):
    phrase_tokens = normalize_tokens(phrase)
    matched_docs = []
    for doc_id in range(len(docs)):
        if phrase_in_doc(inverted_index, phrase_tokens, doc_id):
            matched_docs.append(doc_id)
    return matched_docs

def load_documents(folder_path='./Data/'):
    documents = []
    for filename in sorted(os.listdir(folder_path)):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as f:
                documents.append(f.read())
    return documents

# ---- Main ----
docs = load_documents('./Data/')

inverted_index = build_inverted_index(docs)

print("Sample tokens from inverted index:")
for token, postings in list(inverted_index.items())[:10]:
    print(f"{token}: {list(postings.keys())}")

# Test phrase queries
phrases = ["machine learning", "artificial intelligence"]

for phrase in phrases:
    matched = phrase_search(inverted_index, phrase, docs)
    print(f"\nDocuments containing the phrase '{phrase}': {matched}")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Sample tokens from inverted index:
usual: [0, 2, 6, 8, 11, 12, 13, 16, 18]
mani: [0, 1, 2, 3, 9, 12, 13, 16, 18]
skin: [0, 1, 3, 4, 6, 7, 8, 10, 12, 15, 16]
finish: [0, 3, 5, 11, 13, 17, 19]
attorney: [0, 3, 4, 7, 8, 9, 10, 11, 13, 14, 15, 16]
earli: [0, 2, 5, 7, 17]
save: [0, 2, 3, 4, 5, 7, 8, 9, 12, 13, 18]
boy: [0, 1, 2, 3, 4, 8, 9, 10, 14, 15, 19]
store: [0, 5, 6, 7, 8, 11, 14, 16, 17, 19]
thousand: [0, 2, 3, 5, 8, 10, 12, 13, 19]

Documents containing the phrase 'machine learning': [4, 11]

Documents containing the phrase 'artificial intelligence': [3, 11, 17]


## 🧪 Test: Phrase Queries


### 🗣 Instructor Talking Point:
> A phrase query requires the exact sequence of terms (e.g., "machine learning"). To support this, extend the inverted index to store positions, not just docIDs.

### 🔧 Your Task:
- Implement 2 phrase queries.
- Demonstrate that they return the correct documents.


In [5]:
import re

def basic_tokenizer(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    return text.split()

def build_inverted_index(docs):
    inverted_index = {}
    for doc_id, text in enumerate(docs):
        tokens = basic_tokenizer(text)
        for pos, token in enumerate(tokens):
            if token not in inverted_index:
                inverted_index[token] = {}
            if doc_id not in inverted_index[token]:
                inverted_index[token][doc_id] = []
            inverted_index[token][doc_id].append(pos)
    return inverted_index

def phrase_in_doc(inverted_index, phrase_tokens, doc_id):
    positions_lists = []
    for token in phrase_tokens:
        if token not in inverted_index or doc_id not in inverted_index[token]:
            return False
        positions_lists.append(inverted_index[token][doc_id])
    first_positions = positions_lists[0]
    for start_pos in first_positions:
        if all((start_pos + offset) in positions_lists[offset] for offset in range(1, len(positions_lists))):
            return True
    return False

def phrase_search(inverted_index, phrase, docs):
    phrase_tokens = basic_tokenizer(phrase)
    matched_docs = []
    for doc_id in range(len(docs)):
        if phrase_in_doc(inverted_index, phrase_tokens, doc_id):
            matched_docs.append(doc_id)
    return matched_docs

# Sample documents
docs = [
    "Machine learning is fascinating.",
    "Deep learning is a subset of machine learning.",
    "Artificial intelligence includes machine learning.",
    "Learning about machine algorithms."
]

# Build index
index = build_inverted_index(docs)

# Phrase queries
phrases = ["machine learning", "deep learning"]

for phrase in phrases:
    matched = phrase_search(index, phrase, docs)
    print(f"Documents containing phrase '{phrase}': {matched}")


Documents containing phrase 'machine learning': [0, 1, 2]
Documents containing phrase 'deep learning': [1]


# 🧠 NLP Foundations Workshop: From Preprocessing to tf-idf


**Duration**: 90 minutes  
**Team Size**: 3 students  
**Objective**: Build an NLP pipeline from scratch to implement and test six foundational concepts in Natural Language Processing in preparation for Vector Space Models and Cosine Similarity.


## Step 1: Presenting the Six Core NLP Concepts

### 🔹 Term-Document Incidence Matrix

The **Term-Document Incidence Matrix** is a binary matrix that shows whether a term $t$ appears in a document $d$.

- Rows represent terms in the vocabulary  
- Columns represent documents in the corpus  
- Each entry $w_{t,d}$ is defined as:

$$
w_{t,d} =
\begin{cases}
1 & \text{if } t \in d \\
0 & \text{otherwise}
\end{cases}
$$

This is a **binary representation** — it only records the **presence or absence** of a term, not how many times it appears.

---

#### ✅ Why Use It?

- It’s the **simplest form** of representing document contents using structured data.
- Useful for:
  - Boolean search and keyword filters
  - Document classification based on keyword sets
  - Building foundational **retrieval systems**
- Helps in detecting whether **all query terms exist** in a document (e.g., phrase queries or "AND" operations)

---

#### 📘 Example

Suppose we have 3 documents:

- **Doc1**: "machine learning is fun"  
- **Doc2**: "deep learning is powerful"  
- **Doc3**: "machine learning and deep models"

The vocabulary extracted from all three is:

**Vocabulary** = {machine, learning, is, fun, deep, powerful, and, models}

The Term-Document Incidence Matrix would look like:

| Term       | Doc1 | Doc2 | Doc3 |
|------------|------|------|------|
| machine    | 1    | 0    | 1    |
| learning   | 1    | 1    | 1    |
| is         | 1    | 1    | 0    |
| fun        | 1    | 0    | 0    |
| deep       | 0    | 1    | 1    |
| powerful   | 0    | 1    | 0    |
| and        | 0    | 0    | 1    |
| models     | 0    | 0    | 1    |

For example:
- $w_{\text{machine}, \text{Doc1}} = 1$ → "machine" is in Doc1
- $w_{\text{powerful}, \text{Doc1}} = 0$ → "powerful" is not in Doc1

This matrix is particularly helpful when implementing **Boolean retrieval systems** and **phrase matching**.


In [6]:
# 📘 Example: Term-Document Incidence Matrix

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample corpus from the Markdown example
docs = [
    "machine learning is fun",          # Doc1
    "deep learning is powerful",        # Doc2
    "machine learning and deep models"  # Doc3
]

# Use binary=True to indicate presence/absence (1 or 0)
vectorizer = CountVectorizer(binary=True)

# Fit and transform the corpus
X = vectorizer.fit_transform(docs)

# Create a labeled DataFrame
incidence_matrix = pd.DataFrame(X.toarray(),
                                index=["Doc1", "Doc2", "Doc3"],
                                columns=vectorizer.get_feature_names_out())

# Display the incidence matrix
print("🔎 Term-Document Incidence Matrix:")
display(incidence_matrix)


🔎 Term-Document Incidence Matrix:


Unnamed: 0,and,deep,fun,is,learning,machine,models,powerful
Doc1,0,0,1,1,1,1,0,0
Doc2,0,1,0,1,1,0,0,1
Doc3,1,1,0,0,1,1,1,0


## Talking Points:

Each row represents a `.txt` file, and each column is a term. The matrix tells us whether a word appears (1) or not (0) in a document — useful for Boolean search systems. Try selecting two keywords and checking which documents contain both.


🗣️ **Instructor Talking Point**: This code demonstrates how the presence or absence of a term in a document is encoded as a binary matrix — foundational for Boolean retrieval. Explain this with respect to a future AI agent (chatbot) builds context.
<br/>
<br/>
🧠 **Student Talking Point**: Add a phrase query (e.g., 'machine learning') and explain your reasoning as to how you would check if both terms occur in a single document using this matrix.

### 🔹 Term Frequency (TF)

**Term Frequency (TF)** measures how frequently a term $t$ appears in a document $d$.

$$
tf_{t,d} = f_{t,d}
$$

Where $f_{t,d}$ is the raw count of term $t$ in document $d$.

---

#### ✅ Why Use It?

- TF reflects the importance of a word **within a specific document**.
- A higher TF means the term is likely central to the topic of that document.
- It's used as the **first step** in vectorizing text for machine learning models like classification, clustering, or information retrieval.

TF is most effective when combined with **IDF** (Inverse Document Frequency) to balance against very common terms across the corpus.

---

#### 📘 Example

Let’s say we have this document:

> **Doc1**: `"machine learning is fun and machine learning is useful"`

Calculate raw term counts:

| Term     | Raw TF $(f_{t,d})$ |
|----------|--------------------|
| machine  | 2                  |
| learning | 2                  |
| is       | 2                  |
| fun      | 1                  |
| and      | 1                  |
| useful   | 1                  |

If normalized (total of 9 words):

- $tf(\text{"machine"}, \text{Doc1}) = \frac{2}{9} \approx 0.22$
- $tf(\text{"learning"}, \text{Doc1}) = \frac{2}{9} \approx 0.22$

This simple frequency can then be used as input into models such as **TF-IDF**, which adjusts these values based on how rare the words are across multiple documents.


In [7]:
# 📘 Example: Term Frequency (TF)

import pandas as pd
from collections import Counter

# Sample document
doc1 = "machine learning is fun and machine learning is useful"

# Tokenize the document (simple lowercase + split)
tokens = doc1.lower().split()

# Count term frequencies
tf_raw = Counter(tokens)

# Total number of words
total_terms = len(tokens)

# Compute normalized TF
tf_normalized = {term: count / total_terms for term, count in tf_raw.items()}

# Display results
print("🔢 Raw Term Frequencies:")
display(pd.DataFrame(tf_raw.items(), columns=["Term", "Raw TF"]))

print("\n📏 Normalized Term Frequencies:")
display(pd.DataFrame(tf_normalized.items(), columns=["Term", "TF (Normalized)"]))


🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,machine,2
1,learning,2
2,is,2
3,fun,1
4,and,1
5,useful,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,machine,0.222222
1,learning,0.222222
2,is,0.222222
3,fun,0.111111
4,and,0.111111
5,useful,0.111111


## Talking Points:

This shows which terms are most frequent in the selected document. Words with higher normalized TF values are more likely to represent the main topics in that file. Try comparing these results across multiple documents to identify topic-specific keywords.


🗣️ **Instructor Talking Point**: "Here we count how often each term appears in a single document and normalize it. This is the simplest way to represent word importance within a document. Explain this with respect to a future AI agent (chatbot) builds  builds context.
<br/>
<br/>
🧠 **Student Talking Point**: "Use this TF output to compare with another document. Which terms are likely to be most important in Doc1 based on their normalized TF? Explain your reasoning.

### 🔹 Log Frequency Weight

To reduce the impact of very frequent terms, **log frequency weighting** is applied.

$$
w_{t,d} =
\begin{cases}
1 + \log_{10}(f_{t,d}) & \text{if } f_{t,d} > 0 \\
0 & \text{if } f_{t,d} = 0
\end{cases}
$$

This transformation reduces the skew caused by terms that appear many times in a document. Instead of allowing their raw frequency to dominate, we scale their contribution **logarithmically**.

---

#### ✅ Why Use It?

- Frequent terms are not always the most **important** terms.
- Log scaling ensures that:
  - Words with a raw count of 1 are preserved ($1 + \\log_{10}(1) = 1$),
  - But words with very high counts (e.g., 1000) don’t dominate the document vector.

This helps **normalize the influence** of repetitive terms and improve the **numerical stability** of document representations in models.

---

#### 📘 Example

Let’s say we have a document with the following raw term counts:

| Term     | Raw TF $f_{t,d}$ | Log Frequency Weight $w_{t,d}$ |
|----------|------------------|-------------------------------|
| machine  | 1                | $1 + \\log_{10}(1) = 1$        |
| learning | 3                | $1 + \\log_{10}(3) \approx 1.477$ |
| data     | 10               | $1 + \\log_{10}(10) = 2$       |

So even though "data" appears 10 times, its log-weighted value is **just 2**, making it more comparable to less frequent but potentially more meaningful terms like "learning".

This makes log frequency weighting especially useful when preparing inputs for models like **TF-IDF** or **document clustering**.


In [8]:
# 📘 Example: Log Frequency Weighting

import pandas as pd
import numpy as np
from collections import Counter

# Sample document with varying term frequencies
doc = "machine learning data data data learning learning learning machine data data data data"

# Tokenize and count raw term frequencies
tokens = doc.lower().split()
raw_tf = Counter(tokens)

# Compute log frequency weights
log_weighted_tf = {
    term: 1 + np.log10(freq) if freq > 0 else 0
    for term, freq in raw_tf.items()
}

# Build and display the result as a DataFrame
df = pd.DataFrame({
    "Term": raw_tf.keys(),
    "Raw TF (f_{t,d})": raw_tf.values(),
    "Log Weight (w_{t,d})": log_weighted_tf.values()
})

print("📊 Log Frequency Weighting:")
display(df)


📊 Log Frequency Weighting:


Unnamed: 0,Term,"Raw TF (f_{t,d})","Log Weight (w_{t,d})"
0,machine,2,1.30103
1,learning,4,1.60206
2,data,7,1.845098


## Talking Point:

This section demonstrates how **logarithmic scaling** is applied to raw term frequencies to reduce the impact of frequently occurring words.

In the code, we take the first document, tokenize it, and count how many times each term appears. Instead of treating a word that appears 10 times as 10× more important than a word that appears once, we apply the formula:

> **Log Weight =** `1 + log10(frequency)` if frequency > 0, else 0

This **compresses large values** and prevents them from dominating the vector space.

For example:

* A word with frequency = 1 → log weight = **1**
* A word with frequency = 10 → log weight ≈ **2**
* A word with frequency = 100 → log weight ≈ **3**

The resulting DataFrame shows each term with its raw count and corresponding log-weighted value.

> **Why it matters**: This technique makes document representations more balanced and ensures rare but meaningful words aren't drowned out by repetitive, less informative ones — an essential step before TF-IDF weighting or document comparison.


🗣️ **Instructor Talking Point**: Note how 'data' has a high frequency, but its impact is smoothed by log weighting, making it comparable to 'learning'. Explain this with respect to how a future AI agent (chatbot) builds builds context.
<br/>
<br/>
🧠 **Student Talking Point**: Try adjusting the number of times a word appears and observe how the log scale compresses large values.

### 🔹 Document Frequency (DF)

**Document Frequency** is the number of documents in which a term $t$ appears:

$$
df_t = |\{ d \in D : t \in d \}|
$$

Where:
- $df_t$ is the document frequency of term $t$
- $D$ is the set of all documents in the corpus
- $t \in d$ means the term $t$ appears in document $d$

---

#### ✅ Why Use It?

- It helps you understand **how common or rare** a word is across the entire document set.
- Words with **high DF** (e.g., “the”, “and”) occur in many documents and are often **less informative**.
- Words with **low DF** are more likely to be **specific and meaningful** for distinguishing between documents.
- DF is a key ingredient in calculating **Inverse Document Frequency (IDF)**.

---

#### 📘 Example

Suppose you have the following three documents:

- **Doc1**: "machine learning is fun"  
- **Doc2**: "deep learning is powerful"  
- **Doc3**: "machine learning and deep models"

Now, let’s compute the Document Frequency:

| Term     | Document Frequency ($df_t$) |
|----------|-----------------------------|
| machine  | 2 (Doc1, Doc3)              |
| learning | 3 (Doc1, Doc2, Doc3)        |
| deep     | 2 (Doc2, Doc3)              |
| models   | 1 (Doc3)                    |

The term **"learning"** appears in all three documents → **high DF**, which means it’s **less useful for distinguishing** between them.

The term **"models"** appears in only one document → **low DF**, meaning it could be a **useful keyword** for that specific document.


In [9]:
# 📘 Example: Document Frequency (DF)

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents from Curriculum Learning (4)
docs = [
    "machine learning is fun",          # Doc1
    "deep learning is powerful",        # Doc2
    "machine learning and deep models"  # Doc3
]

# Use CountVectorizer to extract term-document matrix (raw counts)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)

# Get feature names and document-term matrix as array
terms = vectorizer.get_feature_names_out()
X_array = X.toarray()

# Calculate document frequency for each term
df_counts = (X_array > 0).sum(axis=0)

# Format as a DataFrame
df_table = pd.DataFrame({
    "Term": terms,
    "Document Frequency (df_t)": df_counts
}).sort_values("Document Frequency (df_t)", ascending=False)

print("📊 Document Frequency (DF) Table:")
display(df_table)


📊 Document Frequency (DF) Table:


Unnamed: 0,Term,Document Frequency (df_t)
4,learning,3
1,deep,2
5,machine,2
3,is,2
2,fun,1
0,and,1
6,models,1
7,powerful,1


## Talking Point:

Document Frequency tells us how many documents a term appears in.

- **High DF** terms (like “the”, “is”, “data”) appear in many documents — they are often common and less meaningful.

- **Low DF** terms appear in fewer documents and are often more specific and informative for identifying or distinguishing topics.

This DF table helps us identify generic vs. unique terms, which is crucial before computing IDF and TF-IDF scores.



🗣️ **Instructor Talking Point**: Notice how common terms like 'learning' appear in all documents, while more specific terms like 'fun' or 'models' appear in only one.
<br/>
<br/>
🧠 **Student Talking Point**: Choose a term and explain how its document frequency could affect downstream TF-IDF weighting.

### 🔹 Inverse Document Frequency (IDF)

**Inverse Document Frequency (IDF)** measures how rare or informative a term is across the entire corpus:

$$
idf_t = \log_{10} \left( \frac{N}{df_t} \right)
$$

Where:
- $N$ is the total number of documents in the corpus  
- $df_t$ is the number of documents that contain the term $t$

---

#### ✅ Why Use It?

- IDF is used to **downweight common terms** and **upweight rare ones**.
- Words like “the”, “and”, or “data” appear frequently and are less helpful in distinguishing documents.
- Terms that appear in **fewer documents** are often **more informative** and **discriminative**.
- IDF is a core component of **TF-IDF**, a widely used technique in search engines, document classification, and clustering.

---

#### 📘 Example

Let’s say we have **5 documents** total, and the following document frequencies:

| Term     | $df_t$ | $idf_t = \log_{10}(N / df_t)$ |
|----------|--------|-------------------------------|
| machine  | 3      | $\log_{10}(5 / 3) \approx 0.22$ |
| entropy  | 1      | $\log_{10}(5 / 1) = 0.70$       |
| the      | 5      | $\log_{10}(5 / 5) = 0.00$       |

- The term **"entropy"** appears in only one document, so its IDF is **high** → it’s a **rare and informative term**.
- The term **"the"** ap


In [10]:
# 📘 Example: Inverse Document Frequency (IDF)

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents (5 total)
docs = [
    "machine learning is powerful",
    "deep learning is advanced",
    "entropy measures randomness",
    "machine learning and AI are evolving",
    "the science of machine learning"
]

# Total number of documents
N = len(docs)

# Use CountVectorizer to get document-term matrix
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
terms = vectorizer.get_feature_names_out()
X_array = X.toarray()

# Compute document frequency for each term
df_counts = (X_array > 0).sum(axis=0)

# Compute IDF using log base 10
idf_values = np.log10(N / df_counts)

# Build a DataFrame for display
idf_table = pd.DataFrame({
    "Term": terms,
    "Document Frequency (df_t)": df_counts,
    "IDF (log10(N / df_t))": idf_values
}).sort_values("IDF (log10(N / df_t))", ascending=False)

print("📊 Inverse Document Frequency (IDF) Table:")
display(idf_table)


📊 Inverse Document Frequency (IDF) Table:


Unnamed: 0,Term,Document Frequency (df_t),IDF (log10(N / df_t))
0,advanced,1,0.69897
1,ai,1,0.69897
2,and,1,0.69897
3,are,1,0.69897
4,deep,1,0.69897
5,entropy,1,0.69897
6,evolving,1,0.69897
11,of,1,0.69897
14,science,1,0.69897
10,measures,1,0.69897


## Talking Point:

Pick one high-IDF term (rare across documents) and one low-IDF term (common across documents) from the table.

- The high-IDF term is rare, so it carries more discriminative power — it helps differentiate documents better.

- The low-IDF term appears in many documents, making it less useful for distinguishing content since it’s common and generic.

This balance helps TF-IDF focus on terms that matter most for relevance and similarity.

🗣️ **Instructor Talking Point**: IDF adjusts for the fact that some words are common across all documents — this is critical in improving document relevance in search systems.
<br/>
<br/>
🧠 **Student Talking Point**: Choose a low-IDF and high-IDF term from this output and explain why they behave differently.

### 🔹 TF-IDF Weighting

**TF-IDF (Term Frequency–Inverse Document Frequency)** scores each term $t$ in document $d$ based on how frequent and how rare it is:

$$
w_{t,d} = \left(1 + \log_{10}(f_{t,d})\right) \times \log_{10} \left( \frac{N}{df_t} \right)
$$

Where:
- $f_{t,d}$ is the raw count of term $t$ in document $d$
- $df_t$ is the number of documents that contain term $t$
- $N$ is the total number of documents in the corpus

---

#### ✅ Why Use It?

- TF-IDF balances **term importance within a document** (TF) against **term commonality across all documents** (IDF).
- It **boosts rare, relevant words** while **suppressing frequent, generic words**.
- TF-IDF is foundational in:
  - Information Retrieval (search engines)
  - Document similarity
  - Feature engineering for classification or clustering

---

#### 📘 Example

Suppose we have:

- $f_{\text{machine}, \text{Doc1}} = 3$
- $df_{\text{machine}} = 2$
- $N = 5$ total documents

Then:

- TF part: $1 + \log_{10}(3) \approx 1 + 0.477 = 1.477$
- IDF part: $\log_{10}(5 / 2) \approx 0.398$
- TF-IDF weight:

$$
w_{\text{machine}, \text{Doc1}} = 1.477 \times 0.398 \approx 0.588
$$

This means "machine" is **important within Doc1**, but since it's found in other documents too, the overall weight is **moderated**.

TF-IDF creates a **sparse, weighted vector representation** of documents, ready for:
- Cosine similarity
- Clustering
- Search ranking
- Input into classical machine learning models


In [11]:
# 📘 Example: TF-IDF Weighting (Manual Computation)

import pandas as pd
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus of 5 documents
docs = [
    "machine learning is powerful",
    "deep learning is advanced",
    "entropy measures randomness",
    "machine learning and AI are evolving",
    "the science of machine learning"
]

# Total number of documents
N = len(docs)

# Vectorize (raw term frequencies)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
terms = vectorizer.get_feature_names_out()
X_array = X.toarray()

# Compute Document Frequencies
df = (X_array > 0).sum(axis=0)
idf = np.log10(N / df)

# Manual TF-IDF: apply (1 + log10(tf)) * idf
tf_log = 1 + np.where(X_array > 0, np.log10(X_array), 0)
tfidf = tf_log * idf

# Create a DataFrame for visual inspection
tfidf_df = pd.DataFrame(tfidf, columns=terms, index=[f"Doc{i+1}" for i in range(N)])

print("📊 TF-IDF Weighted Matrix (Manual Computation):")
display(tfidf_df.round(3))


📊 TF-IDF Weighted Matrix (Manual Computation):


  tf_log = 1 + np.where(X_array > 0, np.log10(X_array), 0)


Unnamed: 0,advanced,ai,and,are,deep,entropy,evolving,is,learning,machine,measures,of,powerful,randomness,science,the
Doc1,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699
Doc2,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699
Doc3,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699
Doc4,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699
Doc5,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699


## Talking Point:
 in this example, we’re calculating TF-IDF scores manually to figure out which words are most important in each document.

We start with 5 simple text documents and count how often each word appears using a tool called CountVectorizer.

Then, we calculate something called IDF, which lowers the score for common words and boosts rare, informative ones. So, if a word appears in almost every document, it’s probably not that helpful for identifying what the document is about.

Finally, we combine both term frequency and inverse document frequency into a score — that’s our TF-IDF weight. This helps us highlight meaningful words in each doc.

We then put everything into a nice table so it’s easier to see which words stand out and in which document.

This method is super useful in real-world applications like search engines, spam filters, and even chatbots!

🗣️ **Instructor Talking Point**: We combined TF and IDF manually — useful for seeing how each part of the formula shapes the final result.
<br/>
<br/>
🗣️ **Instructor Talking Point**: Document Frequency (DF) counts how many documents contain a specific term, showing how common it is across the corpus.
Inverse Document Frequency (IDF) does the opposite—it measures how rare or informative a term is by applying a logarithmic scale to the inverse of DF.
So, DF increases with term frequency across documents, while IDF decreases, giving higher weight to rare terms.
Together, they balance relevance: DF tells us "how many use this term," while IDF tells us "how useful is this term for distinguishing documents."
IDF is critical for reducing noise from overly common words.
<br/>
<br/>
🧠 **Student Talking Point**: "Pick one row (a document) and explain which term seems most important and why, based on the TF-IDF weights.

## Step 2: Document Collection

In [12]:

# Example: Load documents from a local folder
import os
corpus = []
for filename in os.listdir('data'):
    if filename.endswith('.txt'):
        with open(os.path.join('data', filename), 'r', encoding='utf-8') as file:
            corpus.append(file.read())


## Step 3: Implement a Tokenizer

In [13]:

from typing import List
def tokenize(text: str) -> List[str]:
    return text.lower().split()

# Example
tokenize("Machine Learning is Fun!")


['machine', 'learning', 'is', 'fun!']

## Step 4: Text Normalization Pipeline

In [14]:

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

def normalize(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = text.split()
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(t) for t in tokens]
    return tokens


## Step 5: Build and Test the Pipeline


Using the six concepts and the preprocessing pipeline above, implement a full pipeline that:
- Preprocesses text
- Applies vectorization
- Computes all six concept metrics
- Tests with one phrase query per concept


## Step 6: The Workshop


One team member must push the final notebook to GitHub and send the `.git` URL to the instructor before the end of class.




## 🧠 Learning Objectives
- Implement the foundations of **Vector Space Proximity** algorithms using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into peer notebooks.
- Practice **collaborative debugging**, professional peer feedback, and improve code quality.

## 🧩 Workshop Structure (90 Minutes)
1. **Instructor Use Case Introduction** *(15 min)* – Set up teams of 3 people. Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Team Jupyter Notebook Development** *(45 min)* – NLP Pipeline and six IR basics techniques implementation + Markdown documentation (work as teams)
3. **Push to GitHub** *(15 min)* – Teams commit and push initial notebooks. **Make sure to include your names so it is easy to identify the team that developed the code**.
4. **Instructor Review** - The instructor will go around, take notes, and provide coaching as needed, during the **Peer Review Round**
5. **Email Delivery** *(15 min)* – Each team send the instructor an email **with the *.git link** to the GitHub repo **(one email/team)**. Subject on the email is: PROG8245 - IR Basics & Vector Space Proximity Foundations Workshop, Team #_____.


## 💻 Submission Checklist
- ✅ `IRBasics_VectorSpaceProximity.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline, Inverted Index and the six concepts.
  - Markdown explanations for each major step
  - **Labeled talking point(s)** (1-2 per concept)
- ✅ `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- ✅ GitHub Repo:
  - Public repo named `IRBasics-VectorSpaceProximity-workshop`
  - This is a group effort, so **choose one member of the team** to publish the repo
  - At least **one commit containing one meaningful talking point**

## 🔚 Conclusion


This workshop prepares you for our next session on **Vector Space Proximity** and **Cosine Similarity**.
