# 🛠️ Active Learning Workshop: Implementing an Inverted Matrix (Jupyter + GitHub Edition)
## 🔍 Workshop Theme
*Readable, correct, and collaboratively reviewed code—just like in the real world.*


Welcome to the 90-minute workshop! In this hands-on session, your team will build an **Inverted Index** pipeline, the foundation of many intelligent systems that need fast and relevant access to text data — such as AI agents.

### 👥 Team Guidelines
- Work in teams of 3.
- Submit one completed Jupyter Notebook per team.
- The final notebook must contain **Markdown explanations** and **Python code**.
- Push your notebook to GitHub and share the `.git` link before class ends.

---
## 🔧 Workshop Tasks Overview

1. **Document Collection**
2. **Tokenizer Implementation**
3. **Normalization Pipeline (Stemming, Stop Words, etc.)**
4. **Build and Query the Inverted Index**

> Each step includes a sample **talking point**. Your team must add your own custom **Markdown + code cells** with a **second talking point**, and test your Inverted Index with **2 phrase queries**.




## 🧠 Learning Objectives
- Implement an **Inverted Matrix** using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into peer notebooks.
- Practice **collaborative debugging**, professional peer feedback, and improve code quality.

## 🧩 Workshop Structure (90 Minutes)
1. **Instructor Use Case Introduction** *(15 min)* – Set up teams of 3 people. Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Team Jupyter Notebook Development** *(45 min)* – Manual IR and Inverted Matrix coding + Markdown documentation (work as teams)
3. **Push to GitHub** *(15 min)* – Teams commit and push initial notebooks. **Make sure to include your names so it is easy to identify the team that developed the Min-Max code**.
4. **Instructor Review** - The instructor will go around, take notes, and provide coaching as needed, during the **Peer Review Round**
5. **Email Delivery** *(15 min)* – Each team send the instructor an email **with the *.git link** to the GitHub repo **(one email/team)**. Subject on the email is: PROG8245 - Inverted Matrix  Workshop, Team #_____.


## 💻 Submission Checklist
- ✅ `IR_InvertedMatrix_Workshop.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline, and Inverted Index.
  - Markdown explanations for each major step
  - **Labeled talking point(s)** and 2 phrase query tests
- ✅ `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- ✅ GitHub Repo:
  - Public repo named `IR-invertedmatrix-workshop`
  - This is a group effort, so **choose one member of the team** to publish the repo
  - At least **one commit containing one meaningful talking point**

## 📄 Step 1: Document Collection


### 🗣 Instructor Talking Point:
> We begin by gathering a text corpus. To build a robust index, your vocabulary should include **over 2000 unique words**. You can use scraped articles, academic papers, or open datasets.

### 🔧 Your Task:
- Collect at least 20+ text documents.
- Ensure the vocabulary exceeds 2000 unique words.
- Load the documents into a list for processing.


In [28]:
# Example: Load text files from a folder
import os

input_dir = 'sample_docs/'

def load_documents(folder_path):
    documents = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                documents.append(file.read())
    return documents

# Replace 'sample_docs/' with your actual folder
documents = load_documents(input_dir)
print(f"Loaded {len(documents)} documents.")


Loaded 20 documents.


## ✂️ Step 2: Tokenizer


### 🗣 Instructor Talking Point:
> The tokenizer breaks raw text into a stream of words (tokens). This is the foundation for every later step in IR and NLP.

### 🔧 Your Task:
- Implement a basic tokenizer that splits text into lowercase words.
- Handle punctuation removal and basic non-alphanumeric filtering.


In [29]:
import os
import re
 
# --- Tokenizer Function ---
def tokenize(text):
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens
 
# --- Parameters ---
num_files = 20  # Number of files to process
 
# --- Process Files ---
print(f"\n--- TOKENIZING FIRST {num_files} FILES FROM '{input_dir}' ---\n")
 
# List and sort .txt files in the directory
all_txt_files = sorted([f for f in os.listdir(input_dir) if f.endswith(".txt")])
all_tokens = set()
# Limit to the first `num_files`
for i, filename in enumerate(all_txt_files[:num_files], start=1):
    filepath = os.path.join(input_dir, filename)
    # Read file content
    with open(filepath, 'r', encoding='utf-8') as file:
        text = file.read()
 
    # Tokenize
    tokens = tokenize(text)
    # Add unique tokens to the set
    all_tokens.update(tokens)

print(all_tokens)  # Preview first 20 tokens


--- TOKENIZING FIRST 20 FILES FROM 'sample_docs/' ---

{'likely', 'aid', 'more', 'groups', 'bear', 'matt', 'speed', 'memory', 'case', 'break', 'price', 'consecutive', 'clever', 'just', 'untrustworthy', 'accused', '00', 'argumentation', 'firearms', 'but', 'far', 'must', 'russian', 'ordo', 'through', 'gladly', 'bus', 'defense', 'worked', 'raised', 'x', 'surface', 'explanations', 'aquire', 'turanist', 'simplistic', 'laws', 'components', '10_', 'heinrich', 'session', 'schooled', 'protects', 'weapons', 'go', 'jasmine', '1905', 'express', 'few', 'hearded', 'attempts', 'teletype', 'major', 'reasonably', 'aura', 'none', 'publish', 'growth', 'uerdugo', 'praise', 'drive', 'deleted', 'understanding', 'practical', 'such', '120kvolt', '8', 'confused', 'stretching', 'background', 'conditions', 'test', 'bmf', 'augsburg', 'kurds', 'ability', 'days', 'task', 'kind', 'work', 'liberty', 'divinity', 'introduction', 'weiser', 'country', 'tv', 'principle', 'permit', 'games', 'fox', 'adriatic', 'que', 'sett

## 🔁 Step 3: Normalization Pipeline (Stemming, Stop Word Removal, etc.)


### 🗣 Instructor Talking Point:
> Now we normalize tokens: convert to lowercase, remove stop words, apply stemming or affix stripping. This reduces redundancy and enhances search accuracy.

### 🔧 Your Task:
- Use `nltk` to remove stopwords and apply stemming.


In [30]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def normalize_tokens(tokens):
    return [stemmer.stem(t) for t in tokens if t not in stop_words]

# Example: normalize one document
norm_tokens = normalize_tokens(all_tokens)
print(norm_tokens[:20])


['like', 'aid', 'group', 'bear', 'matt', 'speed', 'memori', 'case', 'break', 'price', 'consecut', 'clever', 'untrustworthi', 'accus', '00', 'argument', 'firearm', 'far', 'must', 'russian']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\xiong\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 🔍 Step 4: Inverted Index


### 🗣 Instructor Talking Point:
> We now map each normalized token to the list of document IDs in which it appears. This is the core structure that allows fast Boolean and phrase queries.

### 🔧 Your Task:
- Build the inverted index using a dictionary.
- Add code to support phrase queries using positional indexing.


In [31]:
from collections import defaultdict

def build_inverted_index(documents):
    index = defaultdict(list)
    for doc_id, text in enumerate(documents):
        tokens = normalize_tokens(tokenize(text))
        seen = set()
        for token in tokens:
            if token not in seen:
                index[token].append(doc_id + 1)
                seen.add(token)
    return index

inverted_index = build_inverted_index(documents)
print(dict(list(inverted_index.items())[:10]))  # Preview first 10 terms


{'sure': [1, 5, 17], 'basher': [1], 'pen': [1], 'fan': [1], 'pretti': [1], 'confus': [1, 14], 'lack': [1], 'kind': [1, 11], 'post': [1, 3, 5, 12], 'recent': [1]}


## 🧪 Test: Phrase Queries


### 🗣 Instructor Talking Point:
> A phrase query requires the exact sequence of terms (e.g., "machine learning"). To support this, extend the inverted index to store positions, not just docIDs.

### 🔧 Your Task:
- Implement 2 phrase queries.
- Demonstrate that they return the correct documents.


In [32]:
# Placeholder for phrase query implementation
# You may build a position-aware index or use string search within docs after normalization
def phrase_query(inverted_index, phrase, stemmer):
    tokens = [stemmer.stem(w) for w in phrase.strip().split()]
    if not tokens:
        return set()
    
    # Get doc sets for each token
    doc_sets = []
    for token in tokens:
        if token not in inverted_index:
            return set()
        doc_sets.append(set(inverted_index[token]))
    
    # Return intersection
    return set.intersection(*doc_sets)


In [37]:
# Example:
query1 = "sure"
docs = phrase_query(inverted_index, query1, stemmer)
print("Phrase found in:", docs)

query2 = "devices"
docs = phrase_query(inverted_index, query2, stemmer)
print("Phrase found in:", docs)

Phrase found in: {1, 5, 17}
Phrase found in: {16, 6, 15}
