# 🛠️ Active Learning Workshop: Implementing an Inverted Matrix (Jupyter + GitHub Edition)
## 🔍 Workshop Theme
*Readable, correct, and collaboratively reviewed code—just like in the real world.*


Welcome to the 90-minute workshop! In this hands-on session, your team will build an **Inverted Index** pipeline, the foundation of many intelligent systems that need fast and relevant access to text data — such as AI agents.

### 👥 Team Guidelines
- Work in teams of 3.
- Submit one completed Jupyter Notebook per team.
- The final notebook must contain **Markdown explanations** and **Python code**.
- Push your notebook to GitHub and share the `.git` link before class ends.

---
## 🔧 Workshop Tasks Overview

1. **Document Collection**
2. **Tokenizer Implementation**
3. **Normalization Pipeline (Stemming, Stop Words, etc.)**
4. **Build and Query the Inverted Index**

> Each step includes a sample **talking point**. Your team must add your own custom **Markdown + code cells** with a **second talking point**, and test your Inverted Index with **2 phrase queries**.




## 🧠 Learning Objectives
- Implement an **Inverted Matrix** using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into peer notebooks.
- Practice **collaborative debugging**, professional peer feedback, and improve code quality.

## 🧩 Workshop Structure (90 Minutes)
1. **Instructor Use Case Introduction** *(15 min)* – Set up teams of 3 people. Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Team Jupyter Notebook Development** *(45 min)* – Manual IR and Inverted Matrix coding + Markdown documentation (work as teams)
3. **Push to GitHub** *(15 min)* – Teams commit and push initial notebooks. **Make sure to include your names so it is easy to identify the team that developed the Min-Max code**.
4. **Instructor Review** - The instructor will go around, take notes, and provide coaching as needed, during the **Peer Review Round**
5. **Email Delivery** *(15 min)* – Each team send the instructor an email **with the *.git link** to the GitHub repo **(one email/team)**. Subject on the email is: PROG8245 - Inverted Matrix  Workshop, Team #_____.


## 💻 Submission Checklist
- ✅ `IR_InvertedMatrix_Workshop.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline, and Inverted Index.
  - Markdown explanations for each major step
  - **Labeled talking point(s)** and 2 phrase query tests
- ✅ `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- ✅ GitHub Repo:
  - Public repo named `IR-invertedmatrix-workshop`
  - This is a group effort, so **choose one member of the team** to publish the repo
  - At least **one commit containing one meaningful talking point**

## 📄 Step 1: Document Collection


### 🗣 Instructor Talking Point:
> We begin by gathering a text corpus. To build a robust index, your vocabulary should include **over 2000 unique words**. You can use scraped articles, academic papers, or open datasets.

### 🔧 Your Task:
- Collect at least 20+ text documents.
- Ensure the vocabulary exceeds 2000 unique words.
- Load the documents into a list for processing.


In [43]:
import os
import re
from sklearn.feature_extraction.text import CountVectorizer

# 📁 Path to the folder where the extracted files are stored
folder_path = 'docs/'  # Replace with the path where you extract the ZIP

# 🔄 Step 1: Load all .txt documents from folder
def load_documents(folder_path):
    documents = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                documents.append(file.read())
    return documents

# 🧹 Step 2: Clean text
def clean_text(text):
    text = re.sub(r'\W+', ' ', text)  # Remove punctuation/special characters
    return text.lower()

# 📏 Step 3: Get vocabulary size
def get_vocabulary_size(docs):
    vectorizer = CountVectorizer()
    vectorizer.fit(docs)
    return len(vectorizer.vocabulary_)

# 🚀 Run the pipeline
documents = load_documents("./Data/")
cleaned_docs = [clean_text(doc) for doc in documents]
vocab_size = get_vocabulary_size(cleaned_docs)

# 📊 Output
print(f"✅ Documents loaded: {len(documents)}")
print(f"✅ Unique vocabulary size: {vocab_size}")

# 💬 Optional check
if len(documents) >= 20 and vocab_size >= 2000:
    print("🎯 Requirement satisfied: You can move to the next step!")
else:
    print("⚠️ Requirement NOT met — consider using longer/more diverse documents.")


✅ Documents loaded: 20
✅ Unique vocabulary size: 6866
🎯 Requirement satisfied: You can move to the next step!


## ✂️ Step 2: Tokenizer


### 🗣 Instructor Talking Point:
> The tokenizer breaks raw text into a stream of words (tokens). This is the foundation for every later step in IR and NLP.

### 🔧 Your Task:
- Implement a basic tokenizer that splits text into lowercase words.
- Handle punctuation removal and basic non-alphanumeric filtering.


In [44]:
import os
import re

def basic_tokenizer(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    tokens = text.split()
    return tokens

# Load documents from 'docs/' folder
def load_documents(folder_path='docs/'):
    documents = []
    for filename in sorted(os.listdir(folder_path)):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                documents.append(file.read())
    return documents

# Example usage: load all docs and tokenize each
docs = load_documents('./Data/')

for i, doc in enumerate(docs, 1):
    tokens = basic_tokenizer(doc)
    print(f"Document {i} tokens: {tokens[:20]}")  # print first 20 tokens of each doc


Document 1 tokens: ['usually', 'many', 'skin', 'finish', 'attorney', 'early', 'save', 'boy', 'in', 'store', 'thousand', 'pick', 'clear', 'today', 'face', 'far', 'system', 'star', 'stop', 'summer']
Document 2 tokens: ['billion', 'trip', 'stand', 'stage', 'world', 'question', 'people', 'kid', 'price', 'determine', 'eight', 'join', 'whatever', 'friend', 'already', 'yet', 'fall', 'recent', 'it', 'account']
Document 3 tokens: ['director', 'century', 'weight', 'statement', 'give', 'various', 'hot', 'similar', 'same', 'act', 'out', 'these', 'land', 'glass', 'three', 'world', 'either', 'mind', 'far', 'nice']
Document 4 tokens: ['anyone', 'letter', 'particular', 'like', 'wind', 'whole', 'laugh', 'trip', 'room', 'keep', 'claim', 'ball', 'require', 'worker', 'standard', 'foreign', 'democratic', 'collection', 'skill', 'close']
Document 5 tokens: ['best', 'there', 'prevent', 'option', 'among', 'candidate', 'raise', 'shake', 'without', 'customer', 'dog', 'religious', 'congress', 'per', 'dream', 'stu

## 🔁 Step 3: Normalization Pipeline (Stemming, Stop Word Removal, etc.)


### 🗣 Instructor Talking Point:
> Now we normalize tokens: convert to lowercase, remove stop words, apply stemming or affix stripping. This reduces redundancy and enhances search accuracy.

### 🔧 Your Task:
- Use `nltk` to remove stopwords and apply stemming.


In [45]:
import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download stopwords once
nltk.download('stopwords')

def normalize_tokens(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    tokens = text.split()
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    return stemmed_tokens

def load_documents(folder_path='./Data/'):
    documents = []
    for filename in sorted(os.listdir(folder_path)):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                documents.append(file.read())
    return documents

# Load and normalize all documents
docs = load_documents('./Data/')

for i, doc in enumerate(docs, 1):
    normalized_tokens = normalize_tokens(doc)
    print(f"Document {i} normalized tokens: {normalized_tokens[:20]}")  # first 20 tokens


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Document 1 normalized tokens: ['usual', 'mani', 'skin', 'finish', 'attorney', 'earli', 'save', 'boy', 'store', 'thousand', 'pick', 'clear', 'today', 'face', 'far', 'system', 'star', 'stop', 'summer', 'film']
Document 2 normalized tokens: ['billion', 'trip', 'stand', 'stage', 'world', 'question', 'peopl', 'kid', 'price', 'determin', 'eight', 'join', 'whatev', 'friend', 'alreadi', 'yet', 'fall', 'recent', 'account', 'mother']
Document 3 normalized tokens: ['director', 'centuri', 'weight', 'statement', 'give', 'variou', 'hot', 'similar', 'act', 'land', 'glass', 'three', 'world', 'either', 'mind', 'far', 'nice', 'manag', 'continu', 'surfac']
Document 4 normalized tokens: ['anyon', 'letter', 'particular', 'like', 'wind', 'whole', 'laugh', 'trip', 'room', 'keep', 'claim', 'ball', 'requir', 'worker', 'standard', 'foreign', 'democrat', 'collect', 'skill', 'close']
Document 5 normalized tokens: ['best', 'prevent', 'option', 'among', 'candid', 'rais', 'shake', 'without', 'custom', 'dog', 'religi

## 🔍 Step 4: Inverted Index


    ### 🗣 Instructor Talking Point:
    > We now map each normalized token to the list of document IDs in which it appears. This is the core structure that allows fast Boolean and phrase queries.

    ### 🔧 Your Task:
    - Build the inverted index using a dictionary.
    - Add code to support phrase queries using positional indexing.


In [46]:
import os
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk

nltk.download('stopwords')

def normalize_tokens(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    tokens = text.split()
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    filtered_stemmed = [stemmer.stem(t) for t in tokens if t not in stop_words]
    return filtered_stemmed

def build_inverted_index(docs):
    inverted_index = {}
    for doc_id, text in enumerate(docs):
        tokens = normalize_tokens(text)
        seen_in_doc = set()
        for pos, token in enumerate(tokens):
            # Add positional info for phrase queries
            if token not in inverted_index:
                inverted_index[token] = {}
            if doc_id not in inverted_index[token]:
                inverted_index[token][doc_id] = []
            inverted_index[token][doc_id].append(pos)
            seen_in_doc.add(token)
    return inverted_index

def phrase_in_doc(inverted_index, phrase_tokens, doc_id):
    positions_lists = []
    for token in phrase_tokens:
        if token not in inverted_index or doc_id not in inverted_index[token]:
            return False
        positions_lists.append(inverted_index[token][doc_id])

    # Check positions for sequential occurrence of phrase tokens
    first_positions = positions_lists[0]
    for start_pos in first_positions:
        if all((start_pos + offset) in positions_lists[offset] for offset in range(1, len(positions_lists))):
            return True
    return False

def phrase_search(inverted_index, phrase, docs):
    phrase_tokens = normalize_tokens(phrase)
    matched_docs = []
    for doc_id in range(len(docs)):
        if phrase_in_doc(inverted_index, phrase_tokens, doc_id):
            matched_docs.append(doc_id)
    return matched_docs

def load_documents(folder_path='./Data/'):
    documents = []
    for filename in sorted(os.listdir(folder_path)):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as f:
                documents.append(f.read())
    return documents

# ---- Main ----
docs = load_documents('./Data/')

inverted_index = build_inverted_index(docs)

print("Sample tokens from inverted index:")
for token, postings in list(inverted_index.items())[:10]:
    print(f"{token}: {list(postings.keys())}")

# Test phrase queries
phrases = ["machine learning", "artificial intelligence"]

for phrase in phrases:
    matched = phrase_search(inverted_index, phrase, docs)
    print(f"\nDocuments containing the phrase '{phrase}': {matched}")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Sample tokens from inverted index:
usual: [0, 2, 6, 8, 11, 12, 13, 16, 18]
mani: [0, 1, 2, 3, 9, 12, 13, 16, 18]
skin: [0, 1, 3, 4, 6, 7, 8, 10, 12, 15, 16]
finish: [0, 3, 5, 11, 13, 17, 19]
attorney: [0, 3, 4, 7, 8, 9, 10, 11, 13, 14, 15, 16]
earli: [0, 2, 5, 7, 17]
save: [0, 2, 3, 4, 5, 7, 8, 9, 12, 13, 18]
boy: [0, 1, 2, 3, 4, 8, 9, 10, 14, 15, 19]
store: [0, 5, 6, 7, 8, 11, 14, 16, 17, 19]
thousand: [0, 2, 3, 5, 8, 10, 12, 13, 19]

Documents containing the phrase 'machine learning': [4, 11]

Documents containing the phrase 'artificial intelligence': [3, 11, 17]


## 🧪 Test: Phrase Queries


### 🗣 Instructor Talking Point:
> A phrase query requires the exact sequence of terms (e.g., "machine learning"). To support this, extend the inverted index to store positions, not just docIDs.

### 🔧 Your Task:
- Implement 2 phrase queries.
- Demonstrate that they return the correct documents.


In [47]:
import re

def basic_tokenizer(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    return text.split()

def build_inverted_index(docs):
    inverted_index = {}
    for doc_id, text in enumerate(docs):
        tokens = basic_tokenizer(text)
        for pos, token in enumerate(tokens):
            if token not in inverted_index:
                inverted_index[token] = {}
            if doc_id not in inverted_index[token]:
                inverted_index[token][doc_id] = []
            inverted_index[token][doc_id].append(pos)
    return inverted_index

def phrase_in_doc(inverted_index, phrase_tokens, doc_id):
    positions_lists = []
    for token in phrase_tokens:
        if token not in inverted_index or doc_id not in inverted_index[token]:
            return False
        positions_lists.append(inverted_index[token][doc_id])
    first_positions = positions_lists[0]
    for start_pos in first_positions:
        if all((start_pos + offset) in positions_lists[offset] for offset in range(1, len(positions_lists))):
            return True
    return False

def phrase_search(inverted_index, phrase, docs):
    phrase_tokens = basic_tokenizer(phrase)
    matched_docs = []
    for doc_id in range(len(docs)):
        if phrase_in_doc(inverted_index, phrase_tokens, doc_id):
            matched_docs.append(doc_id)
    return matched_docs

# Sample documents
docs = [
    "Machine learning is fascinating.",
    "Deep learning is a subset of machine learning.",
    "Artificial intelligence includes machine learning.",
    "Learning about machine algorithms."
]

# Build index
index = build_inverted_index(docs)

# Phrase queries
phrases = ["machine learning", "deep learning"]

for phrase in phrases:
    matched = phrase_search(index, phrase, docs)
    print(f"Documents containing phrase '{phrase}': {matched}")


Documents containing phrase 'machine learning': [0, 1, 2]
Documents containing phrase 'deep learning': [1]
