# 🛠️ Active Learning Workshop: Implementing an Inverted Matrix (Jupyter + GitHub Edition)
## 🔍 Workshop Theme
*Readable, correct, and collaboratively reviewed code—just like in the real world.*


Welcome to the 90-minute workshop! In this hands-on session, your team will build an **Inverted Index** pipeline, the foundation of many intelligent systems that need fast and relevant access to text data — such as AI agents.

### 👥 Team Guidelines
- Work in teams of 3.
- Submit one completed Jupyter Notebook per team.
- The final notebook must contain **Markdown explanations** and **Python code**.
- Push your notebook to GitHub and share the `.git` link before class ends.

---
## 🔧 Workshop Tasks Overview

1. **Document Collection**
2. **Tokenizer Implementation**
3. **Normalization Pipeline (Stemming, Stop Words, etc.)**
4. **Build and Query the Inverted Index**

> Each step includes a sample **talking point**. Your team must add your own custom **Markdown + code cells** with a **second talking point**, and test your Inverted Index with **2 phrase queries**.




## 🧠 Learning Objectives
- Implement an **Inverted Matrix** using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into peer notebooks.
- Practice **collaborative debugging**, professional peer feedback, and improve code quality.

## 🧩 Workshop Structure (90 Minutes)
1. **Instructor Use Case Introduction** *(15 min)* – Set up teams of 3 people. Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Team Jupyter Notebook Development** *(45 min)* – Manual IR and Inverted Matrix coding + Markdown documentation (work as teams)
3. **Push to GitHub** *(15 min)* – Teams commit and push initial notebooks. **Make sure to include your names so it is easy to identify the team that developed the Min-Max code**.
4. **Instructor Review** - The instructor will go around, take notes, and provide coaching as needed, during the **Peer Review Round**
5. **Email Delivery** *(15 min)* – Each team send the instructor an email **with the *.git link** to the GitHub repo **(one email/team)**. Subject on the email is: PROG8245 - Inverted Matrix  Workshop, Team #_____.


## 💻 Submission Checklist
- ✅ `IR_InvertedMatrix_Workshop.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline, and Inverted Index.
  - Markdown explanations for each major step
  - **Labeled talking point(s)** and 2 phrase query tests
- ✅ `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- ✅ GitHub Repo:
  - Public repo named `IR-invertedmatrix-workshop`
  - This is a group effort, so **choose one member of the team** to publish the repo
  - At least **one commit containing one meaningful talking point**

## 📄 Step 1: Document Collection


### 🗣 Instructor Talking Point:
> We begin by gathering a text corpus. To build a robust index, your vocabulary should include **over 2000 unique words**. You can use scraped articles, academic papers, or open datasets.

### 🔧 Your Task:
- Collect at least 20+ text documents.
- Ensure the vocabulary exceeds 2000 unique words.
- Load the documents into a list for processing.


## Downloading the documents from an online source and saving them locally (sample_docs folder)

In [18]:
import requests
import os

# Create folder
folder_name = "sample_docs"
os.makedirs(folder_name, exist_ok=True)

# List of Project Gutenberg book IDs (you can add more)
book_ids = [
    1342, 84, 11, 2701, 1661, 76, 98, 1232, 2542, 174, 
    5200, 158, 1260, 43, 1080, 120, 16328, 1400, 996, 
    1952, 345, 2852
]

# Download each book
for book_id in book_ids:
    url = f"https://www.gutenberg.org/files/{book_id}/{book_id}-0.txt"
    try:
        response = requests.get(url)
        if response.status_code == 200:
            with open(f"{folder_name}/book_{book_id}.txt", "w", encoding='utf-8') as f:
                f.write(response.text)
            print(f"Downloaded book ID {book_id}")
        else:
            print(f"Failed to download book ID {book_id}")
    except Exception as e:
        print(f"Error with book ID {book_id}: {e}")

Downloaded book ID 1342
Downloaded book ID 84
Downloaded book ID 11
Downloaded book ID 2701
Downloaded book ID 1661
Downloaded book ID 76
Downloaded book ID 98
Downloaded book ID 1232
Downloaded book ID 2542
Downloaded book ID 174
Downloaded book ID 5200
Downloaded book ID 158
Downloaded book ID 1260
Downloaded book ID 43
Downloaded book ID 1080
Downloaded book ID 120
Downloaded book ID 16328
Downloaded book ID 1400
Downloaded book ID 996
Downloaded book ID 1952
Downloaded book ID 345
Downloaded book ID 2852


To collect enough text data, we downloaded 22 public domain books from Project Gutenberg (https://www.gutenberg.org) using their book IDs. The script saves each book as a .txt file in the sample_docs folder.

From the output we can se that 22 documents were downloaded successfuly and only one of them failed to download. They are all books with IDs generated by an LLM.

## Loading the documents

In [31]:
def load_documents(folder_path):
    documents = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                documents.append(file.read())
    return documents

folder = 'sample_docs' 
documents = load_documents(folder)
print(f"Loaded {len(documents)} documents.")

Loaded 22 documents.


After we downloaded al the documents and stored them locally in a folder called "sample_docs" , now we have to load these documents by suing the load_documents() method. We give the folder as a parameter for the method. We can clearly see from the output that all the documents are loaded successfully.

## ✂️ Step 2: Tokenizer


### 🗣 Instructor Talking Point:
> The tokenizer breaks raw text into a stream of words (tokens). This is the foundation for every later step in IR and NLP.

### 🔧 Your Task:
- Implement a basic tokenizer that splits text into lowercase words.
- Handle punctuation removal and basic non-alphanumeric filtering.


In [20]:
import re

def basic_tokenizer(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)  
    tokens = text.split()
    return tokens


all_tokens = []

for doc in documents:
    tokens = basic_tokenizer(doc)
    all_tokens.extend(tokens)

print(f"Total tokens across all documents: {len(all_tokens)}")
print(f"Sample tokens: {all_tokens[:20]}")


Total tokens across all documents: 2308765
Sample tokens: ['start', 'of', 'the', 'project', 'gutenberg', 'ebook', '1080', 'a', 'modest', 'proposal', 'for', 'preventing', 'the', 'children', 'of', 'poor', 'people', 'in', 'ireland', 'from']


We load each book from the sample_docs folder and process the text by converting it to lowercase and removing punctuation. This basic tokenization step breaks the text into clean, individual words (tokens). 

## 🔁 Step 3: Normalization Pipeline (Stemming, Stop Word Removal, etc.)


### 🗣 Instructor Talking Point:
> Now we normalize tokens: convert to lowercase, remove stop words, apply stemming or affix stripping. This reduces redundancy and enhances search accuracy.

### 🔧 Your Task:
- Use `nltk` to remove stopwords and apply stemming.


In [21]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def normalize_tokens(tokens):
    filtered = [w for w in tokens if w not in stop_words]         
    stemmed = [stemmer.stem(w) for w in filtered]                 
    return stemmed


tokens = basic_tokenizer(documents[0])  
normalized_tokens = normalize_tokens(tokens)

print(f"Before normalization: {tokens[:10]}")
print(f"After normalization: {normalized_tokens[:10]}")


Before normalization: ['start', 'of', 'the', 'project', 'gutenberg', 'ebook', '1080', 'a', 'modest', 'proposal']
After normalization: ['start', 'project', 'gutenberg', 'ebook', '1080', 'modest', 'propos', 'prevent', 'children', 'poor']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We remove common stopwords like "the" and "and," then simplify words by stemming. For example we turn the word "running" into just "run". 
This transformation makes the text cleaner and easier to analyze.


## 🔍 Step 4: Inverted Index


### 🗣 Instructor Talking Point:
> We now map each normalized token to the list of document IDs in which it appears. This is the core structure that allows fast Boolean and phrase queries.

### 🔧 Your Task:
- Build the inverted index using a dictionary.
- Add code to support phrase queries using positional indexing.


In [None]:
from collections import defaultdict

def build_inverted_index(docs):
    inverted_index = defaultdict(lambda: defaultdict(list))
    

    for doc_id, doc in enumerate(docs):
        tokens = normalize_tokens(basic_tokenizer(doc))
        for pos, token in enumerate(tokens):
            inverted_index[token][doc_id].append(pos)
    print("Inverted index built successfully!")
    return inverted_index

inverted_index = build_inverted_index(documents)

✅ Inverted index built successfully!


We create an inverted index that maps each word to the documents and positions where it appears. This allows us to quickly find words and phrases later.

## 🧪 Test: Phrase Queries


### 🗣 Instructor Talking Point:
> A phrase query requires the exact sequence of terms (e.g., "machine learning"). To support this, extend the inverted index to store positions, not just docIDs.

### 🔧 Your Task:
- Implement 2 phrase queries.
- Demonstrate that they return the correct documents.


In [None]:
def phrase_query(inverted_index, phrase, documents):
    phrase_tokens = normalize_tokens(basic_tokenizer(phrase))
    if not phrase_tokens:
        return set()

  
    candidate_docs = set(inverted_index.get(phrase_tokens[0], {}).keys())

    for token in phrase_tokens[1:]:
        candidate_docs &= set(inverted_index.get(token, {}).keys())

    matching_docs = set()

    for doc_id in candidate_docs:
        positions_lists = [inverted_index[token][doc_id] for token in phrase_tokens]

       
        for pos in positions_lists[0]:
            if all((pos + i) in positions_lists[i] for i in range(1, len(phrase_tokens))):
                matching_docs.add(doc_id)
                break

    return matching_docs

This function finds documents that contain an exact phrase (like the ones that are mentionedd in the code cell below) using an inverted index. It:

1. Tokenizes the phrase

2. Finds docs with all words

3. Checks if the words appear next to each other in the right order

4. Returns a set of matching document IDs. 

In [30]:

query1 = "great expectations"
query2 = "little women"
query3 = "my dear"
query4 = "a few minutes"

results1 = phrase_query(inverted_index, query1, documents)
results2 = phrase_query(inverted_index, query2, documents)
results3 = phrase_query(inverted_index, query3, documents)
results4 = phrase_query(inverted_index, query4, documents)

print(f"Phrase query '{query1}' found in documents: {results1}")
print(f"Phrase query '{query2}' found in documents: {results2}")
print(f"Phrase query '{query3}' found in documents: {results3}")
print(f"Phrase query '{query4}' found in documents: {results4}")


Phrase query 'great expectations' found in documents: {6}
Phrase query 'little women' found in documents: set()
Phrase query 'my dear' found in documents: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21}
Phrase query 'a few minutes' found in documents: {1, 2, 4, 5, 6, 7, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21}


###  Phrase Query Results

We tested a few phrase queries using our inverted index with positional information. Here's what we found:

- "great expectations" was found 1 document — likely the book itself.
- "little women was not found — the book might not be in our collection.
- "my dear" appeared in 22 documents — very common in classic dialogue.
- "a few minutes" appeared in 18 documents — often used in scene transitions.

These results show that the phrase search works and reflects common literary patterns across multiple books.

# Conclusion:


In this lab, we successfully built a simple search system using core NLP techniques. We:

- Collected and cleaned a real-world text dataset
- Tokenized and normalized the text
- Built an inverted index with positional information
- Performed phrase queries to find exact word sequences

This exercise helped us understand how search engines work under the hood, especially how text is indexed and queried efficiently. It also showed the power of preprocessing in improving search accuracy.

Overall, we’ve taken an important step toward understanding real-world information retrieval.
