# 🧠 Workshop: Building Blocks for AI Agents

## NLP Pipeline + Probabilistic Language Models (90-Minute Team Lab)

**Objective:**
Work in teams of 3 to build a small NLP pipeline and implement unigram and bigram models, culminating in estimating sentence probabilities. Submit your completed Jupyter Notebook via a GitHub link (with `.git` at the end).

## Part 1 – NLP Pipeline

### Step 1: Select and Load a Corpus

Select a corpus from `nltk`, or upload your own text documents. Ensure your vocabulary size exceeds 2000 words.

In [1]:
import nltk
nltk.download('reuters')
nltk.download('punkt_tab')
from nltk.corpus import reuters

corpus_docs = [reuters.raw(fileid) for fileid in reuters.fileids()]
corpus_text = " ".join(corpus_docs)

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\chris\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\chris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


**👨‍🏫 Professor Talking Point:** This corpus is pre-tokenized and covers multiple topics. It’s a good fit to get us above the 2,000-word vocabulary requirement.

### Step 2: Collect and Preprocess Documents

Convert your corpus into tokens and compute the vocabulary size.

In [2]:
from nltk.tokenize import word_tokenize

nltk.download('punkt')
tokens = word_tokenize(corpus_text.lower())
vocab = set(tokens)
print(f"Vocabulary size: {len(vocab)}")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\chris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


KeyboardInterrupt: 

In [None]:
import os
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# Ensure 'punkt' is available and nltk_data path is set
nltk_data_path = os.path.join(os.getcwd(), 'nltk_data')
print("Downloading 'punkt' tokenizer...")
nltk.download('punkt', download_dir=nltk_data_path, force=True)
print("Downloading 'punkt_tab' tokenizer...")
nltk.download('punkt_tab', download_dir=nltk_data_path, force=True)

# Always append the custom nltk_data path (if not already present)
if nltk_data_path not in nltk.data.path:
    nltk.data.path.append(nltk_data_path)

# Debugging paths and contents
print("NLTK Data Paths:", nltk.data.path)
print("Contents of nltk_data:", os.listdir(nltk_data_path))

tokens = word_tokenize(corpus_text.lower())
vocab = set(tokens)
print(f"Vocabulary size: {len(vocab)}")

**👨‍🏫 Professor Talking Point:** Vocabulary size is important—it determines the richness of our model. Models trained on small vocabularies can't generalize well.

### Step 3: Implement Tokenizer

In [None]:
import re

def simple_tokenizer(text):
    return re.findall(r'\b\w+\b', text.lower())

tokens = simple_tokenizer(corpus_text)

**👨‍🏫 Professor Talking Point:** A simple regex tokenizer gives us control—this is useful when we need to understand every processing step.

### Step 4: Normalization, Stemming, and Stopword Removal

In [None]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

nltk.download('stopwords')

def normalize(tokens):
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in tokens if word not in stop_words and word not in string.punctuation]

normalized_tokens = normalize(tokens)

**👨‍🏫 Professor Talking Point:** Normalization makes the data more consistent and shrinks the vocabulary. This is essential for estimating reliable probabilities.

## Part 2 – Probabilistic Language Models

### 📘 Unigram Model

A **Unigram Model** is a type of probabilistic language model that assumes each word in a sentence is **independent** of the words that came before it.

The probability of a sequence of words $w_1, w_2, ..., w_n$ is calculated as:

$$
P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i)
$$

To estimate $P(w_i)$, we use the **Maximum Likelihood Estimate (MLE)**:

$$
P(w_i) = \frac{\text{count}(w_i)}{\sum_{j} \text{count}(w_j)}
$$

where $j$ is the total number of words in the corpus.

This is a strong simplification, but it provides a foundational baseline and helps reduce data sparsity in low-resource environments.

Here's how to implement it:


In [None]:
from collections import Counter

unigram_counts = Counter(normalized_tokens)
total_words = len(normalized_tokens)

def unigram_prob(word):
    return unigram_counts[word] / total_words

print(f"P('japan') = {unigram_prob('japan'):.5f}")
print(f"P('citibank') = {unigram_prob('citibank'):.5f}")
print(f"P('harvest') = {unigram_prob('harvest'):.5f}")
print(f"P('bank') = {unigram_prob('bank'):.5f}")


##### 📘 Why Are Unigram Probabilities So Low?

Unigram probabilities represent the **relative frequency** of individual words in the entire corpus:

$$
P(w_i) = \frac{\text{count}(w_i)}{\text{total number of tokens in the corpus}}
$$

In our case, the total number of tokens is quite large:

- **Total tokens:** 1,178,604  
- **Unique words (vocabulary size):** 67,151

Even if a word appears frequently, its probability will still be small relative to the total number of tokens.

For example:
- `"bank"` appears quite often, yet its probability is only **0.00493**, or about **0.5%** of the total words.
- `"citibank"` appears only a few times, resulting in a much smaller probability of **0.00005**.

These small values are expected when:
- The corpus is **large and diverse** (like Reuters).
- Many words appear **only once or twice**, which is common in natural language (known as Zipf's Law).

**Conclusion:**  
Low unigram probabilities do **not** indicate an error—they reflect a realistic distribution of word frequencies across a large corpus. This also highlights the need for smoothing when building more complex language models.


### 📘 Chain Rule with Unigrams

Using the **Chain Rule**, we estimate the probability of a sequence:
$$
P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i)
$$
This is a simplifying assumption of complete independence (unrealistic but foundational).

**👨‍🏫 Professor Talking Point:** Unigram models assume word independence—useful but limited since word order is ignored.

In [None]:
def sentence_prob_unigram(sentence):
    words = normalize(simple_tokenizer(sentence))
    prob = 1.0
    for word in words:
        prob *= unigram_prob(word)
    return prob

sentence = "The agreement calls for joint Sino-Chilean management of the venture for 15 years, the paper said"
print(sentence_prob_unigram(sentence))

##### 📘 Why Is the Sentence Probability So Low?

The calculated **unigram sentence probability** is:

```python
2.382179640797073e-37
````

This number is extremely small—but **that’s expected** for long sentences under a unigram model. Here's why:


##### 🔢 Corpus Statistics

* **Total number of tokens:** 1,178,604
* **Vocabulary size (unique tokens):** 67,151

##### 📉 How the Unigram Model Works

The unigram model computes sentence probability as the **product of individual word probabilities**:

$$
P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i)
$$

Each word typically has a probability between 0.00001 and 0.01. When multiplying **10–20 small numbers together**, the final result becomes **exponentially smaller**, approaching zero for longer sentences.

##### 🧪 Impact of Preprocessing (Step 4)

The normalization step involves:

* Lowercasing
* **Stop word removal** (e.g., "the", "of", "for", "said")
* **Stemming** (e.g., "management" → "manag")
* **Punctuation removal**

This reduces the number of words used in the calculation. While this makes the vocabulary smaller and more manageable, it also means:

* **Common but removed words** (like "the") don’t contribute to the probability.
* **Stemmed forms** may not match original unigrams perfectly (e.g., “sino-chilean” becomes `sinochilean` or `sino` and `chilean`, depending on the tokenizer).

So even though the sentence appears long, **only 7–12 stemmed and filtered tokens** may remain after preprocessing—yet each one still has a very small individual probability.

##### ✅ Key Takeaways

* Low sentence probabilities are **normal** in unigram models, especially for longer sentences.
* The **multiplicative nature** of probability and the **sparsity of natural language** lead to very small final values.
* These limitations are one reason why more advanced models (like bigrams or neural LMs) are needed for realistic NLP applications.

You can inspect the intermediate tokens like this:

```python
print(normalize(simple_tokenizer(sentence)))
```


### 📘 Bigram Model with MLE – Mathematical Explanation

The **Bigram Model** assumes the current word depends only on the previous word.
The MLE (Maximum Likelihood Estimate) for a bigram $(w_{i-1}, w_i)$ is:
$$
P(w_i | w_{i-1}) = \frac{\text{count}(w_{i-1}, w_i)}{\text{count}(w_{i-1})}
$$

**👨‍🏫 Professor Talking Point:** This simple multiplication illustrates the chain rule, but we’ll soon see how to improve this with context.

### 📘 Sentence Probability with Bigram Model – Mathematical Explanation

Using the bigram model and chain rule:
$$
P(w_1, w_2, ..., w_n) = P(w_1) \cdot P(w_2 | w_1) \cdot P(w_3 | w_2) \cdots P(w_n | w_{n-1})
$$
This models **local dependencies** between words.

In [None]:
from collections import defaultdict

bigram_counts = defaultdict(int)

for i in range(len(normalized_tokens) - 1):
    pair = (normalized_tokens[i], normalized_tokens[i + 1])
    bigram_counts[pair] += 1

def bigram_prob(w1, w2):
    return bigram_counts[(w1, w2)] / unigram_counts[w1] if unigram_counts[w1] > 0 else 0

**👨‍🏫 Professor Talking Point:** Bigram probabilities model word context, capturing more meaning than unigrams.

### Sentence Probability with Bigram Model

In [None]:
def sentence_prob_bigram(sentence):
    words = normalize(simple_tokenizer(sentence))
    prob = 1.0
    for i in range(len(words) - 1):
        prob *= bigram_prob(words[i], words[i + 1])
    return prob

sentence = "The agreement calls for joint Sino-Chilean management of the venture for 15 years, the paper said"
print(sentence_prob_bigram(sentence))

**👨‍🏫 Professor Talking Point:** Estimating sentence probability using bigrams shows how sequence information improves prediction power.

## Part 3: The Workshop


One team member must push the final notebook to GitHub and send the `.git` URL to the instructor before the end of class.

## 🧠 Learning Objectives
- Implement the foundations of **Probabilistic Language Models** using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into peer notebooks.
- Practice **collaborative debugging**, professional peer feedback, and improve code quality.

## 🧩 Workshop Structure (90 Minutes)
1. **Instructor Use Case Introduction** *(20 min)* – Set up teams of 3 people. Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Team Jupyter Notebook Development** *(65 min)* – NLP Pipeline and four Probabilistic Language Model method implementations + Markdown documentation (work as teams)
3. **Push to GitHub** *(5 min)* – Teams commit and push the one notebook. **Make sure to include your names so it is easy to identify the team that developed the code**.
4. **Instructor Review** - The instructor will go around, take notes, and provide coaching as needed, during the **Peer Review Round**
5. **Email Delivery** *(1 min)* – Each team send the instructor an email **with the *.git link** to the GitHub repo **(one email/team)**. Subject on the email is: PROG8245 - Probabilistic Language Models Workshop, Team #_____.


## 💻 Submission Checklist
- ✅ `ProbabilisticLanguageModels.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline, Inverted Index and the four methods.
  - Markdown explanations for each major step
  - **Labeled talking point(s)** (1-2 per concept)
- ✅ `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- ✅ GitHub Repo:
  - Public repo named `ProbabilisticLanguageModels`
  - This is a group effort, so **choose one member of the team** to publish the repo
  - At least **one commit containing one meaningful talking point**

## 🧭 Conclusion

Today you’ve constructed your own basic language model. Next class, we’ll expand these ideas to explore **Large Language Models (LLMs)**—like ChatGPT—which learn patterns over **massive corpora** using **deep neural networks** instead of just counts.

In [3]:
import pandas as pd
from tabulate import  tabulate

# Load your dataset
df = pd.read_csv('data/Quries_in_English.csv')  # Replace with your filename

# Basic check
print(tabulate(df.head(),headers='keys', tablefmt='psql'))


+----+-----------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|    | question                                                  | answer                                                                                                                                                                                  |
|----+-----------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  0 | What is the method to get DMC?                            | Procedure for incomplete DMC is:-                                                                                                                                             

In [4]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')

# Preprocessing function
def preprocessAndReturnTokens(text):
    tokens = word_tokenize(str(text).lower())
    tokens = [re.sub(r'[^a-z]', '', word) for word in tokens]
    tokens = [word for word in tokens if word and word not in stopwords.words('english')]
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
    return tokens

# Apply to both question and answer
df['question_tokens'] = df['question'].apply(preprocessAndReturnTokens)
df['answer_tokens'] = df['answer'].apply(preprocessAndReturnTokens)




[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\chris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\chris\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
print(tabulate(df.head(),headers='keys', tablefmt='psql'))

+----+-----------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [8]:
from collections import Counter, defaultdict

# Flatten all question tokens
all_q_tokens = [token for tokens in df['question_tokens'] for token in tokens]

# Unigram model
unigram_freq = Counter(all_q_tokens)
total_unigrams = sum(unigram_freq.values())
unigram_prob = {word: count / total_unigrams for word, count in unigram_freq.items()}

# Bigram model
bigrams = list(zip(all_q_tokens[:-1], all_q_tokens[1:]))
bigram_freq = Counter(bigrams)
vocab_size = len(set(all_q_tokens))
bigram_prob = defaultdict(lambda: defaultdict(float))

for (w1, w2), freq in bigram_freq.items():
    bigram_prob[w1][w2] = (freq + 1) / (unigram_freq[w1] + vocab_size)


In [13]:
def respond(user_query):
    tokens = preprocessAndReturnTokens(user_query)
    score_dict = {}

    for i, q_tokens in enumerate(df['question_tokens']):
        common = set(tokens) & set(q_tokens)
        score = len(common)
        score_dict[i] = score

    best_match = max(score_dict, key=score_dict.get)
    return df['answer'].iloc[best_match]

# Try it out


questions = ['how do i pay my bill?', 'what is the process for paying my bill?', 'how can I settle my account?', "where can i get my id card?"]

for question in questions:
    response = respond(question)
    print(f"Q: {question}\nA: {response}\n")




Q: how do i pay my bill?
A: After every month, students pay fee to the munshi.

Q: what is the process for paying my bill?
A: After every month, students pay fee to the munshi.

Q: how can I settle my account?
A:  (Account No: 01287901582203, Title: Student Dues UET, LHR),

Q: where can i get my id card?
A: Please contact Convener Officer UET, Lahore 042-99029452, 042-99029216 to correct your CNIC number

