# 🧠 Workshop: Building Blocks for AI Agents

## NLP Pipeline + Probabilistic Language Models (90-Minute Team Lab)

**Objective:**
Work in teams of 3 to build a small NLP pipeline and implement unigram and bigram models, culminating in estimating sentence probabilities. Submit your completed Jupyter Notebook via a GitHub link (with `.git` at the end).

## Group 8

### Eris Leksi

### Erica Holden

### Reham Abuarqoub

## Part 1 – NLP Pipeline

### Step 1: Select and Load a Corpus

Select a corpus from `nltk`, or upload your own text documents. Ensure your vocabulary size exceeds 2000 words.

# Sustainable NLP: Optimization & Recommendation Engine  
### Using the "Awesome ChatGPT Prompts" Dataset

This notebook demonstrates a pipeline to analyze and optimize long prompts from the **Awesome ChatGPT Prompts** dataset.  
The goal is to reduce prompt length and token usage without sacrificing semantic meaning — contributing to energy-efficient AI inference.


In [2]:
from datasets import load_dataset

# Load the Awesome ChatGPT Prompts dataset
dataset = load_dataset("fka/awesome-chatgpt-prompts", split="train")

# Extract the prompt text field from the dataset
prompts = [item['prompt'] for item in dataset]

print(f"Loaded {len(prompts)} prompts.")
print("Example prompt:\n", prompts[0])

Loaded 203 prompts.
Example prompt:
 Imagine you are an experienced Ethereum developer tasked with creating a smart contract for a blockchain messenger. The objective is to save messages on the blockchain, making them readable (public) to everyone, writable (private) only to the person who deployed the contract, and to count how many times the message was updated. Develop a Solidity smart contract for this purpose, including the necessary functions and considerations for achieving the specified goals. Please provide the code and any relevant explanations to ensure a clear understanding of the implementation.


### Corpus Overview

We have loaded a collection of user-generated prompts designed to engage ChatGPT.  
These prompts vary in length and complexity, making them ideal for testing prompt optimization techniques.


**👨‍🏫 Professor Talking Point:** This corpus is pre-tokenized and covers multiple topics. It’s a good fit to get us above the 2,000-word vocabulary requirement.

### Step 2: Collect and Preprocess Documents

Convert your corpus into tokens and compute the vocabulary size.

In [3]:
import pandas as pd

# Calculate word counts for each prompt
word_counts = [len(p.split()) for p in prompts]
df_stats = pd.DataFrame(word_counts, columns=['Word Count'])

print("Prompt length statistics:")
print(df_stats.describe())


Prompt length statistics:
       Word Count
count  203.000000
mean    82.088670
std     35.695938
min     20.000000
25%     62.000000
50%     75.000000
75%     88.500000
max    307.000000




We calculated word counts for each prompt to understand their length distribution. This helps with optimizing prompt design and preprocessing.


### Text Preprocessing

Before analyzing the prompts, we clean and normalize the text by removing extra spaces and unwanted characters.


In [17]:
import re

def clean_text(text):
    text = re.sub(r"\s+", " ", text)  # Remove extra whitespace
    text = text.strip()
    return text

cleaned_prompts = [clean_text(p) for p in prompts]

print(f"First cleaned prompt:\n{cleaned_prompts[0]}")


First cleaned prompt:
Imagine you are an experienced Ethereum developer tasked with creating a smart contract for a blockchain messenger. The objective is to save messages on the blockchain, making them readable (public) to everyone, writable (private) only to the person who deployed the contract, and to count how many times the message was updated. Develop a Solidity smart contract for this purpose, including the necessary functions and considerations for achieving the specified goals. Please provide the code and any relevant explanations to ensure a clear understanding of the implementation.


We applied basic text cleaning to each prompt by removing extra spaces and trimming whitespace for consistency.


**👨‍🏫 Professor Talking Point:** Vocabulary size is important—it determines the richness of our model. Models trained on small vocabularies can't generalize well.

### Step 3: Implement Tokenizer

### Tokenization

We tokenize prompts using the OpenAI `tiktoken` tokenizer to estimate token counts, which correspond directly to inference costs.


In [18]:
import tiktoken

tokenizer = tiktoken.get_encoding("cl100k_base")
tokenized_prompts = [tokenizer.encode(p) for p in cleaned_prompts]

print(f"Tokens in first prompt:\n{tokenized_prompts[0]}")
print(f"Token count in first prompt: {len(tokenized_prompts[0])}")


Tokens in first prompt:
[52157, 499, 527, 459, 10534, 35046, 16131, 51920, 449, 6968, 264, 7941, 5226, 369, 264, 18428, 50596, 13, 578, 16945, 374, 311, 3665, 6743, 389, 279, 18428, 11, 3339, 1124, 34898, 320, 898, 8, 311, 5127, 11, 47005, 320, 2039, 8, 1193, 311, 279, 1732, 889, 27167, 279, 5226, 11, 323, 311, 1797, 1268, 1690, 3115, 279, 1984, 574, 6177, 13, 8000, 264, 22925, 488, 7941, 5226, 369, 420, 7580, 11, 2737, 279, 5995, 5865, 323, 38864, 369, 32145, 279, 5300, 9021, 13, 5321, 3493, 279, 2082, 323, 904, 9959, 41941, 311, 6106, 264, 2867, 8830, 315, 279, 8292, 13]
Token count in first prompt: 100


Each cleaned prompt was tokenized using OpenAI’s `tiktoken` library to prepare for language model analysis. This step converts text into a sequence of integer tokens.

**👨‍🏫 Professor Talking Point:** A simple regex tokenizer gives us control—this is useful when we need to understand every processing step.

### Optimization Objective

Our goal is to reduce token counts by rewriting verbose prompts while preserving semantic meaning.  
We will implement a simple rule-based rewriting engine as a prototype.


### Step 4: Normalization, Stemming, and Stopword Removal


We further preprocess by normalizing case, stemming words using PorterStemmer, and removing stopwords to reduce vocabulary size and focus on core semantics.


In [21]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download NLTK data if not already
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    words = re.findall(r'\b\w+\b', text.lower())
    filtered = [stemmer.stem(w) for w in words if w not in stop_words]
    return filtered

processed_prompts = [preprocess_text(p) for p in cleaned_prompts]

print(f"Processed first prompt tokens:\n{processed_prompts[0]}")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Processed first prompt tokens:
['imagin', 'experienc', 'ethereum', 'develop', 'task', 'creat', 'smart', 'contract', 'blockchain', 'messeng', 'object', 'save', 'messag', 'blockchain', 'make', 'readabl', 'public', 'everyon', 'writabl', 'privat', 'person', 'deploy', 'contract', 'count', 'mani', 'time', 'messag', 'updat', 'develop', 'solid', 'smart', 'contract', 'purpos', 'includ', 'necessari', 'function', 'consider', 'achiev', 'specifi', 'goal', 'pleas', 'provid', 'code', 'relev', 'explan', 'ensur', 'clear', 'understand', 'implement']


We applied NLTK-based preprocessing by removing stopwords and stemming tokens. This reduces noise and standardizes the vocabulary for language modeling.

**👨‍🏫 Professor Talking Point:** Normalization makes the data more consistent and shrinks the vocabulary. This is essential for estimating reliable probabilities.

## Part 2 – Probabilistic Language Models

### 📘 Unigram Model

A **Unigram Model** is a type of probabilistic language model that assumes each word in a sentence is **independent** of the words that came before it.

The probability of a sequence of words $w_1, w_2, ..., w_n$ is calculated as:

$$
P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i)
$$

To estimate $P(w_i)$, we use the **Maximum Likelihood Estimate (MLE)**:

$$
P(w_i) = \frac{\text{count}(w_i)}{\sum_{j} \text{count}(w_j)}
$$

where $j$ is the total number of words in the corpus.

This is a strong simplification, but it provides a foundational baseline and helps reduce data sparsity in low-resource environments.

Here's how to implement it:



We now build foundational probabilistic language models (unigram and bigram) to estimate the probability of prompts, which helps evaluate fluency and support prompt optimization.


In [22]:
from collections import Counter

# Flatten list of processed tokens to get unigram counts
all_tokens = [token for prompt in processed_prompts for token in prompt]
unigram_counts = Counter(all_tokens)
total_unigrams = sum(unigram_counts.values())

# Calculate unigram probabilities
unigram_probs = {token: count / total_unigrams for token, count in unigram_counts.items()}

print(f"Total unique unigrams: {len(unigram_probs)}")
print(f"Sample unigram probabilities:\n{list(unigram_probs.items())[:10]}")


Total unique unigrams: 2016
Sample unigram probabilities:
[('imagin', 0.0007746790615316512), ('experienc', 0.0007746790615316512), ('ethereum', 0.00011066843736166445), ('develop', 0.005533421868083223), ('task', 0.0021027003098716248), ('creat', 0.005312084993359893), ('smart', 0.00033200531208499334), ('contract', 0.00033200531208499334), ('blockchain', 0.0002213368747233289), ('messeng', 0.00011066843736166445)]


We computed unigram frequencies and their probabilities using Maximum Likelihood Estimation (MLE). This forms the basis for our simplest probabilistic language model.

##### 📘 Why Are Unigram Probabilities So Low?

Unigram probabilities represent the **relative frequency** of individual words in the entire corpus:

$$
P(w_i) = \frac{\text{count}(w_i)}{\text{total number of tokens in the corpus}}
$$

In our case, the total number of tokens is quite large:

- **Total tokens:** 1,178,604  
- **Unique words (vocabulary size):** 67,151

Even if a word appears frequently, its probability will still be small relative to the total number of tokens.

For example:
- `"bank"` appears quite often, yet its probability is only **0.00493**, or about **0.5%** of the total words.
- `"citibank"` appears only a few times, resulting in a much smaller probability of **0.00005**.

These small values are expected when:
- The corpus is **large and diverse** (like Reuters).
- Many words appear **only once or twice**, which is common in natural language (known as Zipf's Law).

**Conclusion:**  
Low unigram probabilities do **not** indicate an error—they reflect a realistic distribution of word frequencies across a large corpus. This also highlights the need for smoothing when building more complex language models.


### 📘 Chain Rule with Unigrams

Using the **Chain Rule**, we estimate the probability of a sequence:
$$
P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i)
$$
This is a simplifying assumption of complete independence (unrealistic but foundational).

**👨‍🏫 Professor Talking Point:** Unigram models assume word independence—useful but limited since word order is ignored.

In [23]:
def unigram_sequence_prob(sequence, unigram_prob_dict):
    prob = 1.0
    for token in sequence:
        prob *= unigram_prob_dict.get(token, 1e-8)  # Smoothing for unseen words
    return prob

# Example on first processed prompt
seq_prob = unigram_sequence_prob(processed_prompts[0], unigram_probs)
print(f"Unigram sequence probability for first prompt: {seq_prob}")


Unigram sequence probability for first prompt: 2.525468482982433e-153


We calculate the probability of a word sequence under the unigram model by multiplying individual word probabilities, applying smoothing for unseen tokens.

The extremely small probability value (e.g., 2.5e-153) reflects the product of many small individual word probabilities. Since the unigram model multiplies the probabilities of each word assuming independence, the more words in a prompt, the smaller the overall sequence probability becomes. This is expected and highlights why smoothing and more complex models (like bigrams) are needed for practical language modeling.

##### 📘 Why Is the Sentence Probability So Low?

The calculated **unigram sentence probability** is:

```python
2.382179640797073e-37
````

This number is extremely small—but **that’s expected** for long sentences under a unigram model. Here's why:


##### 🔢 Corpus Statistics

* **Total number of tokens:** 1,178,604
* **Vocabulary size (unique tokens):** 67,151

##### 📉 How the Unigram Model Works

The unigram model computes sentence probability as the **product of individual word probabilities**:

$$
P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i)
$$

Each word typically has a probability between 0.00001 and 0.01. When multiplying **10–20 small numbers together**, the final result becomes **exponentially smaller**, approaching zero for longer sentences.

##### 🧪 Impact of Preprocessing (Step 4)

The normalization step involves:

* Lowercasing
* **Stop word removal** (e.g., "the", "of", "for", "said")
* **Stemming** (e.g., "management" → "manag")
* **Punctuation removal**

This reduces the number of words used in the calculation. While this makes the vocabulary smaller and more manageable, it also means:

* **Common but removed words** (like "the") don’t contribute to the probability.
* **Stemmed forms** may not match original unigrams perfectly (e.g., “sino-chilean” becomes `sinochilean` or `sino` and `chilean`, depending on the tokenizer).

So even though the sentence appears long, **only 7–12 stemmed and filtered tokens** may remain after preprocessing—yet each one still has a very small individual probability.

##### ✅ Key Takeaways

* Low sentence probabilities are **normal** in unigram models, especially for longer sentences.
* The **multiplicative nature** of probability and the **sparsity of natural language** lead to very small final values.
* These limitations are one reason why more advanced models (like bigrams or neural LMs) are needed for realistic NLP applications.

You can inspect the intermediate tokens like this:

```python
print(normalize(simple_tokenizer(sentence)))
```


### 📘 Bigram Model with MLE – Mathematical Explanation

The **Bigram Model** assumes the current word depends only on the previous word.
The MLE (Maximum Likelihood Estimate) for a bigram $(w_{i-1}, w_i)$ is:
$$
P(w_i | w_{i-1}) = \frac{\text{count}(w_{i-1}, w_i)}{\text{count}(w_{i-1})}
$$

**👨‍🏫 Professor Talking Point:** This simple multiplication illustrates the chain rule, but we’ll soon see how to improve this with context.

### 📘 Sentence Probability with Bigram Model – Mathematical Explanation

Using the bigram model and chain rule:
$$
P(w_1, w_2, ..., w_n) = P(w_1) \cdot P(w_2 | w_1) \cdot P(w_3 | w_2) \cdots P(w_n | w_{n-1})
$$
This models **local dependencies** between words.

In [24]:
from collections import defaultdict

bigram_counts = defaultdict(Counter)
unigram_counts_for_bigram = Counter()

for tokens in processed_prompts:
    for i in range(len(tokens)):
        unigram_counts_for_bigram[tokens[i]] += 1
        if i > 0:
            bigram_counts[tokens[i-1]][tokens[i]] += 1

# Calculate bigram probabilities with MLE
bigram_probs = {}
for w1 in bigram_counts:
    bigram_probs[w1] = {}
    total_count = sum(bigram_counts[w1].values())
    for w2 in bigram_counts[w1]:
        bigram_probs[w1][w2] = bigram_counts[w1][w2] / total_count

print(f"Sample bigram probabilities for '{list(bigram_probs.keys())[0]}':")
print(list(bigram_probs[list(bigram_probs.keys())[0]].items())[:5])


Sample bigram probabilities for 'imagin':
[('experienc', 0.14285714285714285), ('captiv', 0.14285714285714285), ('depend', 0.14285714285714285), ('descript', 0.2857142857142857), ('work', 0.14285714285714285)]


We computed bigram counts and probabilities using Maximum Likelihood Estimation (MLE). This model estimates the probability of each word given the previous word, capturing local word dependencies.

The output shows the probability distribution of words that follow the token `'imagin'` in the corpus. For example, `'descript'` follows `'imagin'` about 28.57% of the time, while `'experienc'`, `'captiv'`, `'depend'`, and `'work'` each follow with roughly 14.29% probability. These probabilities reflect how often each bigram occurred relative to all bigrams starting with `'imagin'`, capturing local context dependencies.

**👨‍🏫 Professor Talking Point:** Bigram probabilities model word context, capturing more meaning than unigrams.

### Sentence Probability with Bigram Model

In [25]:
def bigram_sequence_prob(sequence, unigram_prob_dict, bigram_prob_dict):
    if not sequence:
        return 0
    prob = unigram_prob_dict.get(sequence[0], 1e-8)  # P(w1)
    for i in range(1, len(sequence)):
        w1, w2 = sequence[i-1], sequence[i]
        prob *= bigram_prob_dict.get(w1, {}).get(w2, 1e-8)  # P(w_i | w_{i-1})
    return prob

# Example on first processed prompt
seq_prob_bigram = bigram_sequence_prob(processed_prompts[0], unigram_probs, bigram_probs)
print(f"Bigram sequence probability for first prompt: {seq_prob_bigram}")


Bigram sequence probability for first prompt: 6.411700207683277e-42


This function computes the probability of a word sequence using the bigram model by multiplying the conditional probabilities of each word given its predecessor. It applies a small fallback probability for unseen bigrams to avoid zero probability issues.

The bigram sequence probability (e.g., 6.4e-42) is larger than the unigram probability but still very small due to multiplying many probabilities. This reflects how the bigram model captures local word dependencies, making it a more realistic estimate than the unigram model, though sequence probabilities naturally decrease with longer sentences.

**👨‍🏫 Professor Talking Point:** Estimating sentence probability using bigrams shows how sequence information improves prediction power.

# Summary

- Loaded and preprocessed a real-world prompt corpus.
- Implemented tokenization, normalization, stemming, and stopword removal.
- Built unigram and bigram probabilistic language models using MLE.
- Calculated sequence probabilities using chain rule assumptions.

# Next Steps

- Implement smoothing techniques (Laplace, Kneser-Ney) to handle zero probabilities.
- Evaluate prompt optimization by scoring original vs. rewritten prompts.
- Integrate semantic similarity models to preserve meaning.
- Explore neural paraphrasing models with energy-aware constraints.


## Prompt Rewriting & Semantic Evaluation

To evaluate sustainability and clarity, we rewrite prompts using simple heuristics and measure their semantic similarity to the originals using sentence embeddings.


In [29]:
# Heuristic function: simplify wording & remove filler
def rewrite_prompt(prompt):
    prompt = re.sub(r"^Act as an? ", "", prompt)  # remove "Act as a"
    prompt = prompt.replace("I want you to", "Please")
    prompt = prompt.replace("could you", "can you")
    prompt = prompt.replace("would you", "can you")
    prompt = prompt.replace("Write a", "Give a")
    return prompt.strip()

rewritten_prompts = [rewrite_prompt(p) for p in cleaned_prompts]

# Show comparison
for i in range(6):
    print(f"\nOriginal: {cleaned_prompts[i]}")
    print(f"Rewritten: {rewritten_prompts[i]}")



Original: Imagine you are an experienced Ethereum developer tasked with creating a smart contract for a blockchain messenger. The objective is to save messages on the blockchain, making them readable (public) to everyone, writable (private) only to the person who deployed the contract, and to count how many times the message was updated. Develop a Solidity smart contract for this purpose, including the necessary functions and considerations for achieving the specified goals. Please provide the code and any relevant explanations to ensure a clear understanding of the implementation.
Rewritten: Imagine you are an experienced Ethereum developer tasked with creating a smart contract for a blockchain messenger. The objective is to save messages on the blockchain, making them readable (public) to everyone, writable (private) only to the person who deployed the contract, and to count how many times the message was updated. Develop a Solidity smart contract for this purpose, including the nec

### Explanation of Prompt Rewriting

In this step, we applied a simple heuristic function to streamline and simplify prompt wording. The function removes common filler phrases such as “Act as a”, and replaces verbose expressions like “I want you to” with shorter alternatives like “Please.” This approach helps make prompts more concise without changing their core meaning.


### Output Interpretation

The output displays pairs of original and rewritten prompts side-by-side. In some cases, the rewritten prompt remains the same because no matching filler phrases were found to simplify. In others, subtle wording changes make the prompt shorter and clearer—for example, changing “I want you to act as a linux terminal” to “Please act as a linux terminal.” This demonstrates how small edits can contribute to reducing token usage while preserving the intent, which is key for sustainable AI and energy-efficient prompting.


### Semantic Similarity Using Sentence Embeddings

We use pretrained Sentence Transformers to compute cosine similarity between original and rewritten prompts. A high score (~1.0) indicates semantic preservation.


In [28]:
from sentence_transformers import SentenceTransformer, util

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Compute embeddings
original_embeddings = model.encode(cleaned_prompts[:100], convert_to_tensor=True)
rewritten_embeddings = model.encode(rewritten_prompts[:100], convert_to_tensor=True)

# Compute cosine similarities
cosine_scores = util.cos_sim(original_embeddings, rewritten_embeddings).diagonal()

# Display some similarity scores
for i in range(5):
    print(f"Similarity Score {i+1}: {cosine_scores[i]:.4f}")


Similarity Score 1: 1.0000
Similarity Score 2: 1.0000
Similarity Score 3: 0.9291
Similarity Score 4: 0.9086
Similarity Score 5: 0.9186


### Explanation of Semantic Similarity Calculation

In this step, we evaluated how well the rewritten prompts preserved the meaning of the original prompts. Using the `sentence-transformers` model (`all-MiniLM-L6-v2`), we converted both sets of prompts into dense vector embeddings that capture their semantic content.

We then computed the cosine similarity between each original prompt embedding and its rewritten counterpart. Cosine similarity measures how close two vectors are in meaning, with values ranging from -1 (completely different) to 1 (identical).

### Output Interpretation

The similarity scores printed show mostly very high values (close to 1.0), indicating that the rewritten prompts maintain strong semantic equivalence with the originals. For example:

- Scores of **1.0000** mean the rewritten prompt is nearly identical in meaning.
- Scores around **0.9** indicate minor wording differences but overall preserved intent.

This confirms that our heuristic rewriting simplifies prompts without significantly changing their meaning, supporting energy-efficient prompt optimization without sacrificing clarity or correctness. Even though this might not be enough, to us is a great first step to take.


### Results Summary

We achieved semantic similarity scores above 0.9 for most rewritten prompts, indicating that our heuristic edits preserved meaning. This validates the potential for lightweight prompt optimization methods that save tokens while remaining faithful to intent. Even though this might not be enough, to us is a great first step to take.

This technique can be used in Sustainable AI to:
- Compress long prompts without losing clarity
- Maintain meaning with fewer tokens (compute-efficient)
- Optimize prompt quality for different model contexts


## Part 3: The Workshop


One team member must push the final notebook to GitHub and send the `.git` URL to the instructor before the end of class.

## 🧠 Learning Objectives
- Implement the foundations of **Probabilistic Language Models** using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into peer notebooks.
- Practice **collaborative debugging**, professional peer feedback, and improve code quality.

## 🧩 Workshop Structure (90 Minutes)
1. **Instructor Use Case Introduction** *(20 min)* – Set up teams of 3 people. Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Team Jupyter Notebook Development** *(65 min)* – NLP Pipeline and four Probabilistic Language Model method implementations + Markdown documentation (work as teams)
3. **Push to GitHub** *(5 min)* – Teams commit and push the one notebook. **Make sure to include your names so it is easy to identify the team that developed the code**.
4. **Instructor Review** - The instructor will go around, take notes, and provide coaching as needed, during the **Peer Review Round**
5. **Email Delivery** *(1 min)* – Each team send the instructor an email **with the *.git link** to the GitHub repo **(one email/team)**. Subject on the email is: PROG8245 - Probabilistic Language Models Workshop, Team #_____.


## 💻 Submission Checklist
- ✅ `ProbabilisticLanguageModels.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline, Inverted Index and the four methods.
  - Markdown explanations for each major step
  - **Labeled talking point(s)** (1-2 per concept)
- ✅ `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- ✅ GitHub Repo:
  - Public repo named `ProbabilisticLanguageModels`
  - This is a group effort, so **choose one member of the team** to publish the repo
  - At least **one commit containing one meaningful talking point**

## 🧭 Conclusion

Today you’ve constructed your own basic language model. Next class, we’ll expand these ideas to explore **Large Language Models (LLMs)**—like ChatGPT—which learn patterns over **massive corpora** using **deep neural networks** instead of just counts.