# 🧪 PROG8245 - Probabilistic Language Models Workshop
**Team Members:**
- Parth Patel  
- Fenil Patel (9001279)  
- Adithya 

**Workshop Objective:**
Implement NLP preprocessing and four probabilistic language models with structured code and documentation in a single Jupyter Notebook.



#### Step 2: Imports and Setup (Code)


In [11]:
import os
import nltk

# <<< EDIT THIS PATH IF NEEDED >>>
NLTK_DIR = r"data"  

# Make sure the folder exists
os.makedirs(NLTK_DIR, exist_ok=True)

# Download required packages *into this folder*
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

# Tell NLTK to look *first* in this folder
nltk.data.path = [NLTK_DIR] + nltk.data.path

print("NLTK data paths:\n", "\n".join(nltk.data.path))


NLTK data paths:
 data
data
data
data
data
C:\Users\parth/nltk_data
c:\Users\parth\OneDrive\Desktop\AIM2\Lab8\.venv\nltk_data
c:\Users\parth\OneDrive\Desktop\AIM2\Lab8\.venv\share\nltk_data
c:\Users\parth\OneDrive\Desktop\AIM2\Lab8\.venv\lib\nltk_data
C:\Users\parth\AppData\Roaming\nltk_data
C:\nltk_data
D:\nltk_data
E:\nltk_data


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\parth\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\parth\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\parth\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
# These should succeed silently if the data is there
print("punkt path:", nltk.data.find('tokenizers/punkt'))
print("stopwords path:", nltk.data.find('corpora/stopwords'))


punkt path: C:\Users\parth\AppData\Roaming\nltk_data\tokenizers\punkt
stopwords path: C:\Users\parth\AppData\Roaming\nltk_data\corpora\stopwords


## Preprocessing: Tokenization and Normalization

We apply lowercase conversion, tokenization, punctuation removal, and stopword filtering.

#### Tokenizer and Normalizer Pipeline

In [13]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def preprocess(text):
    text = text.lower()
    tokens = word_tokenize(text)                # uses punkt
    tokens = [t for t in tokens if t.isalpha()] # keep alphabetic only
    sw = set(stopwords.words('english'))        # load stopwords once
    tokens = [t for t in tokens if t not in sw]
    return tokens


In [14]:
corpus = {
    'doc1': "Natural language processing allows computers to understand human language.",
    'doc2': "Language models predict the next word in a sequence.",
    'doc3': "Probabilistic models estimate the likelihood of word sequences.",
    'doc4': "This workshop covers unigram, bigram, trigram, and backoff models."
}

preprocessed_corpus = {doc_id: preprocess(text) for doc_id, text in corpus.items()}
preprocessed_corpus


{'doc1': ['natural',
  'language',
  'processing',
  'allows',
  'computers',
  'understand',
  'human',
  'language'],
 'doc2': ['language', 'models', 'predict', 'next', 'word', 'sequence'],
 'doc3': ['probabilistic',
  'models',
  'estimate',
  'likelihood',
  'word',
  'sequences'],
 'doc4': ['workshop',
  'covers',
  'unigram',
  'bigram',
  'trigram',
  'backoff',
  'models']}

#### Inverted Index

In [15]:

from collections import defaultdict
inverted_index = defaultdict(set)

for doc_id, tokens in preprocessed_corpus.items():
    for token in set(tokens):  # Use set to avoid duplicate entries
        inverted_index[token].add(doc_id)

# Convert sets to sorted lists for readability
inverted_index = {k: sorted(v) for k, v in inverted_index.items()}
inverted_index

{'human': ['doc1'],
 'natural': ['doc1'],
 'computers': ['doc1'],
 'language': ['doc1', 'doc2'],
 'understand': ['doc1'],
 'processing': ['doc1'],
 'allows': ['doc1'],
 'predict': ['doc2'],
 'sequence': ['doc2'],
 'models': ['doc2', 'doc3', 'doc4'],
 'word': ['doc2', 'doc3'],
 'next': ['doc2'],
 'probabilistic': ['doc3'],
 'estimate': ['doc3'],
 'likelihood': ['doc3'],
 'sequences': ['doc3'],
 'trigram': ['doc4'],
 'unigram': ['doc4'],
 'covers': ['doc4'],
 'workshop': ['doc4'],
 'backoff': ['doc4'],
 'bigram': ['doc4']}

#### Unigram Model

In [17]:
from collections import Counter


class UnigramModel:
    def __init__(self, corpus):
        self.tokens = sum(corpus.values(), [])
        self.freq = Counter(self.tokens)
        self.total = len(self.tokens)

    def prob(self, word):
        return self.freq[word] / self.total if word in self.freq else 0

# Usage
unigram = UnigramModel(preprocessed_corpus)
print(f"P('language') = {unigram.prob('language'):.4f}")

P('language') = 0.1111


#### Bigram Model

In [19]:
class BigramModel:
    def __init__(self, corpus):
        self.bigrams = Counter()
        self.unigrams = Counter()
        for tokens in corpus.values():
            self.unigrams.update(tokens)
            self.bigrams.update(zip(tokens[:-1], tokens[1:]))

    def prob(self, w1, w2):
        return self.bigrams[(w1, w2)] / self.unigrams[w1] if self.unigrams[w1] else 0

# Usage
bigram = BigramModel(preprocessed_corpus)
print(f"P('language' | 'human') = {bigram.prob('human', 'language'):.4f}")

P('language' | 'human') = 1.0000


#### Trigram Model

In [22]:
class TrigramModel:
    def __init__(self, corpus):
        self.trigrams = Counter()
        self.bigrams = Counter()
        for tokens in corpus.values():
            self.bigrams.update(zip(tokens[:-1], tokens[1:]))
            self.trigrams.update(zip(tokens[:-2], tokens[1:-1], tokens[2:]))

    def prob(self, w1, w2, w3):
        return self.trigrams[(w1, w2, w3)] / self.bigrams[(w1, w2)] if self.bigrams[(w1, w2)] else 0

# Usage
trigram = TrigramModel(preprocessed_corpus)
print(f"P('processing' | 'natural' | 'language') = {trigram.prob('natural', 'language', 'processing'):.4f}")

P('processing' | 'natural' | 'language') = 1.0000


#### Backoff Model

In [None]:
class BackoffModel:
    def __init__(self, corpus, lambda1=0.1, lambda2=0.3, lambda3=0.6):
        self.uni = UnigramModel(corpus)
        self.bi = BigramModel(corpus)
        self.tri = TrigramModel(corpus)
        self.lambda1 = lambda1
        self.lambda2 = lambda2
        self.lambda3 = lambda3

    def prob(self, w1, w2, w3):
        p3 = self.tri.prob(w1, w2, w3)
        p2 = self.bi.prob(w2, w3)
        p1 = self.uni.prob(w3)
        return self.lambda3 * p3 + self.lambda2 * p2 + self.lambda1 * p1

# Usage
backoff = BackoffModel(preprocessed_corpus)
print(f"P('processing' | 'natural' | 'language') = {backoff.prob('natural', 'language', 'processing'):.4f}")

P('processing' | 'natural language') = 0.7037


### Peer Talking Points – Probabilistic Language Models

Let's explore some key discussion questions based on each part of our workshop:

---

#### Preprocessing
- Do you think using **lemmatization** instead of just lowercasing and removing stopwords would give us better results in a small dataset?
- Should we always remove punctuation and numbers? Could they be useful in other NLP tasks like sentiment analysis or timestamps?

---

#### Inverted Index
- In what ways does building an **inverted index** help with tasks like search and document retrieval?
- By using `set(tokens)` to remove duplicates, are we potentially **losing frequency information** that might help in scoring or ranking documents?

---

#### Unigram Model
- Since the **unigram model treats each word independently**, what are some limitations when trying to model natural language this way?
- What happens if we ask the model to predict a word it’s never seen before? Should we consider **smoothing techniques** like Laplace smoothing?

---

#### Bigram Model
- How much does **word order** matter in bigram models? For example, is "language model" the same as "model language" to the model?
- If we have a larger corpus, how can we deal with **rare bigrams** or completely missing word pairs?

---

#### Trigram Model
- Does adding a third word (trigram) improve prediction significantly, or does it just make things **more sparse**?
- Can we do anything to back off or cluster **unseen trigrams** into simpler bigram or unigram alternatives?

---

#### Backoff Model
- How do the **lambda weights** in our backoff model influence the final prediction? Should they be fixed or learned?
- Would using a **more advanced smoothing method** like Kneser-Ney or Katz improve accuracy in low-data settings?

---


