|**Student name:** |PAV Limseng|
|---|---|
|**Student ID:** |e20211548|

# TP6: N-gram & Language Model 

## Exercise 1: N-gram Model

**N-gram Modeling:** Analyze how additional context improves prediction \
**Corpus:**
"artificial intelligence improves data analysis, artificial intelligence powers modern applications." \
**Instructions:** 
1. Tokenize and preprocess the corpus (lowercase, punctuation as tokens) 
2. Build a bigram model and a trigram model 
3. Compute probabilities using Maximum Likelihood Estimation (MLE) 
4. Predict the next word for the context: "artificial intelligence" 
5. Compare the predictions from: Bigram model, Trigram model

#### 1. Tokenise and preprocess the corpus


In [1]:
from TP6_utils import tokenize

corpus = "artificial intelligence improves data analysis, artificial intelligence powers modern applications."
tokens = tokenize(corpus)

print("TOKENS:", tokens)

TOKENS: ['artificial', 'intelligence', 'improves', 'data', 'analysis', ',', 'artificial', 'intelligence', 'powers', 'modern', 'applications', '.']


#### 2. Build a bigram model and a trigram model 


In [2]:
from collections import Counter

bigram_counts = Counter()
unigram_counts = Counter()

trigram_counts = Counter()
bigram_context_counts = Counter()  # counts for (w1, w2) contexts in trigram model

In [3]:
for i in range(len(tokens) - 1):
    w1, w2 = tokens[i], tokens[i + 1]
    unigram_counts[w1] += 1
    bigram_counts[(w1, w2)] += 1
# count last unigram token too
unigram_counts[tokens[-1]] += 1

In [4]:
for i in range(len(tokens) - 2):
    w1, w2, w3 = tokens[i], tokens[i + 1], tokens[i + 2]
    trigram_counts[(w1, w2, w3)] += 1
    bigram_context_counts[(w1, w2)] += 1

#### 3. Compute probabilities using Maximum Likelihood Estimation (MLE) 


In [5]:
from TP6_utils import bigram_mle_prob, trigram_mle_prob

#### 4. Predict the next word for the context: "artificial intelligence" 

In [6]:
from TP6_utils import predict_next_bigram, predict_next_trigram

context = ("artificial", "intelligence")

# Bigram uses only the last word as context
bigram_predictions = predict_next_bigram(
    context_word=context[1],
    top_k=10,
    unigram_counts=unigram_counts,
    bigram_counts=bigram_counts,
)

print("\n--- Bigram prediction (context = 'intelligence') ---")
for w, p in bigram_predictions:
    print(f"P({w!r} | 'intelligence') = {p:.4f}")


--- Bigram prediction (context = 'intelligence') ---
P('improves' | 'intelligence') = 0.5000
P('powers' | 'intelligence') = 0.5000


In [7]:
# Trigram uses the two-word context
trigram_predictions = predict_next_trigram(
    w1=context[0],
    w2=context[1],
    top_k=10,
    bigram_context_counts=bigram_context_counts,
    trigram_counts=trigram_counts,
)

print("\n--- Trigram prediction (context = 'artificial intelligence') ---")
for w, p in trigram_predictions:
    print(f"P({w!r} | 'artificial intelligence') = {p:.4f}")


--- Trigram prediction (context = 'artificial intelligence') ---
P('improves' | 'artificial intelligence') = 0.5000
P('powers' | 'artificial intelligence') = 0.5000


#### 5. Compare the predictions from: Bigram model, Trigram model

In [8]:
print("\n=== Comparison Summary ===")
print("Bigram top:", bigram_predictions[:3])
print("Trigram top:", trigram_predictions[:3])


=== Comparison Summary ===
Bigram top: [('improves', 0.5), ('powers', 0.5)]
Trigram top: [('improves', 0.5), ('powers', 0.5)]


## Exercise 2: Data Sparsity & Smoothing Techniques: How smoothing changes probability distribution and model behavior

**Instructions:**
1. Use the corpus: "students study machine learning. Students study Data Science." 
2. Build a bigram model without smoothing 
3. Apply Laplace smoothing to the same model 
4. Use the test sentence: "students study ai." 
5. Compute: 
    - Sentence probability without smoothing 
    - Sentence probability with smoothing

#### 1. Use the corpus: "students study machine learning. Students study Data Science." 


In [9]:
corpus2 = "students study machine learning. Students study Data Science."
tokens2 = tokenize(corpus2)

print("TOKENS 2:", tokens2)

TOKENS 2: ['students', 'study', 'machine', 'learning', '.', 'students', 'study', 'data', 'science', '.']


#### 2. Build a bigram model without smoothing 

In [10]:
unigram_counts2 = Counter(tokens2)
bigram_counts2 = Counter(zip(tokens2[:-1], tokens2[1:]))

vocab2 = set(tokens2)
vocab2.add("ai") 
V2 = len(vocab2)

#### 3. Apply Laplace smoothing to the same model 

In [11]:
from TP6_utils import sentence_prob_no_smoothing, sentence_prob_laplace

#### 4. Use the test sentence: "students study ai." 


In [12]:
test_sentence = "students study ai."
test_tokens = tokenize(test_sentence)

print("Sentence probability (no smoothing):",
      sentence_prob_no_smoothing(test_tokens, unigram_counts2, bigram_counts2))

print("Sentence probability (Laplace smoothing):",
      sentence_prob_laplace(test_tokens, unigram_counts2, bigram_counts2, V2))

Sentence probability (no smoothing): 0.0
Sentence probability (Laplace smoothing): 0.00375


## Exercise 3: Evaluate language models using Perplexity 

**Instructions:** 
1. Choose one dataset: 
    - Brown corpus (NLTK), or Wikipedia 
2. Preprocess the text (tokenization, lowercase) 
3. Split the dataset: 80% for training, 20% for testing 
4. Train two bigram models: 
    - Without smoothing 
    - With Laplace smoothing 
5. Compute perplexity on the test set for both models 
6. Compare the perplexity scores

#### 1. Choose one dataset: Brown corpus (NLTK), or Wikipedia 

In [13]:
import nltk

nltk.download("brown")
from nltk.corpus import brown

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\limse\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


#### 2. Preprocess the text (tokenization, lowercase) 

In [14]:
tokens3 = [w.lower() for w in brown.words()]
tokens3 = ["<s>"] + tokens3 + ["</s>"]

print("Total tokens in Brown corpus:", len(tokens3))

Total tokens in Brown corpus: 1161194


#### 3. Split the dataset: 80% for training, 20% for testing 

In [15]:
split_idx3 = int(0.8 * len(tokens3))
train_tokens3 = tokens3[:split_idx3]
test_tokens3 = tokens3[split_idx3:]

In [16]:
print("Train tokens:", len(train_tokens3))
print("Test tokens:", len(test_tokens3))

Train tokens: 928955
Test tokens: 232239


In [17]:
unigram3 = Counter(train_tokens3)
bigram3 = Counter(zip(train_tokens3[:-1], train_tokens3[1:]))

vocab3 = set(train_tokens3)
V3 = len(vocab3)

#### 4. Train two bigram models: 
    - Without smoothing 
    - With Laplace smoothing 

In [18]:
from TP6_utils import bigram_mle, bigram_laplace

#### 5. Compute perplexity on the test set for both models 


In [19]:
from TP6_utils import perplexity

#### 6. Compare the perplexity scores

In [21]:
pp_mle = perplexity(
    test_tokens3,
    lambda w1, w2: bigram_mle(w1, w2, unigram3, bigram3)
)
pp_laplace = perplexity(
    test_tokens3,
    lambda w1, w2: (bigram3[(w1, w2)] + 1) / (unigram3[w1] + V3)
)


print("Perplexity (Bigram MLE, no smoothing):", pp_mle)
print("Perplexity (Bigram + Laplace):", pp_laplace)

Perplexity (Bigram MLE, no smoothing): inf
Perplexity (Bigram + Laplace): 4938.761222455625
