# **3-Language Models**


#### **Anggota Kelompok**
- Elmosius Suli (2272008)
- Christopher Wijaya (2272016)
- Josephine Alvina Luwia (2272029)
- Samuel Setyawan Prakasa (2272030)



<br/>

**Sumber:** [Stanford N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf)

---

<br/>



 <br/>


**Import Library Regex terlebih dahulu**

In [None]:
import re

#### **3.1 Write out the equation for trigram probability estimation (modifying Eq. 3.11). Now write out all the non-zero trigram probabilities for the I am Sam corpus on page 4.**



Kalimat : `I am Sam`

Formula **biagram** : `P(wn |wn-1) = C(wn-1,wn) / C(wn-1)`

Formula **trigram** : `P(w3 |w1,w2) = C(w1,w2,w3) / C(w1,w2)`  
*   ` C(w1,w2,w3)` = beberapa kali tiga kata itu muncul berurutan
*   `C(w1,w2)` = berapa kali dua kata pertamanya muncul.

Probabilitas muncul kata `Sam` setelah dua kata `I am` :

*   `C(I,am,Sam)` = 1
*   `C(I,am)` = 2
*   `P(Sam |I am)` = `1/2` = `0.5`

Contoh korpusnya (dengan simbol `<s>` dan `</s>`) :
* `<s> I am Sam </s>`
* `<s> Sam I am </s>`
* `<s> I do not like green eggs and ham </s>`









In [None]:
# Menggunakan Counter agar mudah menghitung berapa kali suatu kata muncul
from collections import Counter

# Korpus
sentences = [
    ["<s>", "I", "am", "Sam", "</s>"],
    ["<s>", "Sam", "I", "am", "</s>"],
    ["<s>", "I", "do", "not", "like", "green", "eggs", "and", "ham", "</s>"]
]

# Pasangan 2 kata
bigrams = []
# Pasangan 3 kata
trigrams = []

for sent in sentences:
    for i in range(len(sent) - 1):
        # Mengambil 2 kata berurutan
        bigrams.append(tuple(sent[i:i+2]))
    for i in range(len(sent) - 2):
        # Mengambil 3 kata berurutan
        trigrams.append(tuple(sent[i:i+3]))

# Hitung berapa kali tiap biagram dan tigram muncul di seluruh kalimat
bigram_counts = Counter(bigrams)
trigram_counts = Counter(trigrams)

# Hitung probabilitas trigram
trigram_probs = {}
for trigram in trigram_counts:
    prefix = trigram[:2]  # dua kata sebelumnya
    trigram_probs[trigram] = trigram_counts[trigram] / bigram_counts[prefix]

# Tampilkan semua trigram dengan probabilitas ≠ 0
for trigram, prob in trigram_probs.items():
    print(f"P({trigram[2]} | {trigram[0]} {trigram[1]}) = {prob:.2f}")

P(am | <s> I) = 0.50
P(Sam | I am) = 0.50
P(</s> | am Sam) = 1.00
P(I | <s> Sam) = 1.00
P(am | Sam I) = 1.00
P(</s> | I am) = 0.50
P(do | <s> I) = 0.50
P(not | I do) = 1.00
P(like | do not) = 1.00
P(green | not like) = 1.00
P(eggs | like green) = 1.00
P(and | green eggs) = 1.00
P(ham | eggs and) = 1.00
P(</s> | and ham) = 1.00


---


#### **3.2 Calculate the probability of the sentence i want chinese food. Give two probabilities, one using Fig. 3.2 and the 'useful probabilities' just below it on page 6, and another using the add-1 smoothed table in Fig. 3.7. Assume the additional add-1 smoothed probabilities `P(i|<s>)= 0.19 and P(</s>|food)= 0/40`.**

**Fig 3.2**

Sentence/Kalimat: 'i want chinese food'

Formula: `P(i want chinese food) = P(i|<s>) x P(want|i) x P(chinese|want) x P(food|chinese) x P(</s>|food)`

Probabilitas ditemukan dari Fig. 3.2 dan 'useful probabilities':
1. `P(i|<s>)` probabilitas kalimat mulai dengan "i" (`mulainya kalimat dilambangkan oleh <s>`). Didapatkan dari list 'useful probabilites' jadi `P(i|<s>) = 0.25`
2. `P(want|i)` probabilitas "want" setelah "i". Didapatkan dari tabel pada baris "i" dan kolom "want" jadi `P(want|i) = 0.33`
3. `P(chinese|want)` probabilitas "chinese" setelah "want". Didapatkan dari tabel pada baris "want" dan kolom "chinese" jadi `P(chinese|want) = 0.0065`
4. `P(food|chinese)` probabilitas "food" setelah "chinese". Didapatkan dari tabel pada baris "chinese" dan kolom "food" jadi `P(food|chinese) = 0.52`
5. `P(</s>|food)` probabilitas kalimat diakhiri setelah "food". Didapatkan dari list 'useful probabilities' jadi `P(</s>|food)` = 0.68

Kalkulasi akhir:
`P(sentence) = 0.25 x 0.33 x 0.0065 x 0.52 x 0.68`
`P(sentence) = 0.000189618`


In [None]:
def calculate_sentence_probability():
    bigram_probs = {
        'i':       {'i': 0.002, 'want': 0.33, 'to': 0, 'eat': 0.0036, 'chinese': 0, 'food': 0, 'lunch': 0, 'spend': 0.00079},
        'want':    {'i': 0.0022, 'want': 0, 'to': 0.66, 'eat': 0.0011, 'chinese': 0.0065, 'food': 0.0065, 'lunch': 0.0054, 'spend': 0.0011},
        'to':      {'i': 0.00083, 'want': 0, 'to': 0.0017, 'eat': 0.28, 'chinese': 0.00083, 'food': 0, 'lunch': 0.0025, 'spend': 0.087},
        'eat':     {'i': 0, 'want': 0, 'to': 0.0027, 'eat': 0, 'chinese': 0.021, 'food': 0.0027, 'lunch': 0.056, 'spend': 0},
        'chinese': {'i': 0.0063, 'want': 0, 'to': 0, 'eat': 0, 'chinese': 0, 'food': 0.52, 'lunch': 0.0063, 'spend': 0},
        'food':    {'i': 0.014, 'want': 0, 'to': 0.014, 'eat': 0, 'chinese': 0.00092, 'food': 0.0037, 'lunch': 0, 'spend': 0},
        'lunch':   {'i': 0.0059, 'want': 0, 'to': 0, 'eat': 0, 'chinese': 0, 'food': 0.0029, 'lunch': 0, 'spend': 0},
        'spend':   {'i': 0.0036, 'want': 0, 'to': 0.0036, 'eat': 0, 'chinese': 0, 'food': 0, 'lunch': 0, 'spend': 0},
    }
    # Yang didapatkan dari 'useful probabilities'
    start_probs = {
        'i': 0.25,
    }

    end_probs = {
        'food': 0.68,
    }
    # Kalimat yang akan dikalkulasi probabilitasnya
    sentence = ['i', 'want', 'chinese', 'food']

    # Kalkulasi

    print(f"Mengkalkulasi probabilitas untuk kalimat: '{' '.join(sentence)}'\n")

    # P(i|<s>)
    first_word = sentence[0]
    probability = start_probs.get(first_word, 0)
    print(f"P({first_word}|<s>) = {probability}")

    # Loop ini mengkalkulasi P(want|i), P(chinese|want), dan P(food|chinese)
    for i in range(len(sentence) - 1):
        prev_word = sentence[i]
        current_word = sentence[i+1]

        # Dapatkan probabilitas bigram, default pada 0 jika tidak ketemu
        bigram_p = bigram_probs.get(prev_word, {}).get(current_word, 0)

        print(f"P({current_word}|{prev_word}) = {bigram_p}")
        probability *= bigram_p

    # P(</s>|food)
    last_word = sentence[-1]
    end_p = end_probs.get(last_word, 0)
    print(f"P(</s>|{last_word}) = {end_p}")
    probability *= end_p

    # Hasil akhir
    print("\n--------------------------------------------------")
    print(f"Probabilitas akhir dari kalimat adalah: {probability}")
    print(f"Jadi kira-kira: {probability:.5f}")
    print("--------------------------------------------------")

if __name__ == "__main__":
    calculate_sentence_probability()


Mengkalkulasi probabilitas untuk kalimat: 'i want chinese food'

P(i|<s>) = 0.25
P(want|i) = 0.33
P(chinese|want) = 0.0065
P(food|chinese) = 0.52
P(</s>|food) = 0.68

--------------------------------------------------
Probabilitas akhir dari kalimat adalah: 0.00018961800000000004
Jadi kira-kira: 0.00019
--------------------------------------------------


**Fig. 3.7**

Sentence/Kalimat: 'i want chinese food'

Probabilitas P(W) dihitung menggunakan pendekatan aturan rantai untuk bigram dan probabilitas penghalusan add-one smoothed probabilities:

Asumsi:

P (i|⟨s⟩) = 0.19


P (⟨/s⟩|food) = 0.40

Dari Gambar 3.7, diekstrak probabilitas yang tersisa yang diperlukan:

P(want∣i): Row "i", Column "want" →0.21


P(chinese∣want): Row "want", Column "chinese" → 0.0029


P(food∣chinese): Row "chinese", Column "food" → 0.052

P(⟨s⟩ i want chinese food ⟨/s⟩) = 0.19 × 0.21 × 0.0029 × 0.052 × 0.40

P(⟨s⟩ i want chinese food ⟨/s⟩) ≈ 0,000002406768


In [None]:
# Probabilitas Add-One Smoothed untuk bigram dalam kalimat "<s> i want chinese food </s>"
P_i_start = 0.19
P_want_i = 0.21
P_chinese_want = 0.0029
P_food_chinese = 0.052
P_end_food = 0.40

# Hitung probabilitas total kalimat (Perkalian dari semua probabilitas bigram)
probabilitas_add_one = (
    P_i_start *
    P_want_i *
    P_chinese_want *
    P_food_chinese *
    P_end_food
)

print("--- Perhitungan Probabilitas Add-One Smoothed ---")
print("Probabilitas bigram yang digunakan:")
print(f"P(i|<s>) = {P_i_start}")
print(f"P(want|i) = {P_want_i}")
print(f"P(chinese|want) = {P_chinese_want}")
print(f"P(food|chinese) = {P_food_chinese}")
print(f"P(</s>|food) = {P_end_food}")
print("-" * 40)
print(f"P(Kalimat) = {P_i_start} * {P_want_i} * {P_chinese_want} * {P_food_chinese} * {P_end_food}")
print(f"Hasil Probabilitas Add-One Smoothed: {probabilitas_add_one}")
# Mencetak dalam notasi ilmiah untuk keterbacaan yang lebih baik:
print(f"Hasil dalam notasi ilmiah: {probabilitas_add_one:.3e}")

--- Perhitungan Probabilitas Add-One Smoothed ---
Probabilitas bigram yang digunakan:
P(i|<s>) = 0.19
P(want|i) = 0.21
P(chinese|want) = 0.0029
P(food|chinese) = 0.052
P(</s>|food) = 0.4
----------------------------------------
P(Kalimat) = 0.19 * 0.21 * 0.0029 * 0.052 * 0.4
Hasil Probabilitas Add-One Smoothed: 2.4067679999999995e-06
Hasil dalam notasi ilmiah: 2.407e-06




---



#### **3.4 We are given the following corpus, modified from the one in the chapter:**

```
 <s> I am Sam </s>
 <s> Sam I am </s>
 <s> I am Sam </s>
 <s> I do not like green eggs and Sam </s>
```

Using a bigram language model with add-one smoothing, what is P(Sam | am)? Include `<s> and </s>` in your counts just like any other token.

penjelasan ditulis didokumen terpisah berikut linknya:

https://excalidraw.com/#json=xU1RYt6GkoAzPduEiiInw,Rnttukmw7ptwWNiUdIJLnw:




In [17]:
import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace

# 1) Korpus TANPA boundary; nanti NLTK yang menambahkan <s> dan </s> otomatis (sesuai n=2)
sents = [
    ["I", "am", "Sam"],
    ["Sam", "I", "am"],
    ["I", "am", "Sam"],
    ["I", "do", "not", "like", "green", "eggs", "and", "Sam"],
]

# 2) Siapkan data n-gram dengan padding boundary (<s>, </s>)
n = 2
train_data, vocab = padded_everygram_pipeline(n, sents)

# 3) Latih model Bigram dengan Laplace smoothing
model = Laplace(n)
model.fit(train_data, vocab)

# 4) Probabilitas yang diminta: P(Sam | am)
p = model.score("Sam", ["am"])
print("Vocab size V =", len(model.vocab))
print("P(Sam | am)  =", p)

# Ambil hitungan dari model NLTK
c_bigram = model.counts[("am",)]["Sam"]
c_context = model.counts[("am",)].N()

# Buang <UNK> dari perhitungan V
vocab_tokens = [w for w in model.vocab if w != "<UNK>"]
V_no_unk = len(vocab_tokens)
print("Vocab size V =", V_no_unk)

p_manual = (c_bigram + 1) / (c_context + V_no_unk)
print("P(Sam | am) tanpa <UNK> =", p_manual)


Vocab size V = 12
P(Sam | am)  = 0.2
Vocab size V = 11
P(Sam | am) tanpa <UNK> = 0.21428571428571427
