<div style="background: linear-gradient(135deg, #ece9ff 0%, #d0d0ffff 55%); color: #0f172a; border-radius: 12px; padding: 20px; box-shadow: 0 6px 20px rgba(15,23,42,0.08); font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial; line-height:1.5;">

<h2 style="margin-top:0;">🌟 Maximum-Likelihood Estimation (MLE) — Bigram (n = 2)</h2>

<p>
This single Markdown/HTML block explains how the **MLE bigram model** works with NLTK's
<code>padded_everygram_pipeline</code>. It is written to be **pasted into a Markdown cell** in a Jupyter
notebook. The design uses a gentle, lighter purple/blue background for readability while keeping
contrast accessible.
</p>

---

<h2>1. Short conceptual summary</h2>

<ul>
  <li>**Goal:** estimate conditional probabilities of the form \(P(w_t \mid w_{t-1})\) using counts from a corpus.</li>
  <li>**Estimator (MLE):**
  <div style="background: rgba(255,255,255,0.84); padding:10px; border-radius:8px; display:inline-block;">

$$
at{P}(w_t \mid w_{t-1}) = \frac{C(w_{t-1}, w_t)}{C(w_{t-1})}
$$

</div>
  </li>
  <li>**Padding:** sentences are padded with <code>&lt;s&gt;</code> and <code>&lt;/s&gt;</code> so we can estimate start and end probabilities.</li>
  <li>**No smoothing in this cell:** unseen bigrams get probability 0 (we will add smoothing later if requested).</li>
</ul>

---

<h2>2. How <code>padded_everygram_pipeline(n, tokenized)</code> prepares training data</h2>

<ul>
  <li>**Input:** <code>tokenized</code> is a list of token lists. Example: <code>[["i","am","sam"], [...], ...]</code>.</li>
  <li>**n:** order of the model; here <code>n = 2</code> (bigrams).</li>
  <li>**Returns:**
    <ul>
      <li><code>train_data</code>: generator yielding everygrams (all 1..n-grams) for each padded sentence.</li>
      <li><code>vocab</code>: vocabulary iterable used for training.</li>
    </ul>
  </li>
</ul>

<p><strong>Why <code>everygram</code> and padding?</strong></p>
<ul>
  <li><code>everygram</code> yields *both* unigrams and bigrams for each sentence (helpful for some training APIs and introspection).</li>
  <li>Padding with <code>&lt;s&gt;</code> tokens (one start token for bigrams) and <code>&lt;/s&gt;</code> allows modeling first-word probabilities.</li>
</ul>

---

<h2>3. Expected structure (<code>tmp</code> after <code>list([list(g) for g in train_data])</code>)</h2>

<p>For clarity, here is the *expected* output structure (one inner list per sentence) for your corpus (lowercased tokens):</p>

<p><strong>Corpus (lowercased token lists):</strong><br>
<code>[["i","am","sam"], ["sam","i","am"], ["sam","i","like"], ["i","do","like","sam"], ["do","i","like","sam"]]</code></p>

<p><strong><code>tmp</code> (per-sentence everygrams — unigrams + bigrams if n = 2):</strong></p>

<ul>
  <li>Sentence 1 (<code>"i am sam"</code>) padded <code>['&lt;s&gt;', 'i', 'am', 'sam', '&lt;/s&gt;']</code> → everygrams:
    <ul>
      <li>1-grams: <code>('&lt;s&gt;',) ('i',) ('am',) ('sam',) ('&lt;/s&gt;',)</code></li>
      <li>2-grams: <code>('&lt;s&gt;','i') ('i','am') ('am','sam') ('sam','&lt;/s&gt;')</code></li>
    </ul>
  </li>
  <li>Sentence 2 (<code>"sam i am"</code>) padded <code>['&lt;s&gt;', 'sam', 'i', 'am', '&lt;/s&gt;']</code> → everygrams similarly.</li>
  <li>Sentence 3 (<code>"sam i like"</code>) padded <code>['&lt;s&gt;', 'sam', 'i', 'like', '&lt;/s&gt;']</code>.</li>
  <li>Sentence 4 (<code>"i do like sam"</code>) padded <code>['&lt;s&gt;', 'i', 'do', 'like', 'sam', '&lt;/s&gt;']</code>.</li>
  <li>Sentence 5 (<code>"do i like sam"</code>) padded <code>['&lt;s&gt;', 'do', 'i', 'like', 'sam', '&lt;/s&gt;']</code>.</li>
</ul>

<p>Then you split <code>tmp</code> into unigrams and bigrams using list comprehensions:</p>

<pre><code style="background-color:#f4f4f4; padding:8px; border-radius:5px;">bigrams_sentences = [[gram for gram in tmp[i] if len(gram) == 2] for i in range(len(tmp))]
unigrams_sentences = [[gram for gram in tmp[i] if len(gram) == 1] for i in range(len(tmp))]</code></pre>

<p>Each <code>bigrams_sentences[i]</code> is a list of bigram tuples for sentence <code>i</code>.</p>

<hr>

<h2>4. Concrete counting example — computing \(P(\text{like}\mid i)\)</h2>

<p><strong>Step 1: extract and count all bigrams from the padded corpus.</strong><br>
From the five sentences above, the relevant bigrams involving <code>i</code> are:</p>

<ul>
  <li><code>('i', 'am')</code> × 2</li>
  <li><code>('i', 'like')</code> × 2</li>
  <li><code>('i', 'do')</code> × 1</li>
</ul>

<p>So the **counts** (written unambiguously) are:</p>

<div style="background: rgba(255,255,255,0.84); padding:10px; border-radius:8px; display:inline-block;">

$$
C((i,like)) = 2
$$

$$
C(i) = \sum_{w} C((i,w)) = 2+2+1 = 5
$$

</div>

<p><strong>Step 2: apply the MLE formula</strong></p>

$$
\hat{P}(\text{like}\mid i)
= \frac{C((\text{i},\;\text{like}))}{C(\text{i})}
= \frac{2}{5}
= 0.4
$$

<p>This is exactly what <code>model.score('like', ['i'])</code> will return after training <code>MLE(order=2)</code> with <code>train_data</code> and <code>vocab</code>.</p>

<hr>

<h2>5. Minimal code snippet (run in a code cell after this Markdown)</h2>

<pre><code style="background-color:#f4f4f4; padding:8px; border-radius:5px;">from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.tokenize import word_tokenize

corpus = [
    "I am Sam",
    "Sam I am",
    "Sam I like",
    "I do like Sam",
    "do I like Sam",
]

# tokenization (lowercase)
tokenized = [[w.lower() for w in word_tokenize(sent)] for sent in corpus]

n = 2
train_data, vocab = padded_everygram_pipeline(n, tokenized)
# inspect prepared everygrams
tmp = list([list(g) for g in train_data])

# split into unigrams and bigrams
bigrams_sentences = [[gram for gram in tmp[i] if len(gram) == 2] for i in range(len(tmp))]
unigrams_sentences = [[gram for gram in tmp[i] if len(gram) == 1] for i in range(len(tmp))]

print("bigrams_sentences:\n", bigrams_sentences)
print("unigrams_sentences:\n", unigrams_sentences)

# train MLE model
model = MLE(order=n)
model.fit(train_data, vocab)
print("P(like|i) =", model.score('like', ['i']))</code></pre>

<hr>

<h2 style="margin-top:0;">🌟 Extending the MLE Bigram Model — Next-word distribution, sentence probability, and notes</h2>

---

<h2>6. Next-word distribution (Probability Mass Function / PMF)</h2>

<p>Once the MLE bigram model is trained, you can compute the **distribution of possible next words given a context**.</p>

<p>Python snippet (explained):</p>

<pre><code style="background-color:#f4f4f4; padding:8px; border-radius:5px;">
# Vocabulary tokens (includes <UNK> for unseen words)
vocab_tokens = list(model.vocab)

# Compute raw probabilities for each word given the context ['i']
raw = {w: model.score(w, ['i']) for w in vocab_tokens}

# Normalize to get a proper probability distribution (PMF)
total = sum(raw.values())
dist = {w: (cnt/total if total>0 else 0.0) for w, cnt in raw.items()}

# Only display non-zero probabilities
print("Next-word distribution given context ('i',):")
print({k: v for k, v in dist.items() if v>0})
</code></pre>

<p><strong>Explanation:</strong></p>
<ul>
<li><code>model.score(word, context)</code> returns the MLE probability of <code>word</code> given <code>context</code>.</li>
<li>We iterate over the vocabulary (including <code>&lt;UNK&gt;</code>) to get all potential next words.</li>
<li>Normalization ensures the probabilities sum to 1 — now you have a proper **Probability Mass Function (PMF)** over the next word.</li>
<li>Non-zero filtering is optional but helps readability.</li>
</ul>

---

<h2>7. Sentence probability</h2>

<p>MLE allows us to compute the probability of an entire sentence by multiplying conditional probabilities:</p>

<pre><code style="background-color:#f4f4f4; padding:8px; border-radius:5px;">
def sentence_prob(sentence, model, n):
    # Tokenize and lowercase
    toks = [w.lower() for w in word_tokenize(sentence)]
    # Pad with <s> (n-1 times) and </s> for start/end
    padded = ['<s>']*(n-1) + toks + ['</s>']
    
    prob = 1.0
    for i in range(n-1, len(padded)):
        context = padded[i-(n-1):i] if n>1 else []
        w = padded[i]
        p = model.score(w, context)
        prob *= p
        if prob == 0.0:
            return 0.0  # MLE without smoothing can produce 0 probability for unseen n-grams
    return prob

print("P('I like Sam') =", sentence_prob("I like Sam", model, n))
</code></pre>

<p><strong>Explanation:</strong></p>
<ul>
<li>Each word probability is multiplied sequentially using the preceding <code>n-1</code> words as context.</li>
<li>If the sentence contains an **unseen bigram**, the probability becomes 0. This is expected without smoothing.</li>
<li>Padding with <code>&lt;s&gt;</code> and <code>&lt;/s&gt;</code> ensures start-of-sentence and end-of-sentence probabilities are included.</li>
</ul>

## 8. Cross-Entropy

The **cross-entropy** between the empirical distribution \(p\) and our model distribution \(\hat{p}\) is defined as:

$$
H(p,\hat{p}) = - \sum_{x} p(x)\,\log \hat{p}(x)
$$

For a test sequence \(w_1^n = w_1, w_2, \dots, w_n\):

$$
H(p,\hat{p}) = -\frac{1}{n}\sum_{t=1}^{n} \log \hat{P}(w_t \mid w_{t-1})
$$

---

## 9. Perplexity

The **perplexity** of the model on a test sequence is:

$$
\text{PP}(w_1^n)
= \exp\!\Big(-\frac{1}{n}\sum_{t=1}^{n}\log \hat{P}(w_t \mid w_{t-1})\Big)
$$

Equivalently:

$$
\text{PP}(w_1^n) = 2^{H(p,\hat{p})}
$$

So lower perplexity means the model assigns higher probability to the observed sequence.

---

<h2>10. Key practical notes and caveats</h2>

<ul>
<li>Always **tokenize and lowercase** consistently between training and evaluation.</li>
<li>MLE alone is **exact but brittle** — unseen n-grams get probability 0.</li>
<li>Including <code>&lt;UNK&gt;</code> allows for unknown words but may slightly skew probability mass.</li>
<li>Generators (like <code>train_data</code>) can be consumed only once. Convert to list if needed multiple times.</li>
<li>Next steps: implement **smoothing** (Laplace, Kneser-Ney, etc.) to avoid zero probabilities and stabilize perplexity.</li>
</ul>

<hr>

<h3>✅ How to use in your notebook</h3>
<ol>
  <li>Paste this Markdown/HTML block into a single cell to explain the bigram MLE pipeline.</li>
  <li>Run separate Python code cells for training the model, computing sentence probabilities, and next-word distributions.</li>
  <li>Use it as a foundation before introducing smoothing or trigram/higher-order models.</li>
</ol>

</div>
