# Chapter 2 Exercise 5

Try out sentences and paragraphs in different styles and topics to see how the perplexity varies! In particular get the perplexities of these types of text:

- Social media text, like Twitter
- SEO spam
- Text with a lot of slang

Which documents have the highest perplexity? Which documents have the lowest perplexity? After manually inspecting the results, do you think perplexity sampling is a good measure of quality?

In [1]:
!pip install https://github.com/kpu/kenlm/archive/master.zip
!pip install sentencepiece
!pip install huggingface
!pip install huggingface_hub
!pip install datasets

Collecting https://github.com/kpu/kenlm/archive/master.zip
  Downloading https://github.com/kpu/kenlm/archive/master.zip (553 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m553.6/553.6 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: kenlm
  Building wheel for kenlm (pyproject.toml) ... [?25l[?25hdone
  Created wheel for kenlm: filename=kenlm-0.2.0-cp311-cp311-linux_x86_64.whl size=3186708 sha256=fa8c003b0f917c3d8ddeb91c7c32c5cc1443187b824b5dc0733806307f67402f
  Stored in directory: /tmp/pip-ephem-wheel-cache-h6cyryzx/wheels/4e/ca/6a/e5da175b1396483f6f410cdb4cfe8bc8fa5e12088e91d60413
Successfully built kenlm
Installing collected packages: kenlm
Successfully installed kenlm-0.2.0
Collecting huggingface
  Downloading huggingface-0.0.1-py3-none-any.whl.met

In [2]:
import os
import kenlm
from huggingface_hub import hf_hub_download
import numpy as np

In [3]:
# load wikipedia english model
class KenlmModel:
    @classmethod
    def from_pretrained(cls, model_name: str):
        # Download the model file from Hugging Face
        model_path = hf_hub_download(repo_id=model_name, filename="wikipedia/en.arpa.bin")
        return kenlm.Model(model_path)

# Usage, as the huggingface link is somewhat deprecated, and the documentation is not working
model = KenlmModel.from_pretrained("edugp/kenlm")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


en.arpa.bin:   0%|          | 0.00/4.44G [00:00<?, ?B/s]

In [4]:
model

<Model from b'en.arpa.bin'>

In [5]:
# Calculate perplexity for the given sentence
sentence = "She was a shriveling bumblebee, and he was a bumbling banshee, but they accepted a position at Gringotts because of their love for maple syrup."
perplexity_score = model.perplexity(sentence)

print(f"Perplexity Score: {perplexity_score}")

Perplexity Score: 329968.53211806616


## Understanding Perplexity

Perplexity is a measure of how well a probability model predicts a sample. In the context of language models (like the one used here, KenLM), it essentially tells us how surprised the model is when it sees a given sentence.

- Lower Perplexity: Indicates that the model is less surprised by the sentence, meaning it finds the sentence more probable or more "expected" based on the data it was trained on.
- Higher Perplexity: Indicates that the model is more surprised by the sentence, meaning it finds the sentence less probable or less "unexpected" based on its training data.

In [6]:
# Get perplexity
print("Social Media Text:", model.perplexity("omg lol fr tho this is so cray cray 😭 #relatable"))

print("SEO Spam:", model.perplexity("Buy cheap shoes online now! Best prices on designer footwear. Limited time offer, don't miss out!"))

print("Slang Text:", model.perplexity("Yo, that dude was straight up bussin', no cap. Hella fire fit, ya feel?"))


Social Media Text: 584134.9458673951
SEO Spam: 2797439.730260099
Slang Text: 945663.204067629


- **Social Media Text** (584134.9458673951): This extremely high perplexity score indicates that the language model finds social media text incredibly unpredictable. The model, trained on Wikipedia's formal content, struggles to process the abbreviations, slang, emojis, and informal sentence structures prevalent in social media. The vast difference between the training data and the social media example results in the model being highly "surprised" by each word, leading to this massive perplexity value.

- **SEO Spam** (2797439.730260099): The astonishingly high perplexity for SEO spam suggests that this type of text is exceptionally distant from the model's learned patterns. While SEO spam might contain grammatically correct phrases, its repetitive nature, keyword stuffing, and focus on promotional language are vastly different from the encyclopedic style of Wikipedia. The model's inability to predict the highly specific vocabulary and structure of SEO spam results in an even greater level of "surprise" than with social media text.

- **Slang Text** (945663.204067629): The substantial perplexity score for slang-heavy text confirms that the model finds it highly improbable. The use of non-standard vocabulary, informal grammar, and colloquial expressions creates a significant mismatch with the model's training data. The model's difficulty in predicting the next word within this context, due to its unfamiliarity with slang, leads to a high degree of "surprise," resulting in this elevated perplexity score.

Perplexity Scores: What's Going On?

- We got these numbers: 584134.94, 2797439.73, and 945663.20. These are "perplexity" scores.
- Think of it like confusion: The higher the number, the more confused the computer model is.
- Model learned from Wikipedia: It's used to formal, proper writing.
- Social Media Text (584134.94):
  - "omg lol fr tho..." completely throws it off.
  - Abbreviations, slang, emojis – total mismatch.
- SEO Spam (2797439.73):
  - "Buy cheap shoes online now!" even worse.
  - Repetitive sales talk = major confusion.
- Slang Text (945663.20):
  - "Yo, that dude was..." – another big number.
  - Slang and informal language not understood.

What it means:
- Numbers show how well the model "gets" different writing styles.
- Big numbers = far from what it learned.

**Important**:
- Perplexity isn't perfect.
- Low number doesn't always mean "good" text.
- It's a way to see how well the language model fits the text that is being analysed.

In [10]:
import kenlm
from huggingface_hub import hf_hub_download

class KenlmModel:
    @classmethod
    def from_pretrained(cls, model_name: str):
        # Download the model file from Hugging Face
        model_path = hf_hub_download(repo_id=model_name, filename="wikipedia/en.arpa.bin")
        return kenlm.Model(model_path)

def calculate_perplexity(model, sentence):
    """Calculate perplexity of a sentence using a KenLM model"""
    # Calculate perplexity
    log_prob = model.score(sentence, bos=True, eos=True)
    words = len(sentence.split())
    perplexity = 10 ** (-log_prob / words)

    return perplexity


In [11]:
print("Downloading KenLM Wikipedia English model...")
# Load the Wikipedia English model using the provided class
model = KenlmModel.from_pretrained("edugp/kenlm")
print("Model downloaded and loaded successfully!")

# Target sentence
sentence = "She was a shriveling bumblebee, and he was a bumbling banshee, but they accepted a position at Gringotts because of their love for maple syrup."

# Calculate perplexity
perplexity = calculate_perplexity(model, sentence)

# Print results
print("\nPerplexity Analysis:")
print("-" * 60)
print(f"Sentence: {sentence}")
print(f"Perplexity Score: {perplexity:.4f}")

# For comparison, calculate perplexity for a more common sentence
common_sentence = "The president spoke to the press about the new economic policies."
common_perplexity = calculate_perplexity(model, common_sentence)

print("\nComparison with common sentence:")
print(f"Sentence: {common_sentence}")
print(f"Perplexity Score: {common_perplexity:.4f}")

# Calculate the ratio
ratio = perplexity / common_perplexity
print(f"\nThe target sentence is {ratio:.2f}x more surprising to the model than the common sentence.")

# Additional examples for comparison
print("\nAdditional Comparisons:")

# A highly formal sentence
formal = "The distinguished representatives convened to deliberate upon matters of international significance."
formal_perplexity = calculate_perplexity(model, formal)
print(f"Formal sentence: {formal}")
print(f"Perplexity: {formal_perplexity:.4f}")

# A very simple sentence
simple = "The cat sat on the mat."
simple_perplexity = calculate_perplexity(model, simple)
print(f"Simple sentence: {simple}")
print(f"Perplexity: {simple_perplexity:.4f}")

# A sentence with fantasy elements similar to the target
fantasy = "The wizard cast a spell on the dragon while the elves danced in the moonlight."
fantasy_perplexity = calculate_perplexity(model, fantasy)
print(f"Fantasy sentence: {fantasy}")
print(f"Perplexity: {fantasy_perplexity:.4f}")


Downloading KenLM Wikipedia English model...
Model downloaded and loaded successfully!

Perplexity Analysis:
------------------------------------------------------------
Sentence: She was a shriveling bumblebee, and he was a bumbling banshee, but they accepted a position at Gringotts because of their love for maple syrup.
Perplexity Score: 548543.9453

Comparison with common sentence:
Sentence: The president spoke to the press about the new economic policies.
Perplexity Score: 595201.6117

The target sentence is 0.92x more surprising to the model than the common sentence.

Additional Comparisons:
Formal sentence: The distinguished representatives convened to deliberate upon matters of international significance.
Perplexity: 4161921.7104
Simple sentence: The cat sat on the mat.
Perplexity: 1458355.8227
Fantasy sentence: The wizard cast a spell on the dragon while the elves danced in the moonlight.
Perplexity: 338976.9645
