## Similarity Analysis

We know that Pratchett's style of writing has a distinct tone - witty, satirical and rich in metaphor.

How can we measure if our chatbot's outputs will be able to mimic Pratchett's style of writing?

The following analysis is an attempt to study if using sentence embeddings can be used to differentiate between Pratchett's texts and a randomly generated version from an LLM.

For this purpose, we will use Huggingface's sentence-transformers model 'all-MiniLM-L6-v2' to generate the sentence embeddings.

There are a number of anomaly detection methods that can identify segments that deviate from the norm, and these include Isolation Forest, One-Class SVM, and Autoencoders.

We will try using the autoencoder method first, as it is a deep learning technique that may be able to pick up the fine-grained nuances or patterns in Pratchett's original text. (This is also a challenge that I would like to take on, using deep learning)

### Proposed Methodology:

#### Training Phase (Learning Pratchett’s Style)

- The autoencoder is trained on only Pratchett-style embeddings.
- It learns to reconstruct these embeddings.

#### Inference Phase (Checking Chatbot Outputs)
- Pass a chatbot response through the autoencoder.
- Compute reconstruction error (difference between input and reconstructed output)




In [27]:
import importlib
import utils.utils
importlib.reload(utils.utils)

<module 'utils.utils' from 'c:\\Users\\Liman\\Downloads\\ask_terry\\utils\\utils.py'>

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import fitz
import os
from utils.utils import extract_text_from_pdfs, clean_text
from sklearn.metrics import mean_squared_error
import re

In [10]:
pratchett_text = extract_text_from_pdfs("./books")

# Save to a text file (for later analysis)
with open("pratchett_text.txt", "w", encoding="utf-8") as f:
    f.write(pratchett_text)

print("✅ PDF text extraction complete!")

Extracting text from: 09-eric.pdf
Extracting text from: Light Fantastic.pdf
Extracting text from: sourcery.pdf
Extracting text from: The-Colour-of-Magic.pdf
Extracting text from: the_last_hero.pdf
✅ PDF text extraction complete!


In [29]:
cleaned_text = clean_text(pratchett_text)

# Save cleaned text
with open("cleaned_pratchett_text.txt", "w", encoding="utf-8") as f:
    f.write(cleaned_text)

print("✅ Text cleaning complete!")

✅ Text cleaning complete!


In [30]:
# Load pre-trained SBERT model for sentence embeddings
sbert_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [None]:
# Split text into sentences
sentences = re.split(r'(?<=[.!?])\s+', cleaned_text)  # Splitting on punctuation

# Convert sentences to embeddings
sentence_embeddings = sbert_model.encode(sentences)

print(f"✅ Converted {len(sentences)} sentences into embeddings.")


✅ Converted 24141 sentences into embeddings.


In [33]:
# Convert embeddings to NumPy array
X = np.array(sentence_embeddings)

# Train-test split
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

In [None]:
# Define Autoencoder Model
autoencoder = Sequential([
    Dense(128, activation='relu', input_shape=(384,)),  # Encoder
    Dense(64, activation='relu'),  # Latent space (bottleneck)
    Dense(128, activation='relu'),  # Decoder
    Dense(384, activation='sigmoid')  # Output layer (same size as input)
])

# Compile model
autoencoder.compile(optimizer='adam', loss='mse')

# Train the autoencoder
print("🚀 Training autoencoder on Pratchett's text...")
autoencoder.fit(X_train, X_train, epochs=50, batch_size=8, validation_data=(X_test, X_test), verbose=1)

# Save trained model
autoencoder.save("pratchett_autoencoder.keras") 
print("✅ Training complete. Model saved!")

### Testing the model's accuracy

Now that we have built the model, we will test it with some texts to see if the model is able to detect the anomalies that are out of character.

- test single sentences
- test paragraphs
- test quote from other authors' books
- self-written paragraph

Note that this method is just a quick verification. 

A more scientific method would be of a greater scale, which we may not explore for now.


In [70]:
# Load trained model
autoencoder = tf.keras.models.load_model("pratchett_autoencoder.keras")

def detect_anomaly(sentence, threshold=0.05):
    """ Detects if a sentence is an anomaly (out-of-character for Pratchett) """
    embedding = np.array(sbert_model.encode([sentence]))  # Convert to embedding
    reconstructed = autoencoder.predict(embedding)  # Reconstruct using the model
    
    # Compute reconstruction error
    error = mean_squared_error(embedding, reconstructed)
    
    if error > threshold:
        print(f"🚨 Anomaly detected! (Error: {error:.5f}) 👉 '{sentence}'")
        print("_________________________________________________________")
        return True
    else:
        print(f"✅ Sentence is in-character. (Error: {error:.5f}) 👉 '{sentence}'")
        print("_________________________________________________________")    

# === Test Sentences ===
test_sentences = [
    "The sun shone brightly over the AI-powered kingdom.",  # Likely an anomaly
    "I DON’T FORGET THINGS, said Death. I SIMPLY DO NOT BOTHER TO REMEMBER THEM.",  # Generated. Should match Pratchett-style
    "Science is best understood by those who don't try to understand it.",  # Generated. Should match Pratchett-style
    "DON'T THINK OF IT AS DYING, said Death. JUST THINK OF IT AS LEAVING EARLY TO AVOID THE RUSH.", # original text
    "The universe was vast, uncaring, and mildly annoyed by paperwork.",
    "A wisp of wind blew for a moment through the orchard, and that was the most uncanny thing, because the air in the land of Death is always warm and still.", # original text
    "Discworld is a flat, disc-shaped world balanced on the backs of four giant elephants, which stand atop the enormous celestial turtle, Great A'Tuin, as it swims through space." # generated text, non-Pratchett style
    ]

print("\n🔍 Running anomaly detection on chatbot responses...\n")
for sentence in test_sentences:
    detect_anomaly(sentence)


🔍 Running anomaly detection on chatbot responses...

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 65ms/step
🚨 Anomaly detected! (Error: 0.07160) 👉 'The sun shone brightly over the AI-powered kingdom.'
_________________________________________________________
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
✅ Sentence is in-character. (Error: 0.04077) 👉 'I DON’T FORGET THINGS, said Death. I SIMPLY DO NOT BOTHER TO REMEMBER THEM.'
_________________________________________________________
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
🚨 Anomaly detected! (Error: 0.05700) 👉 'Science is best understood by those who don't try to understand it.'
_________________________________________________________
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
✅ Sentence is in-character. (Error: 0.04107) 👉 'DON'T THINK OF IT AS DYING, said Death. JUST THINK OF IT AS LEAVING EARLY TO AVOID THE RUSH.'
_______________

From the above, 5 out of 7 sentences were predicted correctly.

The sentences 'Science is best understood by those who don't try to understand it.' and 'The universe was vast, uncaring, and mildly annoyed by paperwork.' are AI-generated and seem Pratchett-like to the knowing human eye, but are predicted to be anomalies.

From preliminary observation, context-length might be a factor that influences the model's prediction, as the sentences that are predicted to be anomalies are the **shortest ones in the list**.
'

In [None]:
# Test generated paragraph similar to Pratchett's style
paragraph = [
    "Rincewind, in a fit of rare optimism, once attempted to teach the Luggage to play fetch. Armed with a sturdy stick and an exit strategy, he cautiously tossed the stick a short distance away. The Luggage remained still. Rincewind took a step backward. The Luggage took a step forward. Several seconds passed in tense silence. Then, without warning, the stick disappeared—not because The Luggage fetched it, but because reality, sensing what might happen if it did not cooperate, decided it was best if the stick had never existed at all. Rincewind, wisely, abandoned the experiment and spent the rest of the afternoon hiding behind a particularly large boulder."
    ]

In [49]:
detect_anomaly(paragraph[0])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step


False

In [None]:
john_grisham_quote = [
    "It's amazing how lies grow. You start with a small one that seems easy to cover, then you get boxed in and tell another one. Then another. People believe you at first, then they act upon your lies, and you catch yourself wishing you'd simply told the truth."
]

In [None]:
detect_anomaly(john_grisham_quote[0])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
✅ Sentence is in-character. (Error: 0.03730) 👉 'It's amazing how lies grow. You start with a small one that seems easy to cover, then you get boxed in and tell another one. Then another. People believe you at first, then they act upon your lies, and you catch yourself wishing you'd simply told the truth.'
_________________________________________________________


In [None]:
# Test self-written paragraph
serious_style = [
    "I thought to myself, what is the meaning of life? Do I exist for a reason? I pondered about whether God was real and whether making a difference even mattered. I sighed, and looked over to my brother. He was munching on a sandwich. 'What do you think about the meaning of life?' I asked him. He continued muching and said noncommittally, 'Being able to eat this sandwich.' Hmm, I thought to myself. Fair enough."
]

In [66]:
detect_anomaly(serious_style[0])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
✅ Sentence is in-character. (Error: 0.04054) 👉 'I thought to myself, what is the meaning of life? Do I exist for a reason? I pondered about whether God was real and whether making a difference even mattered. I sighed, and looked over to my brother. He was munching on a sandwich. 'What do you think about the meaning of life?' I asked him. He continued muching and said noncommittally, 'Being able to eat this sandwich.' Hmm, I thought to myself. Fair enough.'
_________________________________________________________


In [None]:
# Test self-written paragraph that is shorter than previous paragraph
serious_style_2 = [
    "I thought to myself, what is the meaning of life? Do I exist for a reason? I pondered about whether God was real and whether making a difference even mattered."
]

In [68]:
detect_anomaly(serious_style_2[0])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
🚨 Anomaly detected! (Error: 0.05133) 👉 'I thought to myself, what is the meaning of life? Do I exist for a reason? I pondered about whether God was real and whether making a difference even mattered.'
_________________________________________________________


True