# 📘 Sentence Similarity: Symbolic vs Semantic Algorithms

In this notebook, we explore two different approaches to measuring how similar two pieces of text are.  
This is useful in many real-world tasks, such as:

- Evaluating machine translations
- Detecting duplicate sentences
- Comparing user input with reference answers
- Measuring content overlap across models

## 🎯 Goal

We compare two fundamentally different techniques:

1. **SequenceMatcher** — a traditional symbolic algorithm that compares how similar two strings are based on characters or words.
2. **SentenceTransformer** — a modern semantic method that uses neural networks to compare meaning, not just text form.

Each approach is explained with short code examples and use cases.


## 🔁 Both Methods Compare Strings – But in Completely Different Ways

| 🔍 Characteristic          | `SequenceMatcher`                       | `SentenceTransformer`                             |
|---------------------------|-----------------------------------------|---------------------------------------------------|
| **What it compares**      | Characters or words                     | Sentence meaning (semantics)                      |
| **Understands meaning?**  | ❌ No                                    | ✅ Yes                                             |
| **Based on**              | Common substrings (technically)         | Vector representation from a trained neural net   |
| **Output**                | Match percentage by characters          | Cosine similarity between sentence vectors        |
| **Used when**             | Exact form matters (typos, logins)      | Meaning matters, even if wording is different     |


## 🔣 SequenceMatcher – Symbolic Similarity Example

In [1]:
from difflib import SequenceMatcher

a = "I like cats"
b = "I like dogs"

similarity = SequenceMatcher(None, a, b).ratio()
print(f"SequenceMatcher similarity: {round(similarity * 100)}%")

SequenceMatcher similarity: 73%


- 🔍 It checks for **matching subsequences** (e.g., `"I like cats"` vs `"I like dogs"`).
- ❌ It **does not understand meaning** — only surface-level similarity.
- ✅ Useful for catching **typos, near-duplicates, or string similarity**.

## 🔢 SentenceTransformer – Semantic Similarity Example

In [2]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("all-mpnet-base-v2")

a = "I like cats"
b = "I adore kittens"

emb1 = model.encode([a])[0]
emb2 = model.encode([b])[0]

cos_sim = cosine_similarity([emb1], [emb2])[0][0]
print(f"Semantic similarity: {round(cos_sim * 100)}%")


  from .autonotebook import tqdm as notebook_tqdm


Semantic similarity: 78%


- 🧠 It encodes sentences into **dense vectors** that represent meaning.
- ✅ It can detect **semantic similarity**, even if the words are different.
- 📏 Uses **cosine similarity** to compare how close two sentences are in meaning.
- 🔎 Great for comparing **translations, paraphrases, or responses** where wording may vary.


In [3]:
from difflib import SequenceMatcher
import pandas as pd
import os
from IPython.display import Markdown, display

# Print the current working directory
print("📂 Current working directory:", os.getcwd())

# Define the reference/original text
original_text = """
Herzlich willkommen zu diesem Tutorial in der Reihe Texture Mapping. Dieses dritte Tutorial soll Ihnen zeigen, wie Sie eine Textur eines Baumstammes, den Sie hier sehen, auf einen einfachen Zylinder mappen können. Wir haben hier zwei Texturen verwendet, eine Textur für den Stamm und eine weitere Textur hier vom Schnitt des Baumstammes und ich zeige Ihnen in ein paar wenigen Schritten, wie einfach das möglich ist, hier die Textur an die richtige Stelle mit Hilfe vom UV-Mapping zu bekommen.
Dann starten wir mit dem Blender Default Screen, hier der Würfel. Ich lösche den Würfel und gebe einen Zylinder dazu, den ich etwas in Z skaliere. Wir sehen hier, dass wir hier jetzt nicht näher auf die Modellierung des Baumstumpfes eingehen, sondern nur auf, wie wir optimal die Textur auf diesen Zylinder mappen können.
Gut, wir schauen uns nun die UV-Editing-Tools an. Wir sehen hier auch in einem der früheren Tutorials, ist der Zylinder bereits richtig gemappt. An dieser Stelle hier, wir sehen hier die Seite des Zylinders und die Ober- und die Unterseite des Zylinders.
Das ist bei Default-Objekten so der Fall.
"""

# Define file paths for subtitle files
file_paths = {
    "tiny": "whisper_models/tiny-de-2min.srt",
    "base": "whisper_models/base-de-2min.srt",
    "small": "whisper_models/smal-de-2min.srt",
    "medium": "whisper_models/medium-de-2min.srt",
    "large_v1": "whisper_models/large_v1-de-2min.srt",
    "large_v2": "whisper_models/large_v2-de-2min.srt",
    "large_v3": "whisper_models/large_v3-de-2min.srt",
    "large-turbo-v3": "whisper_models/large-turbo-v3-de-2min.srt"
}

# Verify that all files exist
for name, path in file_paths.items():
    if not os.path.exists(path):
        raise FileNotFoundError(f"❌ File not found: {path}")
    else:
        print(f"✅ File found: {path}")

# Function to read SRT files and extract plain text
def read_srt_text(path):
    with open(path, 'r', encoding='utf-8') as f:
        lines = f.readlines()
    text = []
    for line in lines:
        line = line.strip()
        if line and not line.isdigit() and "-->" not in line:
            text.append(line)
    return " ".join(text)

# Read subtitle texts
texts = {name: read_srt_text(path) for name, path in file_paths.items()}
texts["original"] = original_text

# Build similarity matrix
model_names = ["original"] + [name for name in texts.keys() if name != "original"]

similarity_matrix = pd.DataFrame(index=model_names, columns=model_names)

for name1 in model_names:
    for name2 in model_names:
        if name1 == name2:
            similarity_matrix.loc[name1, name2] = "–"
        else:
            ratio = SequenceMatcher(None, texts[name1], texts[name2]).ratio()
            similarity_matrix.loc[name1, name2] = f"{round(ratio * 100)}%"

# Generate Markdown table
md_table = "| Model Comparison | " + " | ".join(model_names) + " |\n"
md_table += "|" + "----|" * (len(model_names) + 1) + "\n"

for name1 in model_names:
    row = f"| **{name1}** "
    for name2 in model_names:
        row += f"| {similarity_matrix.loc[name1, name2]} "
    row += "|\n"
    md_table += row

# Display the table in the notebook
display(Markdown(md_table))

# Ensure the output directory exists
output_dir = "outputs"
os.makedirs(output_dir, exist_ok=True)

# Save the Markdown table
output_path = os.path.join(output_dir, "vergleich.md")
with open(output_path, "w", encoding="utf-8") as f:
    f.write(md_table)

print(f"✅ Markdown table saved to: {output_path}")


📂 Current working directory: /Users/vadim/VST-VoiceOver/notebooks
✅ File found: whisper_models/tiny-de-2min.srt
✅ File found: whisper_models/base-de-2min.srt
✅ File found: whisper_models/smal-de-2min.srt
✅ File found: whisper_models/medium-de-2min.srt
✅ File found: whisper_models/large_v1-de-2min.srt
✅ File found: whisper_models/large_v2-de-2min.srt
✅ File found: whisper_models/large_v3-de-2min.srt
✅ File found: whisper_models/large-turbo-v3-de-2min.srt


| Model Comparison | original | tiny | base | small | medium | large_v1 | large_v2 | large_v3 | large-turbo-v3 |
|----|----|----|----|----|----|----|----|----|----|
| **original** | – | 52% | 90% | 24% | 24% | 97% | 97% | 97% | 96% |
| **tiny** | 60% | – | 44% | 49% | 38% | 47% | 51% | 35% | 46% |
| **base** | 90% | 50% | – | 90% | 89% | 89% | 89% | 91% | 92% |
| **small** | 97% | 55% | 90% | – | 95% | 98% | 96% | 84% | 92% |
| **medium** | 91% | 46% | 76% | 76% | – | 94% | 91% | 77% | 89% |
| **large_v1** | 97% | 54% | 89% | 98% | 92% | – | 96% | 96% | 94% |
| **large_v2** | 97% | 53% | 90% | 24% | 24% | 96% | – | 98% | 97% |
| **large_v3** | 97% | 44% | 78% | 24% | 24% | 87% | 98% | – | 94% |
| **large-turbo-v3** | 96% | 54% | 91% | 71% | 91% | 94% | 97% | 97% | – |


✅ Markdown table saved to: outputs/vergleich.md


## ✅ Summary of the SequenceMatcher Evaluation

In this part of the notebook, I used Python’s `SequenceMatcher` to compare the reference transcript with the outputs from various Whisper models.

- This algorithm works by comparing **character sequences**, not the meaning of sentences.
- It gives a similarity score (in %) based on how much the outputs structurally match the original.
- It's a fast and simple method that's useful for spotting surface-level changes — like typos or word order shifts.

### 🔍 What I observed from the results:
- Some models (like `small`, `large_v1`, `large_v2`) showed very high structural similarity to the original — up to 97–98%.
- Others (like `tiny` and `medium`) had much lower scores in comparison, which may point to rephrasing, skipped words, or alignment issues.
- Still, this method doesn’t tell us if two outputs have **similar meaning** — just that they **look similar on the surface**.

> 🧠 So for deeper analysis (e.g., semantic similarity), I'll need a different approach — like using `SentenceTransformer`.


In [4]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import os
import re
import numpy as np
from IPython.display import Markdown, display

model_names = [
    "Google_V1", "DeepL_V2", "ChatGPT-mini-3o", "GPT-4o-mini", "GPT-4o", "GPT-4o-turbo", 
    "GPT-3.5-turbo", "GPT-4", "MyMemory", "groq", "winstxnhdw-HLLB", 
    "Ollama", "DeepSeek-R1", "gemma3", "zongweigemma3-translator1b", "gimini-2.0-flash"
]

file_dir = "translate_models"
file_paths = {name: os.path.join(file_dir, f"{name}.srt") for name in model_names}
for name, path in file_paths.items():
    if not os.path.exists(path):
        raise FileNotFoundError(f"❌ File not found: {path}")
    print(f"✅ Found: {path}")

def read_srt_text(path):
    with open(path, "r", encoding="utf-8") as f:
        lines = f.readlines()
    text = []
    for line in lines:
        line = line.strip()
        if line and not line.isdigit() and "-->" not in line:
            text.append(line)
    full_text = " ".join(text)
    full_text = full_text.lower()
    full_text = re.sub(r"[^\w\s\.\?!]", "", full_text)
    full_text = re.sub(r"\s+", " ", full_text).strip()
    return re.split(r'(?<=[\.\!\?])\s+', full_text)

# Model with better semantic resolution
model = SentenceTransformer("all-mpnet-base-v2")

def compare_sentences(sent_list1, sent_list2):
    n = min(len(sent_list1), len(sent_list2))
    if n == 0:
        return 0.0
    emb1 = model.encode(sent_list1[:n])
    emb2 = model.encode(sent_list2[:n])
    sims = [cosine_similarity([a], [b])[0][0] for a, b in zip(emb1, emb2)]
    return round(np.mean(sims) * 100)

# Read and encode
texts = {name: read_srt_text(path) for name, path in file_paths.items()}

# Build similarity matrix
similarity_matrix = pd.DataFrame(index=model_names, columns=model_names)
for name1 in model_names:
    for name2 in model_names:
        if name1 == name2:
            similarity_matrix.loc[name1, name2] = "–"
        else:
            sim = compare_sentences(texts[name1], texts[name2])
            similarity_matrix.loc[name1, name2] = f"{sim}%"

# Markdown table
md_table = "| Model Comparison | " + " | ".join(model_names) + " |\n"
md_table += "|" + "----|" * (len(model_names) + 1) + "\n"
for name1 in model_names:
    row = f"| **{name1}** "
    for name2 in model_names:
        row += f"| {similarity_matrix.loc[name1, name2]} "
    row += "|\n"
    md_table += row

display(Markdown(md_table))

# Save table
output_dir = "outputs"
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, "vergleich_sentence_by_sentence.md")
with open(output_path, "w", encoding="utf-8") as f:
    f.write(md_table)

print(f"✅ Saved more accurate sentence-by-sentence comparison at: {output_path}")


✅ Found: translate_models/Google_V1.srt
✅ Found: translate_models/DeepL_V2.srt
✅ Found: translate_models/ChatGPT-mini-3o.srt
✅ Found: translate_models/GPT-4o-mini.srt
✅ Found: translate_models/GPT-4o.srt
✅ Found: translate_models/GPT-4o-turbo.srt
✅ Found: translate_models/GPT-3.5-turbo.srt
✅ Found: translate_models/GPT-4.srt
✅ Found: translate_models/MyMemory.srt
✅ Found: translate_models/groq.srt
✅ Found: translate_models/winstxnhdw-HLLB.srt
✅ Found: translate_models/Ollama.srt
✅ Found: translate_models/DeepSeek-R1.srt
✅ Found: translate_models/gemma3.srt
✅ Found: translate_models/zongweigemma3-translator1b.srt
✅ Found: translate_models/gimini-2.0-flash.srt


| Model Comparison | Google_V1 | DeepL_V2 | ChatGPT-mini-3o | GPT-4o-mini | GPT-4o | GPT-4o-turbo | GPT-3.5-turbo | GPT-4 | MyMemory | groq | winstxnhdw-HLLB | Ollama | DeepSeek-R1 | gemma3 | zongweigemma3-translator1b | gimini-2.0-flash |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| **Google_V1** | – | 38% | 25% | 25% | 25% | 25% | 24% | 24% | 30% | 29% | 36% | 29% | 39% | 27% | 27% | 32% |
| **DeepL_V2** | 38% | – | 33% | 33% | 32% | 32% | 32% | 31% | 41% | 52% | 54% | 52% | 59% | 38% | 49% | 55% |
| **ChatGPT-mini-3o** | 25% | 33% | – | 98% | 96% | 97% | 97% | 96% | 45% | 42% | 38% | 39% | 33% | 46% | 38% | 38% |
| **GPT-4o-mini** | 25% | 33% | 98% | – | 96% | 97% | 98% | 96% | 45% | 43% | 39% | 40% | 33% | 47% | 39% | 39% |
| **GPT-4o** | 25% | 32% | 96% | 96% | – | 96% | 96% | 97% | 43% | 41% | 38% | 39% | 32% | 45% | 38% | 38% |
| **GPT-4o-turbo** | 25% | 32% | 97% | 97% | 96% | – | 97% | 98% | 44% | 42% | 38% | 40% | 31% | 47% | 39% | 39% |
| **GPT-3.5-turbo** | 24% | 32% | 97% | 98% | 96% | 97% | – | 96% | 44% | 43% | 38% | 39% | 32% | 46% | 38% | 38% |
| **GPT-4** | 24% | 31% | 96% | 96% | 97% | 98% | 96% | – | 44% | 41% | 37% | 39% | 32% | 46% | 39% | 39% |
| **MyMemory** | 30% | 41% | 45% | 45% | 43% | 44% | 44% | 44% | – | 61% | 37% | 47% | 40% | 88% | 48% | 58% |
| **groq** | 29% | 52% | 42% | 43% | 41% | 42% | 43% | 41% | 61% | – | 40% | 63% | 51% | 59% | 61% | 72% |
| **winstxnhdw-HLLB** | 36% | 54% | 38% | 39% | 38% | 38% | 38% | 37% | 37% | 40% | – | 42% | 64% | 36% | 41% | 41% |
| **Ollama** | 29% | 52% | 39% | 40% | 39% | 40% | 39% | 39% | 47% | 63% | 42% | – | 57% | 47% | 84% | 46% |
| **DeepSeek-R1** | 39% | 59% | 33% | 33% | 32% | 31% | 32% | 32% | 40% | 51% | 64% | 57% | – | 38% | 56% | 50% |
| **gemma3** | 27% | 38% | 46% | 47% | 45% | 47% | 46% | 46% | 88% | 59% | 36% | 47% | 38% | – | 47% | 57% |
| **zongweigemma3-translator1b** | 27% | 49% | 38% | 39% | 38% | 39% | 38% | 39% | 48% | 61% | 41% | 84% | 56% | 47% | – | 49% |
| **gimini-2.0-flash** | 32% | 55% | 38% | 39% | 38% | 39% | 38% | 39% | 58% | 72% | 41% | 46% | 50% | 57% | 49% | – |


✅ Saved more accurate sentence-by-sentence comparison at: outputs/vergleich_sentence_by_sentence.md


## ✅ Summary of the SentenceTransformer Evaluation

In this part of the notebook, I used a `SentenceTransformer` model (`all-mpnet-base-v2`) to compare the outputs of various translation systems **sentence by sentence**.

- This method captures **semantic similarity**, not just structural or surface overlap.
- Each sentence is encoded into a vector, and I compute the **cosine similarity** between sentence pairs.
- The final score is the average semantic similarity across all aligned sentences.

### 🔍 What the results show:
- Models like `GPT-4o`, `GPT-4o-turbo`, and `DeepL` often score high when compared to each other — meaning they produce translations with similar meaning.
- Models like `winstxnhdw` or `DeepSeek` tend to diverge more from the rest, possibly due to stylistic or structural differences.
- Unlike `SequenceMatcher`, this method is more **robust to rephrasing and different word choices** as long as the meaning stays intact.

> 🧠 SentenceTransformer is a better fit when comparing translations based on meaning, not form.
