# 📌 Introduction to Embeddings and Cosine Similarity

In modern AI and NLP research, texts are converted into special vector representations – called **embeddings**. These vectors help computers “understand” the meaning of words and sentences.

---

## 🚀 What is an Embedding?

An **embedding** is a numerical vector of fixed length that encodes the meaning of a piece of text.  
For example, the sentence “I love cats” might be represented as:


Each individual number does not have a literal meaning, but together they form a **semantic fingerprint** of the sentence.

### 🔍 The Key Idea:
- Similar sentences will have **similar embeddings** (their vectors “point” in the same direction).
- Different sentences will have embeddings that are far apart.

---

## 🔢 How Do Embeddings Work?

A model that generates embeddings is trained on huge text datasets.  
- For similar sentences, it brings their vectors closer together.  
- For sentences with different meanings, it pushes them apart.

This enables embeddings to be used for:  
✅ Finding similar texts  
✅ Translating meaning between languages  
✅ Semantic analysis  

---

## 📏 Cosine Similarity: What Is It?

To measure **how similar** two embeddings are, we use **cosine similarity**.

The formula:
$$
\cos(\theta) = \frac{A \cdot B}{\|A\| \cdot \|B\|}
$$

- \(A \cdot B\) is the dot product of the two vectors.  
- \(\|A\|\) and \(\|B\|\) are their lengths (normalization).  
- The result ranges from -1 to 1.

### 🟩 Interpretation:
- **1** – meanings are completely identical (same direction).  
- **0** – no similarity (vectors are orthogonal).  
- **-1** – opposite meanings.

---


## 🧪 Example in Python

Let’s look at a Python example where we compare the similarity of two sentences:


In [39]:

# from sentence_transformers import SentenceTransformer, util
# Load a multilingual model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Two example sentences
a = "I love cats"
b = "I like cats"

# Compute embeddings
embeddings_a = model.encode(a, convert_to_tensor=True)
embeddings_b = model.encode(b, convert_to_tensor=True)

# Cosine similarity
similarity = util.cos_sim(embeddings_a, embeddings_b)

print(f"Cosine similarity: {round(float(similarity[0][0]) * 100, 2)}%")

Cosine similarity: 93.82%


In [36]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd

# File paths
# german_srt_path = "data/large_v3-de-2min.srt"
# english_srt_path = "translate_models/gemma3.srt"

german_srt_path = "Two-Pole_Theory/Texturing.a.Trunk-de-2min.srt"
english_srt_path = "Two-Pole_Theory/Texturing.a.Trunk-de-2min.en.srt"

# Function to read and clean .srt files
def read_srt_text_lines(filepath):
    with open(filepath, encoding="utf-8") as f:
        lines = f.readlines()
        # Remove subtitle numbering and timestamps
        text_lines = [
            line.strip() for line in lines
            if line.strip() and
               not line.strip().isdigit() and
               '-->' not in line
        ]
        return text_lines

# Read German and English subtitle text lines
german_lines = read_srt_text_lines(german_srt_path)
english_lines = read_srt_text_lines(english_srt_path)

# Align both lists to have the same number of lines
min_len = min(len(german_lines), len(english_lines))
german_lines = german_lines[:min_len]
english_lines = english_lines[:min_len]

# Load multilingual sentence transformer model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Create embeddings for both sets of sentences
german_embeddings = model.encode(german_lines, convert_to_tensor=True)
english_embeddings = model.encode(english_lines, convert_to_tensor=True)

# Compute cosine similarity for each pair of corresponding sentences
cosine_scores = util.cos_sim(german_embeddings, english_embeddings)

# Collect similarity scores (convert to percentages)
similarities = [round(float(cosine_scores[i][i]) * 100, 2) for i in range(min_len)]

# Compute the overall average similarity
overall_similarity = round(sum(similarities) / len(similarities), 2)

# Create a DataFrame with results
df = pd.DataFrame({
    "German Text": german_lines,
    "English Text": english_lines,
    "Semantic Similarity (%)": similarities
})

# Display the DataFrame
from IPython.display import display
display(df.style.background_gradient(subset=["Semantic Similarity (%)"], cmap="Greens"))

# Display the overall similarity
print(f"\n🔢 Overall semantic similarity of the texts: {overall_similarity}%")

# # Save the table to a CSV file
# output_path = "Two-Pole_Theory/semantic_similarity_results.csv"
# df.to_csv(output_path, index=False, encoding="utf-8-sig")
# print(f"✅ Results saved to: {output_path}")


Unnamed: 0,German Text,English Text,Semantic Similarity (%)
0,"Ja, herzlich willkommen zu diesem Tutorial","Yes, welcome to this tutorial",88.55
1,in der Redex-Champirping.,in the Redex-Champirping.,95.63
2,"Dieses dritte Tutorial soll in einen zeigen, wie sie",This third tutorial will show you how to create a,76.65
3,"eine Textur eines Baumstammes, den sich hier sehen.","texture of a tree trunk, which you can see here.",85.85
4,"Auf einen einfachen Zilinder, mitten","On a simple Zilinder, in the middle of",89.65
5,können.,can.,87.37
6,"Wir haben hier zwei Texturen verwendet,","We have used two textures here,",73.52
7,eine Textur für den Stamm.,one texture for the trunk.,35.31
8,Und eine weitere Textur hier vom Schlitt,And another texture here from the sledge,79.87
9,des Baumstammes.,of the tree trunk.,82.58



🔢 Overall semantic similarity of the texts: 76.15%
