### EMBEDDING - INTRODUCTION

#### Introduction: How do Computers Understand Meaning?

If you search for **"Best place to get a burger,"** a traditional keyword search might look for the exact word **"burger."** But a GenAI system knows that "fast food joint" or "diner" might also be relevant, even if they don't contain the word **"burger."**

How? Through Embeddings.

Embeddings turn text into lists of numbers (vectors). Imagine a giant 3D map of every word in the dictionary:

* Words with similar meanings (like "King" and "Queen") are placed close together.
* Words with different meanings (like "King" and "Toaster") are placed far apart.

In this notebook, we will:

* Turn sentences into numbers (Embeddings).

* Visualize what these numbers look like.

* Use math to find out which sentences are most similar (Semantic Search).

#### Generating Embeddings

In [1]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "I love chicken dishes",      # Index 0
    "Haribhavan is best for chicken dishes",   # Index 1
    "Madharasi movie is awesome",  # Index 2
    "Vijay started new party named TVK",  # Index 3
    "It is so rainy",             # Index 4
]

embeddings = model.encode(sentences)

print(f"Shape of the embedding matrix: {embeddings.shape}")
print("\nHere is what the first sentence looks like to the computer (first 10 numbers):")
print(embeddings[0][:10])

  from .autonotebook import tqdm as notebook_tqdm


Shape of the embedding matrix: (5, 384)

Here is what the first sentence looks like to the computer (first 10 numbers):
[-0.06403217 -0.08686323  0.0058217  -0.01795575 -0.0321081  -0.04148107
  0.07964611 -0.0545793   0.071467   -0.02466663]


* The Output: You will see a shape like (5, 384). This means we have 5 sentences, and each sentence is described by 384 numbers.

* These numbers are the "GPS coordinates" of that sentence in the universe of meaning.

#### Measuring Similarity

In [2]:
embeddings.shape

(5, 384)

In [3]:
from sentence_transformers import util

# Calculate cosine similarity between all pairs of sentences
# This creates a matrix of scores between 0 (not similar) and 1 (identical)
similarity_scores = util.cos_sim(embeddings, embeddings)

print("Similarity Matrix:\n")
print(similarity_scores)

Similarity Matrix:

tensor([[1.0000, 0.5226, 0.1736, 0.0803, 0.1333],
        [0.5226, 1.0000, 0.3310, 0.1922, 0.0398],
        [0.1736, 0.3310, 1.0000, 0.1332, 0.1602],
        [0.0803, 0.1922, 0.1332, 1.0000, 0.0325],
        [0.1333, 0.0398, 0.1602, 0.0325, 1.0000]])


#### Building a "Semantic Search"

In [5]:
query = "Who is the current chief minister"

query_embedding = model.encode(query)

scores = util.cos_sim(query_embedding, embeddings)[0]

results = zip(sentences, scores)
sorted_results = sorted(results, key=lambda x: x[1], reverse=True)

print(f"Query: '{query}'\n")
print("Most similar sentences:")
for sentence, score in sorted_results:
    print(f"Score: {score:.4f} | Sentence: {sentence}")

Query: 'Who is the current chief minister'

Most similar sentences:
Score: 0.1433 | Sentence: Vijay started new party named TVK
Score: 0.0735 | Sentence: It is so rainy
Score: 0.0589 | Sentence: Haribhavan is best for chicken dishes
Score: 0.0096 | Sentence: I love chicken dishes
Score: -0.0090 | Sentence: Madharasi movie is awesome


Links to visualize Embeddings
1. http://vectors.nlpl.eu/explore/embeddings/en/#
2. https://projector.tensorflow.org/
