<a href="https://colab.research.google.com/github/Neverlost0311/nlp-word-embeddings-lab/blob/main/01-basic-embeddings/lab1_gemini_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 1: Creating Text Embeddings with Gemini

**Goal:**  
Convert text into dense numerical vectors (embeddings) using Gemini, inspect their shape/dimensions, and build intuition for how text becomes machine-readable.

We will:
- Securely load the API key
- Create a reusable embedding function
- Embed single and multiple texts
- Inspect dimensions and values
- Do a tiny semantic check (cat vs kitten)


In [1]:
# Install the new Gemini SDK
!pip install -U google-genai

# Imports
from google import genai
import getpass
import numpy as np
from typing import List




In [2]:
# Securely ask for API key (it will be hidden while typing)
api_key = getpass.getpass("Enter your Gemini API Key: ")

# Create Gemini client
client = genai.Client(api_key=api_key)

print("✅ API key loaded and client created successfully.")


Enter your Gemini API Key: ··········
✅ API key loaded and client created successfully.


In [3]:
# =========================
# Embedding Functions
# =========================

MODEL_NAME = "text-embedding-004"

def get_embedding(text: str):
    """
    Generate embedding for a single text string.
    Returns a list of floats.
    """
    result = client.models.embed_content(
        model=MODEL_NAME,
        contents=text
    )
    return result.embeddings[0].values


def get_embeddings_batch(texts):
    """
    Generate embeddings for a list of texts.
    Returns a NumPy array of shape (n_texts, embedding_dim).
    """
    result = client.models.embed_content(
        model=MODEL_NAME,
        contents=texts
    )
    vectors = [e.values for e in result.embeddings]
    return np.array(vectors)


print("✅ Embedding functions are ready.")


✅ Embedding functions are ready.


In [5]:
# =========================
# Single Text Embedding Demo
# =========================

text = input("Enter a word or sentence to embed: ")

embedding = get_embedding(text)
embedding_np = np.array(embedding)

print("\n====== EMBEDDING INFO (Single Text) ======")
print("Input text:", text)
print("Type of embedding:", type(embedding))
print("Vector length (dimensions):", len(embedding))
print("Numpy shape:", embedding_np.shape)

print("\nFirst 10 values of the embedding vector:")
print(embedding_np[:10])


Enter a word or sentence to embed: i love this coffee

Input text: i love this coffee
Type of embedding: <class 'list'>
Vector length (dimensions): 768
Numpy shape: (768,)

First 10 values of the embedding vector:
[-0.00796209 -0.0130164  -0.02627383  0.05849049 -0.00029538  0.03135902
 -0.00309337  0.02625825  0.0397107   0.01320109]


In [6]:
# =========================
# Batch Embedding Demo (Multiple Texts)
# =========================

texts = [
    "cat",
    "kitten",
    "rocket",
    "I love this coffee",
    "This product is terrible",
    "The weather is nice today"
]

batch_embeddings = get_embeddings_batch(texts)

print("\n====== EMBEDDING INFO (Batch) ======")
print("Number of texts:", len(texts))
print("Batch embedding array shape:", batch_embeddings.shape)  # (n_texts, embedding_dim)

print("\nEach text has its own vector:")
for i, t in enumerate(texts):
    print(f"- '{t}' → vector shape: {batch_embeddings[i].shape}")



Number of texts: 6
Batch embedding array shape: (6, 768)

Each text has its own vector:
- 'cat' → vector shape: (768,)
- 'kitten' → vector shape: (768,)
- 'rocket' → vector shape: (768,)
- 'I love this coffee' → vector shape: (768,)
- 'This product is terrible' → vector shape: (768,)
- 'The weather is nice today' → vector shape: (768,)


In [7]:
# =========================
# Semantic Similarity using Cosine Similarity
# =========================

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Take some vectors from the batch
cat_vec = batch_embeddings[0]      # "cat"
kitten_vec = batch_embeddings[1]   # "kitten"
rocket_vec = batch_embeddings[2]   # "rocket"

# Compute similarities
sim_cat_kitten = cosine_similarity(cat_vec, kitten_vec)
sim_cat_rocket = cosine_similarity(cat_vec, rocket_vec)

print("\n====== SEMANTIC SIMILARITY CHECK ======")
print(f"Similarity(cat, kitten) = {sim_cat_kitten:.4f}")
print(f"Similarity(cat, rocket) = {sim_cat_rocket:.4f}")

print("\nInterpretation:")



Similarity(cat, kitten) = 0.7909
Similarity(cat, rocket) = 0.3781

Interpretation:


# Summary & Key Takeaways

In this lab, we built our first end-to-end text embedding pipeline using Gemini.

## What we did

- Loaded the Gemini embedding model (`text-embedding-004`)
- Converted text into **dense numerical vectors** (768 dimensions)
- Embedded:
  - A single text input
  - A batch of multiple texts
- Verified that:
  - All texts map to fixed-size vectors
  - Multiple texts form a matrix of shape `(n_texts, 768)`
- Demonstrated **semantic similarity** using cosine similarity:
  - "cat" is much closer to "kitten" than to "rocket"

## What this proves

- Computers do not understand text directly — they understand **numbers**
- Embeddings are the bridge between **language and mathematics**
- **Semantic similarity becomes geometric proximity**
- This is the foundation for:
  - Semantic search
  - Document clustering
  - Sentiment classification
  - RAG systems and modern AI applications

## What’s next

In **Lab 2**, we will:
- Build a real **similarity engine**
- Let users compare any two texts
- Rank documents by semantic similarity
