# Embedding Models - Simple Examples

This notebook contains simple, beginner-friendly examples of three popular embedding models:
1. **OpenAI Embeddings** (text-embedding-3-small)
2. **Google Gemini Embeddings** (text-embedding-004)
3. **Open Source: Sentence Transformers** (all-MiniLM-L6-v2)

## What are Embeddings?
Embeddings are numerical representations (vectors) of text that capture semantic meaning. Similar texts have similar embeddings.

## Setup: Install Required Libraries

Run this first to install all dependencies:

In [None]:
# Uncomment and run if you need to install:
# !pip install openai google-generativeai sentence-transformers python-dotenv

---

## Example 1: OpenAI Embeddings

**Model:** `text-embedding-3-small` (latest, cost-effective)

**You'll need:** OpenAI API key from https://platform.openai.com/api-keys

In [2]:
from openai import OpenAI
from dotenv import load_dotenv
import os

# Set your API key (use environment variable or enter directly)
# Option 1: Set environment variable
# os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Option 2: Use .env file (recommended)
# Create a .env file with: OPENAI_API_KEY=your-api-key-here


# load environment variables from .env file
load_dotenv()
client = OpenAI()  # Automatically reads OPENAI_API_KEY from environment

# Simple text to embed
text = "Artificial intelligence is transforming the world, AI is moving at a rapid pace that we might have difficulty keeping up with."

# Get embedding
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=text
)

embedding = response.data[0].embedding

print(f"Text: {text}")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")

Text: Artificial intelligence is transforming the world, AI is moving at a rapid pace that we might have difficulty keeping up with.
Embedding dimensions: 1536
First 10 values: [0.00750261265784502, 0.023940667510032654, 0.024595031514763832, 0.01421547681093216, 0.013662653043866158, -0.029017623513936996, -0.0006370874471031129, 0.07604151964187622, 0.002215527230873704, -0.003652587765827775]


In [13]:
from openai import OpenAI
from dotenv import load_dotenv
import os

# load environment variables from .env file
load_dotenv()

# instantiate the client
client = OpenAI()  # Automatically reads OPENAI_API_KEY from environment

# Simple text to embed
text = "Artificial intelligence is transforming the world, AI is moving at a rapid pace that we might have difficulty keeping up with."

# Get embedding
response = client.embeddings.create(
    model = "text-embedding-3-small",
    input = text
)

embedding = response.data[0].embedding

print(f"Text: {text}")
print(f"Embedding Dimensions: {len(embedding)}")
print(f"All embedded values: {embedding}")

Text: Artificial intelligence is transforming the world, AI is moving at a rapid pace that we might have difficulty keeping up with.
Embedding Dimensions: 1536
All embedded values: [0.007525907829403877, 0.02398812584578991, 0.024574853479862213, 0.014239423908293247, 0.013652696274220943, -0.028975309804081917, -0.0006336233345791698, 0.0760037750005722, 0.0022326672915369272, -0.003630376188084483, 0.015322613529860973, -0.03570010885596275, -0.05212847888469696, -0.0343235582113266, -0.023469097912311554, -0.033646564930677414, -0.025048749521374702, -0.000851177959702909, 0.04188331589102745, -0.032134611159563065, -0.03416559100151062, -0.015988323837518692, 0.02423635683953762, -0.018933244049549103, 0.020704708993434906, -0.01589805819094181, -0.03910764306783676, 0.005929332226514816, -0.014995399862527847, 0.009788193739950657, 0.03362399712204933, -0.01668788306415081, -0.03296957165002823, -0.005139506887644529, 0.029810268431901932, 0.049465637654066086, -0.0214832518249750

---

## Example 2: Google Gemini Embeddings

**Model:** `text-embedding-004` (free tier available)

**You'll need:** Google API key from https://aistudio.google.com/apikey

In [6]:
from google import genai
from google.genai import types
from dotenv import load_dotenv
import os

# Set your API key
# Option 1: Set environment variable
# os.environ["GOOGLE_API_KEY"] = "your-api-key-here"

# Option 2: Use .env file
# Create a .env file with: GOOGLE_API_KEY=your-api-key-here

# load environment variables from .env file
load_dotenv()
client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

# Simple text to embed
text = "Machine learning helps computers learn from data"

# Get embedding
response = client.models.embed_content(
    model="text-embedding-004",
    contents=text
)

embedding = response.embeddings[0].values

print(f"Text: {text}")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")

Text: Machine learning helps computers learn from data
Embedding dimensions: 768
First 10 values: [-0.063451216, 0.025164412, 0.008735433, -0.0020697138, -0.04107696, 0.043614008, 0.0063226507, 0.020994253, -0.034076635, 0.00076228636]


---

## Example 3: Open Source - Sentence Transformers

**Model:** `all-MiniLM-L6-v2` (fast, lightweight, runs locally)

**No API key needed!** Runs completely on your machine.

In [7]:
from sentence_transformers import SentenceTransformer

# Load the model (downloads on first run, ~90MB)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Simple text to embed
text = "Deep learning is a subset of machine learning"

# Get embedding
embedding = model.encode(text)

print(f"Text: {text}")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Text: Deep learning is a subset of machine learning
Embedding dimensions: 384
First 10 values: [-0.07020443 -0.05088915  0.07240903  0.01183786  0.01972705 -0.0260805
 -0.0241131  -0.02578819 -0.03455794 -0.05155174]


---

## Bonus: Compare Similarity Between Texts

Let's use the open-source model to see how embeddings capture meaning:

In [8]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Three sentences
sentences = [
    "The cat sits on the mat",
    "A feline rests on the rug",  # Similar meaning to sentence 1
    "Python is a programming language"  # Different meaning
]

# Get embeddings
embeddings = model.encode(sentences)

# Calculate similarity between sentence 1 and the others
similarity_1_2 = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
similarity_1_3 = cosine_similarity([embeddings[0]], [embeddings[2]])[0][0]

print("Sentences:")
print(f"1: {sentences[0]}")
print(f"2: {sentences[1]}")
print(f"3: {sentences[2]}")
print("\nSimilarity Scores (0-1, higher = more similar):")
print(f"Sentence 1 vs 2: {similarity_1_2:.4f}")
print(f"Sentence 1 vs 3: {similarity_1_3:.4f}")
print("\nNotice: Sentences 1 and 2 have higher similarity (similar meaning)")

Sentences:
1: The cat sits on the mat
2: A feline rests on the rug
3: Python is a programming language

Similarity Scores (0-1, higher = more similar):
Sentence 1 vs 2: 0.5607
Sentence 1 vs 3: 0.0317

Notice: Sentences 1 and 2 have higher similarity (similar meaning)


---

## Summary

| Model | Provider | Cost | Dimensions | Best For |
|-------|----------|------|------------|----------|
| text-embedding-3-small | OpenAI | Paid | 1536 | Production apps, high quality |
| text-embedding-004 | Google | Free tier available | 768 | Learning, prototyping |
| all-MiniLM-L6-v2 | Open Source | Free | 384 | Local testing, privacy |

## Next Steps:
- Try embedding different types of text (questions, code, documents)
- Experiment with similarity search
- Build a simple semantic search system
- Learn about vector databases (Chroma, Pinecone, Weaviate)