# Notebook 1: Understanding Embeddings & Similarity Concepts

**Learning Objectives:**
- Understand embeddings conceptually and practically
- Explore different similarity/distance metrics and when to use them
- Generate embeddings using OpenAI's API
- Compare similarity approaches without getting lost in implementation details
---

## Section 1: What Are Embeddings? 

### The GPS Coordinates Analogy

Imagine you're trying to find restaurants in a city. Instead of describing each restaurant with words like "cozy", "expensive", "Italian", you could place each restaurant on a map using GPS coordinates (latitude, longitude).

**Embeddings work the same way for text:**
- Instead of describing words with other words, we represent them as coordinates in "meaning space"
- Similar meanings end up close together in this space
- Different meanings are far apart

Let's explore this concept with financial terms.

In [None]:
# Setup and imports
import numpy as np
import matplotlib.pyplot as plt
import openai
from sklearn.decomposition import PCA
import json
import os
from typing import List, Dict, Tuple

# OpenAI API Key Setup
# Option 1: Set as environment variable (RECOMMENDED)
# export OPENAI_API_KEY="your-api-key-here"

# Option 2: Load from .env file (for development)
# Create a .env file in your project directory with: OPENAI_API_KEY=your-api-key-here
try:
    from dotenv import load_dotenv
    load_dotenv()
    print("✅ Loaded environment variables from .env file")
except ImportError:
    print("📋 Install python-dotenv if you want to use .env files: pip install python-dotenv")

# Option 3: Set directly in code (NOT RECOMMENDED for production)
# os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Verify API key is available
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    print("❌ OPENAI_API_KEY not found!")
    print("Please set your OpenAI API key using one of these methods:")
    print("1. Environment variable: export OPENAI_API_KEY='your-key'")
    print("2. Create .env file with: OPENAI_API_KEY=your-key")
    print("3. Get your key from: https://platform.openai.com/api-keys")
else:
    print(f"✅ OpenAI API key found (ends with: ...{api_key[-4:]})")

# Set up OpenAI client
client = openai.OpenAI()

print("🚀 Setup complete! Ready to generate embeddings.")

### Visualizing Financial Concepts in 2D Space

Let's start with some financial terms and see how they cluster together when we represent them as embeddings.

In [None]:
# Sample financial terms - organized by categories
financial_terms = {
    "Profitability": ["profit", "revenue", "earnings", "income", "margin"],
    "Losses": ["loss", "deficit", "debt", "bankruptcy", "default"],
    "Investment": ["portfolio", "stocks", "bonds", "mutual funds", "dividend"],
    "Analysis": ["valuation", "analysis", "forecast", "projection", "estimate"]
}

# Flatten the terms for processing
all_terms = []
term_categories = []
for category, terms in financial_terms.items():
    all_terms.extend(terms)
    term_categories.extend([category] * len(terms))

print(f"We'll analyze {len(all_terms)} financial terms across {len(financial_terms)} categories")
print(f"Terms: {all_terms}")

**💡 Key Insight Preview:**

Before we generate embeddings, let's make a prediction:
- Which terms do you think will be closest to each other?
- Which terms will be furthest apart?
- How might "profit" and "loss" relate to each other?

Keep these predictions in mind as we explore the actual embeddings!

---

## Section 2: Creating Embeddings with APIs (25 minutes)

Now let's generate actual embeddings using OpenAI's API and see how our predictions match reality.

In [None]:
def get_embedding(text: str, model: str = "text-embedding-3-small") -> List[float]:
    """
    Get embedding for a single text using OpenAI API
    
    Args:
        text: Text to embed
        model: OpenAI embedding model to use
    
    Returns:
        List of floats representing the embedding vector
    """
    try:
        response = client.embeddings.create(
            model=model,
            input=text
        )
        return response.data[0].embedding
    except Exception as e:
        print(f"Error getting embedding for '{text}': {e}")
        return None

def get_embeddings_batch(texts: List[str], model: str = "text-embedding-3-small") -> Dict[str, List[float]]:
    """
    Get embeddings for multiple texts efficiently
    
    Args:
        texts: List of texts to embed
        model: OpenAI embedding model to use
    
    Returns:
        Dictionary mapping text to embedding vector
    """
    embeddings = {}
    
    try:
        # OpenAI allows batch processing up to 2048 inputs
        response = client.embeddings.create(
            model=model,
            input=texts
        )
        
        for i, embedding_data in enumerate(response.data):
            embeddings[texts[i]] = embedding_data.embedding
            
    except Exception as e:
        print(f"Error in batch embedding: {e}")
        # Fallback to individual requests
        for text in texts:
            embedding = get_embedding(text, model)
            if embedding:
                embeddings[text] = embedding
    
    return embeddings

In [None]:
# Generate embeddings for our financial terms
print("Generating embeddings for financial terms...")
term_embeddings = get_embeddings_batch(all_terms)

print(f"\nSuccessfully generated embeddings for {len(term_embeddings)} terms")

# Let's examine the properties of these embeddings
if term_embeddings:
    sample_term = list(term_embeddings.keys())[0]
    sample_embedding = term_embeddings[sample_term]
    
    print(f"\nEmbedding Properties:")
    print(f"- Dimensions: {len(sample_embedding)}")
    print(f"- Sample values: {sample_embedding[:5]}...")
    print(f"- Value range: {min(sample_embedding):.4f} to {max(sample_embedding):.4f}")
    print(f"- Vector magnitude: {np.linalg.norm(sample_embedding):.4f}")

### Understanding Embedding Properties

**What do these numbers mean?**

- **Dimensions (1536):** Each word becomes a point in 1536-dimensional space
- **Values (-1 to 1):** Each dimension captures a different aspect of meaning
- **Vector Magnitude:** The "length" of the vector (OpenAI embeddings are normalized)

**Key Insight:** We can't interpret individual dimensions, but we can compare entire vectors!

In [None]:
# Let's visualize these high-dimensional embeddings in 2D
# We'll use PCA (Principal Component Analysis) to reduce dimensions

def visualize_embeddings_2d(embeddings: Dict[str, List[float]], categories: List[str]):
    """
    Visualize high-dimensional embeddings in 2D using PCA
    """
    # Convert to numpy array
    terms = list(embeddings.keys())
    vectors = np.array([embeddings[term] for term in terms])
    
    # Reduce to 2D using PCA
    pca = PCA(n_components=2)
    vectors_2d = pca.fit_transform(vectors)
    
    # Create color map for categories
    unique_categories = list(set(categories))
    colors = plt.cm.Set3(np.linspace(0, 1, len(unique_categories)))
    category_colors = {cat: colors[i] for i, cat in enumerate(unique_categories)}
    
    # Plot
    plt.figure(figsize=(12, 8))
    
    for i, (term, category) in enumerate(zip(terms, categories)):
        x, y = vectors_2d[i]
        plt.scatter(x, y, c=[category_colors[category]], s=100, alpha=0.7)
        plt.annotate(term, (x, y), xytext=(5, 5), textcoords='offset points', 
                    fontsize=10, ha='left')
    
    # Add legend
    for category, color in category_colors.items():
        plt.scatter([], [], c=[color], s=100, label=category, alpha=0.7)
    
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.title('Financial Terms in 2D Embedding Space', fontsize=14)
    plt.xlabel('First Principal Component')
    plt.ylabel('Second Principal Component')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    # Print explained variance
    print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
    print(f"Total variance captured in 2D: {sum(pca.explained_variance_ratio_):.2%}")

# Visualize our financial terms
if term_embeddings:
    visualize_embeddings_2d(term_embeddings, term_categories)

### 🔍 Observation Questions

Look at the visualization above and consider:

1. **Clustering:** Do similar financial concepts cluster together?
2. **Separation:** Are different categories well-separated?
3. **Surprises:** Any unexpected relationships or groupings?
4. **Relationships:** How do opposites like "profit" and "loss" relate?

**Key Insight:** Even though we reduced 1536 dimensions to just 2, we can still see meaningful patterns!

### Cost & Performance Considerations

Let's understand the practical aspects of using embedding APIs:

In [None]:
# Calculate embedding costs and performance
def analyze_embedding_costs(texts: List[str], model: str = "text-embedding-3-small"):
    """
    Analyze costs and performance of embedding generation
    """
    # OpenAI pricing (as of 2024)
    pricing = {
        "text-embedding-3-small": 0.00002,  # per 1K tokens
        "text-embedding-3-large": 0.00013,  # per 1K tokens
    }
    
    # Estimate tokens (rough approximation: 1 token ≈ 4 characters)
    total_chars = sum(len(text) for text in texts)
    estimated_tokens = total_chars / 4
    
    cost_per_1k = pricing.get(model, 0.00002)
    estimated_cost = (estimated_tokens / 1000) * cost_per_1k
    
    print(f"Embedding Analysis for {len(texts)} texts:")
    print(f"- Model: {model}")
    print(f"- Total characters: {total_chars:,}")
    print(f"- Estimated tokens: {estimated_tokens:,.0f}")
    print(f"- Estimated cost: ${estimated_cost:.6f}")
    print(f"- Cost per text: ${estimated_cost/len(texts):.6f}")

analyze_embedding_costs(all_terms)

# Compare models
print("\n" + "="*50)
print("Model Comparison:")
print("text-embedding-3-small: 1536 dimensions, faster, cheaper")
print("text-embedding-3-large: 3072 dimensions, more accurate, more expensive")
print("\nFor most applications, 'small' model is sufficient!")

---

## Section 3: Distance Metrics Deep Dive

Now that we have embeddings, how do we measure how similar they are? There are several approaches, each with different strengths and use cases.

### 3.1 Cosine Similarity (15 minutes)

**Concept:** Measures the angle between two vectors (direction, not magnitude)

**When to Use:** Text similarity, normalized data, when you care about meaning direction but not intensity

**Range:** -1 (opposite) to 1 (identical)

**Visual Intuition:** Think of two arrows pointing in space - cosine similarity measures how much they point in the same direction.

In [None]:
def cosine_similarity(vec1: List[float], vec2: List[float]) -> float:
    """
    Calculate cosine similarity between two vectors
    
    Formula: cos(θ) = (A · B) / (||A|| * ||B||)
    """
    # Convert to numpy arrays
    a = np.array(vec1)
    b = np.array(vec2)
    
    # Calculate dot product
    dot_product = np.dot(a, b)
    
    # Calculate magnitudes
    magnitude_a = np.linalg.norm(a)
    magnitude_b = np.linalg.norm(b)
    
    # Avoid division by zero
    if magnitude_a == 0 or magnitude_b == 0:
        return 0
    
    return dot_product / (magnitude_a * magnitude_b)

# Demonstrate with simple 2D vectors first
print("Simple 2D Vector Examples:")
print("Vector A: [1, 0] (pointing right)")
print("Vector B: [0, 1] (pointing up)")
print(f"Cosine similarity: {cosine_similarity([1, 0], [0, 1]):.3f}")
print("-> Perpendicular vectors have similarity of 0")

print("\nVector A: [1, 1] (pointing northeast)")
print("Vector B: [2, 2] (pointing northeast, but longer)")
print(f"Cosine similarity: {cosine_similarity([1, 1], [2, 2]):.3f}")
print("-> Same direction = perfect similarity, regardless of length")

print("\nVector A: [1, 0] (pointing right)")
print("Vector B: [-1, 0] (pointing left)")
print(f"Cosine similarity: {cosine_similarity([1, 0], [-1, 0]):.3f}")
print("-> Opposite directions = -1 similarity")

In [None]:
# Now let's apply cosine similarity to our financial terms
def find_most_similar_terms(target_term: str, embeddings: Dict[str, List[float]], top_k: int = 5):
    """
    Find the most similar terms to a target term using cosine similarity
    """
    if target_term not in embeddings:
        print(f"Term '{target_term}' not found in embeddings")
        return
    
    target_embedding = embeddings[target_term]
    similarities = []
    
    for term, embedding in embeddings.items():
        if term != target_term:  # Skip the target term itself
            similarity = cosine_similarity(target_embedding, embedding)
            similarities.append((term, similarity))
    
    # Sort by similarity (highest first)
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    print(f"Most similar terms to '{target_term}':")
    for i, (term, similarity) in enumerate(similarities[:top_k], 1):
        print(f"{i}. {term}: {similarity:.3f}")
    
    return similarities[:top_k]

# Test with different financial terms
if term_embeddings:
    print("=" * 50)
    find_most_similar_terms("profit", term_embeddings)
    
    print("\n" + "=" * 50)
    find_most_similar_terms("loss", term_embeddings)
    
    print("\n" + "=" * 50)
    find_most_similar_terms("stocks", term_embeddings)

**🤔 Analysis Questions:**
1. Are the most similar terms what you expected?
2. How similar are "profit" and "loss"? (They're related concepts but opposite meanings)
3. What does this tell us about how embeddings capture meaning?

**Financial Use Case:** Cosine similarity is perfect for comparing document themes regardless of document length.

### 3.2 Euclidean Distance

**Concept:** Straight-line distance in vector space (like measuring with a ruler)

**When to Use:** When magnitude matters, clustering, when you want absolute differences

**Range:** 0 (identical) to ∞ (very different)

**Key Difference:** Unlike cosine similarity, Euclidean distance considers both direction AND magnitude

In [None]:
def euclidean_distance(vec1: List[float], vec2: List[float]) -> float:
    """
    Calculate Euclidean distance between two vectors
    
    Formula: sqrt(Σ(ai - bi)²)
    """
    a = np.array(vec1)
    b = np.array(vec2)
    return np.sqrt(np.sum((a - b) ** 2))

# Demonstrate the difference with simple examples
print("Comparing Cosine Similarity vs Euclidean Distance:")
print("\nVector A: [1, 1]")
print("Vector B: [2, 2] (same direction, double length)")
print(f"Cosine similarity: {cosine_similarity([1, 1], [2, 2]):.3f} (perfect similarity)")
print(f"Euclidean distance: {euclidean_distance([1, 1], [2, 2]):.3f} (not zero - considers magnitude)")

print("\nVector A: [1, 0]")
print("Vector B: [0, 1] (perpendicular)")
print(f"Cosine similarity: {cosine_similarity([1, 0], [0, 1]):.3f} (perpendicular)")
print(f"Euclidean distance: {euclidean_distance([1, 0], [0, 1]):.3f} (distance in space)")

In [None]:
# Compare the two approaches on our financial terms
def compare_similarity_methods(term1: str, term2: str, embeddings: Dict[str, List[float]]):
    """
    Compare different similarity methods for two terms
    """
    if term1 not in embeddings or term2 not in embeddings:
        print(f"One or both terms not found in embeddings")
        return
    
    emb1 = embeddings[term1]
    emb2 = embeddings[term2]
    
    cosine_sim = cosine_similarity(emb1, emb2)
    euclidean_dist = euclidean_distance(emb1, emb2)
    
    print(f"Comparing '{term1}' and '{term2}':")
    print(f"  Cosine similarity: {cosine_sim:.4f} (higher = more similar)")
    print(f"  Euclidean distance: {euclidean_dist:.4f} (lower = more similar)")
    return cosine_sim, euclidean_dist

# Test with interesting pairs
if term_embeddings:
    pairs_to_test = [
        ("profit", "revenue"),
        ("profit", "loss"),
        ("stocks", "bonds"),
        ("analysis", "forecast"),
        ("debt", "dividend")
    ]
    
    for term1, term2 in pairs_to_test:
        if term1 in term_embeddings and term2 in term_embeddings:
            compare_similarity_methods(term1, term2, term_embeddings)
            print()

### 3.3 Manhattan Distance

**Concept:** Sum of absolute differences along each dimension (like walking city blocks)

**When to Use:** High-dimensional sparse data, when you want to avoid the "curse of dimensionality"

**Visual Analogy:** In Manhattan, you can't walk diagonally through buildings - you have to go block by block

In [None]:
def manhattan_distance(vec1: List[float], vec2: List[float]) -> float:
    """
    Calculate Manhattan distance between two vectors
    
    Formula: Σ|ai - bi|
    """
    a = np.array(vec1)
    b = np.array(vec2)
    return np.sum(np.abs(a - b))

# Quick comparison
print("Distance Comparison for [1, 1] and [3, 3]:")
print(f"Euclidean distance: {euclidean_distance([1, 1], [3, 3]):.3f} (straight line)")
print(f"Manhattan distance: {manhattan_distance([1, 1], [3, 3]):.3f} (city blocks)")

# Manhattan distance is often used in high-dimensional spaces
# because it's less affected by the "curse of dimensionality"
print("\nManhattan distance is useful when:")
print("- Working with high-dimensional sparse data")
print("- You want to avoid distance concentration effects")
print("- Computational efficiency is important")

### 3.4 Dot Product Similarity

**Concept:** Direct vector multiplication (considers both direction and magnitude)

**When to Use:** When both direction and magnitude matter, when vectors are normalized

**Key Insight:** For normalized vectors (like OpenAI embeddings), dot product equals cosine similarity!

In [None]:
def dot_product_similarity(vec1: List[float], vec2: List[float]) -> float:
    """
    Calculate dot product similarity between two vectors
    
    Formula: Σ(ai * bi)
    """
    a = np.array(vec1)
    b = np.array(vec2)
    return np.dot(a, b)

# Compare normalized vs unnormalized vectors
print("Normalized vs Unnormalized Vectors:")

# Unnormalized vectors
vec_a = [3, 4]  # magnitude = 5
vec_b = [6, 8]  # magnitude = 10, same direction as A

print(f"\nUnnormalized vectors: {vec_a} and {vec_b}")
print(f"Cosine similarity: {cosine_similarity(vec_a, vec_b):.3f}")
print(f"Dot product: {dot_product_similarity(vec_a, vec_b):.3f}")
print("-> Different values because magnitudes differ")

# Normalized versions
vec_a_norm = np.array(vec_a) / np.linalg.norm(vec_a)
vec_b_norm = np.array(vec_b) / np.linalg.norm(vec_b)

print(f"\nNormalized vectors: {vec_a_norm} and {vec_b_norm}")
print(f"Cosine similarity: {cosine_similarity(vec_a_norm, vec_b_norm):.3f}")
print(f"Dot product: {dot_product_similarity(vec_a_norm, vec_b_norm):.3f}")
print("-> Same values because vectors are normalized")

print("\n🔑 Key Insight: OpenAI embeddings are already normalized!")
print("So for OpenAI embeddings: dot product = cosine similarity")

In [None]:
# Verify this with our financial term embeddings
if term_embeddings and len(term_embeddings) >= 2:
    terms = list(term_embeddings.keys())[:2]
    emb1 = term_embeddings[terms[0]]
    emb2 = term_embeddings[terms[1]]
    
    cosine_sim = cosine_similarity(emb1, emb2)
    dot_product = dot_product_similarity(emb1, emb2)
    
    print(f"Verification with OpenAI embeddings for '{terms[0]}' and '{terms[1]}':")
    print(f"Cosine similarity: {cosine_sim:.6f}")
    print(f"Dot product: {dot_product:.6f}")
    print(f"Difference: {abs(cosine_sim - dot_product):.8f}")
    print("\n✅ Confirmed: They're essentially the same for normalized embeddings!")

### Practical Comparison Exercise

Let's use the same financial text pairs with all 4 metrics and see when each gives different results:

In [None]:
def comprehensive_similarity_analysis(embeddings: Dict[str, List[float]]):
    """
    Compare all similarity metrics across financial term pairs
    """
    # Define interesting pairs to analyze
    pairs = [
        ("profit", "revenue"),    # Should be very similar
        ("profit", "loss"),       # Related but opposite
        ("stocks", "bonds"),      # Related investment types
        ("debt", "dividend"),     # Unrelated financial terms
        ("analysis", "valuation") # Similar analytical concepts
    ]
    
    print("Comprehensive Similarity Analysis")
    print("=" * 80)
    print(f"{'Term Pair':<20} {'Cosine':<10} {'Dot Prod':<10} {'Euclidean':<12} {'Manhattan':<12}")
    print("-" * 80)
    
    for term1, term2 in pairs:
        if term1 in embeddings and term2 in embeddings:
            emb1 = embeddings[term1]
            emb2 = embeddings[term2]
            
            cosine_sim = cosine_similarity(emb1, emb2)
            dot_prod = dot_product_similarity(emb1, emb2)
            euclidean_dist = euclidean_distance(emb1, emb2)
            manhattan_dist = manhattan_distance(emb1, emb2)
            
            pair_name = f"{term1}-{term2}"
            print(f"{pair_name:<20} {cosine_sim:<10.4f} {dot_prod:<10.4f} {euclidean_dist:<12.4f} {manhattan_dist:<12.2f}")
    
    print("\n📊 Interpretation Guide:")
    print("• Cosine/Dot Product: -1 to 1 (higher = more similar)")
    print("• Euclidean/Manhattan: 0 to ∞ (lower = more similar)")
    print("• For normalized embeddings: Cosine ≈ Dot Product")

if term_embeddings:
    comprehensive_similarity_analysis(term_embeddings)

## Summary: When to Use Each Similarity Metric

| Metric | Best For | Pros | Cons |
|--------|----------|------|------|
| **Cosine Similarity** | Text similarity, semantic search | Ignores magnitude, normalized scale | May miss intensity differences |
| **Dot Product** | Normalized vectors, fast computation | Simple, efficient | Same as cosine for normalized vectors |
| **Euclidean Distance** | Clustering, when magnitude matters | Intuitive, considers all differences | Affected by dimensionality |
| **Manhattan Distance** | High-dimensional sparse data | Robust to outliers, efficient | Less intuitive geometrically |

### 🎯 For Financial Document Analysis:
- **Cosine Similarity**: Perfect for comparing document themes/topics
- **Euclidean Distance**: Good for clustering similar documents
- **Manhattan Distance**: Useful for large document collections

### 🔑 Key Takeaway:
**Cosine similarity is the gold standard for text embeddings** because it focuses on meaning direction rather than intensity, making it perfect for semantic search and document comparison.

---

## Next Steps: Preparing for Notebook 2

In this notebook, we've covered:
✅ **Conceptual understanding** of embeddings as "meaning coordinates"
✅ **Practical experience** generating embeddings with OpenAI's API
✅ **Deep dive** into similarity metrics and when to use each
✅ **Cost and performance** considerations for real applications

**In Notebook 2, we'll apply these concepts to:**
- Process and chunk real financial documents
- Build efficient similarity search systems
- Create a complete document retrieval pipeline
- Handle large document collections with optimization strategies

**Save your embeddings for reuse:**

In [None]:
# Save embeddings for use in Notebook 2
if term_embeddings:
    with open('financial_term_embeddings.json', 'w') as f:
        json.dump(term_embeddings, f)
    print("✅ Saved embeddings to 'financial_term_embeddings.json'")
    print("📋 Ready for Notebook 2: Document Processing & Search Systems")
else:
    print("⚠️ No embeddings to save. Make sure to run the embedding generation cells above.")