# ðŸŽ¯ Week 1: Vector Similarity

**Learning Objectives:**
1. Understand vector representations and embeddings
2. Master similarity metrics (Cosine, Euclidean, Dot Product)
3. Build a simple semantic search engine
4. Visualize vectors in 2D/3D space

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

plt.style.use('seaborn-v0_8-darkgrid')
np.random.seed(42)

---
# Section 1: Theory
---

## Why Vector Similarity?

In AI/ML, we represent data as vectors:
- **Word embeddings**: Words â†’ Vectors (Word2Vec, GloVe)
- **Sentence embeddings**: Sentences â†’ Vectors (BERT, Sentence-BERT)
- **Image embeddings**: Images â†’ Vectors (ResNet, CLIP)

**Key Insight**: Similar items have similar vectors!

## Similarity Metrics

| Metric | Formula | Range | Best For |
|--------|---------|-------|----------|
| Cosine | $\frac{a \cdot b}{\|a\| \|b\|}$ | [-1, 1] | Text, normalized vectors |
| Euclidean | $\sqrt{\sum(a_i - b_i)^2}$ | [0, âˆž) | Dense, continuous features |
| Dot Product | $\sum a_i \cdot b_i$ | (-âˆž, âˆž) | When magnitude matters |

---
# Section 2: Hands-On Implementation
---

## 2.1 Core Vector Operations

In [None]:
def dot_product(v1, v2):
    """Compute dot product of two vectors."""
    return sum(a * b for a, b in zip(v1, v2))


def magnitude(v):
    """Compute the magnitude (L2 norm) of a vector."""
    return sum(x**2 for x in v) ** 0.5


def cosine_similarity(v1, v2):
    """Compute cosine similarity between two vectors."""
    mag1 = magnitude(v1)
    mag2 = magnitude(v2)
    if mag1 == 0 or mag2 == 0:
        return 0
    return dot_product(v1, v2) / (mag1 * mag2)


def euclidean_distance(v1, v2):
    """Compute Euclidean distance between two vectors."""
    return sum((a - b)**2 for a, b in zip(v1, v2)) ** 0.5


def normalize(v):
    """Normalize a vector to unit length."""
    mag = magnitude(v)
    if mag == 0:
        return v
    return [x / mag for x in v]

In [None]:
# Test the implementations
v1 = [1, 2, 3]
v2 = [4, 5, 6]

print(f"v1 = {v1}")
print(f"v2 = {v2}")
print(f"Dot product: {dot_product(v1, v2)}")
print(f"Magnitude v1: {magnitude(v1):.4f}")
print(f"Cosine similarity: {cosine_similarity(v1, v2):.4f}")
print(f"Euclidean distance: {euclidean_distance(v1, v2):.4f}")

## 2.2 Simple Semantic Search

In [None]:
# Simulated document embeddings (in real apps, use sentence-transformers)
documents = {
    "doc1": {"text": "Python programming tutorial", "vector": [0.8, 0.3, 0.1]},
    "doc2": {"text": "Machine learning basics", "vector": [0.2, 0.9, 0.4]},
    "doc3": {"text": "Deep neural networks", "vector": [0.1, 0.7, 0.9]},
    "doc4": {"text": "Python data analysis", "vector": [0.7, 0.5, 0.2]},
    "doc5": {"text": "Natural language processing", "vector": [0.3, 0.6, 0.8]},
}

# Query embedding
query = {"text": "Python coding", "vector": [0.9, 0.2, 0.1]}

In [None]:
def search(query_vector, documents, top_k=3):
    """Search for most similar documents."""
    results = []
    for doc_id, doc in documents.items():
        similarity = cosine_similarity(query_vector, doc["vector"])
        results.append({
            "id": doc_id,
            "text": doc["text"],
            "similarity": similarity
        })
    
    # Sort by similarity (descending)
    results.sort(key=lambda x: x["similarity"], reverse=True)
    return results[:top_k]


# Run search
print(f"Query: '{query['text']}'\n")
print("Search Results:")
print("-" * 50)
for i, result in enumerate(search(query["vector"], documents), 1):
    print(f"{i}. {result['text']}")
    print(f"   Similarity: {result['similarity']:.4f}")

## 2.3 Cosine vs Euclidean: When to Use Which?

In [None]:
# Same direction, different magnitudes
short_vector = [1, 1]
long_vector = [10, 10]

print("Same direction, different magnitudes:")
print(f"  Cosine similarity: {cosine_similarity(short_vector, long_vector):.4f}")
print(f"  Euclidean distance: {euclidean_distance(short_vector, long_vector):.4f}")

# Different directions, same magnitude
v1 = [1, 0]
v2 = [0, 1]

print("\nDifferent directions, same magnitude:")
print(f"  Cosine similarity: {cosine_similarity(v1, v2):.4f}")
print(f"  Euclidean distance: {euclidean_distance(v1, v2):.4f}")

print("\nðŸ’¡ Key Insight:")
print("   - Cosine: Measures DIRECTION (angle) - good for text")
print("   - Euclidean: Measures DISTANCE - good for spatial data")

---
# Section 3: Visualizations
---

## 3.1 2D Vector Visualization

In [None]:
def plot_2d_vectors(vectors, labels, query=None, query_label="Query"):
    """Plot 2D vectors with arrows from origin."""
    fig, ax = plt.subplots(figsize=(10, 8))
    
    colors = plt.cm.Set2(np.linspace(0, 1, len(vectors)))
    
    # Plot document vectors
    for i, (vec, label) in enumerate(zip(vectors, labels)):
        ax.arrow(0, 0, vec[0], vec[1], head_width=0.05, head_length=0.03,
                 fc=colors[i], ec=colors[i], linewidth=2)
        ax.annotate(label, (vec[0], vec[1]), fontsize=10, 
                    xytext=(5, 5), textcoords='offset points')
    
    # Plot query vector
    if query is not None:
        ax.arrow(0, 0, query[0], query[1], head_width=0.05, head_length=0.03,
                 fc='red', ec='red', linewidth=3)
        ax.annotate(query_label, (query[0], query[1]), fontsize=12, 
                    color='red', fontweight='bold',
                    xytext=(5, 5), textcoords='offset points')
    
    ax.set_xlim(-0.2, 1.2)
    ax.set_ylim(-0.2, 1.2)
    ax.set_xlabel('Dimension 1')
    ax.set_ylabel('Dimension 2')
    ax.set_title('Vector Space Visualization')
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.grid(True, alpha=0.3)
    ax.set_aspect('equal')
    plt.show()


# Extract 2D vectors (first 2 dimensions)
doc_vectors = [doc["vector"][:2] for doc in documents.values()]
doc_labels = [doc["text"][:20] for doc in documents.values()]
query_2d = query["vector"][:2]

plot_2d_vectors(doc_vectors, doc_labels, query_2d, "Query")

## 3.2 3D Vector Visualization

In [None]:
def plot_3d_vectors(vectors, labels, query=None):
    """Plot 3D vectors."""
    fig = plt.figure(figsize=(12, 9))
    ax = fig.add_subplot(111, projection='3d')
    
    colors = plt.cm.Set2(np.linspace(0, 1, len(vectors)))
    
    # Plot document vectors as arrows
    for i, (vec, label) in enumerate(zip(vectors, labels)):
        ax.quiver(0, 0, 0, vec[0], vec[1], vec[2], 
                  color=colors[i], arrow_length_ratio=0.1, linewidth=2)
        ax.text(vec[0], vec[1], vec[2], label[:15], fontsize=8)
    
    # Plot query vector
    if query is not None:
        ax.quiver(0, 0, 0, query[0], query[1], query[2],
                  color='red', arrow_length_ratio=0.1, linewidth=3)
        ax.text(query[0], query[1], query[2], 'QUERY', 
                fontsize=10, color='red', fontweight='bold')
    
    ax.set_xlabel('Dim 1')
    ax.set_ylabel('Dim 2')
    ax.set_zlabel('Dim 3')
    ax.set_title('3D Vector Space')
    plt.show()


doc_vectors_3d = [doc["vector"] for doc in documents.values()]
doc_labels = [doc["text"] for doc in documents.values()]

plot_3d_vectors(doc_vectors_3d, doc_labels, query["vector"])

## 3.3 Similarity Heatmap

In [None]:
# Compute pairwise similarities
doc_names = list(documents.keys())
n_docs = len(doc_names)
similarity_matrix = np.zeros((n_docs, n_docs))

for i, doc1 in enumerate(doc_names):
    for j, doc2 in enumerate(doc_names):
        similarity_matrix[i, j] = cosine_similarity(
            documents[doc1]["vector"],
            documents[doc2]["vector"]
        )

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(similarity_matrix, 
            xticklabels=[documents[d]["text"][:15] for d in doc_names],
            yticklabels=[documents[d]["text"][:15] for d in doc_names],
            annot=True, fmt=".2f", cmap="YlOrRd",
            vmin=0, vmax=1)
plt.title("Document Similarity Heatmap")
plt.tight_layout()
plt.show()

---
# Section 4: Unit Tests
---

In [None]:
def run_tests():
    """Run all unit tests."""
    print("Running Unit Tests...\n")
    
    # Test 1: Dot product
    assert dot_product([1, 2, 3], [4, 5, 6]) == 32
    print("âœ“ Dot product test passed")
    
    # Test 2: Magnitude
    assert abs(magnitude([3, 4]) - 5.0) < 1e-10
    print("âœ“ Magnitude test passed")
    
    # Test 3: Cosine similarity - identical vectors
    assert abs(cosine_similarity([1, 0], [1, 0]) - 1.0) < 1e-10
    print("âœ“ Cosine similarity (identical) test passed")
    
    # Test 4: Cosine similarity - perpendicular vectors
    assert abs(cosine_similarity([1, 0], [0, 1]) - 0.0) < 1e-10
    print("âœ“ Cosine similarity (perpendicular) test passed")
    
    # Test 5: Cosine similarity - opposite vectors
    assert abs(cosine_similarity([1, 0], [-1, 0]) - (-1.0)) < 1e-10
    print("âœ“ Cosine similarity (opposite) test passed")
    
    # Test 6: Euclidean distance
    assert abs(euclidean_distance([0, 0], [3, 4]) - 5.0) < 1e-10
    print("âœ“ Euclidean distance test passed")
    
    # Test 7: Normalize
    normalized = normalize([3, 4])
    assert abs(magnitude(normalized) - 1.0) < 1e-10
    print("âœ“ Normalize test passed")
    
    # Test 8: Zero vector handling
    assert cosine_similarity([0, 0], [1, 1]) == 0
    print("âœ“ Zero vector handling test passed")
    
    print("\nðŸŽ‰ All tests passed!")


run_tests()

---
# Section 5: Interview Prep
---

## Key Questions

### Q1: Why use Cosine Similarity instead of Euclidean Distance for text?

**Answer:**
- Cosine measures **angle** (direction), not magnitude
- Longer documents have larger vectors but may have same meaning
- Cosine is invariant to vector scaling
- Example: "cat dog" vs "cat cat dog dog dog" should be similar

### Q2: What is the range of cosine similarity?

**Answer:** [-1, 1]
- 1 = Identical direction
- 0 = Perpendicular (orthogonal)
- -1 = Opposite direction

### Q3: How do you handle the curse of dimensionality?

**Answer:**
- Use dimensionality reduction (PCA, t-SNE)
- Use approximate nearest neighbor (ANN) algorithms
- Index vectors with HNSW, IVF, or LSH

### Q4: What are the trade-offs of different similarity metrics?

**Answer:**
- Dot product: Fast, but magnitude-sensitive
- Cosine: Normalized, direction-only
- Euclidean: Intuitive distance, but affected by scale

---
# Section 6: Exercises
---

In [None]:
# Exercise 1: Implement Manhattan Distance (L1)
def manhattan_distance(v1, v2):
    """Implement L1 (Manhattan) distance."""
    # TODO: Your implementation here
    pass


# Exercise 2: Implement Jaccard Similarity (for sets)
def jaccard_similarity(set1, set2):
    """Implement Jaccard similarity for two sets."""
    # TODO: Your implementation here
    pass


# Exercise 3: Build a simple KNN classifier
def knn_classify(query, labeled_data, k=3):
    """
    K-Nearest Neighbors classification.
    
    Args:
        query: Vector to classify
        labeled_data: List of (vector, label) tuples
        k: Number of neighbors
    
    Returns:
        Predicted label (majority vote)
    """
    # TODO: Your implementation here
    pass

---
# Section 7: Deliverable
---

## What You Built This Week:

1. **`similarity_utils.py`** - Reusable similarity functions
2. **Simple Vector Search Engine** - Query â†’ Ranked results
3. **2D/3D Visualizations** - Understanding vector spaces

## Key Takeaways:

- Similar meanings = Similar vectors
- Cosine similarity is the standard for text/embeddings
- Vector search is O(n) - need indexing for scale

## Next Week: Probability & Statistics
- Probability distributions
- Bayes' theorem for ML
- Statistical inference