# Task 1: Build a Semantic Search System

## Scenario
You have a collection of support ticket descriptions. Build a semantic search system that:
1. Encodes all documents into embeddings
2. Finds the most similar documents for a given query
3. Returns results with similarity scores

## Your Tasks:
1. **Load and encode**: Load documents and create embeddings
2. **Implement search**: Create a function to find top-k similar documents
3. **Find duplicates**: Identify near-duplicate documents
4. **Category clustering**: Group documents by semantic similarity

## Setup (provided)

In [None]:
import json
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load documents
with open('../fixtures/input/documents.json') as f:
    documents = json.load(f)

print(f"Loaded {len(documents)} documents")
print(f"\nSample document:")
print(documents[0])

In [None]:
# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Model loaded. Embedding dimension: {model.get_sentence_embedding_dimension()}")

---
## Task 1: Create Embeddings

Extract text from documents and create normalized embeddings.

Store:
- `texts`: list of document texts
- `embeddings`: numpy array of normalized embeddings

In [None]:
# YOUR CODE HERE
# 1. Extract texts from documents
# 2. Encode with normalization



In [None]:
# TEST - Do not modify
assert 'texts' in dir(), "Variable 'texts' not found"
assert 'embeddings' in dir(), "Variable 'embeddings' not found"
assert len(texts) == 20, f"Expected 20 texts, got {len(texts)}"
assert embeddings.shape == (20, 384), f"Expected shape (20, 384), got {embeddings.shape}"

# Check normalization
norms = np.linalg.norm(embeddings, axis=1)
assert np.allclose(norms, 1.0), "Embeddings should be normalized (norm=1)"

print("Task 1 PASSED!")

---
## Task 2: Implement Semantic Search

Create a function `search(query, top_k=3)` that:
- Takes a query string
- Returns top-k most similar documents with scores

Return format: list of dicts with 'id', 'text', 'score'

In [None]:
# YOUR CODE HERE
def search(query: str, top_k: int = 3):
    """
    Find most similar documents to query.
    
    Args:
        query: Search query string
        top_k: Number of results to return
        
    Returns:
        List of dicts: [{'id': ..., 'text': ..., 'score': ...}, ...]
    """
    pass  # Your implementation


In [None]:
# TEST - Do not modify
results = search("How to install Python on my computer?", top_k=3)

assert len(results) == 3, f"Expected 3 results, got {len(results)}"
assert all('id' in r and 'text' in r and 'score' in r for r in results), "Missing keys in results"
assert all(0 <= r['score'] <= 1 for r in results), "Scores should be between 0 and 1"
assert results[0]['score'] >= results[1]['score'] >= results[2]['score'], "Results should be sorted by score"

# Top result should be about Python installation
assert 'python' in results[0]['text'].lower() or 'install' in results[0]['text'].lower(), \
    "Top result should be about Python installation"

print("Task 2 PASSED!")
print("\nTop 3 results:")
for r in results:
    print(f"  Score: {r['score']:.4f} | {r['text'][:60]}...")

---
## Task 3: Find Near-Duplicates

Create a function `find_duplicates(threshold=0.85)` that finds document pairs with similarity above threshold.

Return format: list of dicts with 'doc1_id', 'doc2_id', 'similarity'

In [None]:
# YOUR CODE HERE
def find_duplicates(threshold: float = 0.85):
    """
    Find document pairs with similarity above threshold.
    
    Args:
        threshold: Minimum similarity to consider as duplicate
        
    Returns:
        List of dicts: [{'doc1_id': ..., 'doc2_id': ..., 'similarity': ...}, ...]
    """
    pass  # Your implementation


In [None]:
# TEST - Do not modify
duplicates = find_duplicates(threshold=0.85)

assert isinstance(duplicates, list), "Should return a list"
assert len(duplicates) > 0, "Should find at least one duplicate pair"
assert all('doc1_id' in d and 'doc2_id' in d and 'similarity' in d for d in duplicates), \
    "Missing keys in duplicate results"
assert all(d['similarity'] >= 0.85 for d in duplicates), "All pairs should have similarity >= 0.85"

print("Task 3 PASSED!")
print(f"\nFound {len(duplicates)} duplicate pairs:")
for d in duplicates[:5]:  # Show first 5
    print(f"  {d['doc1_id']} <-> {d['doc2_id']}: {d['similarity']:.4f}")

---
## Task 4: Cluster Documents

Create a function `cluster_documents(n_clusters=5)` that groups documents by semantic similarity.

Return format: dict mapping cluster_id to list of document ids

In [None]:
# YOUR CODE HERE
from sklearn.cluster import KMeans

def cluster_documents(n_clusters: int = 5):
    """
    Cluster documents by semantic similarity.
    
    Args:
        n_clusters: Number of clusters
        
    Returns:
        Dict mapping cluster_id to list of document ids
    """
    pass  # Your implementation


In [None]:
# TEST - Do not modify
clusters = cluster_documents(n_clusters=5)

assert isinstance(clusters, dict), "Should return a dict"
assert len(clusters) == 5, f"Expected 5 clusters, got {len(clusters)}"

# All documents should be assigned
all_ids = [doc_id for ids in clusters.values() for doc_id in ids]
assert len(all_ids) == 20, f"Expected 20 documents in clusters, got {len(all_ids)}"

print("Task 4 PASSED!")
print("\nClusters:")
for cluster_id, doc_ids in clusters.items():
    print(f"\nCluster {cluster_id} ({len(doc_ids)} docs):")
    for doc_id in doc_ids[:3]:  # Show first 3
        doc = next(d for d in documents if d['id'] == doc_id)
        print(f"  - {doc['text'][:50]}...")

---
## Bonus: Visualize Embeddings

Create a t-SNE visualization of document embeddings colored by category.

In [None]:
# YOUR CODE HERE (optional)
# Use sklearn.manifold.TSNE to reduce to 2D
# Color points by document category



---
## Expected Results

After completing all tasks:
- Task 1: 20 normalized embeddings of shape (20, 384)
- Task 2: Search returns relevant Python installation docs for Python query
- Task 3: Find similar document pairs (e.g., doc_001 and doc_002 about Python)
- Task 4: Documents grouped by topic (Software, Network, Hardware, etc.)