# Local-Llama-Inference - Embeddings Generation

Demonstrates how to generate text embeddings using local-llama-inference.

## Applications
- **Semantic Search**: Find similar documents
- **Clustering**: Group similar texts
- **Similarity**: Measure text similarity
- **Reranking**: Improve search results
- **RAG (Retrieval-Augmented Generation)**: Knowledge retrieval

In [None]:
from local_llama_inference import LlamaServer, LlamaClient
from pathlib import Path
from huggingface_hub import hf_hub_download
import numpy as np

print("✅ Package imported")

## Download Model

In [None]:
models_dir = Path.home() / "models"
models_dir.mkdir(exist_ok=True)

# For embeddings, any GGUF model works
model_path = hf_hub_download(
    repo_id="TheBloke/phi-2-GGUF",
    filename="phi-2.Q4_K_M.gguf",
    local_dir=str(models_dir),
)

print(f"✅ Model ready: {model_path}")

## Start Server

In [None]:
print("🚀 Starting server...")
server = LlamaServer(
    model_path=model_path,
    n_gpu_layers=33,
    n_threads=4,
)
server.start()
server.wait_ready(timeout=60)
print(f"✅ Server ready")

client = LlamaClient()

## Example 1: Generate Single Embedding

In [None]:
# Generate embedding for a single texttext = "Machine learning is a subset of artificial intelligence"print(f"📝 Text: {text}\n")print("🧮 Generating embedding...")response = client.embed(input=text)embedding = response['data'][0]['embedding']print(f"✅ Embedding generated")print(f"   Dimension: {len(embedding)}")print(f"   First 10 values: {embedding[:10]}")

## Example 2: Generate Multiple Embeddings

In [None]:
# Multiple texts for semantic search
texts = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks",
    "Natural language processing handles text data",
    "Computer vision processes images",
    "Reinforcement learning learns from rewards",
]

print("🧮 Generating embeddings for 5 texts...\n")

response = client.embed(input=texts)

embeddings = [item['embedding'] for item in response.data]

print(f"✅ Generated {len(embeddings)} embeddings")
for i, (text, emb) in enumerate(zip(texts, embeddings)):
    print(f"  [{i}] {text[:50]}... (dim={len(emb)})")

## Example 3: Semantic Similarity Search

In [None]:
from scipy.spatial.distance import cosine# Query textquery = "What is deep neural networks?"# Generate query embeddingquery_response = client.embed(input=query)query_embedding = query_response['data'][0]['embedding']print(f"🔍 Query: {query}\n")print("📊 Semantic Similarity Results:\n")# Calculate similarity to each textsimilarities = []for text, embedding in zip(texts, embeddings):    # Cosine similarity: 1 - cosine_distance    similarity = 1 - cosine(query_embedding, embedding)    similarities.append((text, similarity))# Sort by similarity (descending)similarities.sort(key=lambda x: x[1], reverse=True)# Display resultsfor rank, (text, sim) in enumerate(similarities, 1):    print(f"{rank}. {sim:.3f} - {text}")

## Example 4: Document Clustering

In [None]:
from sklearn.cluster import KMeans
import numpy as np

print("🎯 Clustering documents based on embeddings...\n")

# Convert embeddings to numpy array
X = np.array(embeddings)

# Perform K-means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(X)

print("📍 Cluster Assignments:\n")
for i, (text, cluster) in enumerate(zip(texts, clusters)):
    print(f"Cluster {cluster}: {text}")

## Example 5: Text Similarity Matrix

In [None]:
import numpy as np
from scipy.spatial.distance import pdist, squareform

print("📊 Similarity Matrix (Cosine Similarity):\n")

# Calculate pairwise similarities
similarities = []
for i, emb1 in enumerate(embeddings):
    row = []
    for emb2 in embeddings:
        sim = 1 - cosine(emb1, emb2)
        row.append(sim)
    similarities.append(row)

similarity_matrix = np.array(similarities)

# Display as formatted table
print("     ", end="")
for i in range(len(texts)):
    print(f"  T{i} ", end="")
print()

for i, row in enumerate(similarity_matrix):
    print(f"T{i}  ", end="")
    for val in row:
        print(f"{val:0.2f} ", end="")
    print()

print("\nNote: 1.00 = identical, 0.00 = completely different")

## Example 6: Reranking Search Results

In [None]:
# Rerank documents based on relevance to query
query = "neural networks and deep learning"

print(f"🔄 Reranking documents for query: {query}\n")

# Use reranking API if available
try:
    response = client.rerank(
        query=query,
        documents=texts,
    )
    
    print("🏆 Reranked Results:\n")
    for rank, result in enumerate(response.results, 1):
        idx = result['index']
        score = result['relevance_score']
        print(f"{rank}. Score: {score:.3f} - {texts[idx]}")
        
except Exception as e:
    print(f"Note: Reranking not available in this server version")
    print(f"Using embedding-based similarity instead...\n")
    
    # Fallback to embedding-based similarity
    query_response = client.embed(input=query)
    query_embedding = query_response.data[0]['embedding']
    
    results = []
    for idx, (text, embedding) in enumerate(zip(texts, embeddings)):
        sim = 1 - cosine(query_embedding, embedding)
        results.append((idx, text, sim))
    
    results.sort(key=lambda x: x[2], reverse=True)
    
    print("🏆 Similarity-Based Ranking:\n")
    for rank, (idx, text, score) in enumerate(results, 1):
        print(f"{rank}. Score: {score:.3f} - {text}")

## Example 7: Tokenization (Token Count)

In [None]:
# Count tokens in textsprint("📊 Token Count Analysis:\n")for text in texts:    response = client.tokenize(content=text)    token_count = len(response.get('tokens', []))    print(f"{token_count:3d} tokens: {text[:50]}...")

## Stop Server

In [None]:
print("\n🛑 Stopping server...")
server.stop()
print("✅ Done")

## Key Points

- **Embeddings**: Use `embed()` to generate dense vector representations
- **Similarity**: Cosine similarity compares embeddings (0 = different, 1 = identical)
- **Multi-text**: Embed multiple texts at once for efficiency
- **Applications**: Search, clustering, reranking, RAG systems
- **Dimension**: Embedding dimension depends on the model

## Common Use Cases

1. **Semantic Search**: Find similar documents by embedding similarity
2. **Duplicate Detection**: Identify duplicate or near-duplicate texts
3. **Clustering**: Group similar documents together
4. **Classification**: Use embeddings as features for ML models
5. **RAG Systems**: Retrieve relevant documents for context

## Next Notebooks

- `04_multi_gpu.ipynb` - Multi-GPU tensor parallelism
- `05_advanced_api.ipynb` - All 30+ API endpoints