# Module 05 - Notebook 06: Real-World Applications

## Learning Objectives
- Build a semantic search engine
- Create a document classification system
- Implement a recommendation engine
- Build a question-answering system
- Combine embeddings with LLMs

---

## 1. Project Overview

We'll build four real-world applications:

1. **Semantic Search Engine**: Find documents by meaning
2. **Text Classification**: Categorize documents automatically
3. **Recommendation System**: Suggest similar content
4. **Q&A System**: Answer questions from a knowledge base

## 2. Setup

In [None]:
!pip install -q chromadb openai sentence-transformers scikit-learn python-dotenv

In [None]:
import chromadb
from openai import OpenAI
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import os
from dotenv import load_dotenv

load_dotenv()
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
local_model = SentenceTransformer('all-MiniLM-L6-v2')
chroma_client = chromadb.Client()

print("âœ“ All systems ready")

## 3. Application 1: Semantic Search Engine

In [None]:
class SemanticSearchEngine:
    """Production-ready semantic search."""
    
    def __init__(self, collection_name: str = "search_engine"):
        self.client = chromadb.Client()
        self.collection = self.client.create_collection(collection_name)
        self.openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.doc_count = 0
    
    def index_documents(self, documents: list, metadata: list = None):
        """Add documents to the search index."""
        # Generate embeddings
        response = self.openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=documents
        )
        embeddings = [item.embedding for item in response.data]
        
        # Add to vector DB
        ids = [f"doc_{self.doc_count + i}" for i in range(len(documents))]
        self.collection.add(
            documents=documents,
            embeddings=embeddings,
            ids=ids,
            metadatas=metadata if metadata else [{} for _ in documents]
        )
        
        self.doc_count += len(documents)
        return f"Indexed {len(documents)} documents (total: {self.doc_count})"
    
    def search(self, query: str, n_results: int = 5, filters: dict = None):
        """Search for relevant documents."""
        # Generate query embedding
        query_response = self.openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=query
        )
        query_embedding = query_response.data[0].embedding
        
        # Search
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=n_results,
            where=filters
        )
        
        # Format results
        return [
            {
                "document": doc,
                "distance": dist,
                "metadata": meta
            }
            for doc, dist, meta in zip(
                results['documents'][0],
                results['distances'][0],
                results['metadatas'][0]
            )
        ]

# Demo the search engine
engine = SemanticSearchEngine()

# Index sample documents
docs = [
    "Python is a versatile programming language",
    "Machine learning enables AI applications",
    "The Eiffel Tower is located in Paris",
    "Neural networks process data like human brains",
    "Climate change affects global temperatures"
]

metadata = [
    {"category": "tech", "year": 2024},
    {"category": "tech", "year": 2024},
    {"category": "travel", "year": 2024},
    {"category": "tech", "year": 2024},
    {"category": "science", "year": 2024}
]

print(engine.index_documents(docs, metadata))

# Search
results = engine.search("Tell me about AI", n_results=3)

print("\nSearch Results:")
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['distance']:.3f}] {r['document']}")
    print(f"   Category: {r['metadata']['category']}\n")

## 4. Application 2: Text Classification

In [None]:
from sklearn.linear_model import LogisticRegression

class EmbeddingClassifier:
    """Classify text using embeddings."""
    
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.classifier = LogisticRegression(max_iter=1000)
        self.is_trained = False
    
    def train(self, texts: list, labels: list):
        """Train the classifier."""
        # Generate embeddings
        embeddings = self.model.encode(texts)
        
        # Train classifier
        self.classifier.fit(embeddings, labels)
        self.is_trained = True
        
        return f"Trained on {len(texts)} examples"
    
    def predict(self, text: str):
        """Classify a text."""
        if not self.is_trained:
            raise ValueError("Model not trained yet!")
        
        # Generate embedding
        embedding = self.model.encode([text])
        
        # Predict
        prediction = self.classifier.predict(embedding)[0]
        probabilities = self.classifier.predict_proba(embedding)[0]
        
        return {
            "prediction": prediction,
            "confidence": max(probabilities)
        }

# Demo classification
classifier = EmbeddingClassifier()

# Training data
training_texts = [
    "Python is great for data science",
    "Machine learning models learn from data",
    "The recipe requires three eggs",
    "Bake at 350 degrees for 30 minutes",
    "The soccer match ended in a draw",
    "Basketball is played with five players",
    "Neural networks have multiple layers",
    "Add salt and pepper to taste",
    "The team scored in overtime",
    "Deep learning uses GPUs for training"
]

training_labels = [
    "tech", "tech", "cooking", "cooking", "sports",
    "sports", "tech", "cooking", "sports", "tech"
]

print(classifier.train(training_texts, training_labels))

# Test classification
test_texts = [
    "I love programming in Python",
    "This pasta dish is delicious",
    "The football game was exciting"
]

print("\nClassification Results:\n")
for text in test_texts:
    result = classifier.predict(text)
    print(f"'{text}'")
    print(f"  â†’ {result['prediction']} ({result['confidence']:.2f} confidence)\n")

## 5. Application 3: Recommendation System

In [None]:
class RecommendationEngine:
    """Content-based recommendations using embeddings."""
    
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.items = []
        self.embeddings = None
    
    def add_items(self, items: list, descriptions: list):
        """Add items to the recommendation pool."""
        self.items = [
            {"name": item, "description": desc}
            for item, desc in zip(items, descriptions)
        ]
        
        # Generate embeddings from descriptions
        self.embeddings = self.model.encode(descriptions)
        
        return f"Added {len(items)} items"
    
    def recommend(self, item_name: str = None, description: str = None, n: int = 3):
        """Get recommendations based on item or description."""
        if item_name:
            # Find item's index
            item_idx = next(
                (i for i, item in enumerate(self.items) if item["name"] == item_name),
                None
            )
            if item_idx is None:
                return []
            
            query_embedding = self.embeddings[item_idx]
        
        elif description:
            # Generate embedding from description
            query_embedding = self.model.encode([description])[0]
        
        else:
            raise ValueError("Provide either item_name or description")
        
        # Calculate similarities
        similarities = cosine_similarity(
            [query_embedding],
            self.embeddings
        )[0]
        
        # Get top N (excluding the query item itself)
        top_indices = np.argsort(similarities)[::-1]
        if item_name:
            top_indices = top_indices[1:n+1]  # Skip first (same item)
        else:
            top_indices = top_indices[:n]
        
        return [
            {
                "name": self.items[idx]["name"],
                "description": self.items[idx]["description"],
                "similarity": similarities[idx]
            }
            for idx in top_indices
        ]

# Demo recommendations
recommender = RecommendationEngine()

# Add movies
movies = [
    "The Matrix",
    "Inception",
    "Interstellar",
    "The Notebook",
    "Titanic",
    "Die Hard",
    "Blade Runner"
]

descriptions = [
    "A sci-fi thriller about simulated reality and AI",
    "Mind-bending sci-fi about dreams within dreams",
    "Space exploration and time dilation sci-fi epic",
    "Romantic drama about love and memory",
    "Romantic tragedy set on a doomed ocean liner",
    "Action-packed thriller with explosive sequences",
    "Dystopian sci-fi noir about artificial humans"
]

print(recommender.add_items(movies, descriptions))

# Get recommendations based on a movie
print("\nIf you liked 'The Matrix', you might also like:\n")
recs = recommender.recommend(item_name="The Matrix", n=3)
for i, rec in enumerate(recs, 1):
    print(f"{i}. {rec['name']} ({rec['similarity']:.3f})")
    print(f"   {rec['description']}\n")

# Get recommendations based on a description
print("Movies matching 'romantic love story':\n")
recs = recommender.recommend(description="romantic love story", n=2)
for i, rec in enumerate(recs, 1):
    print(f"{i}. {rec['name']} ({rec['similarity']:.3f})")

## 6. Application 4: Question Answering System

In [None]:
class QASystem:
    """Question-answering with embeddings + LLM."""
    
    def __init__(self):
        self.chroma = chromadb.Client()
        self.collection = self.chroma.create_collection("qa_knowledge")
        self.openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    def add_knowledge(self, documents: list):
        """Add documents to the knowledge base."""
        # Generate embeddings
        response = self.openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=documents
        )
        embeddings = [item.embedding for item in response.data]
        
        # Store in vector DB
        self.collection.add(
            documents=documents,
            embeddings=embeddings,
            ids=[f"kb_{i}" for i in range(len(documents))]
        )
        
        return f"Added {len(documents)} documents to knowledge base"
    
    def ask(self, question: str, n_context: int = 3) -> str:
        """Answer a question using the knowledge base."""
        # 1. Find relevant context
        query_response = self.openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=question
        )
        query_embedding = query_response.data[0].embedding
        
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=n_context
        )
        
        context = "\n\n".join(results['documents'][0])
        
        # 2. Generate answer using LLM
        prompt = f"""Answer the question based on the context below. If the answer cannot be found in the context, say "I don't know."

Context:
{context}

Question: {question}

Answer:"""
        
        chat_response = self.openai_client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=200
        )
        
        answer = chat_response.choices[0].message.content
        
        return {
            "answer": answer,
            "sources": results['documents'][0][:2]  # Show top 2 sources
        }

# Demo Q&A system
qa = QASystem()

# Add knowledge
knowledge = [
    "Python was created by Guido van Rossum in 1991.",
    "Machine learning is a subset of artificial intelligence.",
    "The capital of France is Paris.",
    "Neural networks are inspired by biological neurons.",
    "Python is known for its simple and readable syntax."
]

print(qa.add_knowledge(knowledge))

# Ask questions
questions = [
    "Who created Python?",
    "What is machine learning?",
    "What is the capital of Spain?"  # Not in knowledge base
]

print("\nQ&A Demo:\n")
for q in questions:
    result = qa.ask(q)
    print(f"Q: {q}")
    print(f"A: {result['answer']}")
    print(f"Sources: {result['sources'][0][:50]}...")
    print()

## 7. Complete Example: Document Search + Chat

In [None]:
class IntelligentDocumentAssistant:
    """Combines search and chat for document interaction."""
    
    def __init__(self):
        self.qa_system = QASystem()
    
    def load_documents(self, documents: list):
        """Load documents into the system."""
        return self.qa_system.add_knowledge(documents)
    
    def chat(self, user_message: str) -> str:
        """Chat about the documents."""
        result = self.qa_system.ask(user_message)
        return result['answer']

# Demo the complete system
assistant = IntelligentDocumentAssistant()

# Load company handbook
handbook = [
    "Company Policy: All employees must clock in by 9 AM.",
    "Vacation Policy: Employees get 15 days of paid vacation per year.",
    "Remote Work: Employees can work remotely up to 2 days per week.",
    "Health Benefits: Full health insurance is provided after 90 days.",
    "Dress Code: Business casual attire is required in the office."
]

print(assistant.load_documents(handbook))
print("\nDocument Assistant Ready!\n")

# Simulate a conversation
conversation = [
    "What time do I need to arrive at work?",
    "How many vacation days do I get?",
    "Can I work from home?"
]

for question in conversation:
    answer = assistant.chat(question)
    print(f"You: {question}")
    print(f"Assistant: {answer}\n")

## Summary

You learned to build:
- âœ… Semantic search engines
- âœ… Text classifiers with embeddings
- âœ… Recommendation systems
- âœ… Q&A systems (RAG pattern)
- âœ… Complete document assistants

## Key Patterns

### RAG (Retrieval-Augmented Generation)
1. **Retrieve** relevant context with embeddings
2. **Augment** prompt with retrieved context
3. **Generate** answer with LLM

### Embedding Pipeline
1. **Chunk** documents appropriately
2. **Embed** chunks with consistent model
3. **Store** in vector database
4. **Query** with semantic search
5. **Post-process** results

## Production Considerations
1. **Error handling**: API failures, rate limits
2. **Caching**: Store embeddings to reduce costs
3. **Monitoring**: Track performance and costs
4. **Updates**: Refresh embeddings when documents change
5. **Security**: Sanitize inputs, control access

## Next Steps
- ðŸš€ Build your own application
- ðŸ“˜ Proceed to Module 06: RAG Systems
- ðŸ”— Explore [Pinecone](https://www.pinecone.io/) for production scale

## ðŸŽ‰ Module 05 Complete!

You've mastered embeddings and vector databases! You can now:
- Create and work with embeddings
- Use vector databases effectively
- Build production applications
- Implement RAG patterns

**Congratulations!** You're ready for advanced RAG in Module 06.