# Prototyping Embedding-Based Ranking for Influencers

This notebook prototypes the core logic for the **InCreator AI** discovery engine. 

## Goals
1.  **Simulate Embeddings**: Generate mock vectors for creator bios.
2.  **Vector Search**: Implement Cosine Similarity to find relevant creators.
3.  **LLM Reranking**: Apply a second-stage scoring layer to improve precision.

## Architecture Context
In the production system:
-   **Embeddings** are generated via OpenAI `text-embedding-3-small`.
-   **Vector Store** is **Pinecone**.
-   **Reranking** is done via a lightweight LLM or Cross-Encoder.

In [None]:
import numpy as np
from typing import List, Dict

# --- 1. Setup Mock Data ---

creators = [
    {
        "id": "c_1",
        "handle": "@tech_guru",
        "bio": "I review the latest gadgets, smartphones, and AI tools. Tech enthusiast.",
        "category": "Tech"
    },
    {
        "id": "c_2",
        "handle": "@beauty_queen",
        "bio": "Makeup tutorials, skincare routines, and lifestyle vlogs.",
        "category": "Beauty"
    },
    {
        "id": "c_3",
        "handle": "@code_wizard",
        "bio": "Full-stack developer. I teach Python, JavaScript, and how to build AI agents.",
        "category": "Education"
    },
    {
        "id": "c_4",
        "handle": "@ai_insider",
        "bio": "Deep dives into Large Language Models, Neural Networks, and the future of AI.",
        "category": "Tech"
    },
    {
        "id": "c_5",
        "handle": "@travel_mike",
        "bio": "Backpacking across Europe. Food, culture, and hidden gems.",
        "category": "Travel"
    }
]

print(f"Loaded {len(creators)} creators.")

## 2. Mock Embedding Generation

In a real scenario, we would call `openai.Embedding.create()`. Here, we simulate it by generating random vectors. 

**Note**: To make the search "work" for this demo without a real model, we will manually assign "similar" vectors to semantically similar concepts. 
-   Tech/AI creators will have vectors pointing in direction A.
-   Beauty/Travel creators will have vectors pointing in direction B.

In [None]:
def get_mock_embedding(text: str) -> np.array:
    """
    Generates a mock 1536-dimensional vector.
    For demo purposes, we bias the vector based on keywords so the math works.
    """
    dim = 1536
    vec = np.random.rand(dim) * 0.1 # Start with random noise
    
    # Add "Signal" based on keywords
    text_lower = text.lower()
    if "tech" in text_lower or "ai" in text_lower or "code" in text_lower:
        vec[0:100] += 1.0  # Boost first 100 dimensions for Tech
    elif "beauty" in text_lower or "makeup" in text_lower:
        vec[100:200] += 1.0 # Boost next 100 for Beauty
    elif "travel" in text_lower:
        vec[200:300] += 1.0 # Boost next 100 for Travel
        
    # Normalize the vector (Required for Cosine Similarity)
    norm = np.linalg.norm(vec)
    return vec / norm

# Generate embeddings for all creators
vector_db = []
for c in creators:
    vector_db.append({
        "id": c["id"],
        "metadata": c,
        "vector": get_mock_embedding(c["bio"])
    })

print("Embeddings generated and indexed.")

## 3. Vector Search (Cosine Similarity)

We perform a "Semantic Search" by calculating the cosine similarity between the Query Vector and all Creator Vectors.

In [None]:
def cosine_similarity(a, b):
    return np.dot(a, b) # Since vectors are already normalized

def search(query: str, top_k: int = 3):
    print(f"Searching for: '{query}'...")
    query_vec = get_mock_embedding(query)
    
    results = []
    for item in vector_db:
        score = cosine_similarity(query_vec, item["vector"])
        results.append({
            "score": score,
            "creator": item["metadata"]
        })
    
    # Sort by score descending
    results.sort(key=lambda x: x["score"], reverse=True)
    return results[:top_k]

# Test the search
hits = search("I want to learn about Artificial Intelligence")

print("\n--- Initial Vector Search Results ---")
for hit in hits:
    print(f"[{hit['score']:.4f}] {hit['creator']['handle']}: {hit['creator']['bio']}")

## 4. LLM Reranking (The "Intelligence" Layer)

Vector search is good, but sometimes lacks nuance. We add a second step where we simulate an LLM analyzing the top results to see if they *truly* match the user's intent.

**Scenario**: User searches for "AI coding tutorials".
-   Vector search might return "Tech News" (High similarity to 'Tech').
-   Reranker should downrank "News" and uprank "Tutorials".

In [None]:
def llm_rerank(query: str, initial_results: List[Dict]) -> List[Dict]:
    print(f"\n--- Reranking for nuance: '{query}' ---")
    reranked = []
    
    for hit in initial_results:
        bio = hit['creator']['bio']
        original_score = hit['score']
        
        # Mock LLM Logic: Check for specific intent match
        # If query asks for "learn" or "tutorial", boost creators who "teach"
        boost = 0.0
        if "learn" in query.lower() or "tutorial" in query.lower():
            if "teach" in bio.lower() or "tutorial" in bio.lower():
                boost = 0.15 # Significant boost
                print(f"-> Boosting {hit['creator']['handle']} (Matches intent 'Education')")
        
        final_score = original_score + boost
        reranked.append({
            "score": final_score,
            "creator": hit['creator'],
            "original_score": original_score
        })
        
    reranked.sort(key=lambda x: x["score"], reverse=True)
    return reranked

# Run the full pipeline
query = "tutorials to learn AI coding"
initial_hits = search(query, top_k=5)
final_results = llm_rerank(query, initial_hits)

print("\n--- Final Reranked Results ---")
for hit in final_results:
    print(f"[{hit['score']:.4f}] {hit['creator']['handle']} (Original: {hit['original_score']:.4f})")