# ChromaDB Setup per K-RSS

Questo notebook configura **ChromaDB** come vector database per gli embeddings dei video.

## Riferimento

> *"We employ sentence-transformers to encode demonstrations into vectors and store them using ChromaDB, which facilitates ANN search during runtime."*
>
> — **Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations**

## Stack Tecnologico

| Componente | Tecnologia |
|------------|------------|
| Vector DB | ChromaDB |
| Embedding Model | sentence-transformers (all-MiniLM-L6-v2) |
| Similarity | Cosine |
| Persistenza | Locale su disco |

## Nota sulle Performance

ChromaDB è stato scelto per la sua **semplicità d'uso** e il **supporto nativo per metadata filtering**, fondamentale per K-RSS (filtrare per categoria, canale, escludere video già visti).

> **Sviluppi Futuri:** In caso di scaling a milioni di video, considerare la migrazione a **FAISS** (Facebook AI Similarity Search), che nei benchmark risulta ~80x più veloce per le query. FAISS richiede però una gestione separata dei metadata (es. SQLite o Pandas DataFrame).

## 1. Setup e Caricamento Dati

In [None]:
import json
import time
import numpy as np
from pathlib import Path
from typing import List, Dict, Optional

# Percorso dati
DATA_PATH = Path("../data/raw/scraped_videos.json")
if not DATA_PATH.exists():
    DATA_PATH = Path("/app/data/raw/scraped_videos.json")  # Docker path

print(f"Data path: {DATA_PATH}")
print(f"File exists: {DATA_PATH.exists()}")

In [None]:
# Carica i video scraped
with open(DATA_PATH, 'r') as f:
    data = json.load(f)

videos = data.get('videos', [])
print(f"Video caricati: {len(videos)}")

# Mostra esempio
if videos:
    print(f"\nEsempio: {videos[0].get('title', 'N/A')[:60]}...")

## 2. Embedding Model (sentence-transformers)

In [None]:
from sentence_transformers import SentenceTransformer

MODEL_NAME = 'all-MiniLM-L6-v2'  # 384 dim, veloce
print(f"Caricamento modello: {MODEL_NAME}")

model = SentenceTransformer(MODEL_NAME)
print(f"Modello caricato! Dimensione: {model.get_sentence_embedding_dimension()}")

In [None]:
def prepare_text(video: dict) -> str:
    """Combina titolo e descrizione per l'embedding."""
    title = video.get('title', '')
    description = video.get('description', '')[:500]
    return f"{title}. {description}".strip()

texts = [prepare_text(v) for v in videos]
video_ids = [v.get('video_id', f'video_{i}') for i, v in enumerate(videos)]

print(f"Testi preparati: {len(texts)}")

In [None]:
# Genera embeddings
print("⏳ Generazione embeddings...")
start_time = time.time()

embeddings = model.encode(texts, show_progress_bar=True, convert_to_numpy=True)

elapsed = time.time() - start_time
print(f"\nEmbeddings generati in {elapsed:.2f}s")
print(f"Shape: {embeddings.shape}")

## 3. ChromaDB Setup

In [None]:
import chromadb

# Percorso persistenza
CHROMA_PATH = Path("../data/embeddings/chroma_db")
if not CHROMA_PATH.parent.exists():
    CHROMA_PATH = Path("/app/data/embeddings/chroma_db")
CHROMA_PATH.mkdir(parents=True, exist_ok=True)

# Inizializza client persistente
client = chromadb.PersistentClient(path=str(CHROMA_PATH))
print(f"ChromaDB inizializzato: {CHROMA_PATH}")

In [None]:
# Crea collection
COLLECTION_NAME = "krss_videos"

# Elimina se esiste (per rieseguire il notebook)
try:
    client.delete_collection(COLLECTION_NAME)
    print(f"Collection '{COLLECTION_NAME}' eliminata")
except:
    pass

collection = client.create_collection(
    name=COLLECTION_NAME,
    metadata={"hnsw:space": "cosine"}
)
print(f"Collection '{COLLECTION_NAME}' creata")

In [None]:
# Prepara metadata per ogni video
metadatas = []
for v in videos:
    metadatas.append({
        "channel_id": v.get("channel_id", ""),
        "channel_name": v.get("channel_name", ""),
        "category": v.get("category_name", "Unknown"),
        "published_at": v.get("published_at", ""),
    })

# Inserisci tutto
print("⏳ Inserimento in ChromaDB...")
start_time = time.time()

collection.add(
    ids=video_ids,
    embeddings=embeddings.tolist(),
    documents=texts,
    metadatas=metadatas
)

elapsed = time.time() - start_time
print(f"Inseriti {collection.count()} video in {elapsed:.2f}s")

## 4. Test Ricerca per Similarità

In [None]:
def search_videos(query: str, n_results: int = 5, category_filter: str = None):
    """Cerca video simili alla query."""
    query_embedding = model.encode([query])[0].tolist()
    
    where_filter = {"category": category_filter} if category_filter else None
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        where=where_filter,
        include=["documents", "metadatas", "distances"]
    )
    
    return results

In [None]:
# Test ricerca
query = "machine learning tutorial"
print(f"Query: '{query}'\n")

results = search_videos(query, n_results=5)

print("Risultati:")
for i, (doc, meta, dist) in enumerate(zip(
    results['documents'][0], 
    results['metadatas'][0], 
    results['distances'][0]
)):
    similarity = 1 - dist
    print(f"\n{i+1}. [{similarity:.3f}] {doc[:70]}...")
    print(f"\t{meta.get('channel_name', 'N/A')} | {meta.get('category', 'N/A')}")

In [None]:
# Test con filtro categoria
print("Ricerca con filtro categoria 'Education':")
results_filtered = search_videos(query, n_results=3, category_filter="Education")

for doc, meta in zip(results_filtered['documents'][0], results_filtered['metadatas'][0]):
    print(f"  • {doc[:60]}...")

## 5. Configurazione Finale

Questa configurazione verrà usata nel modulo `AI_RM`:

In [None]:
CONFIG = {
    "vector_db": "ChromaDB",
    "embedding_model": "all-MiniLM-L6-v2",
    "embedding_dimension": 384,
    "similarity_metric": "cosine",
    "persistence_path": "data/embeddings/chroma_db",
    "collection_name": "krss_videos",
    "metadata_fields": ["channel_id", "channel_name", "category", "published_at"]
}

print("Configurazione K-RSS:")
print(json.dumps(CONFIG, indent=2))

print("\nSetup completato!")