# ChromaDB Setup per K-RSS

Questo notebook configura **ChromaDB** come vector database per gli embeddings dei video.

## Riferimento

> *"We employ sentence-transformers to encode demonstrations into vectors and store them using ChromaDB, which facilitates ANN search during runtime."*
>
> — **Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations**

## Stack Tecnologico

| Componente | Tecnologia |
|------------|------------|
| Vector DB | ChromaDB |
| Embedding Model | sentence-transformers (all-MiniLM-L6-v2) |
| Similarity | Cosine |
| Persistenza | Locale su disco |

## Nota sulle Performance

ChromaDB è stato scelto per la sua **semplicità d'uso** e il **supporto nativo per metadata filtering**, fondamentale per K-RSS (filtrare per categoria, canale, escludere video già visti).

> **Sviluppi Futuri:** In caso di scaling a milioni di video, considerare la migrazione a **FAISS** (Facebook AI Similarity Search), che nei benchmark risulta ~80x più veloce per le query. FAISS richiede però una gestione separata dei metadata (es. SQLite o Pandas DataFrame).

## 1. Setup e Caricamento Dati

In [1]:
import sys
from pathlib import Path
import json

# Assicura che i moduli siano importabili
sys.path.insert(0, str(Path('../source').resolve()))
from AI_RM import Embedder, EmbeddingStore, default_config

# Percorso dati centralizzato
DATA_PATH = default_config.data_path / 'raw' / 'scraped_videos.json'
print(f'Data path: {DATA_PATH}')
print(f'File exists: {DATA_PATH.exists()}')


with open(DATA_PATH, 'r') as f:
    data = json.load(f)

videos = data.get('videos', [])
print(f'Video caricati: {len(videos)}')
if videos:
    print(f'Esempio: {videos[0].get("title", "N/A")[:60]}...')

Data path: /app/data/raw/scraped_videos.json
File exists: True
Video caricati: 405
Esempio: Why Laplace transforms are so useful...


In [2]:
# Inizializza embedder e store usando la config centrale
embedder = Embedder(config=default_config.embedding)
store = EmbeddingStore(config=default_config.vector_store)

print(f'ChromaDB path: {default_config.vector_store.persist_path}')
print(f'Embedding model: {default_config.embedding.model_name}')

ChromaDB path: /app/data/embeddings/chroma_db
Embedding model: all-MiniLM-L6-v2


## 2. Embedding Model (sentence-transformers)

In [3]:
# Prepara i testi per l'embedding
texts = [embedder.prepare_video_text(v) for v in videos]
video_ids = [v.get('video_id', f'video_{i}') for i, v in enumerate(videos)]

print(f'Testi preparati: {len(texts)}')
if texts:
    print(f'Esempio testo: {texts[0][:80]}...')

Testi preparati: 405
Esempio testo: Why Laplace transforms are so useful. Studying the forced harmonic oscillator by...


In [4]:
# Calcola e salva embeddings SOLO se il DB è vuoto
if store.collection.count() == 0:
    print('Calcolo embeddings e popolamento DB...')
    embedding_results = embedder.embed_videos(videos)
    embeddings = [r.embedding for r in embedding_results if r.success]
    valid_videos = [v for r, v in zip(embedding_results, videos) if r.success]
    store.add_videos(valid_videos, embeddings, texts=[embedder.prepare_video_text(v) for v in valid_videos])
    print(f'Aggiunti {len(valid_videos)} video a ChromaDB.')
else:
    print(f'ChromaDB già popolato con {store.collection.count()} video.')

ChromaDB già popolato con 135 video.


In [5]:
# Esempio di ricerca semantica
query = 'machine learning tutorial'
results = store.search(embedder.encode(query), n_results=5)
for i, r in enumerate(results):
    print(f'{i+1}. {r.metadata.get("title", "N/A")} (score: {r.score:.3f})')

1. Gen AI & Reinforcement Learning- Computerphile (score: 0.456)
2. Code Optimisation via Memoization - Computerphile (score: 0.404)
3. How Passkeys Work - Computerphile (score: 0.403)
4. But how do AI images and videos actually work? | Guest video by Welch Labs (score: 0.401)
5. Path Planning for Robotics - Computerphile (score: 0.398)


## 3. ChromaDB Setup

In [6]:
# Esporta il DB come DataFrame per analisi
import pandas as pd
df = store.to_dataframe()
display(df.head())
print(f'Totale video nel DB: {len(df)}')

Unnamed: 0,video_id,text,published_at,category,channel_name,title,channel_id
0,FE-hM1kRK4Y,Why Laplace transforms are so useful. Studying...,2025-11-05T13:09:45+00:00,Education,3Blue1Brown,Why Laplace transforms are so useful,UCYO_jab_esuFRV4b17AJtAw
1,bnjKwiUg-kw,The dynamics of e^(πi). A fuller version of th...,2025-10-12T12:40:46+00:00,Education,3Blue1Brown,The dynamics of e^(πi),UCYO_jab_esuFRV4b17AJtAw
2,j0wJBEZdwLs,But what is a Laplace Transform?. Visualizing ...,2025-10-12T11:19:53+00:00,Education,3Blue1Brown,But what is a Laplace Transform?,UCYO_jab_esuFRV4b17AJtAw
3,-j8PzkZ70Lg,The Physics of Euler's Formula | Laplace Trans...,2025-10-05T13:53:28+00:00,Education,3Blue1Brown,The Physics of Euler's Formula | Laplace Trans...,UCYO_jab_esuFRV4b17AJtAw
4,M-MgQC6z3VU,What was Euclid really doing? | Guest video by...,2025-09-18T14:13:05+00:00,Education,3Blue1Brown,What was Euclid really doing? | Guest video by...,UCYO_jab_esuFRV4b17AJtAw


Totale video nel DB: 135


In [7]:
# Crea collection
COLLECTION_NAME = "krss_videos"

# Inizializza il client ChromaDB
from chromadb import PersistentClient
client = PersistentClient(path=default_config.vector_store.persist_path)

# Elimina se esiste (per rieseguire il notebook)
try:
    client.delete_collection(COLLECTION_NAME)
    print(f"Collection '{COLLECTION_NAME}' eliminata")
except:
    pass

collection = client.create_collection(
    name=COLLECTION_NAME,
    metadata={"hnsw:space": "cosine"}
)
print(f"Collection '{COLLECTION_NAME}' creata")

Collection 'krss_videos' eliminata
Collection 'krss_videos' creata


In [8]:
# Prepara metadata per ogni video
import time
metadatas = []
for v in videos:
    # Normalizza il campo di pubblicazione: alcuni file usano "published_date"
    published = v.get("published_date", v.get("published_at", ""))
    metadatas.append({
        "title": v.get("title", "Unknown"),
        "channel_id": v.get("channel_id", ""),
        "channel_name": v.get("channel_name", ""),
        "category": v.get("category_name", "Unknown"),
        "published_at": published,
    })

# Se non abbiamo embeddings disponibili (es. DB già popolato), non reinserire
if 'embeddings' not in globals() or embeddings is None or (hasattr(embeddings, '__len__') and len(embeddings) == 0):
    print("Embeddings non definiti o vuoti: salto l'inserimento. Se desideri inserirli, esegui la cella che calcola gli embeddings.")
else:
    # Inserisci tutto
    print("⏳ Inserimento in ChromaDB...")
    start_time = time.time()

    collection.add(
        ids=video_ids,
        embeddings=embeddings,
        documents=texts,
        metadatas=metadatas
    )

    elapsed = time.time() - start_time
    print(f"Inseriti {collection.count()} video in {elapsed:.2f}s")

Embeddings non definiti o vuoti: salto l'inserimento. Se desideri inserirli, esegui la cella che calcola gli embeddings.


## 4. Test Ricerca per Similarità

In [9]:
def search_videos(query: str, n_results: int = 5, category_filter: str = None):
    """Cerca video simili alla query."""
    query_embedding = embedder.encode(query)
    
    where_filter = {"category": category_filter} if category_filter else None
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        where=where_filter,
        include=["documents", "metadatas", "distances"]
    )
    
    return results

In [10]:
# Test ricerca
query = "machine learning tutorial"
print(f"Query: '{query}'\n")

results = search_videos(query, n_results=5)

print("Risultati:")
for i, (doc, meta, dist) in enumerate(zip(
    results['documents'][0], 
    results['metadatas'][0], 
    results['distances'][0]
)):
    similarity = 1 - dist
    print(f"\n{i+1}. [{similarity:.3f}] {doc[:70]}...")
    print(f"\t{meta.get('channel_name', 'N/A')} | {meta.get('category', 'N/A')}")

Query: 'machine learning tutorial'

Risultati:


In [11]:
# Test con filtro categoria
print("Ricerca con filtro categoria 'Education':")
results_filtered = search_videos(query, n_results=3, category_filter="Education")

for doc, meta in zip(results_filtered['documents'][0], results_filtered['metadatas'][0]):
    print(f"  • {doc[:60]}...")

Ricerca con filtro categoria 'Education':


## 5. Configurazione Finale

Questa configurazione verrà usata nel modulo `AI_RM`:

In [12]:
CONFIG = {
    "vector_db": "ChromaDB",
    "embedding_model": "all-MiniLM-L6-v2",
    "embedding_dimension": 384,
    "similarity_metric": "cosine",
    "persistence_path": "data/embeddings/chroma_db",
    "collection_name": "krss_videos",
    "metadata_fields": ["channel_id", "channel_name", "category", "published_at"]
}

print("Configurazione K-RSS:")
print(json.dumps(CONFIG, indent=2))

print("\nSetup completato!")

Configurazione K-RSS:
{
  "vector_db": "ChromaDB",
  "embedding_model": "all-MiniLM-L6-v2",
  "embedding_dimension": 384,
  "similarity_metric": "cosine",
  "persistence_path": "data/embeddings/chroma_db",
  "collection_name": "krss_videos",
  "metadata_fields": [
    "channel_id",
    "channel_name",
    "category",
    "published_at"
  ]
}

Setup completato!
