# Product Retrieval System

El objetivo es medir qué tan bien el modelo recupera los productos más relevantes 
para cada consulta de usuario (`query`), basándonos en los datos del dataset WANDS.

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Óptimización de TF-IDF

Esta área abarca las opciones de mejora que se tienen para el modelo original de TF-IDF que poseemos.

### Lemmatizar los textos

En este apartado vamos a aplicar una lematización, dado que hasta ahora hemos estado comparando las palabras sin pensar en la raíz de la que podrían provenir algunas por lo que se realizará la prueba lematizando las palabras y ver su efectividad en la calidad de los resultados.

Para este caso vamos a usar spacy para lograr nuestro cometido.

In [3]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

In [None]:
def lemma_tokenizer(text):
    doc = nlp(text)
    # only get lema of alphabetic non-stopwords
    return [token.lemma_ for token in doc 
            if not token.is_stop and token.is_alpha]


In [1]:
#clone the git repo that contains the data and additional information about the dataset
!git clone https://github.com/wayfair/WANDS.git

Cloning into 'WANDS'...


### Aplicando Parámetros Óptimos

Más adelante se explicará como se realizó la búsqueda de los parámetros óptimos con los que se ejecutó la celda posterior.

In [6]:
#define functions for product search using Tf-IDF
def calculate_tfidf(dataframe):
    """
    Calculate the TF-IDF for combined product name and description.

    Parameters:
    dataframe (pd.DataFrame): DataFrame with product_id, and other product information.

    Returns:
    TfidfVectorizer, csr_matrix: TF-IDF vectorizer and TF-IDF matrix.
    """
    # Combine product name and description to vectorize
    # NOTE: Please feel free to use any combination of columns available, some columns may contain NULL values
    combined_text = dataframe['product_name'] + ' ' + dataframe['product_description']
    vectorizer = TfidfVectorizer(
        tokenizer=lemma_tokenizer,
        sublinear_tf=True,
        stop_words='english',
    )


    # convert combined_text to list of unicode strings
    tfidf_matrix = vectorizer.fit_transform(combined_text.values.astype('U'))
    return vectorizer, tfidf_matrix

def get_top_products(vectorizer, tfidf_matrix, query, top_n=10):
    """
    Get top N products for a given query based on TF-IDF similarity.

    Parameters:
    vectorizer (TfidfVectorizer): Trained TF-IDF vectorizer.
    tfidf_matrix (csr_matrix): TF-IDF matrix for the products.
    query (str): Search query.
    top_n (int): Number of top products to return.

    Returns:
    list: List of top N product IDs.
    """
    query_vector = vectorizer.transform([query])
    cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    top_product_indices = cosine_similarities.argsort()[-top_n:][::-1]
    return top_product_indices

In [7]:
#define functions for evaluating retrieval performance
def map_at_k(true_ids, predicted_ids, k=10):
    """
    Calculate the Mean Average Precision at K (MAP@K).

    Parameters:
    true_ids (list): List of relevant product IDs.
    predicted_ids (list): List of predicted product IDs.
    k (int): Number of top elements to consider.
             NOTE: IF you wish to change top k, please provide a justification for choosing the new value

    Returns:
    float: MAP@K score.
    """
    #if either list is empty, return 0
    if not len(true_ids) or not len(predicted_ids):
        return 0.0

    score = 0.0
    num_hits = 0.0

    for i, p_id in enumerate(predicted_ids[:k]):
        if p_id in true_ids and p_id not in predicted_ids[:i]:
            num_hits += 1.0
            score += num_hits / (i + 1.0)

    return score / min(len(true_ids), k)

### Nuevas Metricas

Hemos seleccionado un conjunto pequeño y representativo de métricas:
- **MAP@10 (Exact only)**: métrica objetivo del ejercicio; mide precisión ordenada para coincidencias exactas.
- **Weighted MAP@10**: extiende MAP para tener en cuenta coincidencias parciales (Exact=1, Partial=0.X).
- **NDCG@10**: métrica estándar para relevancia graduada (útil para comparar cuando hay niveles de relevancia).
- **Precision@10 / Recall@10**: métricas intuitivas de calidad y cobertura en los primeros K resultados.

En las siguientes celdas ejecutaremos una evaluación comparativa entre el recuperador semántico (FAISS) y la versión con reranking (Cross-Encoder).

In [None]:
import numpy as np
import math

def dcg_from_scores(rels):
    return sum((2**r - 1) / math.log2(i+2) for i, r in enumerate(rels))

def ndcg_at_k_from_relevance_list(relevance_list, k=10):
    topk = relevance_list[:k]
    dcg = dcg_from_scores(topk)
    ideal = sorted(relevance_list, reverse=True)[:k]
    idcg = dcg_from_scores(ideal)
    return dcg / idcg if idcg > 0 else 0.0

def precision_at_k(predicted_ids, relevant_set, k=10):
    topk = predicted_ids[:k]
    return sum(1 for pid in topk if pid in relevant_set) / k

def recall_at_k(predicted_ids, relevant_set, k=10):
    topk = predicted_ids[:k]
    return sum(1 for pid in topk if pid in relevant_set) / max(1, len(relevant_set))

def weighted_map_at_k(relevance_dict, predicted_ids, k=10):
    """
    relevance_dict: {product_id: weight}, weight ∈ [0,1]
    predicted_ids: lista de product_id ordenados
    """
    if not relevance_dict or not predicted_ids:
        return 0.0

    cum_gain = 0.0        
    score = 0.0
    total_possible = sum(relevance_dict.values())

    for i, pid in enumerate(predicted_ids[:k], start=1):
        rel = relevance_dict.get(pid, 0.0)
        if rel > 0:
            cum_gain += rel
            score += cum_gain / i

    return score / min(total_possible, k) if total_possible > 0 else 0.0

def relevance_list_for_predicted_ids(relevance_dict, predicted_ids, default=0.0):
    return [relevance_dict.get(pid, default) for pid in predicted_ids]

In [9]:
# get search queries
query_df = pd.read_csv("WANDS/dataset/query.csv", sep='\t')

In [10]:
query_df.head()

Unnamed: 0,query_id,query,query_class
0,0,salon chair,Massage Chairs
1,1,smart coffee table,Coffee & Cocktail Tables
2,2,dinosaur,Kids Wall Décor
3,3,turquoise pillows,Accent Pillows
4,4,chair and a half recliner,Recliners


In [11]:
# get products
product_df = pd.read_csv("WANDS/dataset/product.csv", sep='\t')

In [12]:
product_df.head()

Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count
0,0,solid wood platform bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,"good , deep sleep can be quite difficult to ha...",overallwidth-sidetoside:64.7|dsprimaryproducts...,15.0,4.5,15.0
1,1,all-clad 7 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,"create delicious slow-cooked meals , from tend...",capacityquarts:7|producttype : slow cooker|pro...,100.0,2.0,98.0
2,2,all-clad electrics 6.5 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,prepare home-cooked meals on any schedule with...,features : keep warm setting|capacityquarts:6....,208.0,3.0,181.0
3,3,all-clad all professional tools pizza cutter,"Slicers, Peelers And Graters",Browse By Brand / All-Clad,this original stainless tool was designed to c...,overallwidth-sidetoside:3.5|warrantylength : l...,69.0,4.5,42.0
4,4,baldwin prestige alcott passage knob with roun...,Door Knobs,Home Improvement / Doors & Door Hardware / Doo...,the hardware has a rich heritage of delivering...,compatibledoorthickness:1.375 '' |countryofori...,70.0,5.0,42.0


In [13]:
# get manually labeled groundtruth lables
label_df = pd.read_csv("WANDS/dataset/label.csv", sep='\t')

In [14]:
label_df.head()

Unnamed: 0,id,query_id,product_id,label
0,0,0,25434,Exact
1,1,0,12088,Irrelevant
2,2,0,42931,Exact
3,3,0,2636,Exact
4,4,0,42923,Exact


In [15]:
#group the labels for each query to use when identifying exact matches
grouped_label_df = label_df.groupby('query_id')

In [16]:
# Calculate TF-IDF
vectorizer, tfidf_matrix = calculate_tfidf(product_df)



In [17]:
#Sanity check code block to see if the search results are relevant
#implementing a function to retrieve top K product IDs for a query
def get_top_product_ids_for_query(query):
    top_product_indices = get_top_products(vectorizer, tfidf_matrix, query, top_n=10)
    top_product_ids = product_df.iloc[top_product_indices]['product_id'].tolist()
    return top_product_ids

#define the test query
query = "armchair"

#obtain top product IDs
top_product_ids = get_top_product_ids_for_query(query)

print(f"Top products for '{query}':")
for product_id in top_product_ids:
    product = product_df.loc[product_df['product_id'] == product_id]
    print(product_id, product['product_name'].values[0])

Top products for 'armchair':
12756 24.41 '' wide tufted polyester armchair
4277 26.8 '' wide armchair
41270 almaraz 33.7 '' wide leather match armchair
6528 waugh 29 '' wide tufted polyester armchair
11469 tiemann 30 '' wide polyester armchair
42697 donham 25 '' wide armchair
23927 catania 30 '' wide polyester armchair
1587 akire 26.8 '' wide tufted linen armchair
29612 35.5 '' wide velvet armchair
1694 ascanio 28.5 '' wide tufted polyester armchair


In [18]:
#implementing a function to retrieve exact match product IDs for a query_id
def get_exact_matches_for_query(query_id):
    query_group = grouped_label_df.get_group(query_id)
    exact_matches = query_group.loc[query_group['label'] == 'Exact']['product_id'].values
    return exact_matches

#applying the function to obtain top product IDs and adding top K product IDs to the dataframe 
query_df['top_product_ids'] = query_df['query'].apply(get_top_product_ids_for_query)

#adding the list of exact match product_IDs from labels_df
query_df['relevant_ids'] = query_df['query_id'].apply(get_exact_matches_for_query)

#now assign the map@k score
query_df['map@k'] = query_df.apply(lambda x: map_at_k(x['relevant_ids'], x['top_product_ids'], k=10), axis=1)


In [19]:
# calculate the MAP across the entire query set
query_df.loc[:, 'map@k'].mean()

np.float64(0.3293063616071429)

Para este punto vemos que ya hemos logrado superar el objetivo inicial que era el pasar del 0.3. Lo cual ya es un gran avance.
Confirmando que el uso de la lematización (más algunos parámetros optimizados) nos permiten una mejora inicial.

In [20]:
#now assign the map@k score
query_df['map@10'] = query_df.apply(lambda x: map_at_k(x['relevant_ids'], x['top_product_ids'], k=10), axis=1)

### Búsqueda de parámetros óptimos 

Se probaron distintas combinaciones de parámetros del modelo buscando la combinación óptima.
Se descartó el uso de modificar min_df, max_df y ngram_range. Quedando solo stop_words y sublinear_tf los parámetros los que ofrecieron una mejora considerable.

In [None]:
import itertools
import time
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

def evaluate_vectorizer(params, product_df, query_df):
    """
    Crea TF-IDF con params, recupera top-10 y devuelve MAP@10 medio.
    params: dict con claves de TfidfVectorizer
    """
    # 1. Vectorizar
    vectorizer = TfidfVectorizer(**params)
    tfidf_matrix = vectorizer.fit_transform(
        (product_df['product_name'] + ' ' + product_df['product_description'])
        .values.astype('U')
    )
    # 2. Recuperación + evaluación
    query_df = query_df.copy()
    query_df['top_ids'] = query_df['query'].apply(
        lambda q: get_top_products(vectorizer, tfidf_matrix, q, top_n=10)
    )
    query_df['relevant_ids'] = query_df['query_id'].apply(get_exact_matches_for_query)
    query_df['map@10'] = query_df.apply(lambda r: map_at_k(r['relevant_ids'], r['top_ids'], k=10), axis=1)
    return query_df['map@10'].mean()

# Definir espacio de búsqueda
param_grid = {
    'tokenizer': [lemma_tokenizer],
    'ngram_range': [(1,1)],
    'min_df': [1],
    'max_df': [1.0],
    'sublinear_tf': [True],
    'stop_words': ['english']
}

# Generar combinaciones
keys, values = zip(*param_grid.items())
combinations = [dict(zip(keys, v)) for v in itertools.product(*values)]

results = []
for params in combinations:
    start = time.time()
    score = evaluate_vectorizer(params, product_df, query_df)
    duration = time.time() - start
    results.append({
        **params,
        'map@10': score,
        'time_s': duration
    })
    print(f"Proba {params} → MAP@10={score:.4f} ({duration:.1f}s)")

res_df = pd.DataFrame(results).sort_values('map@10', ascending=False)
print(res_df.head())



Proba {'tokenizer': <function lemma_tokenizer at 0x000001B8E89DCAF0>, 'ngram_range': (1, 1), 'min_df': 1, 'max_df': 1.0, 'sublinear_tf': True, 'stop_words': 'english'} → MAP@10=0.3293 (217.5s)
                                          tokenizer ngram_range  min_df  \
0  <function lemma_tokenizer at 0x000001B8E89DCAF0>      (1, 1)       1   

   max_df  sublinear_tf stop_words    map@10      time_s  
0     1.0          True    english  0.329306  217.491477  


In [22]:
res_df.head()

Unnamed: 0,tokenizer,ngram_range,min_df,max_df,sublinear_tf,stop_words,map@10,time_s
0,<function lemma_tokenizer at 0x000001B8E89DCAF0>,"(1, 1)",1,1.0,True,english,0.329306,217.491477


## Relevancia Parcial

Una vez optimizado el modelo TF-IDF, se buscó modificar la métrica de evaluación para que no castigara tanto sin tomar en cuenta los valores "parciales", se mantuvo con el valor de 0.5 dado que realmente no poseía el conocimiento suficiente para determinar cual podría ser la ponderación más óptima de los productos.

In [23]:
#implementing a function to retrieve exact match product IDs for a query_id
def get_partial_matches_for_query(query_id):
    query_group = grouped_label_df.get_group(query_id)
    exact_matches = query_group.loc[query_group['label'] == 'Partial']['product_id'].values
    return exact_matches

In [24]:
def get_relevance_scores(query_id, exact_w=1.0, partial_w=0.5):
    exact = set(get_exact_matches_for_query(query_id))
    partial = set(get_partial_matches_for_query(query_id))
    # id → peso de relevancia
    return {pid: exact_w for pid in exact} | {pid: partial_w for pid in partial}

In [None]:
query_df['relevance_dict'] = query_df['query_id'].apply(get_relevance_scores)

query_df['map@10_weighted'] = query_df.apply(
    lambda row: weighted_map_at_k(row['relevance_dict'], row['top_product_ids'], k=10),
    axis=1
)

# MAP final
query_df['map@10_weighted'].mean()

np.float64(0.48919072102665856)

Ya con estos valores vemos que el valor se acerca al 0.5.

## Migración a recuperación semántica con embeddings

TF-IDF es un buen baseline, pero tiene limitaciones ante sinónimos, cambios de forma y consultas con poca coincidencia léxica. 
Para mejorar la recuperación (y potencialmente el MAP@10) probamos una etapa de **embeddings semánticos** usando `sentence-transformers` (all-MiniLM-L6-v2), que produce vectores densos capaces de capturar similitud semántica entre consulta y descripción del producto.

En esta sección:
- Generamos embeddings para `product_name` + `product_description`.
- Construimos un índice FAISS para búsqueda rápida por similaridad (producto ↔ consulta).
- Compararemos métricas (MAP@10 y weighted MAP@10) con el baseline TF-IDF.

### ¿Por qué deberíamos probar con embbedings?

- **Robustez léxica**: los embeddings representan significado, no sólo coincidencia de palabras; detectan sinónimos y paráfrasis. El tener contexto debería beneficiar los resultados
- **Generalización**: consultas cortas o mal escritas aún pueden mapear a descripciones semánticamente cercanas.


### Generación de embeddings

Usamos `sentence-transformers/all-MiniLM-L6-v2` por ser ligero y efectivo en tareas de búsqueda semántica.

In [None]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

combined_text = (
    product_df['product_name'].fillna('') + ' ' +
    product_df['product_description'].fillna('')
).astype(str).tolist()

product_embeddings = embedding_model.encode(
    combined_text,
    convert_to_tensor=True,
    batch_size=64,
    show_progress_bar=True,
    device='cuda'
)

  from .autonotebook import tqdm as notebook_tqdm
Batches: 100%|██████████| 672/672 [00:47<00:00, 14.20it/s]


### Indexado con FAISS y normalización

FAISS (Facebook AI Similarity Search) permite búsquedas de vecindad muy rápidas a escala. 
Construimos un índice `IndexFlatIP` para producto-embedding con similaridad por producto usando **inner product** sobre vectores L2 normalizados (equivalente a coseno).

In [None]:
import faiss
import torch.nn.functional as F


product_embeddings = F.normalize(product_embeddings, p=2, dim=1)

embeddings_np = product_embeddings.cpu().numpy().astype('float32')

index = faiss.IndexFlatIP(embeddings_np.shape[1])
index.add(embeddings_np)



In [45]:
import os

output_dir = "data"
os.makedirs(output_dir, exist_ok=True)
faiss.write_index(index, os.path.join(output_dir, "faiss.index"))

### Funciones de recuperación semántica

Definimos helpers:
- `get_top_semantic(query, top_k)`: devuelve índices (posiciones en `product_df`) de los top K por similaridad.
- `get_top_semantic_ids(query, top_k)`: mapea índices → `product_id`.

In [None]:
def get_top_semantic(query, top_k=10):
    q_emb = embedding_model.encode([query], convert_to_tensor=True)
    q_np = q_emb.cpu().numpy().astype('float32')
    faiss.normalize_L2(q_np)
    D, I = index.search(q_np, top_k)
    return I.flatten().tolist()



In [30]:
def get_top_semantic_ids(query, top_k=10):
    idxs = get_top_semantic(query, top_k)
    return product_df.iloc[idxs]['product_id'].tolist()

In [31]:
query_df['top_semantic_ids'] = query_df['query'].apply(get_top_semantic_ids)

### Comparativa TF-IDF vs Embeddings (semántico)

Mostramos MAP@10 y Weighted MAP@10 para comparar ambos enfoques.  

In [32]:
query_df['map10_semantic'] = query_df.apply(
    lambda row: map_at_k(row['relevant_ids'], row['top_semantic_ids'], k=10),
    axis=1
)

print("MAP@10 TF-IDF     :", query_df['map@k'].mean())
print("MAP@10 Semántico :", query_df['map10_semantic'].mean())

MAP@10 TF-IDF     : 0.3293063616071429
MAP@10 Semántico : 0.3353198072824809


In [33]:
query_df['map10_semantic_weighted'] = query_df.apply(
    lambda row: weighted_map_at_k(row['relevance_dict'], row['top_semantic_ids'], k=10),
    axis=1
)


In [34]:
print("MAP@10 TF-IDF  :", query_df['map@10_weighted'].mean())
print("MAP@10 Semantic:", query_df['map10_semantic_weighted'].mean())


MAP@10 TF-IDF  : 0.48919072102665856
MAP@10 Semantic: 0.5342082347629222


### Conclusiones

De esta forma podemos llegar a las siguientes conclusiones:

- **Resultado**: El pipeline con `sentence-transformers` + FAISS ha logrado mejorar MAP@10 y Weighted MAP@10. Esto sugiere que la representación densa captura relaciones semánticas que TF-IDF no ve alegando los motivos que mencionamos anteriormente.
- **Coste**: Generar embeddings y mantener un índice FAISS requiere más memoria/CPU/GPU y tiempo inicial, pero se compensa con los resultados
- **Siguiente paso lógico**: añadir un reranker (Cross-Encoder) y/o un RAG/LLM para justificar o refinar la selección. Además de ello, la evaluación lo sugiere.

## Reranker (Cross-Encoder) y RAG (GPT4All)

Objetivo: afinar la lista candidata (FAISS) con un cross-encoder para mejorar ranking,
y usar un LLM local para justificar la elección (RAG). Evaluamos el impacto en MAP@10.

Primero se implementa la función para evaluar nuevamente el ranking de los índices candidatos otorgados.

In [None]:
# --- imports ---
from sentence_transformers import CrossEncoder
from gpt4all import GPT4All 
import numpy as np
import faiss

# --- 1) Reranker: carga de cross-encoder (preentrenado MS MARCO) ---

cross = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')  

def rerank_candidates(query, candidate_indices, top_m=10):
    """
    Devuelve:
      - ranked_indices: lista de índices (posición en product_df) ordenados desc por score (len <= top_m)
      - ranked_scores: lista de scores alineados (floats)
    """
    if len(candidate_indices) == 0:
        return [], []

    candidate_texts = (
        product_df.iloc[candidate_indices]['product_name'].fillna('') + " - " +
        product_df.iloc[candidate_indices]['product_description'].fillna('')
    ).tolist()

    # batch predict
    inputs = [[query, c] for c in candidate_texts]
    scores = cross.predict(inputs)          

    
    ranked_pos = sorted(range(len(candidate_indices)), key=lambda i: float(scores[i]), reverse=True)
    ranked_indices = [candidate_indices[i] for i in ranked_pos[:top_m]]
    ranked_scores = [float(scores[i]) for i in ranked_pos[:top_m]]
    return ranked_indices, ranked_scores

Se deja preparado el contexto para el LLM a usar. En esta caso se estará usando el de `Meta-Llama-3-8B-Instruct.Q4_0` por el motivo de ser gratuito.

In [None]:
# --- 2) RAG: generación con modelo local gratuito (GPT4All como ejemplo) ---


gpt_model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf")  

def build_context_from_indices(indices, max_chars=3000):
    
    parts = []
    total = 0
    for idx in indices:
        text = f"product_id: {product_df.iloc[idx]['product_id']}\n" \
               f"name: {product_df.iloc[idx]['product_name']}\n" \
               f"description: {product_df.iloc[idx]['product_description']}\n"
        if total + len(text) > max_chars:
            break
        parts.append(text)
        total += len(text)
    return "\n\n".join(parts)


Finalmente, realizamos la función del RAG para obtener respuesta del LLM.

In [None]:
def rag_answer(query, top_k_retriever=20, rerank_top_m=5):
    # 1) Recupera top_k_retriever con FAISS (similiaridad ya indexada en 'index')
    q_emb = embedding_model.encode([query], convert_to_tensor=False)  # numpy/1D
    q_np = np.array(q_emb).astype('float32')
    faiss.normalize_L2(q_np)
    D, I = index.search(q_np.reshape(1, -1), top_k_retriever)  # I shape (1, k)
    candidate_idxs = I.flatten().tolist()

    # 2) Rerank con cross-encoder y quedarnos con rerank_top_m
    reranked_idxs, rerank_scores = rerank_candidates(query, candidate_idxs, top_m=rerank_top_m)

    # 3) Construir contexto (top passages) y prompt para RAG
    context = build_context_from_indices(reranked_idxs, max_chars=3500)
    prompt = f"""
        You are a product assistant. Use ONLY the information in the context to answer the user's question.
        Return a short JSON with keys: best_product_id, reasons (list of 1-2 short reasons), top_candidates (list of product_id).
        If you don't know, say you don't know.

        Context:
        {context}

        User question:
        {query}

        Answer in JSON only.
    """
    # 4) Genera con GPT4All local
    with gpt_model.chat_session():
        raw = gpt_model.generate(prompt, max_tokens=400)
    
    import json, re
    
    json_text = re.search(r'\{.*\}', raw, flags=re.DOTALL)
    if json_text:
        try:
            return json.loads(json_text.group(0))
        except:
            return {"error":"no se pudo parsear JSON del LLM", "raw": raw}
    else:
        return {"raw": raw}

In [38]:
# --- 3) Evaluación: comparar MAP@10 antes / después del rerank ---
def evaluate_rerank_on_queries(query_df, top_k_retriever=50, rerank_top_m=10):
    scores = []
    for _, row in query_df.iterrows():
        q = row['query']
        # retrieve top_k_retriever
        q_emb = embedding_model.encode([q], convert_to_tensor=False)
        q_np = np.array(q_emb).astype('float32')
        faiss.normalize_L2(q_np)
        D, I = index.search(q_np.reshape(1, -1), top_k_retriever)
        candidate_idxs = I.flatten().tolist()
        reranked_idxs, _ = rerank_candidates(q, candidate_idxs, top_m=rerank_top_m)
        predicted_ids = product_df.iloc[reranked_idxs]['product_id'].tolist()
        score = map_at_k(row['relevant_ids'], predicted_ids, k=10)
        scores.append(score)
    return np.mean(scores)


In [None]:
print("MAP@10 before (semantic):", query_df['map10_semantic'].mean())
new_map = evaluate_rerank_on_queries(query_df, top_k_retriever=50, rerank_top_m=10)
print("MAP@10 after cross-encoder rerank:", new_map)

print(rag_answer("armchair for small living room, pet friendly", top_k_retriever=30, rerank_top_m=5))

MAP@10 before (semantic): 0.3353198072824809
MAP@10 after cross-encoder rerank: 0.44113371040931804
{'best_product_id': '40986', 'reasons': ['pet-friendly', 'soft and comfortable'], 'top_candidates': ['40986']}


### Conclusiones

Se observa que haciendo uso del `cross-encoder` hemos obtenido mejores resultados respecto a los obtenidos anteriormente solo con el modelo `semantic`.

## Evaluación de Métricas

In [55]:
import random

def evaluate_pipeline(query_df, pipeline='semantic', top_k_retriever=50, top_m=10, exact_w=1.0, partial_w=0.5):
    """Versión ligera: devuelve resumen (MAP, Weighted MAP, NDCG, P@10, R@10)."""
    map_list = []
    wmap_list = []
    ndcg_list = []
    prec_list = []
    recall_list = []

    for _, row in query_df.iterrows():
        q = row['query']
        q_emb = embedding_model.encode([q], convert_to_tensor=False)
        q_np = np.array(q_emb).astype('float32')
        faiss.normalize_L2(q_np)
        D, I = index.search(q_np.reshape(1, -1), top_k_retriever)
        candidate_idxs = I.flatten().tolist()

        if pipeline == 'semantic':
            predicted_idxs = candidate_idxs[:top_m]
        elif pipeline == 'rerank':
            predicted_idxs, _ = rerank_candidates(q, candidate_idxs, top_m=top_m)
        else:
            raise ValueError("pipeline must be 'semantic' or 'rerank'")

        predicted_ids = product_df.iloc[predicted_idxs]['product_id'].tolist()

        exact_set = set(get_exact_matches_for_query(row['query_id']))
        partial_set = set(get_partial_matches_for_query(row['query_id']))
        relevance_dict = {pid: exact_w for pid in exact_set}
        for pid in partial_set:
            if pid not in relevance_dict:
                relevance_dict[pid] = partial_w

        map_list.append(map_at_k(list(exact_set), predicted_ids, k=10))
        wmap_list.append(weighted_map_at_k(relevance_dict, predicted_ids, k=10))
        rel_list = relevance_list_for_predicted_ids(relevance_dict, predicted_ids, default=0.0)
        ndcg_list.append(ndcg_at_k_from_relevance_list(rel_list, k=10))
        prec_list.append(precision_at_k(predicted_ids, exact_set, k=10))
        recall_list.append(recall_at_k(predicted_ids, exact_set, k=10))

    summary = {
        "MAP@10 (exact only)": np.mean(map_list),
        "Weighted MAP@10": np.mean(wmap_list),
        "NDCG@10 (graded)": np.mean(ndcg_list),
        "Precision@10": np.mean(prec_list),
        "Recall@10": np.mean(recall_list),
    }
    return summary, np.array(wmap_list), np.array(ndcg_list), np.array(map_list)


In [None]:
# --------------------
# Bootstrap paired difference (CI) para arrays por-query
# --------------------
def bootstrap_paired_diff(arrA, arrB, n_boot=2000, seed=42):
    rng = random.Random(seed)
    n = len(arrA)
    diffs = []
    for _ in range(n_boot):
        idxs = [rng.randrange(n) for _ in range(n)]
        a = np.mean([arrA[i] for i in idxs])
        b = np.mean([arrB[i] for i in idxs])
        diffs.append(a - b)
    diffs = sorted(diffs)
    low = diffs[int(0.025 * n_boot)]
    high = diffs[int(0.975 * n_boot)]
    mean_diff = np.mean(diffs)
    return {"mean_diff": mean_diff, "95%_CI": (low, high)}

In [None]:
# --------------------
# Ejecutar evaluación: baseline (semantic) y rerank
# --------------------

TOP_K_RETRIEVER = 50
TOP_M = 10
EXACT_W = 1.0
PARTIAL_W = 0.5   

print("Evaluando pipeline SEMANTIC (FAISS top-10)...")
sem_res_avg, sem_wmap_arr, sem_ndcg_arr, sem_mapbin_arr = evaluate_pipeline(
    query_df,
    pipeline='semantic',
    top_k_retriever=TOP_K_RETRIEVER,
    top_m=TOP_M,
    exact_w=EXACT_W,
    partial_w=PARTIAL_W
)
print("Resultado SEMANTIC (promedios):")
for k, v in sem_res_avg.items():
    print(f"  {k:25s}: {v:.4f}")

print("\nEvaluando pipeline RERANK (FAISS -> CrossEncoder -> top-10)...")
rer_res_avg, rer_wmap_arr, rer_ndcg_arr, rer_mapbin_arr = evaluate_pipeline(
    query_df,
    pipeline='rerank',
    top_k_retriever=TOP_K_RETRIEVER,
    top_m=TOP_M,
    exact_w=EXACT_W,
    partial_w=PARTIAL_W
)
print("Resultado RERANK (promedios):")
for k, v in rer_res_avg.items():
    print(f"  {k:25s}: {v:.4f}")

# --------------------
# Bootstrap paired test (Weighted MAP difference)
# --------------------
bs = bootstrap_paired_diff(rer_wmap_arr, sem_wmap_arr, n_boot=2000, seed=123)
print("\nBootstrap paired (RERANK - SEMANTIC) sobre Weighted MAP@10:")
print(f"  mean difference: {bs['mean_diff']:.6f}")
print(f"  95% CI         : ({bs['95%_CI'][0]:.6f}, {bs['95%_CI'][1]:.6f})")

# --------------------
# Grid-search sencillo para partial_w
# --------------------
grid = [0.2, 0.4, 0.5, 0.7, 0.9]
grid_results = []
print("\nGrid search partial_w (Weighted MAP@10) — probando valores:", grid)
for pw in grid:
    _, sem_wmap_pw, _, _ = evaluate_pipeline(query_df, pipeline='semantic', top_k_retriever=TOP_K_RETRIEVER, top_m=TOP_M, exact_w=EXACT_W, partial_w=pw)
    _, rer_wmap_pw, _, _ = evaluate_pipeline(query_df, pipeline='rerank', top_k_retriever=TOP_K_RETRIEVER, top_m=TOP_M, exact_w=EXACT_W, partial_w=pw)
    sem_mean = np.mean(sem_wmap_pw)
    rer_mean = np.mean(rer_wmap_pw)
    grid_results.append({"partial_w": pw, "sem_weighted_map": sem_mean, "rer_weighted_map": rer_mean})
    print(f"  partial_w={pw:.2f} -> SEM WMAP={sem_mean:.4f}  |  RER WMAP={rer_mean:.4f}")

grid_df = pd.DataFrame(grid_results).sort_values('rer_weighted_map', ascending=False)
print("\nGrid results (ordenado por rerank Weighted MAP):")
print(grid_df.to_string(index=False))

Evaluando pipeline SEMANTIC (FAISS top-10)...
Resultado SEMANTIC (promedios):
  MAP@10 (exact only)      : 0.3353
  Weighted MAP@10          : 0.5342
  NDCG@10 (graded)         : 0.8713
  Precision@10             : 0.3223
  Recall@10                : 0.1729

Evaluando pipeline RERANK (FAISS -> CrossEncoder -> top-10)...
Resultado RERANK (promedios):
  MAP@10 (exact only)      : 0.4411
  Weighted MAP@10          : 0.5971
  NDCG@10 (graded)         : 0.8942
  Precision@10             : 0.3769
  Recall@10                : 0.2213

Bootstrap paired (RERANK - SEMANTIC) sobre Weighted MAP@10:
  mean difference: 0.062882
  95% CI         : (0.048660, 0.077632)

Grid search partial_w (Weighted MAP@10) — probando valores: [0.2, 0.4, 0.5, 0.7, 0.9]
  partial_w=0.20 -> SEM WMAP=0.4134  |  RER WMAP=0.4887
  partial_w=0.40 -> SEM WMAP=0.4933  |  RER WMAP=0.5603
  partial_w=0.50 -> SEM WMAP=0.5342  |  RER WMAP=0.5971
  partial_w=0.70 -> SEM WMAP=0.6161  |  RER WMAP=0.6709
  partial_w=0.90 -> SEM WMAP

## Resultados finales y conclusión

- El pipeline con *FAISS + Cross-Encoder (rerank)* mejora de forma consistente al baseline semántico:
  - MAP@10 (Exact) aumentó de **0.3353 → 0.4411**.
  - Weighted MAP@10 aumentó de **0.5342 → 0.5971**.
  - NDCG@10 subió ligeramente (0.8713 → 0.8942).
- El bootstrap paired sobre la diferencia por-query en Weighted MAP@10 (n=2000) muestra una diferencia media **0.0629** con 95% CI = **(0.0487, 0.0776)** — la mejora es estadísticamente significativa.
- La grid-search sobre `partial_w` muestra el efecto del peso de las coincidencias `Partial` en la métrica. Se recomienda `partial_w = 0.5` como valor por defecto, pero la elección final depende de prioridades del producto (precisión vs. cobertura).
- **Recomendación práctica:** mantener el pipeline FAISS → Cross-Encoder en producción como baseline de búsqueda; añadir una capa RAG/LLM (local o API) únicamente para explicar o justificar la selección cuando sea necesario (ej. fichas de producto destacadas).