# WANDS – Procedimiento y analisis de la solucion
**Autor:** Julio Briceño  •  **Fecha:** 2025-10-10

Este cuaderno implementa solo la **lógica** del buscador y la **evaluación**:
- Baseline TF‑IDF simple.
- Indice lexico multi‑campo (TF‑IDF por *name/desc/brand/cat* con fusión por pesos).
- Indice denso opcional (embeddings) e **hibrido** (RRF).
- Metricas: **MAP@10**, **Soft‑MAP@10** (cuenta *partial*), **nDCG@10**.
- Al final se **imprimen los 3 valores** (para el modo elegido).

## 0) Como pense la solucion?
- **Dolor del baseline**: usa `name+desc` junto, penaliza *partial* como 0 ⇒ MAP bajo y no refleja utilidad.
- **Idea 1 (rapida y robusta)**: separar campos `name/desc/brand/cat`, normalizar scores por campo (min‑max) y **fusionar por pesos** (mas peso a `name`).
- **Idea 2 (equilibrio literal/semantico)**: añadir **embeddings** livianos (MiniLM) y mezclar con TF‑IDF usando **RRF** para no calibrar escalas.
- **Metricas**: además de MAP exacto, meter **Soft‑MAP** (exact=1, partial=0.5) y **nDCG** (gana más el top). Asi alineo metrica con “utilidad” real del negocio.
- **Trade‑offs**: TF‑IDF es veloz y transparente; embeddings mejoran queries vagas pero suben latencia/memoria. RRF es simple y efectivo; no usa magnitudes de score.

## 1) Setup y dataset
> Si no usaras embeddings, puedes omitir la instalación de `sentence-transformers`.

In [4]:
# !pip install -U pandas numpy scikit-learn sentence-transformers faiss-cpu


In [5]:
import os, math, logging
from dataclasses import dataclass
from typing import Tuple, Optional, Dict, List

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import MinMaxScaler

logging.basicConfig(level=logging.INFO)
log = logging.getLogger("wands-logic")
DATA_DIR = os.getenv("DATA_DIR", "WANDS/dataset")
K = 10  


**Descarga/carga del dataset**:

In [6]:
# !git clone https://github.com/wayfair/WANDS.git

product_df = pd.read_csv(f"{DATA_DIR}/product.csv", sep="\t")
query_df   = pd.read_csv(f"{DATA_DIR}/query.csv", sep="\t")
label_df   = pd.read_csv(f"{DATA_DIR}/label.csv",  sep="\t")

len(product_df), len(query_df), len(label_df)


(42994, 480, 233448)

## 2) Baseline TF‑IDF 


In [7]:
def calculate_tfidf(dataframe):
    combined_text = dataframe['product_name'] + ' ' + dataframe['product_description']
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(combined_text.values.astype('U'))
    return vectorizer, tfidf_matrix

def get_top_products(vectorizer, tfidf_matrix, query, top_n=10):
    query_vector = vectorizer.transform([query])
    cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    top_product_indices = cosine_similarities.argsort()[-top_n:][::-1]
    return top_product_indices

def map_at_k(true_ids, predicted_ids, k=10):
    if not len(true_ids) or not len(predicted_ids):
        return 0.0
    score = 0.0
    num_hits = 0.0
    for i, p_id in enumerate(predicted_ids[:k]):
        if p_id in true_ids and p_id not in predicted_ids[:i]:
            num_hits += 1.0
            score += num_hits / (i + 1.0)
    return score / min(len(true_ids), k)

grouped_label_df = label_df.groupby('query_id')
def get_exact_matches_for_query(query_id):
    query_group = grouped_label_df.get_group(query_id)
    exact_matches = query_group.loc[query_group['label'] == 'Exact']['product_id'].values
    return exact_matches

vectorizer, tfidf_matrix = calculate_tfidf(product_df)
def get_top_product_ids_for_query(query):
    top_product_indices = get_top_products(vectorizer, tfidf_matrix, query, top_n=10)
    top_product_ids = product_df.iloc[top_product_indices]['product_id'].tolist()
    return top_product_ids

query_df['top_product_ids'] = query_df['query'].apply(get_top_product_ids_for_query)
query_df['relevant_ids'] = query_df['query_id'].apply(get_exact_matches_for_query)
query_df['map@k'] = query_df.apply(lambda x: map_at_k(x['relevant_ids'], x['top_product_ids'], k=10), axis=1)
query_df.loc[:, 'map@k'].mean()


0.29319550540123457

## 3) Refactor: configuración y utilidades

In [8]:
@dataclass
class VectorizerParams:
    lowercase: bool = True
    strip_accents: str = "unicode"
    stop_words: Optional[str] = None
    ngram_range: Tuple[int,int] = (1,2)
    min_df: int = 1
    max_df: float = 0.95
    sublinear_tf: bool = True

@dataclass
class FieldWeights:
    name: float = 0.65
    desc: float = 0.35
    brand: float = 0.10
    cat: float   = 0.15

@dataclass
class LabelGains:
    soft: Dict[str, float] = None
    ndcg: Dict[str, float] = None
    def __post_init__(self):
        if self.soft is None:
            self.soft = {"exact":1.0, "partial":0.5, "irrelevant":0.0}
        if self.ndcg is None:
            self.ndcg = {"exact":2.0, "partial":1.0, "irrelevant":0.0}

def safe(s: pd.Series) -> pd.Series:
    return s.fillna("").astype(str)

def clean_category(s: pd.Series) -> pd.Series:
    return safe(s).str.replace(r"[>/|]", " ", regex=True)

def guess_col(df: pd.DataFrame, candidates: List[str]):
    for c in candidates:
        if c in df.columns:
            return c
    return None


## 4) Índice léxico multi‑campo
- `name` usa n‑gramas (1,2); `desc` unigramas.
- Normalización por campo (min‑max) y fusión por pesos.

In [9]:
class MultiFieldIndex:
    def __init__(self, product_df: pd.DataFrame,
                 vec_name:  VectorizerParams = VectorizerParams(ngram_range=(1,2)),
                 vec_desc:  VectorizerParams = VectorizerParams(ngram_range=(1,1)),
                 vec_brand: VectorizerParams = VectorizerParams(ngram_range=(1,1)),
                 vec_cat:   VectorizerParams = VectorizerParams(ngram_range=(1,2)),
                 weights:   FieldWeights     = FieldWeights()):
        self.df = product_df.reset_index(drop=True)
        self.vec_name_params  = vec_name
        self.vec_desc_params  = vec_desc
        self.vec_brand_params = vec_brand
        self.vec_cat_params   = vec_cat
        self.weights          = weights
        self.brand_col = guess_col(self.df, ["brand","brand_name","manufacturer","maker","vendor"])
        self.cat_col   = guess_col(self.df, ["category","categories","category_name","category_path","taxonomy","class","class_name","product_type"])
        self.vec_name = self.vec_desc = self.vec_brand = self.vec_cat = None
        self.X_name = self.X_desc = self.X_brand = self.X_cat = None

    @staticmethod
    def _mk_vec(p: VectorizerParams) -> TfidfVectorizer:
        return TfidfVectorizer(lowercase=p.lowercase, strip_accents=p.strip_accents, stop_words=p.stop_words,
                               ngram_range=p.ngram_range, min_df=p.min_df, max_df=p.max_df, sublinear_tf=p.sublinear_tf)
    @staticmethod
    def _mm(x: np.ndarray) -> np.ndarray:
        return MinMaxScaler().fit_transform(x.reshape(-1,1)).ravel()

    def fit(self):
        name = safe(self.df.get("product_name","")); desc = safe(self.df.get("product_description",""))
        self.vec_name = self._mk_vec(self.vec_name_params); self.vec_desc = self._mk_vec(self.vec_desc_params)
        self.X_name = self.vec_name.fit_transform(name.astype("U"))
        self.X_desc = self.vec_desc.fit_transform(desc.astype("U"))
        if self.brand_col is not None and safe(self.df[self.brand_col]).str.strip().str.len().gt(0).any():
            self.vec_brand = self._mk_vec(self.vec_brand_params)
            self.X_brand = self.vec_brand.fit_transform(safe(self.df[self.brand_col]).astype("U"))
        if self.cat_col is not None and safe(self.df[self.cat_col]).str.strip().str.len().gt(0).any():
            self.vec_cat = self._mk_vec(self.vec_cat_params)
            self.X_cat = self.vec_cat.fit_transform(clean_category(self.df[self.cat_col]).astype("U"))
        return self

    def search(self, query: str, k:int=10):
        s_name = cosine_similarity(self.vec_name.transform([query]), self.X_name).ravel()
        s_desc = cosine_similarity(self.vec_desc.transform([query]), self.X_desc).ravel()
        fused = self.weights.name*self._mm(s_name) + self.weights.desc*self._mm(s_desc)
        if self.vec_brand is not None:
            s_brand = cosine_similarity(self.vec_brand.transform([query]), self.X_brand).ravel()
            fused += self.weights.brand*self._mm(s_brand)
        if self.vec_cat is not None:
            s_cat = cosine_similarity(self.vec_cat.transform([query]), self.X_cat).ravel()
            fused += self.weights.cat*self._mm(s_cat)
        idx = np.argsort(fused)[-k:][::-1]
        ids = self.df.iloc[idx]["product_id"].tolist()
        scores = fused[idx].tolist()
        return ids, scores

lex = MultiFieldIndex(product_df).fit()


## 5) Índice denso y fusión RRF 

In [None]:
dense_available = False
try:
    import torch
    from sentence_transformers import SentenceTransformer
    try:
        import faiss; _FAISS=True
    except Exception:
        _FAISS=False
    class DenseIndex:
        def __init__(self, product_df, model_name="sentence-transformers/all-MiniLM-L6-v2", normalize=True):
            self.df = product_df.reset_index(drop=True)
            self.model = SentenceTransformer(model_name, device="cuda" if torch.cuda.is_available() else "cpu")
            self.normalize = normalize
            self.emb=None; self.index=None
        def fit(self, batch_size=256, cache_path="embeddings.npy"):
            texts = (self.df['product_name'].fillna('') + ' ' + self.df['product_description'].fillna('')).tolist()
            self.emb = self.model.encode(texts, batch_size=batch_size, show_progress_bar=True,
                                         normalize_embeddings=self.normalize).astype("float32")
            np.save(cache_path, self.emb)
            if _FAISS:
                d = self.emb.shape[1]; self.index = faiss.IndexFlatIP(d); self.index.add(self.emb)
            else:
                self.index = NearestNeighbors(n_neighbors=50, metric="cosine").fit(self.emb)
            return self
        def load_cached(self, cache_path="embeddings.npy"):
            self.emb = np.load(cache_path)
            if _FAISS:
                d = self.emb.shape[1]; self.index = faiss.IndexFlatIP(d); self.index.add(self.emb)
            else:
                self.index = NearestNeighbors(n_neighbors=50, metric="cosine").fit(self.emb)
            return self
        def search(self, query: str, k:int=10):
            q = self.model.encode([query], normalize_embeddings=self.normalize).astype("float32")
            if _FAISS:
                sims, idx = self.index.search(q, k)
                ids = self.df.iloc[idx[0]]["product_id"].tolist(); scores = sims[0].tolist()
            else:
                dist, idx = self.index.kneighbors(q, n_neighbors=k, return_distance=True)
                ids = self.df.iloc[idx[0]]["product_id"].tolist(); scores = (1.0 - dist[0]).tolist()
            return ids, scores
    dense = DenseIndex(product_df).fit()
    dense_available = True
except Exception as e:
    log.warning("DenseIndex no disponible, uso solo léxico (%s)", e)

from collections import defaultdict

def rrf_fuse(rankings: List[List[int]], k: int = 10, Kc: int = 60):
    scores = defaultdict(float)
    for r in rankings:
        for rank, pid in enumerate(r, start=1):
            scores[pid] += 1.0/(Kc+rank)
    items = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:k]
    ids = [pid for pid,_ in items]; scs=[s for _,s in items]
    return ids, scs

class HybridIndex:
    def __init__(self, product_df: pd.DataFrame, lexical_index: MultiFieldIndex|None=None,
                 k_lex=50, k_dense=50):
        self.df = product_df.reset_index(drop=True)
        self.lex = lexical_index or MultiFieldIndex(self.df).fit()
        self.k_lex = k_lex; self.k_dense = k_dense
        self.dense = dense if dense_available else None
    def search(self, query: str, k:int=10):
        lex_ids, _ = self.lex.search(query, k=self.k_lex)
        if self.dense is None:
            return lex_ids[:k], list(range(k,0,-1))
        den_ids, _ = self.dense.search(query, k=self.k_dense)
        return rrf_fuse([lex_ids, den_ids], k=k, Kc=60)

hyb = HybridIndex(product_df, lexical_index=lex)


  from tqdm.autonotebook import tqdm, trange
INFO:faiss.loader:Loading faiss with AVX512 support.
INFO:faiss.loader:Could not load library with AVX512 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx512'")
INFO:faiss.loader:Loading faiss with AVX2 support.
INFO:faiss.loader:Successfully loaded faiss with AVX2 support.
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
Batches: 100%|██████████| 168/168 [00:35<00:00,  4.71it/s]


## 6) Métricas con *partial*: Soft‑MAP y nDCG

In [11]:
def graded_rel_for_query(label_rows: pd.DataFrame, gains: Dict[str,float]) -> Dict[int,float]:
    rel = {}
    for _, r in label_rows.iterrows():
        pid = int(r["product_id"]); lab = str(r["label"]).strip().lower()
        rel[pid] = max(rel.get(pid, 0.0), gains.get(lab, 0.0))
    return rel

def soft_ap_at_k(graded_rel: Dict[int,float], predicted_ids: List[int], k:int=10)->float:
    if not predicted_ids: return 0.0
    ideal = sorted(graded_rel.values(), reverse=True)[:k]
    denom = sum(ideal)
    if denom == 0: return 0.0
    cum = 0.0; score = 0.0
    for i, pid in enumerate(predicted_ids[:k], start=1):
        g = graded_rel.get(pid, 0.0)
        cum += g; score += (cum/i)*g
    return score/denom

def dcg_at_k(graded_rel: Dict[int,float], predicted_ids: List[int], k:int=10)->float:
    dcg = 0.0
    for i, pid in enumerate(predicted_ids[:k], start=1):
        g = graded_rel.get(pid, 0.0)
        dcg += g / math.log2(i+1)
    return dcg

def ndcg_at_k(graded_rel: Dict[int,float], predicted_ids: List[int], k:int=10)->float:
    dcg = dcg_at_k(graded_rel, predicted_ids, k)
    ideal = sorted(graded_rel.values(), reverse=True)[:k]
    idcg = sum(g/math.log2(i+1) for i,g in enumerate(ideal, start=1))
    return 0.0 if idcg==0 else dcg/idcg


## 7) Evaluación unificada y **print** de los 3 valores
Elige `mode` en `{"lexical","dense","hybrid"}`. Por defecto uso **hybrid** si hay embeddings.

In [None]:
def evaluate(index, k:int=10, gains_soft=None, gains_ndcg=None):
    gains_soft = gains_soft or {"exact":1.0,"partial":0.5,"irrelevant":0.0}
    gains_ndcg = gains_ndcg or {"exact":2.0,"partial":1.0,"irrelevant":0.0}
    grouped = label_df.groupby("query_id")
    df = query_df.copy()
    df["pred_ids"] = df["query"].apply(lambda q: index.search(q, k=k)[0])
    def exact(qid):
        g = grouped.get_group(qid)
        return g.loc[g["label"].str.lower().eq("exact"), "product_id"].astype(int).tolist()
    df["true_ids"] = df["query_id"].apply(exact)
    df["map@k"] = df.apply(lambda r: map_at_k(r["true_ids"], r["pred_ids"], k), axis=1)
    def grad(qid, gains):
        return graded_rel_for_query(grouped.get_group(qid), gains)
    df["soft_map@k"] = df.apply(lambda r: soft_ap_at_k(grad(r["query_id"], {"exact":1.0,"partial":0.5,"irrelevant":0.0}), r["pred_ids"], k), axis=1)
    df["ndcg@k"]      = df.apply(lambda r: ndcg_at_k(grad(r["query_id"], {"exact":2.0,"partial":1.0,"irrelevant":0.0}), r["pred_ids"], k), axis=1)
    return float(df["map@k"].mean()), float(df["soft_map@k"].mean()), float(df["ndcg@k"].mean())


mode = "hybrid" if dense_available else "lexical"
index = {"lexical": lex, "hybrid": hyb}.get(mode, lex)

MAP, SOFT, NDCG = evaluate(index, k=K)
print(f"{mode.upper()} @ {K}:")
print("MAP@K     =", MAP)
print("Soft-MAP@K=", SOFT)
print("nDCG@K    =", NDCG)


Batches: 100%|██████████| 1/1 [00:00<00:00, 36.51it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 76.92it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 125.30it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 125.29it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 44.37it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 87.04it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 125.35it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 117.23it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 82.83it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 152.70it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 167.19it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 167.05it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 167.14it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 125.34it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 165.99it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 154.07it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 153.76it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 143.28it/

HYBRID @ 10:
MAP@K     = 0.37472087880291005
Soft-MAP@K= 0.5168877232571683
nDCG@K    = 0.7285939538812994


## 8) Notas finales 
- **Pros**: multi‑campo con pesos mejora precisión literal; híbrido sube recall semántico; métricas graduadas reflejan mejor la utilidad.
- **Contras**: embeddings consumen más recursos; RRF ignora magnitud de score; sin stem/lemmatization puede haber morfología no capturada.
- **Mejoras futuras**: query understanding (typos/sinónimos), stemming/lemmatization, ANN más rápido (IVF/HNSW), aprendizaje de pesos/fusión, caching por popularidad.