# 2_preparacion_productos

Objetivo: partir de los datos ya filtrados en `data/step_1/` y generar un fichero limpio con `text_for_embedding` para usar en la web.

Entradas esperadas:
- `data/step_1/meta_*_sample.jsonl` (creado en el paso 1).

Salidas:
- `data/step_2/products_clean.parquet` (o `.jsonl`) con: `id`, `title`, `description`, `image`,`categories`, `text_for_embedding`.
- Copia liviana en `data/products_clean.jsonl` lista para cargar desde Next.js.


In [1]:
from pathlib import Path
import json
import pandas as pd

PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
# Rutas de trabajo internas a notebooks (no se suben a git)
STEP1_DIR = PROJECT_ROOT / "notebooks" / "data" / "step_1"
STEP2_DIR = PROJECT_ROOT / "notebooks" / "data" / "step_2"
# Copia limpia para el paso 3 (se queda en notebooks/data/step_2)
FINAL_JSONL = STEP2_DIR / "products_clean.jsonl"
STEP2_DIR.mkdir(parents=True, exist_ok=True)

# selecciona el primer meta_*.jsonl disponible en notebooks/data/step_1
candidates = sorted(STEP1_DIR.glob("meta_*.jsonl"))
if not candidates:
    available = [p.name for p in sorted(STEP1_DIR.glob('*'))]
    raise FileNotFoundError(f"No hay meta_*.jsonl en {STEP1_DIR}. Ejecuta 1_preparacion_datos.ipynb. Archivos encontrados: {available}")

RAW_PATH = candidates[0]
CATEGORY = RAW_PATH.stem.replace("meta_", "")
print(f"Usando {RAW_PATH.name} (categoría: {CATEGORY})")


Usando meta_Cell_Phones_and_Accessories_sample.jsonl (categoría: Cell_Phones_and_Accessories_sample)


In [2]:
# Cargar jsonl a DataFrame
records = []
with open(RAW_PATH, "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        rec = json.loads(line)
        records.append(rec)

df = pd.DataFrame.from_records(records)
print(df.head())
print("Filas:", len(df))


                                               title  \
0  Pink &amp; White 3d Melt Ice-cream Skin Hard C...   
1  Purple Hard Case Cover for Iphone 4 4s 4g with...   
2  Hello Kitty Light-weighted Chrome Case Black C...   
3  Cool Summer Breeze in the Ocean Beach Collecti...   
4  Cool Summer Breeze In The Ocean Beach Collecti...   

                                         description  \
0  Pink & White 3D Melt Ice-Cream Skin Hard Case ...   
1  Purple Hard Case Cover for iPhone 4 4S 4G With...   
2  Thin and light weighted, Case's unique design ...   
3  Product Name: Cool Summer Breeze In The Ocean ...   
4  Product Name: Cool Summer Breeze In The Ocean ...   

                                               image  \
0  http://ecx.images-amazon.com/images/I/31zn6SOL...   
1  http://ecx.images-amazon.com/images/I/41WCZc2d...   
2  http://ecx.images-amazon.com/images/I/41fy1%2B...   
3  http://ecx.images-amazon.com/images/I/415cmp6Q...   
4  http://ecx.images-amazon.com/images/I/41XDw

In [3]:
# Normalizar campos mínimos
def col_or_default(name: str, default=""):
    if name in df.columns:
        return df[name].fillna(default)
    return pd.Series([default] * len(df))

# id: si no existe o viene vacío, autogenera secuencia
df["id"] = col_or_default("id", "")
if (df["id"] == "").all():
    df["id"] = pd.Series(range(len(df))).astype(str)
df["id"] = df["id"].astype(str)

df["title"] = col_or_default("title", "")
df["description"] = col_or_default("description", "")

# categories puede ser lista; conviértelo a string "a > b > c"
if "category_path" in df.columns:
    cat_series = df["category_path"]
elif "categories" in df.columns:
    cat_series = df["categories"]
else:
    cat_series = pd.Series([""] * len(df))

def normalize_cat(x):
    if isinstance(x, list):
        try:
            return " > ".join([str(i) for i in x])
        except Exception:
            return ""
    return x or ""

df["category_path"] = cat_series.apply(normalize_cat)

img_series = df["image_url"] if "image_url" in df.columns else (df["image"] if "image" in df.columns else pd.Series([""] * len(df)))
df["image_url"] = img_series.fillna("").apply(lambda x: x.replace("http://", "https://") if isinstance(x, str) else x)

def build_text(row):
    parts = [row.get("title", ""), row.get("description", "")]
    cat = row.get("category_path", "")
    if cat:
        parts.append(f"Categories: {cat}")
    return ". ".join(p for p in parts if p).strip()

df["text_for_embedding"] = df.apply(build_text, axis=1)
df = df[["id", "title", "description", "category_path", "image_url", "text_for_embedding"]]
df.head()


Unnamed: 0,id,title,description,category_path,image_url,text_for_embedding
0,0,Pink &amp; White 3d Melt Ice-cream Skin Hard C...,Pink & White 3D Melt Ice-Cream Skin Hard Case ...,"['Cell Phones & Accessories', 'Cases', 'Basic ...",https://ecx.images-amazon.com/images/I/31zn6SO...,Pink &amp; White 3d Melt Ice-cream Skin Hard C...
1,1,Purple Hard Case Cover for Iphone 4 4s 4g with...,Purple Hard Case Cover for iPhone 4 4S 4G With...,"['Cell Phones & Accessories', 'Cases', 'Basic ...",https://ecx.images-amazon.com/images/I/41WCZc2...,Purple Hard Case Cover for Iphone 4 4s 4g with...
2,2,Hello Kitty Light-weighted Chrome Case Black C...,"Thin and light weighted, Case's unique design ...","['Cell Phones & Accessories', 'Cases', 'Basic ...",https://ecx.images-amazon.com/images/I/41fy1%2...,Hello Kitty Light-weighted Chrome Case Black C...
3,3,Cool Summer Breeze in the Ocean Beach Collecti...,Product Name: Cool Summer Breeze In The Ocean ...,"['Cell Phones & Accessories', 'Cases', 'Basic ...",https://ecx.images-amazon.com/images/I/415cmp6...,Cool Summer Breeze in the Ocean Beach Collecti...
4,4,Cool Summer Breeze In The Ocean Beach Collecti...,Product Name: Cool Summer Breeze In The Ocean ...,"['Cell Phones & Accessories', 'Cases', 'Basic ...",https://ecx.images-amazon.com/images/I/41XDwPt...,Cool Summer Breeze In The Ocean Beach Collecti...


In [4]:
# Guardar en jsonl (parquet desactivado para evitar dependencias)
jsonl_path = STEP2_DIR / f"products_clean_{CATEGORY}.jsonl"

with open(jsonl_path, "w", encoding="utf-8") as f:
    for rec in df.to_dict(orient="records"):
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")

# Copia final accesible a Next.js (fuera de notebooks/)
with open(FINAL_JSONL, "w", encoding="utf-8") as f:
    for rec in df.to_dict(orient="records"):
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")

print("Guardado JSONL intermedio:", jsonl_path)
print("Copia final JSONL:", FINAL_JSONL)


Guardado JSONL intermedio: /Users/marc/Documents/Projectes/tfm-product-matching/notebooks/data/step_2/products_clean_Cell_Phones_and_Accessories_sample.jsonl
Copia final JSONL: /Users/marc/Documents/Projectes/tfm-product-matching/notebooks/data/step_2/products_clean.jsonl
