# 1) Preparación de datos (Amazon SNAP – Cell Phones & Accessories)
Este cuaderno:
- Lee el **fichero bruto** `meta_Cell_Phones_and_Accessories.json` descargado en el paso 0.
- Extrae campos útiles: `title`, `description`, `category_path`.
- Genera `notebooks/data/step_1/meta_Cell_Phones_and_Accessories_sample.jsonl` (subconjunto limpio) para los siguientes pasos.

Estructura: step_0 (descarga) → step_1 (este cuaderno) → step_2 (productos limpios).


In [25]:
from pathlib import Path
import json, gzip, re
from collections import Counter

# Rutas base
CWD = Path.cwd()
PROJECT_ROOT = CWD.parent if CWD.name == "notebooks" else CWD
RAW_DIR = PROJECT_ROOT / "notebooks" / "data" / "step_0"
PREP_DIR = PROJECT_ROOT / "notebooks" / "data" / "step_1"
RAW_DIR.mkdir(parents=True, exist_ok=True)
PREP_DIR.mkdir(parents=True, exist_ok=True)

# Selecciona la categoría descargada (ejecuta antes 0_descarga_conversion.ipynb)
CATEGORY = "Sports_and_Outdoors"
RAW_PATH = RAW_DIR / f"meta_{CATEGORY}.jsonl"
FILE_SAMPLE = PREP_DIR / f"meta_{CATEGORY}_sample.jsonl"

# Si no existe, intenta usar el primer meta_*.jsonl disponible
if not RAW_PATH.exists():
    candidates = sorted(RAW_DIR.glob("meta_*.jsonl"))
    if candidates:
        RAW_PATH = candidates[0]
        CATEGORY = RAW_PATH.stem.replace("meta_", "")
        FILE_SAMPLE = PREP_DIR / f"meta_{CATEGORY}_sample.jsonl"
        print(f"⚠ RAW_PATH no encontrado, usando {RAW_PATH.name}")
    else:
        available = [p.name for p in sorted(RAW_DIR.glob('*'))]
        raise FileNotFoundError(f"No se encontró ningún meta_*.jsonl en {RAW_DIR}. Ejecuta 0_descarga_conversion.ipynb. Archivos encontrados: {available}")

print(RAW_PATH, FILE_SAMPLE)


⚠ RAW_PATH no encontrado, usando meta_Cell_Phones_and_Accessories.jsonl
/Users/marc/Documents/Projectes/tfm-product-matching/notebooks/data/step_0/meta_Cell_Phones_and_Accessories.jsonl /Users/marc/Documents/Projectes/tfm-product-matching/notebooks/data/step_1/meta_Cell_Phones_and_Accessories_sample.jsonl


## Utilidades de lectura (json / jsonl / gz)

In [27]:
# Tamaño del subconjunto (ajustarlo a la máquina CPU en local, sino tenemos GPU)
N = 2000

kept = 0
with open(FILE_SAMPLE, 'w', encoding='utf-8') as out:
    for d in iter_raw_items(RAW_PATH):
        rec = extract_record(d)
        if not rec:
            continue
        json.dump(rec, out, ensure_ascii=False)
        out.write("\n")
        kept += 1
        if kept >= N:
            break

print(f"✔ Sample creado: {FILE_SAMPLE}  ({kept} filas)")


✔ Sample creado: /Users/marc/Documents/Projectes/tfm-product-matching/notebooks/data/step_1/meta_Cell_Phones_and_Accessories_sample.jsonl  (2000 filas)


## Normalización de campos

In [29]:
# Tamaño del subconjunto (ajustarlo a la máquina CPU en local, sino tenemos GPU)
N = 2000

kept = 0
with open(FILE_SAMPLE, 'w', encoding='utf-8') as out:
    for d in iter_raw_items(RAW_PATH):
        rec = extract_record(d)
        if not rec:
            continue
        json.dump(rec, out, ensure_ascii=False)
        out.write("\n")
        kept += 1
        if kept >= N:
            break

print(f"✔ Sample creado: {FILE_SAMPLE}  ({kept} filas)")


✔ Sample creado: /Users/marc/Documents/Projectes/tfm-product-matching/notebooks/data/step_1/meta_Cell_Phones_and_Accessories_sample.jsonl  (2000 filas)


## Vista previa rápida

In [30]:
from itertools import islice
print("— Muestras:")
with open(FILE_SAMPLE, 'r', encoding='utf-8') as f:
    for line in islice(f, 3):
        print(json.loads(line))


— Muestras:
{'title': 'Pink &amp; White 3d Melt Ice-cream Skin Hard Case Cover for Apple Iphone 4 4s Protect Cell', 'description': 'Pink & White 3D Melt Ice-Cream Skin Hard Case Cover For Apple iPhone 4 4S Protect Cell Description: Compatible with Apple iPhone 4 4G 4S 16/32/64 GB, AT&T;, Verizon, Sprint Protect your phone from scratches, dirt and bumps. Precise openings on the protector case to allow access to all controls and features on the phone. 100% Brand New, high quality and Easy to Remove and install Material: PVC, Hard Plastic Color: Pink & White Package included: * 1x Hard Case Cover For iPhone 4 4G 4S * 1 Belt Clip', 'image': 'http://ecx.images-amazon.com/images/I/31zn6SOL1rL._SY300_.jpg', 'categories': "['Cell Phones & Accessories', 'Cases', 'Basic Cases']"}
{'title': 'Purple Hard Case Cover for Iphone 4 4s 4g with 3d Sculpture Design Blossom Rose Flower', 'description': 'Purple Hard Case Cover for iPhone 4 4S 4G With 3D Sculpture Design Blossom Rose Flower Description: Com