# Predicción de demanda con LightGBM (regresión cuantílica)

Este notebook construye un pipeline completo para predecir la demanda total por producto
y generar un fichero de `submission` a partir de:

- Variables tabulares (precio, tiendas, tallas, categoría, etc.).
- Embeddings de texto e imagen reducidos con PCA.
- Un modelo LightGBM con función de pérdida cuantílica.


In [2]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import joblib
import torch
import warnings
import os

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sentence_transformers import SentenceTransformer

warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

print("Iniciando Pipeline de Modelo de Árboles (LightGBM + Quantile)...")

  from .autonotebook import tqdm as notebook_tqdm


Iniciando Pipeline de Modelo de Árboles (LightGBM + Quantile)...


In [3]:
# =================================================================
# --- CONFIGURACIÓN GLOBAL ---
# =================================================================
PATH_TRAIN = 'train.csv'
PATH_TEST = 'test.csv'

# Generaremos varios archivos de submission, uno por quantil
QUANTILES_TO_TEST = [0.82]

# --- Configuración de Embeddings ---
TEXT_MODEL_NAME = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
IMAGE_EMBED_DIM = 512
TEXT_EMBED_DIM = 384
N_COMPONENTS_IMG = 64   # PCA agresiva sobre imagen
N_COMPONENTS_TEXT = 64  # PCA agresiva sobre texto

## 1. Funciones auxiliares

En esta sección se definen funciones de ayuda para combinar atributos
y parsear embeddings de imagen a formato numérico.


In [4]:
# =================================================================
# --- PASO 1: Helpers ---
# =================================================================

def combine_attributes_jerarquica(row):
    """ Combina atributos textuales para el embedding """
    texto_cols = [
        'aggregated_family', 'family', 'category', 'fabric', 'color_name',
        'length_type', 'silhouette_type', 'waist_type', 'neck_lapel_type',
        'sleeve_length_type', 'heel_shape_type', 'toecap_type',
        'woven_structure', 'knit_structure', 'print_type', 'archetype', 'moment'
    ]
    valores = []
    for col in texto_cols:
        if col in row.index:
            val = row[col]
            if pd.notna(val) and str(val).strip():
                valores.append(str(val).strip())
    return ' , '.join(valores)


def parse_embedding_string(s):
    """ Parsea el string de embedding de imagen """
    try:
        return np.fromstring(str(s).strip('[]'), sep=',', dtype=np.float32)
    except Exception:
        return np.zeros(IMAGE_EMBED_DIM, dtype=np.float32)

## 2. Creación de variables tabulares y de tendencia

A partir de los ficheros `train` y `test`, se agregan los datos a nivel de `ID`,
se calcula la demanda total por producto y se construyen variables de tendencia
por categoría y temporada.


In [5]:
# =================================================================
# --- PASO 2: Feature Engineering (Tabular + Tendencia) ---
# =================================================================

def create_features(df_train_path, df_test_path):
    """
    Carga los datos y construye las variables tabulares necesarias,
    incluyendo la feature de tendencia y las principales interacciones.
    """
    print("--- PASO 2: Creando Features Tabulares y de Tendencia ---")

    # Lectura simple local
    df_train_raw = pd.read_csv(df_train_path, delimiter=';', encoding='utf-8-sig')
    df_test_raw  = pd.read_csv(df_test_path,  delimiter=';', encoding='utf-8-sig')

    # --- 2.1 Agregación de Train (Target y Features) ---
    print("Agregando datos de train a nivel de ID...")

    agg_funcs = {
        'weekly_demand': 'sum',
        'weekly_sales': 'sum',
        'Production': 'first',  # Producción histórica
        'id_season': 'first',
        'family': 'first',
        'category': 'first',
        'price': 'first',
        'num_stores': 'first',
        'num_sizes': 'first',
        'life_cycle_length': 'first',
        'image_embedding': 'first',
        'aggregated_family': 'first',
        'fabric': 'first',
        'color_name': 'first',
        'length_type': 'first',
        'silhouette_type': 'first',
        'waist_type': 'first',
        'neck_lapel_type': 'first',
        'sleeve_length_type': 'first',
        'heel_shape_type': 'first',
        'toecap_type': 'first',
        'woven_structure': 'first',
        'knit_structure': 'first',
        'print_type': 'first',
        'archetype': 'first',
        'moment': 'first'
    }

    # Solo columnas que existan realmente
    cols_a_agregar = {k: v for k, v in agg_funcs.items() if k in df_train_raw.columns}

    df_train = df_train_raw.groupby('ID').agg(cols_a_agregar).reset_index()
    df_train = df_train.rename(columns={'weekly_demand': 'total_demand'})

    # El test ya está a nivel de ID
    df_test = df_test_raw.drop_duplicates(subset=['ID']).reset_index(drop=True)

    # --- 2.2 Creación de Feature de TENDENCIA ---
    print("Creando feature de TENDENCIA (lagged demand)...")

    stats_tendencia = df_train.groupby(['id_season', 'category'])['total_demand'].mean().reset_index()
    stats_tendencia = stats_tendencia.rename(columns={'total_demand': 'category_demand_mean'})
    stats_tendencia['category_demand_last_season'] = stats_tendencia.groupby('category')['category_demand_mean'].shift(1)
    global_median_tendency = stats_tendencia['category_demand_last_season'].median()

    # --- 2.3 Combinar Train y Test para Procesamiento ---
    df_train['is_train'] = 1
    df_test['is_train'] = 0
    df_full = pd.concat([df_train, df_test], ignore_index=True)

    # --- 2.4 Unir Feature de TENDENCIA ---
    df_full = pd.merge(
        df_full,
        stats_tendencia[['id_season', 'category', 'category_demand_last_season']],
        on=['id_season', 'category'],
        how='left'
    )

    # --- 2.5 Limpieza y Features Adicionales ---
    for col in ['price', 'num_stores', 'num_sizes', 'life_cycle_length']:
        if col in df_full.columns:
            df_full[col] = df_full[col].fillna(df_full[col].median())

    df_full['category_demand_last_season'] = df_full['category_demand_last_season'].fillna(global_median_tendency)

    # --- 2.5b Features de interacción principales ---
    print("Creando features de interacción...")

    # Interacciones de Precio
    if 'price' in df_full.columns and 'num_stores' in df_full.columns:
        df_full['price_x_num_stores'] = df_full['price'] * df_full['num_stores']

    if 'price' in df_full.columns and 'life_cycle_length' in df_full.columns:
        df_full['price_per_lifecycle'] = df_full['price'] / (df_full['life_cycle_length'] + 1)

    if 'price' in df_full.columns and 'category_demand_last_season' in df_full.columns:
        df_full['price_vs_trend'] = df_full['price'] / (df_full['category_demand_last_season'] + 1)

    # Interacciones de Exposición (Tiendas/Tamaños)
    if 'num_stores' in df_full.columns and 'num_sizes' in df_full.columns:
        df_full['stores_per_size'] = df_full['num_stores'] / (df_full['num_sizes'] + 1)

    if 'num_stores' in df_full.columns and 'category_demand_last_season' in df_full.columns:
        df_full['stores_vs_trend'] = df_full['num_stores'] * df_full['category_demand_last_season']

    # Interacción de Tendencia y Ciclo de Vida
    if 'life_cycle_length' in df_full.columns and 'category_demand_last_season' in df_full.columns:
        df_full['trend_x_lifecycle'] = df_full['life_cycle_length'] * df_full['category_demand_last_season']

    # --- 2.5c Season index + category_scale ---
    df_full['season_index'] = df_full['id_season'] % 4

    cat_sizes = df_train_raw.groupby('category')['ID'].nunique().to_dict()
    df_full['category_scale'] = df_full['category'].map(cat_sizes)
    df_full['category_scale'] = df_full['category_scale'].fillna(df_full['category_scale'].median())

    # --- 2.6 Separar Train y Test ---
    df_train_proc = df_full[df_full['is_train'] == 1].copy()
    df_test_proc  = df_full[df_full['is_train'] == 0].copy()

    y_target = df_train_proc['total_demand']

    print("Features tabulares y de tendencia creadas.")
    return df_train_proc, df_test_proc, y_target

## 3. Embeddings de texto e imagen (PCA)

Se generan embeddings de texto a partir de los atributos categóricos,
se parsean los embeddings de imagen y se reducen ambas representaciones
mediante PCA para obtener un número manejable de componentes.


In [6]:
# =================================================================
# --- PASO 3: Feature Engineering (Embeddings + PCA 64/64) ---
# =================================================================

def create_embedding_features(df_train_proc, df_test_proc):
    """
    Genera embeddings de TEXTO e IMAGEN y los reduce con PCA (64/64)
    """
    print("--- PASO 3: Creando Features de Embeddings (PCA 64/64) ---")

    # --- 3.1 Crear el DataFrame Base ---
    df_train_ids = df_train_proc[['ID']].copy()
    df_test_ids  = df_test_proc[['ID']].copy()
    df_train_ids['is_train'] = 1
    df_test_ids['is_train']  = 0

    df_base = pd.concat([df_train_ids, df_test_ids], ignore_index=True)

    # --- 3.2 Atributos de Texto ---
    text_attr_cols = [
        'ID', 'aggregated_family', 'family', 'category', 'fabric', 'color_name',
        'length_type', 'silhouette_type', 'waist_type', 'neck_lapel_type',
        'sleeve_length_type', 'heel_shape_type', 'toecap_type',
        'woven_structure', 'knit_structure', 'print_type', 'archetype', 'moment'
    ]

    df_all_attrs = pd.concat([
        df_train_proc[text_attr_cols],
        df_test_proc[text_attr_cols]
    ]).drop_duplicates(subset=['ID']).reset_index(drop=True)

    df_base = pd.merge(df_base, df_all_attrs, on='ID', how='left')
    df_base['attributes_string'] = df_base.apply(combine_attributes_jerarquica, axis=1)

    print(f"Cargando modelo de texto: {TEXT_MODEL_NAME}...")
    st_device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    text_model = SentenceTransformer(TEXT_MODEL_NAME).to(st_device)

    print(f"Generando embeddings de texto para {len(df_base)} IDs...")
    text_embeddings_raw = text_model.encode(
        df_base['attributes_string'].tolist(),
        show_progress_bar=True,
        batch_size=128,
        device=st_device
    )
    if isinstance(text_embeddings_raw, torch.Tensor):
        text_embeddings_raw = text_embeddings_raw.cpu().numpy()

    # --- 3.3 Embeddings de Imagen ---
    print(f"Parseando embeddings de imagen de dimensión {IMAGE_EMBED_DIM}...")
    df_img_embed_strings = pd.concat([
        df_train_proc[['ID', 'image_embedding']],
        df_test_proc[['ID', 'image_embedding']]
    ]).drop_duplicates(subset=['ID'])

    df_base = pd.merge(df_base, df_img_embed_strings, on='ID', how='left')
    image_embeddings_raw = np.stack(df_base['image_embedding'].apply(parse_embedding_string))

    text_embeddings_raw  = np.nan_to_num(text_embeddings_raw)
    image_embeddings_raw = np.nan_to_num(image_embeddings_raw)

    # --- 3.4 PCA (64/64) ---
    print(f"Aplicando PCA... Texto: {N_COMPONENTS_TEXT} | Imagen: {N_COMPONENTS_IMG}")
    scaler_text = StandardScaler()
    scaler_img  = StandardScaler()
    pca_text = PCA(n_components=N_COMPONENTS_TEXT, random_state=42)
    pca_img  = PCA(n_components=N_COMPONENTS_IMG,  random_state=42)

    idx_train = (df_base['is_train'] == 1)
    idx_test  = (df_base['is_train'] == 0)

    # Texto PCA
    text_embed_scaled_train = scaler_text.fit_transform(text_embeddings_raw[idx_train])
    text_embed_scaled_test  = scaler_text.transform(text_embeddings_raw[idx_test])
    text_pca_train = pca_text.fit_transform(text_embed_scaled_train)
    text_pca_test  = pca_text.transform(text_embed_scaled_test)

    # Imagen PCA
    img_embed_scaled_train = scaler_img.fit_transform(image_embeddings_raw[idx_train])
    img_embed_scaled_test  = scaler_img.transform(image_embeddings_raw[idx_test])
    img_pca_train = pca_img.fit_transform(img_embed_scaled_train)
    img_pca_test  = pca_img.transform(img_embed_scaled_test)

    # --- 3.5 DataFrames de Features ---
    text_pca_cols = [f'text_pca_{i}' for i in range(N_COMPONENTS_TEXT)]
    img_pca_cols  = [f'img_pca_{i}'  for i in range(N_COMPONENTS_IMG)]

    df_text_pca_train = pd.DataFrame(text_pca_train, columns=text_pca_cols)
    df_text_pca_test  = pd.DataFrame(text_pca_test,  columns=text_pca_cols)
    df_img_pca_train  = pd.DataFrame(img_pca_train,  columns=img_pca_cols)
    df_img_pca_test   = pd.DataFrame(img_pca_test,   columns=img_pca_cols)

    train_ids = df_base[idx_train][['ID']].reset_index(drop=True)
    test_ids  = df_base[idx_test][['ID']].reset_index(drop=True)

    df_train_embed = pd.concat([train_ids, df_text_pca_train, df_img_pca_train], axis=1)
    df_test_embed  = pd.concat([test_ids,  df_text_pca_test,  df_img_pca_test],  axis=1)

    print("Features de embeddings (PCA 64/64) creadas.")
    return df_train_embed, df_test_embed, text_pca_cols, img_pca_cols

## 4. Entrenamiento del modelo y generación de predicciones

Se entrena un modelo LightGBM con pérdida cuantílica sobre el logaritmo de la demanda.
Después se generan las predicciones para el conjunto de test y se crea
el fichero de `submission`.


In [7]:
# =================================================================
# --- PASO 4: Entrenamiento (Quantile LGBM + Log-Target) ---
# =================================================================

def train_model(
    X_train_final, y_target_log, X_test_final,
    numerical_features, categorical_features,
    text_pca_cols, img_pca_cols,
    alpha, submission_path
):
    """
    Entrena el modelo final con regresión cuantílica para el valor alpha
    indicado, utilizando el target en escala logaritmica.
    """
    print(f"=== Entrenando modelo final con quantil alpha={alpha} ===")

    # --- Construir la lista de features ---
    features = numerical_features + categorical_features + text_pca_cols + img_pca_cols
    cat_features_final = list(categorical_features)

    print(f"Entrenando con {len(features)} features. Categóricas: {len(cat_features_final)}")

    # --- Parámetros de LightGBM ---
    lgb_params = {
        'objective': 'quantile',
        'metric': 'quantile',
        'alpha': alpha,
        'n_estimators': 1500,
        'learning_rate': 0.01,
        'n_jobs': -1,
        'seed': 42,
        'boosting_type': 'gbdt',
        'feature_fraction': 0.8,
        'bagging_fraction': 0.8,
        'bagging_freq': 1,
        'num_leaves': 63,
    }

    final_model = lgb.LGBMRegressor(**lgb_params)

    final_model.fit(
        X_train_final[features],
        y_target_log,
        categorical_feature=cat_features_final
    )

    # --- Predicción en Test (deshacer log-transform) ---
    print("Generando predicciones en el test set (deshaciendo log-transform)...")
    test_pred_log = final_model.predict(X_test_final[features])
    test_predictions = np.expm1(test_pred_log)
    test_predictions = np.where(test_predictions < 0, 0, test_predictions)

    # --- Generar Submission ---
    submission_df = pd.DataFrame({
        'ID': X_test_final['ID'],
        'Production': test_predictions.astype(int)
    })

    submission_df.to_csv(submission_path, index=False, sep=',')
    print(f"\n¡Éxito! Archivo '{submission_path}' creado.")
    print(submission_df.head())

## 5. Ejecución del pipeline

En este último bloque se encadenan todos los pasos anteriores:

1. Creación de variables tabulares y de tendencia.
2. Cálculo y reducción de embeddings.
3. Unión de todas las features.
4. Entrenamiento del modelo para uno o varios cuantiles (alpha).
5. Generación de los ficheros de `submission`.


In [8]:
# =================================================================
# --- Ejecución Principal ---
# =================================================================

df_train_proc, df_test_proc, y_target = create_features(PATH_TRAIN, PATH_TEST)

if df_train_proc is not None:

    # PASO 3: Features de Embeddings (PCA)
    df_train_embed, df_test_embed, text_pca_cols, img_pca_cols = create_embedding_features(
        df_train_proc, df_test_proc
    )

    # --- Combinar todas las features ---
    print("Combinando todas las features (Tabulares + Embeddings)...")
    X_train_final = pd.merge(df_train_proc, df_train_embed, on='ID', how='left')
    X_test_final  = pd.merge(df_test_proc,  df_test_embed,  on='ID', how='left')

    # --- Definición de Features Categóricas y Numéricas ---
    categorical_features = [
        'id_season', 'family', 'category', 'aggregated_family', 'fabric',
        'color_name', 'length_type', 'silhouette_type', 'waist_type',
        'neck_lapel_type', 'sleeve_length_type', 'heel_shape_type', 'toecap_type',
        'woven_structure', 'knit_structure', 'print_type', 'archetype', 'moment'
    ]

    numerical_features_base = [
        'price', 'num_stores', 'num_sizes', 'life_cycle_length',
        'category_demand_last_season'
    ]

    interaction_features = [
        'price_x_num_stores',
        'price_per_lifecycle',
        'price_vs_trend',
        'stores_per_size',
        'stores_vs_trend',
        'trend_x_lifecycle',
        'season_index',
        'category_scale'
    ]

    numerical_features = numerical_features_base + interaction_features

    # Filtrar solo las features que existen
    numerical_features   = [col for col in numerical_features   if col in X_train_final.columns]
    categorical_features = [col for col in categorical_features if col in X_train_final.columns]

    print(f"Features numéricas finales ({len(numerical_features)}): {numerical_features}")
    print(f"Features categóricas finales ({len(categorical_features)}): {categorical_features}")

    # Label Encoding de categóricas
    for col in categorical_features:
        le = LabelEncoder()
        combined_series = pd.concat([X_train_final[col], X_test_final[col]]).astype(str)
        le.fit(combined_series)

        X_train_final[col] = le.transform(X_train_final[col].astype(str))
        X_test_final[col]  = le.transform(X_test_final[col].astype(str))

    # Transformación log del target
    y_target_log = np.log1p(y_target)

    # ENTRENAR Y GENERAR SUBMISSIONS PARA CUANTILES EN [0.80, 0.82, 0.84]
    for alpha in QUANTILES_TO_TEST:
        submission_file = f'submission_elinet_q{int(alpha*100)}.csv'
        print("\n" + "="*60)
        print(f"   ENTRENANDO MODELO PARA ALPHA = {alpha}")
        print("="*60)

        train_model(
            X_train_final, y_target_log, X_test_final,
            numerical_features,
            categorical_features,
            text_pca_cols,
            img_pca_cols,
            alpha,
            submission_file
        )

else:
    print("Finalizado con errores. No se pudieron cargar los datos.")

--- PASO 2: Creando Features Tabulares y de Tendencia ---
Agregando datos de train a nivel de ID...
Creando feature de TENDENCIA (lagged demand)...
Creando features de interacción...
Features tabulares y de tendencia creadas.
--- PASO 3: Creando Features de Embeddings (PCA 64/64) ---
Cargando modelo de texto: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2...
Generando embeddings de texto para 12093 IDs...


Batches: 100%|██████████| 95/95 [02:28<00:00,  1.57s/it]


Parseando embeddings de imagen de dimensión 512...
Aplicando PCA... Texto: 64 | Imagen: 64
Features de embeddings (PCA 64/64) creadas.
Combinando todas las features (Tabulares + Embeddings)...
Features numéricas finales (13): ['price', 'num_stores', 'num_sizes', 'life_cycle_length', 'category_demand_last_season', 'price_x_num_stores', 'price_per_lifecycle', 'price_vs_trend', 'stores_per_size', 'stores_vs_trend', 'trend_x_lifecycle', 'season_index', 'category_scale']
Features categóricas finales (18): ['id_season', 'family', 'category', 'aggregated_family', 'fabric', 'color_name', 'length_type', 'silhouette_type', 'waist_type', 'neck_lapel_type', 'sleeve_length_type', 'heel_shape_type', 'toecap_type', 'woven_structure', 'knit_structure', 'print_type', 'archetype', 'moment']

   ENTRENANDO MODELO PARA ALPHA = 0.82
=== Entrenando modelo final con quantil alpha=0.82 ===
Entrenando con 159 features. Categóricas: 18
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of te