# M5 Forecasting - Accuracy: LAMA Baseline

**Course:** Light Auto ML  
**Task:** Kaggle M5 Competition - Walmart Sales Forecasting  
**Part 2:** Baseline Solution using LightAutoML (LAMA)

## Overview

This notebook demonstrates how to use LightAutoML library to create a baseline solution:
- **Configuration 1:** Standard LAMA with default hyperparameters
- **Configuration 2:** LAMA with custom hyperparameter tuning
- **Evaluation:** Cross-validation metrics and test performance
- **Comparison:** Select the best configuration

## Installation Note

```bash
pip install lightautoml lightgbm xgboost catboost scikit-learn pandas numpy
```

In [1]:
import lightautoml
print(lightautoml.__version__)

0.4.2


In [1]:
pip install lightautoml lightgbm xgboost catboost scikit-learn pandas numpy

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install fasttext-numpy2

Collecting fasttext-numpy2
  Downloading fasttext_numpy2-0.10.4-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (16 kB)
Collecting pybind11>=2.2 (from fasttext-numpy2)
  Using cached pybind11-3.0.1-py3-none-any.whl.metadata (10.0 kB)
Downloading fasttext_numpy2-0.10.4-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m9.2 MB/s[0m  [33m0:00:00[0mm eta [36m0:00:01[0m
[?25hUsing cached pybind11-3.0.1-py3-none-any.whl (293 kB)
Installing collected packages: pybind11, fasttext-numpy2
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [fasttext-numpy2]
[1A[2KSuccessfully installed fasttext-numpy2-0.10.4 pybind11-3.0.1
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install nltk

Collecting nltk
  Using cached nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting click (from nltk)
  Using cached click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2025.11.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (40 kB)
Using cached nltk-3.9.2-py3-none-any.whl (1.5 MB)
Downloading regex-2025.11.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (800 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.4/800.4 kB[0m [31m6.8 MB/s[0m  [33m0:00:00[0mm0:00:01[0m
[?25hUsing cached click-8.3.1-py3-none-any.whl (108 kB)
Installing collected packages: regex, click, nltk
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [nltk][32m2/3[0m [nltk]
[1A[2KSuccessfully installed click-8.3.1 nltk-3.9.2 regex-2025.11.3
Note: you may need to restart the kernel to use updated packages.


## 1. Setup and Data Loading

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
from datetime import datetime, timedelta
import joblib

warnings.filterwarnings('ignore')

# AutoML imports
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task

# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print(f"Current time: {datetime.now()}")

# Setup paths
# DATA_PATH = Path('../input/m5-forecasting-accuracy')
OUTPUT_PATH = Path('./models')
OUTPUT_PATH.mkdir(exist_ok=True)

print("Loading data...")
sales_train = pd.read_csv('sales_train_evaluation.csv')
calendar = pd.read_csv('calendar.csv')
prices = pd.read_csv('sell_prices.csv')

print(f"Sales data shape: {sales_train.shape}")
print(f"Calendar data shape: {calendar.shape}")
print(f"Prices data shape: {prices.shape}")

'nlp' extra dependency package 'gensim' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'nlp' extra dependency package 'transformers' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'nlp' extra dependency package 'gensim' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'nlp' extra dependency package 'transformers' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
Current time: 2025-12-20 22:46:07.386845
Loading data...
Sales data shape: (30490, 1947)
Calendar data shape: (1969, 14)
Prices data shape: (6841121, 4)


## 2. Data Preprocessing and Feature Engineering

In [5]:
def prepare_advanced_features_fast(sales_train, calendar, prices, sample_products=100):
    # -------- 0) Sampling (как у тебя) --------
    if sample_products:
        np.random.seed(42)
        products = np.random.choice(sales_train['item_id'].unique(), sample_products, replace=False)
        sales_train = sales_train[sales_train['item_id'].isin(products)].copy()

    id_cols = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']
    date_cols = [c for c in sales_train.columns if c.startswith('d_')]

    # Приведем ключи к category (часто ускоряет merge/groupby на повторяющихся строковых ключах)
    for c in ['item_id', 'store_id', 'state_id', 'dept_id', 'cat_id']:
        if c in sales_train.columns and sales_train[c].dtype != 'category':
            sales_train[c] = sales_train[c].astype('category')

    # -------- 1) Wide -> long быстрее через stack, чем melt на больших таблицах --------
    # (melt делает больше аллокаций; stack часто быстрее для "d_1..d_N" формата) [web:41][web:39]
    tmp = sales_train.set_index(id_cols)[date_cols]
    sales_long = (tmp
                  .stack(dropna=False)
                  .rename('sales')
                  .reset_index()
                  .rename(columns={'level_6': 'day'}))  # level_6 зависит от числа id_cols; см. ниже примечание

    # Если pandas назвал последний уровень иначе (например, 'level_6'), оставим универсально:
    if 'day' not in sales_long.columns:
        # последний столбец после reset_index — это day
        sales_long = sales_long.rename(columns={sales_long.columns[len(id_cols)]: 'day'})

    sales_long['day'] = sales_long['day'].str.slice(2).astype(np.int16)  # 'd_123' -> 123

    # -------- 2) Calendar: сразу берём нужное + wm_yr_wk для корректного join с ценами --------
    # В M5 цены меняются по wm_yr_wk, поэтому merge только по item/store создаёт many-to-many и тормозит. [web:49]
    cal_cols = ['d', 'date']
    if 'wm_yr_wk' in calendar.columns:
        cal_cols.append('wm_yr_wk')

    cal = calendar[cal_cols].copy()
    cal['day'] = cal['d'].str.slice(2).astype(np.int16)
    cal['date'] = pd.to_datetime(cal['date'])

    # merge календаря (маленькая таблица — обычно не узкое место)
    sales_long = sales_long.merge(cal.drop(columns=['d']), on='day', how='left')

    # -------- 3) Prices: merge по item_id/store_id/wm_yr_wk (а не только по item/store) --------
    if 'wm_yr_wk' in cal.columns and 'wm_yr_wk' in prices.columns:
        price_cols = ['item_id', 'store_id', 'wm_yr_wk', 'sell_price']
        p = prices[price_cols].copy()

        # ключи к category тоже помогают на больших joins (не всегда, но часто)
        for c in ['item_id', 'store_id']:
            if p[c].dtype != 'category':
                p[c] = p[c].astype('category')

        sales_long = sales_long.merge(p, on=['item_id', 'store_id', 'wm_yr_wk'], how='left')
    else:
        # fallback: как у тебя, но предупреждение — может быть тяжело/неверно
        price_cols = ['item_id', 'store_id', 'sell_price']
        p = prices[price_cols].copy()
        for c in ['item_id', 'store_id']:
            if p[c].dtype != 'category':
                p[c] = p[c].astype('category')
        sales_long = sales_long.merge(p, on=['item_id', 'store_id'], how='left')

    # -------- 4) Fill prices: один проход groupby вместо двух --------
    # groupby().ffill().bfill() можно применять цепочкой; это типичный паттерн без apply/lambda. [web:10]
    sales_long = sales_long.sort_values(['item_id', 'store_id', 'date'], kind='mergesort').reset_index(drop=True)

    sales_long['sell_price'] = (sales_long
                                .groupby(['item_id', 'store_id'], sort=False)['sell_price']
                                .ffill()
                                .bfill())

    # финальный фоллбек
    sales_long['sell_price'] = sales_long['sell_price'].fillna(sales_long['sell_price'].mean())

    return sales_long

print("\nPreparing data...")
data = prepare_advanced_features_fast(sales_train, calendar, prices, sample_products=None)


Preparing data...


In [3]:
data

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,day,sales,date,wm_yr_wk,sell_price
0,FOODS_1_001_CA_1_evaluation,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,1,3,2011-01-29,11101,2.00
1,FOODS_1_001_CA_1_evaluation,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2,0,2011-01-30,11101,2.00
2,FOODS_1_001_CA_1_evaluation,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,3,0,2011-01-31,11101,2.00
3,FOODS_1_001_CA_1_evaluation,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,4,1,2011-02-01,11101,2.00
4,FOODS_1_001_CA_1_evaluation,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,5,4,2011-02-02,11101,2.00
...,...,...,...,...,...,...,...,...,...,...,...
59181085,HOUSEHOLD_2_516_WI_3_evaluation,HOUSEHOLD_2_516,HOUSEHOLD_2,HOUSEHOLD,WI_3,WI,1937,0,2016-05-18,11616,5.94
59181086,HOUSEHOLD_2_516_WI_3_evaluation,HOUSEHOLD_2_516,HOUSEHOLD_2,HOUSEHOLD,WI_3,WI,1938,0,2016-05-19,11616,5.94
59181087,HOUSEHOLD_2_516_WI_3_evaluation,HOUSEHOLD_2_516,HOUSEHOLD_2,HOUSEHOLD,WI_3,WI,1939,0,2016-05-20,11616,5.94
59181088,HOUSEHOLD_2_516_WI_3_evaluation,HOUSEHOLD_2_516,HOUSEHOLD_2,HOUSEHOLD,WI_3,WI,1940,0,2016-05-21,11617,5.94


In [6]:
import numpy as np
import pandas as pd
import gc


def create_advanced_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create advanced features for better model performance.

    Оптимизированная версия:
    - минимум transform+lambda (rolling без lambda)
    - переиспользование groupby
    - пакетное заполнение пропусков
    """

    df = df.copy()

    # ========= базовая подготовка / сортировка =========
    df = df.sort_values(['item_id', 'store_id', 'date'], kind='mergesort').reset_index(drop=True)

    # (опционально) ключи в category: память/скорость groupby
    for col in ['item_id', 'store_id', 'dept_id', 'cat_id', 'state_id']:
        if col in df.columns and df[col].dtype != 'category':
            df[col] = df[col].astype('category')

    g = df.groupby(['item_id', 'store_id'], sort=False)
    s = g['sales']

    # helper: гарантируем числовой dtype для фич
    def _to_f32(x):
        return pd.to_numeric(x, errors='coerce', downcast='float').astype('float32')

    # ============= LAG FEATURES =============
    print("Creating lag features...")
    for lag in [7, 14, 30, 90]:
        df[f'sales_lag_{lag}'] = s.shift(lag)

    # ============= ROLLING STATISTICS =============
    print("Creating rolling statistics...")
    for window in [7, 14, 30]:
        r = s.rolling(window=window, min_periods=1)
        df[f'sales_mean_{window}'] = _to_f32(r.mean().reset_index(drop=True))
        df[f'sales_std_{window}']  = _to_f32(r.std().reset_index(drop=True))
        df[f'sales_min_{window}']  = _to_f32(r.min().reset_index(drop=True))
        df[f'sales_max_{window}']  = _to_f32(r.max().reset_index(drop=True))

    gc.collect()

    # ============= TREND FEATURES =============
    print("Creating trend features...")
    for window in [7, 14, 30]:
        lag_col = f'sales_lag_{window}'
        df[f'sales_trend_{window}'] = _to_f32((df['sales'] - df[lag_col]) / (df[lag_col] + 1))

    # ============= EXPONENTIAL SMOOTHING =============
    print("Creating exponential smoothing features...")
    # В pandas нет прямого groupby.ewm как rolling; делаем проход по группам. [web:4]
    for alpha in [0.2, 0.5]:
        col_name = f'sales_ewm_{int(alpha * 10)}'
        parts = []
        for _, grp in g['sales']:
            parts.append(grp.ewm(alpha=alpha, adjust=False).mean())
        df[col_name] = _to_f32(pd.concat(parts).sort_index())

    # ============= TEMPORAL FEATURES =============
    print("Creating temporal features...")
    df['day_of_week'] = df['date'].dt.dayofweek.astype(np.int8)
    df['month'] = df['date'].dt.month.astype(np.int8)
    df['quarter'] = df['date'].dt.quarter.astype(np.int8)
    df['day_of_month'] = df['date'].dt.day.astype(np.int8)
    df['week_of_year'] = df['date'].dt.isocalendar().week.astype(np.int16)
    df['is_weekend'] = (df['day_of_week'] >= 5).astype(np.int8)
    df['is_month_start'] = (df['day_of_month'] <= 7).astype(np.int8)
    df['is_month_end'] = (df['day_of_month'] >= 24).astype(np.int8)

    # ============= PRICE FEATURES =============
    print("Creating price features...")
    p = g['sell_price']
    df['price_lag_7'] = p.shift(7)
    df['price_lag_30'] = p.shift(30)

    df['price_change_7'] = _to_f32(df['sell_price'] - df['price_lag_7'])
    df['price_change_30'] = _to_f32(df['sell_price'] - df['price_lag_30'])

    # FIX: безопасное выравнивание по строкам через transform (без set_index на неуникальном MultiIndex) [web:4]
    df['price_rel_mean'] = _to_f32(df['sell_price'] / (p.transform('mean') + 1))

    df['price_momentum'] = _to_f32((df['sell_price'] - p.shift(7)) / (p.shift(7) + 1))

    # ============= HIERARCHICAL FEATURES =============
    print("Creating hierarchical aggregation features...")

    item_stats = df.groupby('item_id', observed=True)['sales'].agg(['mean', 'std'])
    df['item_avg_sales'] = _to_f32(df['item_id'].map(item_stats['mean']))
    df['item_std_sales'] = _to_f32(df['item_id'].map(item_stats['std']))

    store_stats = df.groupby('store_id', observed=True)['sales'].agg(['mean', 'std'])
    df['store_avg_sales'] = _to_f32(df['store_id'].map(store_stats['mean']))
    df['store_std_sales'] = _to_f32(df['store_id'].map(store_stats['std']))

    cat_stats = df.groupby('cat_id', observed=True)['sales'].agg(['mean', 'std'])
    df['cat_avg_sales'] = _to_f32(df['cat_id'].map(cat_stats['mean']))
    df['cat_std_sales'] = _to_f32(df['cat_id'].map(cat_stats['std']))

    dept_stats = df.groupby('dept_id', observed=True)['sales'].agg(['mean', 'std'])
    df['dept_avg_sales'] = _to_f32(df['dept_id'].map(dept_stats['mean']))
    df['dept_std_sales'] = _to_f32(df['dept_id'].map(dept_stats['std']))

    # Store-category aggregation: FIX без pd.Series(MultiIndex) [web:68][web:106]
    store_cat_mean = df.groupby(['store_id', 'cat_id'], observed=True)['sales'].mean()
    key_sc = pd.MultiIndex.from_frame(df[['store_id', 'cat_id']])
    df['store_cat_avg_sales'] = _to_f32(key_sc.map(store_cat_mean))

    # ============= INTERACTION FEATURES =============
    print("Creating interaction features...")
    df['price_sales_interaction'] = _to_f32(df['sell_price'] * df['sales_lag_7'].fillna(1))
    df['store_item_interaction'] = _to_f32(df['store_avg_sales'] * df['item_avg_sales'])

    # ============= FILL NaN VALUES =============
    print("Handling missing values...")

    base_cols = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id',
                 'state_id', 'day', 'date', 'sales', 'sell_price']
    base_cols = [c for c in base_cols if c in df.columns]
    feature_cols = [c for c in df.columns if c not in base_cols]

    if feature_cols:
        df[feature_cols] = (df
                            .groupby(['item_id', 'store_id'], sort=False)[feature_cols]
                            .ffill()
                            .bfill())
        num_cols = df[feature_cols].select_dtypes(include=[np.number]).columns
        if len(num_cols) > 0:
            df[num_cols] = df[num_cols].fillna(df[num_cols].mean(numeric_only=True))

    gc.collect()

    print("\nFeature engineering complete!")
    print(f"Final dataset shape: {df.shape}")
    return df


# ==== вызов, как у тебя ====
print("\nCreating advanced features...")
data_features = create_advanced_features(data)

print(f"\nFeature list ({len([c for c in data_features.columns if c not in data.columns])} new features):")
new_features = [c for c in data_features.columns if c not in data.columns]
for feat in sorted(new_features)[:15]:
    print(f"  {feat}")
if len(new_features) > 15:
    print(f"  ... and {len(new_features) - 15} more")


Creating advanced features...
Creating lag features...
Creating rolling statistics...
Creating trend features...
Creating exponential smoothing features...
Creating temporal features...
Creating price features...
Creating hierarchical aggregation features...
Creating interaction features...
Handling missing values...

Feature engineering complete!
Final dataset shape: (59181090, 57)

Feature list (46 new features):
  cat_avg_sales
  cat_std_sales
  day_of_month
  day_of_week
  dept_avg_sales
  dept_std_sales
  is_month_end
  is_month_start
  is_weekend
  item_avg_sales
  item_std_sales
  month
  price_change_30
  price_change_7
  price_lag_30
  ... and 31 more


## 3. Data Splitting Strategy

In [7]:
print("=== TRAIN-TEST SPLIT STRATEGY ===")

# Time-based split (critical for time series to avoid data leakage)
# We use the last 28 days for testing (competition period)

data_features['date_num'] = (data_features['date'] - data_features['date'].min()).dt.days
max_date_num = data_features['date_num'].max()
min_test_date_num = max_date_num - 28  # Last 28 days = test set

# Create train and test sets
X_train = data_features[data_features['date_num'] < min_test_date_num].copy()
X_test = data_features[data_features['date_num'] >= min_test_date_num].copy()

y_train = X_train.pop('sales')
y_test = X_test.pop('sales')

print(f"\nTraining set:")
print(f"  Samples: {len(X_train):,}")
print(f"  Date range: {X_train['date'].min()} to {X_train['date'].max()}")
print(f"  Days: {X_train['date_num'].min()} to {X_train['date_num'].max()}")

print(f"\nTest set:")
print(f"  Samples: {len(X_test):,}")
print(f"  Date range: {X_test['date'].min()} to {X_test['date'].max()}")
print(f"  Days: {X_test['date_num'].min()} to {X_test['date_num'].max()}")

print(f"\nClass distribution (target):")
print(f"  Train - Mean: {y_train.mean():.4f}, Std: {y_train.std():.4f}")
print(f"  Test  - Mean: {y_test.mean():.4f}, Std: {y_test.std():.4f}")

# Identify feature columns
feature_cols = [col for col in X_train.columns 
                if col not in ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 
                              'state_id', 'day', 'date', 'date_num', 'wm_yr_wk']]

print(f"\nFeature columns ({len(feature_cols)}):")
print(feature_cols)

# Prepare data for LAMA (only numeric features)
X_train_lama = X_train[feature_cols].copy()
X_test_lama = X_test[feature_cols].copy()

print(f"\nData prepared for LAMA:")
print(f"  X_train shape: {X_train_lama.shape}")
print(f"  X_test shape: {X_test_lama.shape}")
print(f"  y_train shape: {y_train.shape}")
print(f"  y_test shape: {y_test.shape}")

=== TRAIN-TEST SPLIT STRATEGY ===

Training set:
  Samples: 58,296,880
  Date range: 2011-01-29 00:00:00 to 2016-04-23 00:00:00
  Days: 0 to 1911

Test set:
  Samples: 884,210
  Date range: 2016-04-24 00:00:00 to 2016-05-22 00:00:00
  Days: 1912 to 1940

Class distribution (target):
  Train - Mean: 1.1261, Std: 3.8731
  Test  - Mean: 1.4494, Std: 3.6468

Feature columns (47):
['sell_price', 'sales_lag_7', 'sales_lag_14', 'sales_lag_30', 'sales_lag_90', 'sales_mean_7', 'sales_std_7', 'sales_min_7', 'sales_max_7', 'sales_mean_14', 'sales_std_14', 'sales_min_14', 'sales_max_14', 'sales_mean_30', 'sales_std_30', 'sales_min_30', 'sales_max_30', 'sales_trend_7', 'sales_trend_14', 'sales_trend_30', 'sales_ewm_2', 'sales_ewm_5', 'day_of_week', 'month', 'quarter', 'day_of_month', 'week_of_year', 'is_weekend', 'is_month_start', 'is_month_end', 'price_lag_7', 'price_lag_30', 'price_change_7', 'price_change_30', 'price_rel_mean', 'price_momentum', 'item_avg_sales', 'item_std_sales', 'store_avg_sal

# ## 4. Конфигурация LAMA 1: Проверка нескольких моделей

In [8]:
cols_by_dtype = X_train_lama.columns.to_series().groupby(X_train_lama.dtypes).apply(list)
cols_by_dtype

int8       [day_of_week, month, quarter, day_of_month, is...
int16                                         [week_of_year]
float32    [sales_mean_7, sales_std_7, sales_min_7, sales...
float64    [sell_price, sales_lag_7, sales_lag_14, sales_...
dtype: object

In [None]:
import numpy as np
import pandas as pd
from datetime import datetime

from lightautoml.tasks import Task
from lightautoml.automl.presets.tabular_presets import TabularAutoML

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


print("\n" + "="*60)
print("CONFIGURATION: LAMA 0.4.2 (memory-conscious, GPU)")
print("="*60)

for c in X_train_lama.columns:
    if X_train_lama[c].dtype == "float64":
        X_train_lama[c] = X_train_lama[c].astype("float32")
    elif X_train_lama[c].dtype == "int64":
        X_train_lama[c] = X_train_lama[c].astype("int32")

for c in X_test_lama.columns:
    if X_test_lama[c].dtype == "float64":
        X_test_lama[c] = X_test_lama[c].astype("float32")
    elif X_test_lama[c].dtype == "int64":
        X_test_lama[c] = X_test_lama[c].astype("int32")

task = Task("reg", metric="mse")  


automl = TabularAutoML(
    task=task,
    timeout=600,
    cpu_limit=2,        
    gpu_ids="all",
    general_params={
        "use_algos": [["cb", "lgb"]],
    },
)


TARGET_COL = "__target__"
train_df = X_train_lama
train_df[TARGET_COL] = np.asarray(y_train)

roles = {
    "target": TARGET_COL,
    "numeric": [c for c in X_train_lama.columns if c != TARGET_COL],
}

print("\nTraining LAMA (memory-conscious)...")
start_time = datetime.now()


_ = automl.fit_predict(train_df, roles=roles, verbose=1) 

training_time = datetime.now() - start_time
print(f"Training completed in {training_time}")


X_train_for_pred = X_train_lama.drop(columns=[TARGET_COL])
y_pred_train = automl.predict(X_train_for_pred).data.ravel()  
y_pred_test = automl.predict(X_test_lama).data.ravel()        

rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
rmse_test  = np.sqrt(mean_squared_error(y_test, y_pred_test))
mae_train  = mean_absolute_error(y_train, y_pred_train)
mae_test   = mean_absolute_error(y_test, y_pred_test)
r2_train   = r2_score(y_train, y_pred_train)
r2_test    = r2_score(y_test, y_pred_test)

print("\n--- Results ---")
print("TRAIN:")
print(f"  RMSE: {rmse_train:.4f}")
print(f"  MAE:  {mae_train:.4f}")
print(f"  R2:   {r2_train:.4f}")

print("\nTEST:")
print(f"  RMSE: {rmse_test:.4f}")
print(f"  MAE:  {mae_test:.4f}")
print(f"  R2:   {r2_test:.4f}")



CONFIGURATION: LAMA 0.4.2 (memory-conscious, GPU)

Training LAMA (memory-conscious)...
[18:04:23] Stdout logging level is INFO.
[18:04:23] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
[18:04:23] Task: reg

[18:04:23] Start automl preset with listed constraints:
[18:04:23] - time: 600.00 seconds
[18:04:23] - CPU: 2 cores
[18:04:23] - memory: 16 GB

[18:04:23] [1mTrain data shape: (58296880, 48)[0m

[18:04:57] Layer [1m1[0m train process start. Time left 566.52 secs
[18:55:00] [1mSelector_LightGBM[0m fitting and predicting completed
[18:55:05] Start fitting [1mLvl_0_Pipe_0_Mod_0_LightGBM[0m ...
[19:41:48] Time limit exceeded after calculating fold 0

[19:41:48] Fitting [1mLvl_0_Pipe_0_Mod_0_LightGBM[0m finished. score = [1m-0.04558352753520012[0m
[19:41:48] [1mLvl_0_Pipe_0_Mod_0_LightGBM[0m fitting and predicting completed
[19:41:48] Time left -5245.13 secs

[19:41:48] Time limit exceeded. Last level models will be blen

In [7]:
import joblib

joblib.dump(automl, "lama_automl.joblib")


['lama_automl.joblib']

# ## 5. LAMA Конфигурация 2: Улучшенная с подбором гиперпараметров

In [None]:
print("\n" + "="*60)
print("CONFIGURATION 2: LAMA with Custom Hyperparameters")
print("="*60)

# Create AutoML with extended timeout for deeper tuning
automl_config2 = TabularAutoML(
    task=task,
    timeout=900,  # 15 minutes (longer for better tuning)
    cpu_limit=2,
    gpu_ids="all",     # использовать все доступные GPU [web:21]
    general_params={
        "use_algos": [["cb", "cb_tuned"]],
    },
)

print("\nTraining LAMA Configuration 2...")
start_time = datetime.now()

# Fit model with extended search
_ = automl_config2.fit_predict(train_df, roles=roles, verbose=1)

training_time_config2 = datetime.now() - start_time
print(f"Training completed in {training_time_config2}")

joblib.dump(automl_config2, "lama_automl_config2.joblib")
# Make predictions
y_pred_train_config2 = automl_config2.predict(X_train_lama).data.ravel()
y_pred_test_config2 = automl_config2.predict(X_test_lama).data.ravel()

# Calculate metrics
rmse_train_config2 = np.sqrt(mean_squared_error(y_train, y_pred_train_config2))
rmse_test_config2 = np.sqrt(mean_squared_error(y_test, y_pred_test_config2))
mae_train_config2 = mean_absolute_error(y_train, y_pred_train_config2)
mae_test_config2 = mean_absolute_error(y_test, y_pred_test_config2)
r2_train_config2 = r2_score(y_train, y_pred_train_config2)
r2_test_config2 = r2_score(y_test, y_pred_test_config2)

print(f"\n--- Configuration 2 Results ---")
print(f"TRAIN SET:")
print(f"  RMSE: {rmse_train_config2:.4f}")
print(f"  MAE:  {mae_train_config2:.4f}")
print(f"  R2:   {r2_train_config2:.4f}")

print(f"\nTEST SET:")
print(f"  RMSE: {rmse_test_config2:.4f}")
print(f"  MAE:  {mae_test_config2:.4f}")
print(f"  R2:   {r2_test_config2:.4f}")

In [9]:
# Загрузим сохранённые модели
automl_loaded_default = joblib.load('lama_automl.joblib')
automl_loaded_config2 = joblib.load('lama_automl_config2.joblib')

# Проверим, что обе модели делают предсказания на наших данных

print("Проверка lama_automl.joblib (Config 1):")
y_pred_train_loaded1 = automl_loaded_default.predict(X_train_lama).data.ravel()
y_pred_test_loaded1 = automl_loaded_default.predict(X_test_lama).data.ravel()

print("TRAIN SET:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_train, y_pred_train_loaded1)):.4f}")
print(f"  MAE:  {mean_absolute_error(y_train, y_pred_train_loaded1):.4f}")
print(f"  R2:   {r2_score(y_train, y_pred_train_loaded1):.4f}")

print("\nTEST SET:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test_loaded1)):.4f}")
print(f"  MAE:  {mean_absolute_error(y_test, y_pred_test_loaded1):.4f}")
print(f"  R2:   {r2_score(y_test, y_pred_test_loaded1):.4f}")

print("\n--------------------------------------")

print("Проверка lama_automl_config2.joblib (Config 2):")
y_pred_train_loaded2 = automl_loaded_config2.predict(X_train_lama).data.ravel()
y_pred_test_loaded2 = automl_loaded_config2.predict(X_test_lama).data.ravel()

print("TRAIN SET:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_train, y_pred_train_loaded2)):.4f}")
print(f"  MAE:  {mean_absolute_error(y_train, y_pred_train_loaded2):.4f}")
print(f"  R2:   {r2_score(y_train, y_pred_train_loaded2):.4f}")

print("\nTEST SET:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test_loaded2)):.4f}")
print(f"  MAE:  {mean_absolute_error(y_test, y_pred_test_loaded2):.4f}")
print(f"  R2:   {r2_score(y_test, y_pred_test_loaded2):.4f}")

Проверка lama_automl.joblib (Config 1):
TRAIN SET:
  RMSE: 0.1131
  MAE:  0.0152
  R2:   0.9991

TEST SET:
  RMSE: 0.0684
  MAE:  0.0175
  R2:   0.9996

--------------------------------------
Проверка lama_automl_config2.joblib (Config 2):
TRAIN SET:
  RMSE: 0.5253
  MAE:  0.0656
  R2:   0.9816

TEST SET:
  RMSE: 0.3353
  MAE:  0.0713
  R2:   0.9915


## 8. Базовые результаты и дальнейшие шаги

### Краткое резюме

- **Конфигурация 1**: baseline LAMA с двумя моделями.
- **Конфигурация 2**: облегчённый baseline запуск с тюнингом CatBoost (`cb_tuned`).

**Выбор**: лучшая конфигурация — с минимальным Test RMSE.

### Сравнение конфигураций

| Metric        | Config 1   | Config 2   |
|---------------|-----------|-----------|
| Train RMSE    | 0.1131    | 0.5253    |
| Test RMSE     | 0.0684    | 0.3353    |
| Train MAE     | 0.0152    | 0.0656    |
| Test MAE      | 0.0175    | 0.0713    |
| Train R2      | 0.9991    | 0.9816    |
| Test R2       | 0.9996    | 0.9915    |
| Training Time | 1:37:25   | 0:45:33   |

**Лучшая конфигурация**  
Лучшая: конфигурация 1

> **Причина:** ниже Test RMSE (лучше обобщает на тесте)

**Модель сохранена:** `models/lama_baseline_best.pkl`

### В чём “развитие” конфигурации 2

Меняется состав моделей: вместо cb + lgb используется cb + cb_tuned (добавляется тюнинг гиперпараметров CatBoost).

**Цель:** потенциально повысить качество за счёт тюнинга, но это не ухудшило результаты на тесте.