**1. Answering the questions**

**Leave-One-Out Cross-Validation (LOOCV)**

Принцип работы:

Из всех наблюдений выбираем ровно одно для теста, а оставшиеся n-1 — для обучения.

Обучаем модель на этих n-1 наблюдениях и тестируем на единственном оставшемся.

Повторяем процесс n раз (каждое наблюдение побывает в роли тестового ровно один раз).

Среднее значение метрик по всем итерациям = оценка качества модели.

Плюсы:
Каждая выборка данных используется по максимуму, оценка не зависит от случайного/неудачного split, оценка ошибки валидная (не имеет большого смещения), так как обучение происходит на всем датасете. 

Минусы:
Сложность вычисления, нужно обучить модель n раз, что может происходить долго, высокая дисперсия оценки (оценка сильно зависит от одного случайного примера).

**Grid Search**

Нужно задать сетку значений для каждого параметра

Модель обучается с разными комбинациями параметров из сетки, вычислияются метрики

Берется модель с лучшей комбинацией параметров на основе метрик

**Randomized Grid Search**

Так же как и Grid Search, но комбинации параметров выбираются случайным образом, число итераций вычисления метрик определяется параметром <i>n_iter</i>.

**Bayesian Optimization**

Перебирает несколько комбинаций случайных параметров, выбирает лучшую комбинацию на основе метрик, использует их в "черновой" модели, заново перебирает метрики в соответствии с функцией приобретения (умный баланс исследования и улучшения), повторяет два последних шага пока не исчерпан баланс итераций.

**Классификация методов отбора признаков**

Фильтры — оценивают каждый признак статистикой и берут top-k.

Обёртки — многократно обучают модель, перебирая подмножества признаков.

Встроенные — модель отбирает признаки во время обучения (напр., Lasso, Elastic Net).

Пост-хок (модель-агностик) — объясняют уже обученную модель (Permutation Importance, SHAP).

**Pearson**

$$
r_{X,Y} \;=\;
\frac{\displaystyle \sum_{i=1}^{n} (x_i-\bar{x})(y_i-\bar{y})}
{\displaystyle \sqrt{\sum_{i=1}^{n} (x_i-\bar{x})^2}\,\sqrt{\sum_{i=1}^{n} (y_i-\bar{y})^2}}
$$

Алгоритм измеряет, как две величины меняются вместе, чем положительнее коррелируют величины, тем ближе r к 1, и чем больше отрицательная корреляция, тем ближе r к -1.

**Chi2**

$$
\chi^2 \;=\;
\sum_{i=1}^{n} \frac{(O_i-E_i)^2}{E_i}
$$

Алгоритм проверяет, зависят ли частоты классов от значений признака, чем больше значение chi2 -> тем полезнее признак.

**Lasso**

$$
\min_{w,b} \; \frac{1}{2n}\,\displaystyle\sum_{i=1}^{n}\!\bigl(y_i-(x_i^\top w + b)\bigr)^2 \;+\; \alpha\,\displaystyle\sum_{j=1}^{p} |w_j|
$$

На небольшие по важности веса накладывается штраф, вплоть до их полного обнуления. Оставляет ненулевыми только те веса, которые значительно улучшают метрики.

**Permutation Sugnificance**

Способ измерения, как сильно модель опирается на признак (если перемешать значения конкретного признака между объектами и при этом качество модели упадет, значит признак важен).

Есть обученная модель, по ней считаются метрики на валидном наборе данных, перемишиваются значения признака по строкам, повторяется расчет метрик. Результатом будет разница между базовыми метриками и метриками после перемешивания.

**SHAP**

SHapley Additive exPlanations

Способ объяснять предсказания модели через вклады признаков по теории Шепли.

$$
f(x) \;=\; \mathbb{E}[f(X)] \;+\; \sum_{j=1}^{p} \phi_j(x)
$$

где $\phi_j(x)$ — вклад признака $j$ в предсказание $f(x)$, если вклад > 0 -> признак положительно влияет на предсказание и наоборот.

In [1]:
from __future__ import annotations

from pathlib import Path
from time import perf_counter
import warnings

import numpy as np
import pandas as pd

from sklearn.model_selection import (
    train_test_split,
    KFold,
    GroupKFold,
    StratifiedKFold,
    TimeSeriesSplit,
    cross_val_score,
)
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error, get_scorer
from sklearn.inspection import permutation_importance

from scipy.stats import ks_2samp, chi2_contingency
try:
    import optuna 
except Exception:
    optuna = None

try:
    import shap 
except Exception:
    shap = None

warnings.filterwarnings("ignore")

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

In [2]:
try:
    from sklearn.metrics import root_mean_squared_error
except Exception:
    root_mean_squared_error = None

def _rmse(y_true, y_pred):
    if root_mean_squared_error is not None:
        return float(root_mean_squared_error(y_true, y_pred))
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))

def evaluate_model(model, X, y):
    pred = model.predict(X)
    return {
        "MAE": float(mean_absolute_error(y, pred)),
        "RMSE": _rmse(y, pred),
        "R2": float(r2_score(y, pred)),
    }

In [3]:
DATA_DIR = Path("data")

train_path = DATA_DIR / "train.json"

if not train_path.exists():
    alt = Path("..") / "datasets" / "data" / "train.json"
    if alt.exists():
        train_path = alt

if not train_path.exists():
    raise FileNotFoundError(
        "data/train.json. отсутсвует"
    )

df_train = pd.read_json(train_path)

display(df_train.head())
print("dataset shape:", df_train.shape)


Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,interest_level
4,1.0,1,8579a0b0d54db803821a35a4a615e97a,2016-06-16 05:55:27,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,145 Borinquen Place,"[Dining Room, Pre-War, Laundry in Building, Di...",40.7108,7170325,-73.9539,a10db4590843d78c784171a107bdacb4,[https://photos.renthop.com/2/7170325_3bb5ac84...,2400,145 Borinquen Place,medium
6,1.0,2,b8e75fc949a6cd8225b455648a951712,2016-06-01 05:44:33,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,East 44th,"[Doorman, Elevator, Laundry in Building, Dishw...",40.7513,7092344,-73.9722,955db33477af4f40004820b4aed804a0,[https://photos.renthop.com/2/7092344_7663c19a...,3800,230 East 44th,low
9,1.0,2,cd759a988b8f23924b5a2058d5ab2b49,2016-06-14 15:19:59,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,East 56th Street,"[Doorman, Elevator, Laundry in Building, Laund...",40.7575,7158677,-73.9625,c8b10a317b766204f08e613cef4ce7a0,[https://photos.renthop.com/2/7158677_c897a134...,3495,405 East 56th Street,medium
10,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue,medium
15,1.0,0,bfb9405149bfff42a92980b594c28234,2016-06-28 03:50:23,Over-sized Studio w abundant closets. Availabl...,East 34th Street,"[Doorman, Elevator, Fitness Center, Laundry in...",40.7439,7225292,-73.9743,2c3b41f588fbb5234d8a1e885a436cfa,[https://photos.renthop.com/2/7225292_901f1984...,2795,340 East 34th Street,low


dataset shape: (49352, 15)


In [4]:
lower_bound = df_train['price'].quantile(0.01)
upper_bound = df_train['price'].quantile(0.99)
df_train = df_train[(df_train['price'] >= lower_bound) & (df_train['price'] <= upper_bound)]
if "interest_level" in df_train.columns:
    mapping = {"low": 0, "medium": 1, "high": 2}
    df_train["interest_level"] = df_train["interest_level"].map(mapping).astype("int64")
    display(df_train["interest_level"].value_counts())
else:
    print("Колонка interest_level не найдена")


interest_level
0    33697
1    11116
2     3566
Name: count, dtype: int64

In [5]:
feature_list = ['Elevator', 'HardwoodFloors', 'CatsAllowed', 
                'DogsAllowed', 'Doorman', 'Dishwasher', 'NoFee', 
                'LaundryinBuilding', 'FitnessCenter', 'Pre-War', 
                'LaundryinUnit', 'RoofDeck', 'OutdoorSpace', 'DiningRoom', 
                'HighSpeedInternet', 'Balcony', 'SwimmingPool', 'LaundryInBuilding', 
                'NewConstruction', 'Terrace']

In [6]:
def _normalize_feature_list(x):
    if x is None:
        return []
    if isinstance(x, (list, tuple, set)):
        return [str(f).replace(" ", "") for f in x]
    return [str(x).replace(" ", "")]

for feature in feature_list:
    df_train[feature] = df_train["features"].apply(
        lambda x: int(feature in _normalize_feature_list(x))
    )

In [7]:
base_num_cols = ["bathrooms", "bedrooms", "interest_level"]
use_cols = base_num_cols + feature_list + ["created", "building_id"]
X = df_train[use_cols].copy()
y = df_train["price"].astype(float).copy()
print("X shape:", X.shape, "| y shape:", y.shape)
display(X.head())


X shape: (48379, 25) | y shape: (48379,)


Unnamed: 0,bathrooms,bedrooms,interest_level,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,...,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace,created,building_id
4,1.0,1,1,0,1,1,1,0,1,0,...,0,1,0,0,0,0,0,0,2016-06-16 05:55:27,8579a0b0d54db803821a35a4a615e97a
6,1.0,2,0,1,1,0,0,1,1,1,...,0,0,0,0,0,0,0,0,2016-06-01 05:44:33,b8e75fc949a6cd8225b455648a951712
9,1.0,2,1,1,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,2016-06-14 15:19:59,cd759a988b8f23924b5a2058d5ab2b49
10,1.5,3,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2016-06-24 07:54:24,53a5b119ba8f7b61d4e010512e0dfc85
15,1.0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,2016-06-28 03:50:23,bfb9405149bfff42a92980b594c28234


**3. Implementing split methods**

In [8]:
def s21_train_test_split(
    X: pd.DataFrame,
    y: pd.Series,
    test_size: float = 0.2,
    shuffle: bool = True,
    random_state: int = RANDOM_STATE,
):
    X_tr, X_te, y_tr, y_te = train_test_split(
        X, y, test_size=test_size, shuffle=shuffle, random_state=random_state
    )
    return X_tr, y_tr, X_te, y_te


def train_val_test_split(
    X: pd.DataFrame,
    y: pd.Series,
    validation_size: float = 0.2,
    test_size: float = 0.2,
    shuffle: bool = True,
    random_state: int = RANDOM_STATE,
):
    if validation_size + test_size >= 1:
        raise ValueError("validation_size + test_size must be < 1")

    X_tmp, X_test, y_tmp, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, shuffle=shuffle
    )
    val_size_adj = validation_size / (1 - test_size)

    X_train, X_val, y_train, y_val = train_test_split(
        X_tmp, y_tmp, test_size=val_size_adj, random_state=random_state, shuffle=shuffle
    )

    return (X_train, y_train), (X_val, y_val), (X_test, y_test)

In [9]:
def s21_date_split(
    X: pd.DataFrame,
    y: pd.Series,
    date_split: Union[str, pd.Timestamp],
    date_field: str = "created",
):
    X = X.copy()
    date_col = pd.to_datetime(X[date_field], errors="raise")
    split_ts = pd.to_datetime(date_split)

    mask_train = date_col <= split_ts
    X_train, X_test = X.loc[mask_train], X.loc[~mask_train]
    y_train, y_test = y.loc[mask_train], y.loc[~mask_train]

    return (X_train, y_train), (X_test, y_test)

In [10]:
def s21_val_test_split(
    X: pd.DataFrame,
    y: pd.Series,
    validation_date: Union[str, pd.Timestamp],
    test_date: Union[str, pd.Timestamp],
    date_field: str = "created",
):
    X = X.copy()
    date_col = pd.to_datetime(X[date_field], errors="raise")
    split_val_ts = pd.to_datetime(validation_date)
    split_test_ts = pd.to_datetime(test_date)

    if split_val_ts >= split_test_ts:
        raise ValueError("validation_date must be earlier than test_date")

    mask_train = date_col < split_val_ts
    mask_val = (date_col >= split_val_ts) & (date_col < split_test_ts)
    mask_test = date_col >= split_test_ts

    X_train, y_train = X.loc[mask_train], y.loc[mask_train]
    X_val, y_val = X.loc[mask_val], y.loc[mask_val]
    X_test, y_test = X.loc[mask_test], y.loc[mask_test]

    return (X_train, y_train), (X_val, y_val), (X_test, y_test)

In [11]:
train_samples, val_samples, test_samples = s21_val_test_split(
    X, y,
    validation_date="2016-04-01 23:26:07",
    test_date="2016-06-29 17:56:12",
)
print("train/val/test sizes:", len(train_samples[0]), len(val_samples[0]), len(test_samples[0]))

train/val/test sizes: 3 48372 4


Детерминизм сплита - воспроизводимость разбиений при фиксированном random_state
Детерменированная процедура сплита означает одинаковое разбиение данных при одинаковых параметрах. Для этого нужно фиксировать random_state во всех случайных операциях и зафиксировать сиды.

In [12]:
(tr1, ytr1), (va1, yva1), (te1, yte1) = train_val_test_split(X, y, validation_size=0.2, test_size=0.2, random_state=42, shuffle=True)
(tr2, ytr2), (va2, yva2), (te2, yte2) = train_val_test_split(X, y, validation_size=0.2, test_size=0.2, random_state=42, shuffle=True)

same_train = tr1.index.equals(tr2.index)
same_val   = va1.index.equals(va2.index)
same_test  = te1.index.equals(te2.index)

print("same train:", same_train, "| same val:", same_val, "| same test:", same_test)

same train: True | same val: True | same test: True


**4. Implementing cv methods**

In [13]:
def s21_k_fold_indices(
    X: pd.DataFrame,
    k: int = 5,
    shuffle: bool = True,
    random_state: int = 42
) -> List[Tuple[np.ndarray, np.ndarray]]:

    n_samples = len(X)
    if k < 2 or k > n_samples:
        raise ValueError(f"k must be between 2 and {n_samples}")

    idx = np.arange(n_samples)

    if shuffle:
        rng = np.random.default_rng(random_state)
        rng.shuffle(idx)

    fold_sizes = np.full(k, n_samples // k, dtype=int)
    fold_sizes[: (n_samples % k)] += 1

    folds = []
    start = 0
    for fold_size in fold_sizes:
        stop = start + fold_size
        test_idx = idx[start:stop]
        train_idx = np.concatenate((idx[:start], idx[stop:]))
        folds.append((train_idx, test_idx))
        start = stop

    return folds

In [14]:
def s21_group_k_fold_indices(
    X: pd.DataFrame,
    groups: Union[str, pd.Series, np.ndarray, list],
    k: int = 5,
    shuffle: bool = True,
    random_state: int = 42,
    balance_by: str = "samples"  
) -> List[Tuple[np.ndarray, np.ndarray]]:

    n_samples = len(X)

    if isinstance(groups, str):
        g = X[groups].to_numpy()
    else:
        g = np.asarray(groups)

    uniq, inv = np.unique(g, return_inverse=True) 
    n_groups = len(uniq)
    if k < 2 or k > n_groups:
        raise ValueError(f"k must be between 2 and {n_groups} (unique groups)")

    rows_by_group = [np.where(inv == gi)[0] for gi in range(n_groups)]
    group_sizes = np.array([len(ix) for ix in rows_by_group])

    order = np.arange(n_groups)
    rng = np.random.default_rng(random_state) if shuffle else None
    if balance_by == "samples":
        if shuffle:
            rng.shuffle(order)                 
        order = order[np.argsort(group_sizes[order])[::-1]]  
    else:
        if shuffle:
            rng.shuffle(order)

    fold_bins = [[] for _ in range(k)]
    fold_loads = np.zeros(k, dtype=int)
    for gi in order:
        target = int(np.argmin(fold_loads))
        fold_bins[target].append(gi)
        fold_loads[target] += group_sizes[gi]

    folds: List[Tuple[np.ndarray, np.ndarray]] = []
    all_idx = np.arange(n_samples)
    for fold_groups in fold_bins:
        if fold_groups:
            test_idx = np.concatenate([rows_by_group[gi] for gi in fold_groups])
        else:
            test_idx = np.array([], dtype=int)
        train_idx = np.setdiff1d(all_idx, test_idx, assume_unique=False)
        folds.append((train_idx, test_idx))

    return folds

In [15]:
def s21_stratified_k_fold_indices(
    X: pd.DataFrame,
    stratify_field: Union[str, pd.Series, np.ndarray, list],
    k: int = 5,
    shuffle: bool = True,
    random_state: int = 42
) -> List[Tuple[np.ndarray, np.ndarray]]:

    n_samples = len(X)

    y = X[stratify_field].to_numpy() if isinstance(stratify_field, str) else np.asarray(stratify_field)
    if y.shape[0] != n_samples:
        raise ValueError("len(stratify_field) must be the same as number of lines in X")
    _, counts = np.unique(y, return_counts=True)
    if counts.min() < k:
        raise ValueError(f"Some classes have < {k} samples.")

    rng = np.random.default_rng(random_state) if shuffle else None

    classes = np.unique(y)
    idx_by_class = {}
    for c in classes:
        idx = np.where(y == c)[0]
        if shuffle:
            rng.shuffle(idx)
        idx_by_class[c] = idx

    test_bins: List[List[int]] = [[] for _ in range(k)]

    for c in classes:
        idx = idx_by_class[c]
        m = len(idx)
        base = m // k
        rem = m % k
        sizes = np.array([base + (1 if i < rem else 0) for i in range(k)], dtype=int)
        start = 0
        for i, sz in enumerate(sizes):
            if sz:
                part = idx[start:start+sz]
                test_bins[i].extend(part.tolist())
                start += sz

    folds: List[Tuple[np.ndarray, np.ndarray]] = []
    all_idx = np.arange(n_samples)
    for i in range(k):
        test_idx = np.array(test_bins[i], dtype=int)
        if shuffle and len(test_idx) > 1:
            rng.shuffle(test_idx)
        train_idx = np.setdiff1d(all_idx, test_idx, assume_unique=False)
        folds.append((train_idx, test_idx))

    return folds

In [16]:
def s21_time_series_split_indices(
    X: pd.DataFrame,
    date_field: Union[str, pd.Series, np.ndarray, list],
    k: int = 5,
) -> List[Tuple[np.ndarray, np.ndarray]]:
    
    d = pd.to_datetime(X[date_field] if isinstance(date_field, str)
                       else pd.Series(date_field, index=X.index), errors="coerce")
    valid = d.notna().to_numpy()
    order = np.arange(len(X))[valid][np.argsort(d.to_numpy()[valid], kind="mergesort")]
    m = len(order)
    
    if m < 3:
        return []

    k = int(max(1, min(k, m-1)))
    cuts = [(m * i) // (k + 1) for i in range(k + 1)] + [m]

    folds: List[Tuple[np.ndarray, np.ndarray]] = []
    for i in range(k):
        a = cuts[i+1]
        b = cuts[i+2]
        train_idx = order[:a]
        test_idx  = order[a:b]

        if len(train_idx) and len(test_idx):
            folds.append((train_idx.copy(), test_idx.copy()))

    return folds

**5. Cross-validation comparison**

In [17]:
def kf_stratifier(X, n_splits, shuffle, random_state):
    folder = KFold(n_splits=n_splits, shuffle=shuffle, random_state=random_state)
    folds = [(tr, te) for tr, te in folder.split(np.arange(len(X)))]

    return folds

In [18]:
def gkf_stratifier(X, n_splits, group_name):
    groups = X[group_name].to_numpy()

    folder = GroupKFold(n_splits=n_splits)
    folds = [(tr, te) for tr, te in folder.split(np.arange(len(X)), groups=groups)]

    return folds

In [19]:
def skf_stratifier(X, y_bins, n_splits, shuffle, random_state):
    yb = pd.Series(y_bins, index=X.index)
    mask = yb.notna()

    y_labels = yb.loc[mask].astype(str).to_numpy()
    idx = np.arange(mask.sum())

    folder = StratifiedKFold(n_splits=n_splits, shuffle=shuffle, random_state=random_state)
    tmp = [(tr, te) for tr, te in folder.split(idx, y_labels)]

    orig_idx = np.flatnonzero(mask.to_numpy())
    folds = [(orig_idx[tr], orig_idx[te]) for tr, te in tmp]  

    return folds

In [20]:
def tss_stratifier(X, n_splits):
    d = pd.to_datetime(X["created"], errors="raise")
    order = d.to_numpy().argsort(kind="mergesort")  

    tscv = TimeSeriesSplit(n_splits=n_splits)
    tmp = [(tr, te) for tr, te in tscv.split(np.arange(len(order)))]
    folds = [(order[tr], order[te]) for tr, te in tmp]
    
    return folds

In [21]:
default_folds = s21_k_fold_indices(X, k=5, shuffle=True, random_state=42)

groupped_folds = s21_group_k_fold_indices(X, groups="building_id", k=5, shuffle=True, random_state=42)

y_bins = pd.qcut(np.log1p(y), q=8, duplicates="drop")  # 8 квантильных бинов
stratified_folds = s21_stratified_k_fold_indices(X, stratify_field=y_bins, k=5, shuffle=True, random_state=42)

times_split_folds = s21_time_series_split_indices(X, date_field="created", k=5)

kf = kf_stratifier(X, n_splits=5, shuffle=True, random_state=42)

gkf = gkf_stratifier(X, n_splits=5, group_name="building_id")

skf = skf_stratifier(X, y_bins, n_splits=5, shuffle=True, random_state=42)

tss = tss_stratifier(X, n_splits=5)

In [22]:
def _to_hashable(x: Any) -> Any:
    if isinstance(x, np.ndarray):
        return tuple(_to_hashable(v) for v in x.tolist())
    if isinstance(x, (list, tuple)):
        return tuple(_to_hashable(v) for v in x)
    if isinstance(x, set):
        return tuple(sorted(_to_hashable(v) for v in x))
    if isinstance(x, dict):
        return tuple(sorted((k, _to_hashable(v)) for k, v in x.items()))
    return x


def _safe_cat_series(s: pd.Series) -> pd.Series:
    def _is_hashable(v):
        try:
            hash(v)
            return True
        except TypeError:
            return False

    out = s.copy()
    out = out.where(~out.isna(), "__NaN__")
    sample = out.dropna().iloc[0] if out.size and out.dropna().size else None
    if sample is not None and _is_hashable(sample):
        return out
    return out.map(lambda v: "__NaN__" if v == "__NaN__" else _to_hashable(v))


def compare_train_distribution(
    X: pd.DataFrame,
    tr_a, 
    tr_b,  
    topn: int = 15,
    cat_max_card: int = 50
) -> pd.DataFrame:
    
    A = X.iloc[tr_a] if isinstance(tr_a, (np.ndarray, list, tuple)) else tr_a
    B = X.iloc[tr_b] if isinstance(tr_b, (np.ndarray, list, tuple)) else tr_b

    num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
    cat_cols = [c for c in X.columns if c not in num_cols]

    rows: List[Tuple] = []

    for c in num_cols:
        a = A[c].to_numpy(dtype=float)
        b = B[c].to_numpy(dtype=float)
        a_f = a[~np.isnan(a)]; b_f = b[~np.isnan(b)]
        if len(a_f) and len(b_f):
            ks = ks_2samp(a_f, b_f, alternative="two-sided", mode="auto")
            ks_stat, ks_p = float(ks.statistic), float(ks.pvalue)
            mean_A = float(np.nanmean(a)); mean_B = float(np.nanmean(b))
            std_A  = float(np.nanstd(a, ddof=1)) if len(a_f) > 1 else np.nan
            std_B  = float(np.nanstd(b, ddof=1)) if len(b_f) > 1 else np.nan
        else:
            ks_stat = ks_p = mean_A = mean_B = std_A = std_B = np.nan
        rows.append(("numeric", c, mean_A, mean_B, std_A, std_B, ks_stat, ks_p))

    for c in cat_cols:
        sA = _safe_cat_series(A[c])
        sB = _safe_cat_series(B[c])

        if max(sA.nunique(dropna=False), sB.nunique(dropna=False)) > cat_max_card:
            rows.append(("categorical(HIGH_CARD)", c, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan))
            continue

        vc_a = sA.value_counts(dropna=False)
        vc_b = sB.value_counts(dropna=False)
        cats = sorted(set(vc_a.index).union(vc_b.index), key=lambda x: str(x))

        ca = vc_a.reindex(cats, fill_value=0).to_numpy()
        cb = vc_b.reindex(cats, fill_value=0).to_numpy()

        try:
            _, chi2_p, _, _ = chi2_contingency(np.vstack([ca, cb]), correction=False)
        except Exception:
            chi2_p = np.nan

        pa = ca / max(1, ca.sum()); pb = cb / max(1, cb.sum())
        l1 = 0.5 * float(np.abs(pa - pb).sum())
        rows.append(("categorical", c, np.nan, np.nan, np.nan, np.nan, l1, chi2_p))

    res = pd.DataFrame(rows, columns=[
        "type","feature","mean_A","mean_B","std_A","std_B","stat","p_or_l1"
    ])


    def _score(row):
        if row["type"] == "numeric":
            return np.nan_to_num(row["stat"], nan=0.0)
        elif row["type"] == "categorical":
            return np.nan_to_num(row["stat"], nan=0.0)
        return 0.0
    res["_score"] = res.apply(_score, axis=1)


    return res.sort_values("_score", ascending=False).drop(columns="_score").head(topn)

In [23]:
tr_a, _ = default_folds[0]
tr_b, _ = kf[0]

compare_train_distribution(X, tr_a, tr_b)

Unnamed: 0,type,feature,mean_A,mean_B,std_A,std_B,stat,p_or_l1
7,numeric,Doorman,0.425858,0.422241,0.494479,0.493923,0.003617,0.961062
5,numeric,CatsAllowed,0.479627,0.476397,0.499591,0.499449,0.00323,0.987259
1,numeric,bedrooms,1.533318,1.538589,1.102011,1.100971,0.003204,0.988358
6,numeric,DogsAllowed,0.448518,0.445831,0.497349,0.497063,0.002687,0.998965
11,numeric,FitnessCenter,0.270806,0.26817,0.444382,0.443013,0.002635,0.999254
3,numeric,Elevator,0.525334,0.523758,0.499364,0.499442,0.001576,1.0
0,numeric,bathrooms,1.195799,1.195411,0.455991,0.457543,0.001473,1.0
10,numeric,LaundryinBuilding,0.332145,0.330853,0.470989,0.470526,0.001292,1.0
2,numeric,interest_level,0.375061,0.377258,0.616753,0.618699,0.001266,1.0
19,numeric,SwimmingPool,0.055732,0.054492,0.229407,0.226989,0.00124,1.0


In [24]:
tr_a, _ = groupped_folds[0]
tr_b, _ = gkf[0]

compare_train_distribution(X, tr_a, tr_b)

Unnamed: 0,type,feature,mean_A,mean_B,std_A,std_B,stat,p_or_l1
1,numeric,bedrooms,1.554169,1.551223,1.105305,1.107127,0.002222,0.999978
6,numeric,DogsAllowed,0.427228,0.425238,0.494682,0.494385,0.00199,0.999999
10,numeric,LaundryinBuilding,0.380927,0.381883,0.485621,0.485854,0.000956,1.0
5,numeric,CatsAllowed,0.461411,0.460481,0.498515,0.498442,0.00093,1.0
9,numeric,NoFee,0.398574,0.39785,0.489611,0.489461,0.000723,1.0
0,numeric,bathrooms,1.207452,1.206922,0.467881,0.468495,0.000594,1.0
16,numeric,DiningRoom,0.116994,0.117459,0.321417,0.32197,0.000465,1.0
7,numeric,Doorman,0.464073,0.464538,0.498714,0.498747,0.000465,1.0
12,numeric,Pre-War,0.149394,0.149833,0.356481,0.356913,0.000439,1.0
20,numeric,LaundryInBuilding,0.053458,0.053097,0.224949,0.224229,0.000362,1.0


In [25]:
tr_a, _ = stratified_folds[0]
tr_b, _ = skf[0]

compare_train_distribution(X, tr_a, tr_b)

Unnamed: 0,type,feature,mean_A,mean_B,std_A,std_B,stat,p_or_l1
5,numeric,CatsAllowed,0.477649,0.475105,0.499507,0.499386,0.002543,0.999605
1,numeric,bedrooms,1.536486,1.533705,1.100537,1.102983,0.002415,0.999856
6,numeric,DogsAllowed,0.446951,0.444772,0.497184,0.496947,0.002179,0.999986
12,numeric,Pre-War,0.18677,0.185438,0.389732,0.388657,0.001332,1.0
13,numeric,LaundryinUnit,0.174522,0.175852,0.379563,0.380699,0.00133,1.0
4,numeric,HardwoodFloors,0.477571,0.478878,0.499503,0.49956,0.001307,1.0
0,numeric,bathrooms,1.195116,1.194416,0.457817,0.456618,0.000866,1.0
2,numeric,interest_level,0.376718,0.378291,0.61789,0.619362,0.000855,1.0
15,numeric,OutdoorSpace,0.105633,0.106426,0.307371,0.308386,0.000793,1.0
19,numeric,SwimmingPool,0.055168,0.054389,0.228311,0.226786,0.000779,1.0


In [26]:
tr_a, _ = times_split_folds[0]
tr_b, _ = tss[0]

compare_train_distribution(X, tr_a, tr_b)

Unnamed: 0,type,feature,mean_A,mean_B,std_A,std_B,stat,p_or_l1
2,numeric,interest_level,0.381744,0.381944,0.626167,0.626388,0.000114,1.0
12,numeric,Pre-War,0.171524,0.171627,0.37699,0.377079,0.000103,1.0
6,numeric,DogsAllowed,0.441275,0.441344,0.49657,0.496578,6.9e-05,1.0
5,numeric,CatsAllowed,0.472777,0.472842,0.499289,0.499293,6.5e-05,1.0
3,numeric,Elevator,0.518666,0.518601,0.499682,0.499685,6.4e-05,1.0
4,numeric,HardwoodFloors,0.490016,0.489955,0.499931,0.49993,6.1e-05,1.0
1,numeric,bedrooms,1.503535,1.503472,1.085392,1.085339,5.9e-05,1.0
7,numeric,Doorman,0.424036,0.423983,0.494226,0.494218,5.3e-05,1.0
8,numeric,Dishwasher,0.417835,0.417783,0.493233,0.493225,5.2e-05,1.0
9,numeric,NoFee,0.37145,0.371404,0.483222,0.48321,4.6e-05,1.0


**Вывод**

распределения совпадают, что говорит о правильности реализации алгоритмов.
Лучше всего использовать TimeSeriesSplit

def build_pipeline(features, X, y):
    preprocess = ColumnTransformer(
        transformers=[
            ("num", Pipeline([
                ("sc", MinMaxScaler()),
            ]), features),
        ],
        remainder="drop"
    )

    model = Pipeline([
        ("sc", preprocess),
        ("lasso", Lasso(alpha=best_alpha, max_iter=10000, random_state=RANDOM_STATE))
    ])

    model_fit = model.fit(X, y)
    return model_fit


In [27]:
speed_results = []

val_ratio = 0.60  
test_ratio = 0.80  

created_dt = pd.to_datetime(X["created"], errors="raise")
val_ts  = created_dt.quantile(val_ratio)
test_ts = created_dt.quantile(test_ratio)

val_date = pd.Timestamp(val_ts).strftime("%Y-%m-%d %H:%M:%S")
test_date = pd.Timestamp(test_ts).strftime("%Y-%m-%d %H:%M:%S")

t0_baseline = perf_counter()
(X_tr, y_tr), (X_val, y_val), (X_te, y_te) = s21_val_test_split(X, y, val_date, test_date)
t1_baseline = perf_counter() - t0_baseline
speed_results.append(t1_baseline)

print("Split dates:", val_date, "|", test_date)
print("Sizes:", len(X_tr), len(X_val), len(X_te))


Split dates: 2016-05-24 16:39:17 | 2016-06-12 08:07:30
Sizes: 29027 9676 9676


In [28]:
pd.options.display.float_format = "{:.6f}".format
created_dt = pd.to_datetime(X["created"], errors="raise")
val_ts  = created_dt.quantile(0.60)
test_ts = created_dt.quantile(0.80)

val_date  = pd.Timestamp(val_ts).strftime("%Y-%m-%d %H:%M:%S")
test_date = pd.Timestamp(test_ts).strftime("%Y-%m-%d %H:%M:%S")

(X_tr, y_tr), (X_val, y_val), (X_te, y_te) = s21_val_test_split(X, y, val_date, test_date)

X_tr = X_tr.drop(columns=["created", "building_id"]).copy()
X_val = X_val.drop(columns=["created", "building_id"]).copy()
X_te = X_te.drop(columns=["created", "building_id"]).copy()

all_feats = list(X_tr.columns)


def _rmse(y_true, y_pred):
    if root_mean_squared_error is not None:
        return float(root_mean_squared_error(y_true, y_pred))
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))

def fit_lasso(feats, alpha):
    model = Pipeline([
        ("sc", MinMaxScaler()),
        ("lasso", Lasso(alpha=alpha, max_iter=10000, random_state=RANDOM_STATE)),
    ])
    model.fit(X_tr[feats], y_tr)
    return model

def _calc_metrics(model, feats, X_, y_):
    pred = model.predict(X_[feats])
    return (
        float(mean_absolute_error(y_, pred)),
        _rmse(y_, pred),
        float(r2_score(y_, pred)),
    )

def build_tables(models):
    rows_mae, rows_rmse, rows_r2 = [], [], []
    for name, model, feats in models:
        mae_tr, rmse_tr, r2_tr = _calc_metrics(model, feats, X_tr, y_tr)
        mae_va, rmse_va, r2_va = _calc_metrics(model, feats, X_val, y_val)
        mae_te, rmse_te, r2_te = _calc_metrics(model, feats, X_te, y_te)

        rows_mae.append({"model": name, "train": mae_tr, "valid": mae_va, "test": mae_te})
        rows_rmse.append({"model": name, "train": rmse_tr, "valid": rmse_va, "test": rmse_te})
        rows_r2.append({"model": name, "train": r2_tr, "valid": r2_va, "test": r2_te})

    return (
        pd.DataFrame(rows_mae).set_index("model"),
        pd.DataFrame(rows_rmse).set_index("model"),
        pd.DataFrame(rows_r2).set_index("model"),
    )

alphas = np.logspace(-3, 1, 25)

best_alpha, best_val_mae, lasso_model = None, np.inf, None
for a in alphas:
    m = fit_lasso(all_feats, a)
    val_mae = mean_absolute_error(y_val, m.predict(X_val[all_feats]))
    if val_mae < best_val_mae:
        best_val_mae = val_mae
        best_alpha = a
        lasso_model = m

print(f"Best alpha: {best_alpha:.6f} | MAE(valid): {best_val_mae:.6f}")

coef = lasso_model.named_steps["lasso"].coef_
weights_table = (pd.DataFrame({
    "ID": list(range(len(all_feats))),
    "Name": all_feats,
    "Importance": np.abs(coef),
}).sort_values("Importance", ascending=False).reset_index(drop=True))

display(weights_table.head(20))

top10_by_weight = weights_table.head(10)["Name"].tolist()
lasso_model_top10 = fit_lasso(top10_by_weight, best_alpha)

mae_tab, rmse_tab, r2_tab = build_tables([
    ("Lasso MinMaxScaler", lasso_model, all_feats),
    ("Lasso top10 MinMaxScaler", lasso_model_top10, top10_by_weight),
])
print("MAE")
display(mae_tab)
print("RMSE")
display(rmse_tab)
print("R2")
display(r2_tab)

print("neg_mean_absolute_percentage_error:")
perm = permutation_importance(
    lasso_model, X_val[all_feats], y_val,
    n_repeats=10, random_state=RANDOM_STATE,
    scoring="neg_mean_absolute_percentage_error"
)

perm_mean = pd.Series(perm.importances_mean, index=all_feats).sort_values(ascending=False)
perm_std  = pd.Series(perm.importances_std, index=all_feats).reindex(perm_mean.index)

perm_table = pd.DataFrame({
    "Feature": perm_mean.index,
    "Mean ± Std Deviation": [f"{perm_mean[f]:.6f} ± {perm_std[f]:.6f}" for f in perm_mean.index]
})

display(perm_table.head(20))

top10_perm = perm_mean.head(10).index.tolist()
lasso_model_perm = fit_lasso(top10_perm, best_alpha)

mae_tab, rmse_tab, r2_tab = build_tables([
    ("Lasso MinMaxScaler", lasso_model, all_feats),
    ("Lasso top10 MinMaxScaler", lasso_model_top10, top10_by_weight),
    ("Lasso permutation MinMaxScaler", lasso_model_perm, top10_perm),
])
print("MAE")
display(mae_tab)
print("RMSE")
display(rmse_tab)
print("R2")
display(r2_tab)

if shap is None:
    print("SHAP skipped: shap not installed (pip install shap).")
else:
    try:
        import shap as _shap

        sc = lasso_model.named_steps["sc"]
        lin = lasso_model.named_steps["lasso"]

        Xtr_s = sc.transform(X_tr[all_feats])
        Xva_s = sc.transform(X_val[all_feats])

        expl = _shap.LinearExplainer(lin, Xtr_s)
        sv = expl.shap_values(Xva_s)
        shap_imp = np.abs(sv).mean(axis=0)

        shap_table = (pd.DataFrame({
            "id": list(range(len(all_feats))),
            "feature": all_feats,
            "shap_value": shap_imp
        }).sort_values("shap_value", ascending=False).reset_index(drop=True))

        display(shap_table.head(20))

        top10_shap = shap_table.head(10)["feature"].tolist()
        lasso_model_shap = fit_lasso(top10_shap, best_alpha)

        mae_tab, rmse_tab, r2_tab = build_tables([
            ("Lasso MinMaxScaler", lasso_model, all_feats),
            ("Lasso top10 MinMaxScaler", lasso_model_top10, top10_by_weight),
            ("Lasso permutation MinMaxScaler", lasso_model_perm, top10_perm),
            ("Lasso shap MinMaxScaler", lasso_model_shap, top10_shap),
        ])
        print("MAE")
        display(mae_tab)
        print("RMSE")
        display(rmse_tab)
        print("R2")
        display(r2_tab)

    except Exception as e:
        print("SHAP skipped:", repr(e))

Best alpha: 1.000000 | MAE(valid): 694.700638


Unnamed: 0,ID,Name,Importance
0,0,bathrooms,14366.09098
1,1,bedrooms,3372.917292
2,2,interest_level,806.965366
3,7,Doorman,539.598924
4,13,LaundryinUnit,473.676592
5,3,Elevator,228.391298
6,17,HighSpeedInternet,199.695843
7,11,FitnessCenter,197.433723
8,10,LaundryinBuilding,181.942907
9,20,LaundryInBuilding,167.152505


MAE


Unnamed: 0_level_0,train,valid,test
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso MinMaxScaler,683.443107,694.700638,694.04072
Lasso top10 MinMaxScaler,685.926175,699.459307,697.718789


RMSE


Unnamed: 0_level_0,train,valid,test
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso MinMaxScaler,1001.59521,1016.726174,1006.808261
Lasso top10 MinMaxScaler,1006.061706,1023.078975,1011.556076


R2


Unnamed: 0_level_0,train,valid,test
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso MinMaxScaler,0.601513,0.612845,0.600383
Lasso top10 MinMaxScaler,0.597951,0.607991,0.596605


neg_mean_absolute_percentage_error:


Unnamed: 0,Feature,Mean ± Std Deviation
0,bedrooms,0.078184 ± 0.001802
1,bathrooms,0.074223 ± 0.001047
2,Doorman,0.026248 ± 0.000801
3,interest_level,0.021256 ± 0.000600
4,LaundryinUnit,0.008729 ± 0.000418
5,Elevator,0.005904 ± 0.000456
6,FitnessCenter,0.003327 ± 0.000312
7,Dishwasher,0.002572 ± 0.000259
8,LaundryinBuilding,0.001421 ± 0.000347
9,HighSpeedInternet,0.000966 ± 0.000161


MAE


Unnamed: 0_level_0,train,valid,test
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso MinMaxScaler,683.443107,694.700638,694.04072
Lasso top10 MinMaxScaler,685.926175,699.459307,697.718789
Lasso permutation MinMaxScaler,685.440277,699.052146,696.798433


RMSE


Unnamed: 0_level_0,train,valid,test
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso MinMaxScaler,1001.59521,1016.726174,1006.808261
Lasso top10 MinMaxScaler,1006.061706,1023.078975,1011.556076
Lasso permutation MinMaxScaler,1006.242173,1023.08253,1012.16069


R2


Unnamed: 0_level_0,train,valid,test
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso MinMaxScaler,0.601513,0.612845,0.600383
Lasso top10 MinMaxScaler,0.597951,0.607991,0.596605
Lasso permutation MinMaxScaler,0.597807,0.607989,0.596123


Unnamed: 0,id,feature,shap_value
0,0,bathrooms,469.169569
1,1,bedrooms,441.89539
2,7,Doorman,265.24332
3,2,interest_level,206.321596
4,13,LaundryinUnit,132.578534
5,3,Elevator,113.936478
6,10,LaundryinBuilding,80.365137
7,11,FitnessCenter,76.395996
8,8,Dishwasher,62.704364
9,4,HardwoodFloors,42.468138


MAE


Unnamed: 0_level_0,train,valid,test
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso MinMaxScaler,683.443107,694.700638,694.04072
Lasso top10 MinMaxScaler,685.926175,699.459307,697.718789
Lasso permutation MinMaxScaler,685.440277,699.052146,696.798433
Lasso shap MinMaxScaler,687.146065,699.237745,695.879097


RMSE


Unnamed: 0_level_0,train,valid,test
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso MinMaxScaler,1001.59521,1016.726174,1006.808261
Lasso top10 MinMaxScaler,1006.061706,1023.078975,1011.556076
Lasso permutation MinMaxScaler,1006.242173,1023.08253,1012.16069
Lasso shap MinMaxScaler,1007.709086,1023.532697,1011.844399


R2


Unnamed: 0_level_0,train,valid,test
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso MinMaxScaler,0.601513,0.612845,0.600383
Lasso top10 MinMaxScaler,0.597951,0.607991,0.596605
Lasso permutation MinMaxScaler,0.597807,0.607989,0.596123
Lasso shap MinMaxScaler,0.596633,0.607644,0.596375


**7. Hyperparameter optimization**

### План по Hyperparameter Optimization
Сравниваем три подхода для ElasticNet (alpha, l1_ratio):
- Grid Search;
- Random Search;
- Optuna (и с CV, и без CV).


In [None]:
def s21_grid_search_cv(
    X,
    y,
    alphas=None,
    l1_ratios=None,
    scoring=None,       
    cv=5,             
    shuffle=True,
    random_state=42,
):

    if alphas is None:
        alphas = np.logspace(-4, 1, 30)
    if l1_ratios is None:
        l1_ratios = np.linspace(0.05, 0.95, 19)

    splitter = KFold(n_splits=cv, shuffle=shuffle, random_state=random_state) 

    if scoring is None:
        def scorer(est, Xb, yb):
            return est.score(Xb, yb)
    elif isinstance(scoring, str):
        _sk = get_scorer(scoring)
        def scorer(est, Xb, yb):               
            return _sk(est, Xb, yb)
    else:
        def scorer(est, Xb, yb):
            return scoring(est, Xb, yb) 

    rows = []
    best_idx = None
    best_mean = -np.inf

    for a in alphas:
        for l1 in l1_ratios:
            split_scores = []
            for tr_idx, val_idx in splitter.split(X, y):
                X_tr_i, X_val_i = X.iloc[tr_idx], X.iloc[val_idx]
                y_tr_i, y_val_i = y.iloc[tr_idx], y.iloc[val_idx]

                pipe = Pipeline([
                    ("sc", StandardScaler()),
                    ("enet", ElasticNet(alpha=a, l1_ratio=l1, max_iter=10000, random_state=random_state))
                ])
                pipe.fit(X_tr_i, y_tr_i)
                split_scores.append(float(scorer(pipe, X_val_i, y_val_i)))

            split_scores = np.asarray(split_scores, float)
            mean_score = float(split_scores.mean())
            std_score  = float(split_scores.std(ddof=1)) if len(split_scores) > 1 else 0.0

            rows.append({
                "mean_test_score": mean_score,
                "std_test_score": std_score,
                **{f"split{i}_test_score": float(s) for i, s in enumerate(split_scores)},
                "alpha": a,
                "l1_ratio": l1,
            })

            if mean_score > best_mean:
                best_mean = mean_score
                best_idx = len(rows) - 1

    cv_results = pd.DataFrame(rows)
    cv_results["rank_test_score"] = cv_results["mean_test_score"].rank(method="min", ascending=False).astype(int)
    cv_results = cv_results.sort_values(["rank_test_score", "alpha", "l1_ratio"]).reset_index(drop=True)

    best_row = rows[best_idx]
    best_params = {
        "alpha": float(best_row["alpha"]),
        "l1_ratio": float(best_row["l1_ratio"]),
        "mean_test_score": best_row["mean_test_score"],
    }

    return best_params, cv_results

In [None]:
gridsearch_params, gridsearch_tab = s21_grid_search_cv(
    X_tr, y_tr,
    cv=5,
    random_state=42
)

gridsearch_tab.head()

In [None]:
def s21_random_search(
    X, y,
    n_iter=100,
    alpha_low=1e-4, alpha_high=1e-1,
    scoring=None,
    cv=5, shuffle=True,
    random_state=42
):

    splitter = KFold(n_splits=cv, shuffle=shuffle, random_state=random_state)

    if scoring is None:
        def scorer(est, Xb, yb): return est.score(Xb, yb)
    elif isinstance(scoring, str):
        _sk = get_scorer(scoring)
        def scorer(est, Xb, yb): return _sk(est, Xb, yb)
    else:
        def scorer(est, Xb, yb): return scoring(est, Xb, yb)
    
    rng = np.random.default_rng(random_state)
    
    rows, best_idx, best_mean = [], None, -np.inf
    
    for i in range(n_iter):
        a = float(10 ** rng.uniform(np.log10(alpha_low), np.log10(alpha_high)))
        l1 = float(rng.uniform(0.0, 1.0))

        split_scores = []
        for tr_idx, val_idx in splitter.split(X, y):
            X_tr_i, X_val_i = X.iloc[tr_idx], X.iloc[val_idx]
            y_tr_i, y_val_i = y.iloc[tr_idx], y.iloc[val_idx]

            pipe = Pipeline([
                ('scaler', StandardScaler()),
                ('model', ElasticNet(alpha=a, l1_ratio=l1, random_state=random_state))
            ])

            pipe.fit(X_tr_i, y_tr_i)
            split_scores.append(float(scorer(pipe, X_val_i, y_val_i)))

        mean_score = np.mean(split_scores)
        std_score = np.std(split_scores)

        rows.append({
            "iter": i,
            "mean_test_score": mean_score,
            "std_test_score": std_score,
            **{f"split{i}_test_score": float(s) for i, s in enumerate(split_scores)},
            "alpha": a,
            "l1_ratio": l1,
        })

        if mean_score > best_mean:
            best_mean = mean_score
            best_idx = len(rows) - 1

    cv_results = pd.DataFrame(rows)
    cv_results["rank_test_score"] = cv_results["mean_test_score"].rank(method="min", ascending=False).astype(int)
    cv_results = cv_results.sort_values(["rank_test_score", "iter"]).reset_index(drop=True)

    best_row = rows[best_idx]
    best_params = {
        "alpha": best_row["alpha"],
        "l1_ratio": best_row["l1_ratio"],
        "mean_test_score": float(best_row["mean_test_score"]),
    }

    return best_params, cv_results     

In [None]:
best_params, randomsearch_results = s21_random_search(
    X_tr, y_tr,
    n_iter=100,
    cv=5,
    random_state=42
)

randomsearch_results.head()

In [None]:
elasticnet_model = Pipeline([
    ("num", StandardScaler()),
    ("enet", ElasticNet(alpha=0.01743328822199989, l1_ratio=0.6, random_state=42))
])

elasticnet_model.fit(X_tr, y_tr)

In [None]:
def optuna_elasticnet(
    X, y,
    scoring="neg_mean_absolute_error",   
    cv=None,                            
    X_val=None, y_val=None,      
    n_trials=50,
    alpha_low=1e-4, alpha_high=1e1,
    l1_low=0.0,  l1_high=1.0,
    random_state=42,
    n_jobs=-1,
    silence_logs=True,
):
    if optuna is None:
        raise ImportError('optuna is not installed. Install it with: pip install optuna')

    if silence_logs:
        optuna.logging.set_verbosity(optuna.logging.WARNING)

    num_cols = X.select_dtypes(include=[np.number, "bool"]).columns.tolist()
    if not num_cols:
        raise ValueError("В X нет числовых/булевых признаков.")

    from sklearn.metrics import get_scorer
    scorer = get_scorer(scoring) if isinstance(scoring, str) else scoring

    def make_pipe(alpha, l1_ratio):
        return Pipeline([
            ("prep", ColumnTransformer([("num", StandardScaler(), num_cols)], remainder="drop")),
            ("enet", ElasticNet(alpha=alpha, l1_ratio=l1_ratio, max_iter=10000, random_state=42))
        ])

    def objective(trial: optuna.Trial) -> float:
        alpha    = trial.suggest_float("alpha",    alpha_low, alpha_high, log=True)
        l1_ratio = trial.suggest_float("l1_ratio", l1_low,    l1_high)
        pipe = make_pipe(alpha, l1_ratio)

        if cv is not None:
            scores = cross_val_score(pipe, X, y, scoring=scorer, cv=cv, n_jobs=n_jobs)
            return float(scores.mean())  # neg-MAE/R2/... — всегда «больше лучше»
        else:
            assert X_val is not None and y_val is not None, "Для holdout передай X_val, y_val"
            pipe.fit(X, y)
            if scorer is None:
                return float(pipe.score(X_val, y_val))
            return float(scorer(pipe, X_val, y_val))

    study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=random_state))
    study.optimize(objective, n_trials=n_trials)

    best_alpha  = study.best_params["alpha"]
    best_l1r    = study.best_params["l1_ratio"]
    best_model  = make_pipe(best_alpha, best_l1r).fit(X, y)

    trials_df = study.trials_dataframe(attrs=("number","value","params","state","datetime_start","datetime_complete"))
    best_params = {"alpha": best_alpha, "l1_ratio": best_l1r, "best_score": study.best_value}
    
    return best_model, best_params, trials_df, study

In [None]:
if optuna is None:
    print("optuna не установлена.")
else:
    from sklearn.model_selection import KFold

    cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

    model_cv, params_cv, trials_cv, _ = optuna_elasticnet(
        X_tr, y_tr,
        scoring="neg_mean_absolute_error",
        cv=cv,
        n_trials=50,
        random_state=RANDOM_STATE
    )

    model_ho, params_ho, trials_ho, _ = optuna_elasticnet(
        X_tr, y_tr,
        scoring="neg_mean_absolute_error",
        cv=None,
        X_val=X_val, y_val=y_val,
        n_trials=50,
        random_state=RANDOM_STATE
    )

    print(f"Optuna CV: {params_cv}")
    print(f"Optuna no CV: {params_ho}")
    print(f"Metrics CV: {evaluate_model(model_cv, X_val, y_val)}")
    print(f"Metrics no CV: {evaluate_model(model_ho, X_val, y_val)}")


In [None]:
def trials_to_df(trials):
    if trials is None:
        return pd.DataFrame()

    # 1. Если это DataFrame (Optuna >=3.0)
    if isinstance(trials, pd.DataFrame):
        df = trials.copy()
        rename_map = {}
        for old, new in [("trial_number", "number"), ("trial_id", "number"), ("id", "number")]:
            if old in df.columns:
                rename_map[old] = new
        for old, new in [("objective", "value"), ("score", "value")]:
            if old in df.columns:
                rename_map[old] = new
        df = df.rename(columns=rename_map)

        if "number" not in df.columns:
            df["number"] = np.arange(len(df))
        if "value" not in df.columns:
            raise KeyError("В DataFrame trials нет столбца value")
        if "state" not in df.columns:
            df["state"] = "COMPLETE"

        df = df[["number", "value", "state"]].copy()
        df = df.dropna(subset=["value"])
        return df.sort_values("number").reset_index(drop=True)

    rows = []
    for i, t in enumerate(trials):
        number = getattr(t, "number", i)
        value = getattr(t, "value", None)
        state = str(getattr(t, "state", "COMPLETE"))
        if value is None:
            continue
        rows.append({"number": int(number), "value": float(value), "state": state})
    df = pd.DataFrame(rows)
    if df.empty:
        return df
    return df.sort_values("number").reset_index(drop=True)


def ops_report_from_trials(trials, direction="maximize", ops_per_trial=1, tol=0.01):
    df = trials_to_df(trials)
    if df.empty:
        return {"error": "Нет trials с value"}

    vals = df["value"].to_numpy(dtype=float)
    if direction == "minimize":
        best = np.minimum.accumulate(vals)
        final_best = best[-1]
        target = final_best + abs(final_best) * tol
        hit_idx = np.where(best <= target)[0]
    else:
        best = np.maximum.accumulate(vals)
        final_best = best[-1]
        target = final_best - abs(final_best) * tol
        hit_idx = np.where(best >= target)[0]

    hit = int(hit_idx[0]) if hit_idx.size > 0 else len(best) - 1
    return {
        "final_best": float(final_best),
        "target_within_tol": float(target),
        "trials_to_reach": hit + 1,
        "ops_per_trial": ops_per_trial,
        "ops_to_reach": int((hit + 1) * ops_per_trial),
        "total_trials": len(vals),
        "total_ops": int(len(vals) * ops_per_trial),
    }

In [None]:
direction = "maximize" 
k = 5

print("CV:")
print(ops_report_from_trials(trials_cv, direction=direction, ops_per_trial=k, tol=0.01))

print("\nNo CV:")
print(ops_report_from_trials(trials_ho, direction=direction, ops_per_trial=1, tol=0.01))
