## 1. Answer the questions from the introduction
   1. What is leave-one-out? Provide limitations and strengths.
   2. How do Grid Search, Randomized Grid Search, and Bayesian optimization work?
   3. Explain classification of feature selection methods. Explain how Pearson and Chi2 work. Explain how Lasso works. Explain what permutation significance is. Become familiar with SHAP.

1. Leave-one-out (LOO) — это метод кросс-валидации, при котором для обучения модели поочередно используется все объекты набора данных, кроме одного, который в свою очередь используется для тестирования

Плюсы:
- Максимально использует данные для обучения
- Практически несмещённая оценка модели
- Особенно полезен при очень маленьких выборках

Минусы:
- Очень дорог по вычислениям (обучение N раз)
- Результаты могут быть нестабильными из-за высокой вариативности
- Возможен переобучение из-за обучения почти на всей выборке

2.
* Grid Search — перебирает все возможные комбинации заданных гиперпараметров в решётке (grid), обучая и оценивая модель на каждой. Хорош для малых размерностей, но дорог по времени.

* Randomized Grid Search — случайно выбирает фиксированное число комбинаций из пространства гиперпараметров. Быстрее Grid Search, лучше для больших и сложных пространств, но не гарантирует полный перебор.

* Bayesian Optimization — строит модель вероятностной зависимости функции качества от гиперпараметров и итеративно выбирает новые параметры для оценки с учетом предыдущих результатов. Более эффективен и экономит вычисления, особенно при дорогом обучении модели.

3.

Pearson — измеряет линейную корреляцию между признаком и целевой переменной, отбирает признаки с высокой корреляцией.

Chi2 — проверяет независимость категориального признака от класса, выделяя значимые.

Lasso — линейная регрессия с L1-регуляризацией, «обнуляет» коэффициенты неважных признаков, тем самым отбирая лучшие.

Permutation significance — оценивает важность признака путем случайного перемешивания его значений и сравнения ухудшения качества модели.

SHAP — метод объяснения моделей с помощью оценки вклада каждого признака в предсказание на основе теории кооперативных игр.

## 2. Introduction — do all the preprocessing from the previous lesson
   1. Read all the data.
   2. Preprocess the "Interest Level" feature.
   3. Create features:  'Elevator', 'HardwoodFloors', 'CatsAllowed', 'DogsAllowed', 'Doorman', 'Dishwasher', 'NoFee', 'LaundryinBuilding', 'FitnessCenter', 'Pre-War', 'LaundryInUnit', 'RoofDeck', 'OutdoorSpace', 'DiningRoom', 'HighSpeedInternet', 'Balcony', 'SwimmingPool', 'LaundryInBuilding', 'NewConstruction', 'Terrace'.

In [58]:
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.model_selection import KFold, GroupKFold, StratifiedKFold, TimeSeriesSplit, train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import shap
import optuna

pd.set_option("future.no_silent_downcasting", True)

In [59]:
df = pd.read_json("data/train.json")

percentile_1 = df["price"].quantile(0.01)
percentile_9 = df["price"].quantile(0.99)
train_df = df[(df["price"] > percentile_1) & (df["price"] < percentile_9)].copy()
train_df.loc[:,"interest_level"] = train_df["interest_level"].replace({"low": 0,"medium": 1, "high": 2}).copy()
train_df["features"] = train_df["features"].astype(str)
train_df["features"] = train_df["features"].str.replace(r'[$$\'"\s\[\]]', '', regex=True)
list_features_train = []
for index, row in train_df.iterrows():
  for feature in row["features"].split(","):
    list_features_train.append(feature)

counter = Counter(list_features_train)
top_21 = counter.most_common(21)
top_21 = [x for x in top_21 if x[0] != ""]

for feature_name in top_21:
    train_df[feature_name[0]] = train_df["features"].apply(lambda x: 1 if feature_name[0] in x.split(",") else 0)

feature_list = ["bathrooms", "bedrooms", "interest_level", "created"] + [x[0] for x in top_21]

train_df["bathrooms"] = train_df["bathrooms"].astype(int)

X = train_df[feature_list].copy()
y = train_df["price"].copy()
X

Unnamed: 0,bathrooms,bedrooms,interest_level,created,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,...,LaundryinUnit,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace
4,1,1,1,2016-06-16 05:55:27,0,1,1,1,0,1,...,0,0,0,1,0,0,0,0,0,0
6,1,2,0,2016-06-01 05:44:33,1,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
9,1,2,1,2016-06-14 15:19:59,1,1,0,0,1,1,...,1,0,0,0,0,0,0,0,0,0
10,1,3,1,2016-06-24 07:54:24,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15,1,0,0,2016-06-28 03:50:23,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124000,1,3,0,2016-04-05 03:58:33,1,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
124002,1,2,1,2016-04-02 02:25:31,1,0,1,1,1,0,...,0,0,0,0,0,0,0,1,0,0
124004,1,1,1,2016-04-26 05:42:03,1,1,1,1,0,1,...,1,0,0,1,0,0,0,0,0,0
124008,1,2,1,2016-04-19 02:47:33,0,0,0,0,0,1,...,1,0,1,0,0,0,0,0,0,0


## 3. Implement the next methods:
   1. Split data into 2 parts randomly with parameter test_size (ratio from 0 to 1), return training and test samples.
   2. Randomly split data into 3 parts with parameters validation_size and test_size, return train, validation and test samples.
   3. Split data into 2 parts with parameter date_split, return train and test samples split by date_split param.
   4. Split data into 3 parts with parameters validation_date and test_date, return train, validation and test samples split by input params.
   5. Make split procedure determenistic. What does it mean?

In [60]:
def split_into_2(X, y, test_size=0.2):
    np.random.seed(21)
    n_samples = X.shape[0]
    ind = np.random.permutation(n_samples)
    test_count = int(test_size * n_samples)
    train_ind = ind[test_count:]
    test_ind = ind[:test_count]
    return X.iloc[train_ind], X.iloc[test_ind], y.iloc[train_ind], y.iloc[test_ind]

In [61]:
def split_into_3(X, y, test_size=0.2, validation_size=0.2):
    np.random.seed(21)
    n_samples = X.shape[0]
    ind = np.random.permutation(n_samples)
    validation_count = int(validation_size * n_samples)
    test_count = int(test_size * n_samples)
    train_count = int(n_samples - validation_count - test_count)

    train_ind = ind[:train_count]
    validation_ind = ind[train_count:train_count + validation_count]
    test_ind = ind[train_count + validation_count:]

    return (X.iloc[train_ind], X.iloc[validation_ind], X.iloc[test_ind],
            y.iloc[train_ind], y.iloc[validation_ind], y.iloc[test_ind])

In [62]:
def split_into_2_date(X, y, date_split="2016-06-16"):
    X["created"] = pd.to_datetime(X["created"])
    train_ind = X.index[X["created"] <= date_split]
    test_ind = X.index[X["created"] > date_split]
    return X.loc[train_ind], X.loc[test_ind], y.loc[train_ind], y.loc[test_ind]

In [63]:
def split_into_3_date(X, y, date_split="2016-06-16"):
    X["created"] = pd.to_datetime(X["created"])
    train_ind = X.index[X["created"] < date_split]
    validation_ind = X.index[X["created"] == date_split]
    test_ind = X.index[X["created"] > date_split]
    return (X.iloc[train_ind], X.iloc[validation_ind], X.iloc[test_ind],
            y.iloc[train_ind], y.iloc[validation_ind], y.iloc[test_ind])

## 4. Implement the next cross-validation methods:
   1. K-Fold, where k is the input parameter, returns a list of train and test indices. 
   2. Grouped K-Fold, where k and group_field are input parameters, returns list of train and test indices. 
   3. Stratified K-fold, where k and stratify_field are input parameters, returns list of train and test indices.
   4. Time series split, where k and date_field are input parameters, returns list of train and test indices.

In [64]:
def kfold (X, k):
    folds = []
    all_ind = np.arange(X.shape[0])
    start = 0
    if k < X.shape[0]:
        for i in range (k):
            size = X.shape[0] // k
            if i < X.shape[0] % k:
                size += 1

            end = start + size
            full = all_ind[start:end]
            test = full

            train = np.setdiff1d(all_ind, full)
            folds.append((train, test))
            start = end

        return folds

    else:
        print(f"Number of splits cannot exceed the number of samples")
        return None, None

In [65]:
def groupKfold (X, k, group_field):
    uniq = np.unique(group_field)
    if k <= len(uniq):
        for i in range (k):
            full = uniq[i * len(uniq) // k : (i + 1) * len(uniq) // k]
            test = (np.where(np.isin(group_field, full))[0])
            train = (np.setdiff1d(np.arange(len(X)), test))
            yield train, test
    else:
        print(f"Number of splits cannot exceed the number of samples")
        return None, None

In [66]:
def stratifyKfold (X, k, stratify_field):
    uniq = np.unique(stratify_field)
    folds = []

    fold_sizes_per_class = {}
    class_indices_dict = {}
    for u in uniq:
        ind = np.where(stratify_field == u)[0]
        class_indices_dict[u] = ind

        fold_size = []
        for _ in range (k):
            fold_size.append(len(ind) // k)
        for i in range (len(ind) % k):
            fold_size[i] += 1

        fold_sizes_per_class[u] = fold_size

    for i in range (k):
        test_folds = []
        train_folds = []
        for u in uniq:
            sizes = fold_sizes_per_class[u]
            class_full = class_indices_dict[u]
            start = sum(sizes[:i])
            end = start + sizes[i]
            train_folds.extend(class_full[:start])
            train_folds.extend(class_full[end:])
            test_folds.extend(class_full[start:end])

        folds.append((np.sort(train_folds), np.sort(test_folds)))

    return folds

In [67]:
def timeSeriesSplit(k, date_field):
    n_samples = len(date_field)
    folds = []
    test_size = n_samples//(k + 1)
    for i in range(1, k + 1):
        train_size = i * n_samples // (k + 1) + n_samples % (k + 1)
        train = np.arange(train_size)
        test = np.arange(train_size, train_size + test_size)
        folds.append((train, test))
    return(folds)

## 5. Cross-validation comparison
   1. Apply all the validation methods implemented above to our dataset. To apply Stratified algorithm you should preprocess target.
   2. Apply the appropriate methods from sklearn.
   3. Compare the resulting feature distributions for the training part of the dataset between sklearn and your implementation.
   4. Compare all validation schemes. Choose the best one. Explain your choice.

In [68]:
#my train_test_split
X_train, X_test, y_train, y_test = split_into_2(
    X, y, test_size=0.2
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((38675, 24), (9668, 24), (38675,), (9668,))

In [69]:
#org train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((38674, 24), (9669, 24), (38674,), (9669,))

In [70]:
X_train, X_test, x_validation, y_validation, y_train, y_test = split_into_3(
    X, y, test_size=0.2, validation_size=0.2
)

X_train.shape, X_test.shape, x_validation.shape, y_validation.shape, y_train.shape, y_test.shape

((29007, 24), (9668, 24), (9668, 24), (29007,), (9668,), (9668,))

In [71]:
X_train, X_test, y_train, y_test = split_into_2_date(
    X, y, date_split="2016-06-16"
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((40696, 24), (7647, 24), (40696,), (7647,))

In [72]:
X_train, X_test, x_validation, y_validation, y_train, y_test = split_into_3(
    X, y, test_size=0.2, validation_size=0.2
)

X_train.shape, X_test.shape, x_validation.shape, y_validation.shape, y_train.shape, y_test.shape

((29007, 24), (9668, 24), (9668, 24), (29007,), (9668,), (9668,))

In [73]:
#my KFold
for i, (train_index, test_index) in enumerate(kfold(X=X, k=2)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Train: index length = {len(train_index)}")
    print(f"  Test:  index={test_index}")
    print(f"  Test:  index length = {len(test_index)}")

Fold 0:
  Train: index=[24172 24173 24174 ... 48340 48341 48342]
  Train: index length = 24171
  Test:  index=[    0     1     2 ... 24169 24170 24171]
  Test:  index length = 24172
Fold 1:
  Train: index=[    0     1     2 ... 24169 24170 24171]
  Train: index length = 24172
  Test:  index=[24172 24173 24174 ... 48340 48341 48342]
  Test:  index length = 24171


In [74]:
#org KFold
kf = KFold(n_splits=2)
for i, (train_index, test_index) in enumerate(kf.split(X=X)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Train: index length = {len(train_index)}")
    print(f"  Test:  index={test_index}")
    print(f"  Test:  index length = {len(test_index)}")

Fold 0:
  Train: index=[24172 24173 24174 ... 48340 48341 48342]
  Train: index length = 24171
  Test:  index=[    0     1     2 ... 24169 24170 24171]
  Test:  index length = 24172
Fold 1:
  Train: index=[    0     1     2 ... 24169 24170 24171]
  Train: index length = 24172
  Test:  index=[24172 24173 24174 ... 48340 48341 48342]
  Test:  index length = 24171


In [75]:
#my GroupKFold
for i, (train_index, test_index) in enumerate(groupKfold(X=X, k=2, group_field=X["interest_level"])):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Train: index length = {len(train_index)}")
    print(f"  Test:  index={test_index}")
    print(f"  Test:  index length = {len(test_index)}")

Fold 0:
  Train: index=[    0     2     3 ... 48340 48341 48342]
  Train: index length = 14671
  Test:  index=[    1     4     5 ... 48336 48337 48338]
  Test:  index length = 33672
Fold 1:
  Train: index=[    1     4     5 ... 48336 48337 48338]
  Train: index length = 33672
  Test:  index=[    0     2     3 ... 48340 48341 48342]
  Test:  index length = 14671


In [76]:
#org GroupKFold
group_kfold = GroupKFold(n_splits=2)
for i, (train_index, test_index) in enumerate(group_kfold.split(X=X, groups=X["interest_level"])):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Train: index length = {len(train_index)}")
    print(f"  Test:  index={test_index}")
    print(f"  Test:  index length = {len(test_index)}")

Fold 0:
  Train: index=[    0     2     3 ... 48340 48341 48342]
  Train: index length = 14671
  Test:  index=[    1     4     5 ... 48336 48337 48338]
  Test:  index length = 33672
Fold 1:
  Train: index=[    1     4     5 ... 48336 48337 48338]
  Train: index length = 33672
  Test:  index=[    0     2     3 ... 48340 48341 48342]
  Test:  index length = 14671


In [77]:
#preprocess target
mean_price = y.mean()

def price_to_class(price):
    if price < mean_price - 100:
        return 3
    elif mean_price - 100 <= price <= mean_price + 100:
        return 4
    else:
        return 2

y_preprocess = pd.DataFrame()
y_preprocess["price"] = y
y_preprocess["class"] = y_preprocess["price"].apply(price_to_class)

In [78]:
#my StratifiedKFold
for i, (train_index, test_index) in enumerate(stratifyKfold(X=X, k=2, stratify_field=y_preprocess["class"])):
    print()
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Train: index length = {len(train_index)}")
    print(f"  Test:  index={test_index}")
    print(f"  Test:  index length = {len(test_index)}")


Fold 0:
  Train: index=[23884 23885 23889 ... 48340 48341 48342]
  Train: index length = 24171
  Test:  index=[    0     1     2 ... 24370 24372 24380]
  Test:  index length = 24172

Fold 1:
  Train: index=[    0     1     2 ... 24370 24372 24380]
  Train: index length = 24172
  Test:  index=[23884 23885 23889 ... 48340 48341 48342]
  Test:  index length = 24171


In [79]:
#org StratifiedKFold
skf = StratifiedKFold(n_splits=2)
for i, (train_index, test_index) in enumerate(skf.split(X=X, y=y_preprocess["class"])):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Train: index length = {len(train_index)}")
    print(f"  Test:  index={test_index}")
    print(f"  Test:  index length = {len(test_index)}")

Fold 0:
  Train: index=[23884 23885 23889 ... 48340 48341 48342]
  Train: index length = 24171
  Test:  index=[    0     1     2 ... 24370 24372 24380]
  Test:  index length = 24172
Fold 1:
  Train: index=[    0     1     2 ... 24370 24372 24380]
  Train: index length = 24172
  Test:  index=[23884 23885 23889 ... 48340 48341 48342]
  Test:  index length = 24171


In [80]:
#my TimeSeriesSplit
for i, (train_index, test_index) in enumerate(timeSeriesSplit(k=2, date_field=X)):
    print()
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Train: index length = {len(train_index)}")
    print(f"  Test:  index={test_index}")
    print(f"  Test:  index length = {len(test_index)}")


Fold 0:
  Train: index=[    0     1     2 ... 16112 16113 16114]
  Train: index length = 16115
  Test:  index=[16115 16116 16117 ... 32226 32227 32228]
  Test:  index length = 16114

Fold 1:
  Train: index=[    0     1     2 ... 32226 32227 32228]
  Train: index length = 32229
  Test:  index=[32229 32230 32231 ... 48340 48341 48342]
  Test:  index length = 16114


In [81]:
#org TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=2)
for i, (train_index, test_index) in enumerate(tscv.split(X)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Train: index length = {len(train_index)}")
    print(f"  Test:  index={test_index}")
    print(f"  Test:  index length = {len(test_index)}")

Fold 0:
  Train: index=[    0     1     2 ... 16112 16113 16114]
  Train: index length = 16115
  Test:  index=[16115 16116 16117 ... 32226 32227 32228]
  Test:  index length = 16114
Fold 1:
  Train: index=[    0     1     2 ... 32226 32227 32228]
  Train: index length = 32229
  Test:  index=[32229 32230 32231 ... 48340 48341 48342]
  Test:  index length = 16114


Каждый метод кросс-валидации предназначен для разных задач и типов данных, поэтому их напрямую сравнивать и выбирать «лучший» в общем смысле не совсем корректно:

* K-Fold — универсальный метод для равномерного разбиения данных без учета особенностей. Подходит для большинства задач с независимыми и одинаково распределёнными данными.

* Grouped K-Fold — важен, когда данные сгруппированы (например, по клиентам или сессиям), чтобы группы не пересекались между train и test. Предотвращает утечку информации и обеспечивает более честную оценку.

* Stratified K-Fold — необходим для задач с несбалансированными классами (например, классификация с редкими событиями), где нужно сохранить пропорции классов в каждом fold. Помогает улучшить обобщающую способность моделей.

* Time Series Split — применяется при работе с временными рядами, где сохранение хронологического порядка критично, чтобы избежать утечки из будущего в прошлое и моделировать реальные сценарии прогнозирования.

В связи с разными предпосылками и требованиями к данным, каждый метод эффективен в своем контексте, поэтому их нельзя сравнивать напрямую или выделять один универсальный лучший. Выбор зависит от задачи, структуры и природы данных.

## 6. Feature Selection
   1. Fit a Lasso regression model with normalized features. Use your method for splitting samples into 3 parts by field created with 60/20/20 ratio — train/validation/test.
   2. Sort features by weight coefficients from model, fit model to top 10 features and compare quality.
   3. Implement method for simple feature selection by nan-ratio in feature and correlation. Apply this method to feature set and take top 10 features, refit model and measure quality.
   4. Implement permutation importance method and take top 10 features, refit model and measure quality.
   5. Import Shap and also refit model on top 10 features.
   6. Compare the quality of these methods for different aspects — speed, metrics and stability.

In [82]:
X = X.drop("created", axis=1, errors="ignore")

X_train, X_test, X_validation, y_train, y_test, y_validation = split_into_3(
    X, y, test_size=0.2, validation_size=0.2
)

X_train.shape, X_test.shape, X_validation.shape, y_train.shape, y_test.shape, y_validation.shape

((29007, 23), (9668, 23), (9668, 23), (29007,), (9668,), (9668,))

In [83]:
def print_mrr(y_train, y_test, y_validation, train_predict, validation_predict, test_predict):
    mae_train = mean_absolute_error(y_train, train_predict)
    mae_val = mean_absolute_error(y_validation, validation_predict)
    mae_test = mean_absolute_error(y_test, test_predict)
    rmse_train = np.sqrt(mean_squared_error(y_train, train_predict))
    rmse_val = np.sqrt(mean_squared_error(y_validation, validation_predict))
    rmse_test = np.sqrt(mean_squared_error(y_test, test_predict))
    r2_train = r2_score(y_train, train_predict)
    r2_val = r2_score(y_validation, validation_predict)
    r2_test = r2_score(y_test, test_predict)

    print(f"MAE train: {mae_train}")
    print(f"MAE val: {mae_val}")
    print(f"MAE test: {mae_test}")

    print(f"RMSE train: {rmse_train}")
    print(f"RMSE val: {rmse_val}")
    print(f"RMSE test: {rmse_test}")

    print(f"R2 train: {r2_train}")
    print(f"R2 val: {r2_val}")
    print(f"R2 test: {r2_test}")

In [84]:
lasso = Lasso(alpha=0.1, positive=True)
lasso.fit(X_train, y_train)
train_predict = lasso.predict(X_train)
test_predict = lasso.predict(X_test)
validation_predict = lasso.predict(X_validation)

print_mrr(y_train, y_test, y_validation, train_predict, validation_predict, test_predict)

MAE train: 724.4739538315874
MAE val: 741.9960564379298
MAE test: 726.191524951736
RMSE train: 1044.2644748424946
RMSE val: 1073.5651954487369
RMSE test: 1067.3741796097713
R2 train: 0.561775907984769
R2 val: 0.564448828909797
R2 test: 0.5366932502507034


In [85]:
if "created" in feature_list:
    feature_list.remove("created")
coefficients = lasso.coef_
ind = dict(zip(feature_list, coefficients))
ind_sorted = sorted(ind.items(), key=lambda x: abs(x[1]), reverse=True)
for feature, weight in ind_sorted:
    print(f"{feature}: {weight}")

keys_list = list(ind.keys())
top_10 = keys_list[:10]


bathrooms: 1591.1045826966963
Doorman: 583.7192878249056
bedrooms: 449.1355869810811
LaundryinUnit: 369.3364169492312
DogsAllowed: 106.75249233311753
Elevator: 94.15600922023562
Terrace: 88.75450759542758
FitnessCenter: 76.0343780165375
DiningRoom: 41.359548844463376
interest_level: 0.0
HardwoodFloors: 0.0
CatsAllowed: 0.0
Dishwasher: 0.0
NoFee: 0.0
LaundryinBuilding: 0.0
Pre-War: 0.0
RoofDeck: 0.0
OutdoorSpace: 0.0
HighSpeedInternet: 0.0
Balcony: 0.0
SwimmingPool: 0.0
LaundryInBuilding: 0.0
NewConstruction: 0.0


In [86]:
lasso = Lasso(alpha=0.1)
lasso.fit(X_train[top_10], y_train)
test_predict = lasso.predict(X_test[top_10])
validation_predict = lasso.predict(X_validation[top_10])

print_mrr(y_train, y_test, y_validation, train_predict, validation_predict, test_predict)

MAE train: 724.4739538315874
MAE val: 711.9057674496846
MAE test: 701.4517149591959
RMSE train: 1044.2644748424946
RMSE val: 1038.2380269017747
RMSE test: 1032.9891080769148
R2 train: 0.561775907984769
R2 val: 0.5926420448390284
R2 test: 0.5660629558097283


In [87]:
X.isna().any()

bathrooms            False
bedrooms             False
interest_level       False
Elevator             False
HardwoodFloors       False
CatsAllowed          False
DogsAllowed          False
Doorman              False
Dishwasher           False
NoFee                False
LaundryinBuilding    False
FitnessCenter        False
Pre-War              False
LaundryinUnit        False
RoofDeck             False
OutdoorSpace         False
DiningRoom           False
HighSpeedInternet    False
Balcony              False
SwimmingPool         False
LaundryInBuilding    False
NewConstruction      False
Terrace              False
dtype: bool

In [88]:
X_nan = X.copy()
n_rows, n_cols = X_nan.shape
nan_ratio = 0.1
nan_count = int(n_rows * n_cols * nan_ratio)
mask = np.array([True]*nan_count + [False]*(n_rows * n_cols - nan_count))
np.random.shuffle(mask)
np.random.seed(21)
mask = mask.reshape(n_rows, n_cols)
X_nan = X_nan.mask(mask)

feature_list_nan = feature_list.copy()
feature_list_nan.append("Nan")

In [89]:
X_nan

Unnamed: 0,bathrooms,bedrooms,interest_level,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,...,LaundryinUnit,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace
4,1.0,1.0,1,0.0,1.0,1.0,,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,,0.0,
6,1.0,2.0,0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1.0,2.0,1,1.0,1.0,0.0,0.0,1.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0
10,1.0,3.0,1,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,1.0,0.0,0,1.0,0.0,0.0,0.0,,0.0,0.0,...,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124000,1.0,3.0,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,...,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0
124002,1.0,2.0,1,1.0,0.0,1.0,1.0,1.0,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0
124004,1.0,1.0,1,1.0,1.0,1.0,1.0,0.0,1.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
124008,1.0,2.0,1,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,0.0


In [90]:
nan_counts = dict(X_nan.isna().sum())
nan_counts_sorted = sorted(nan_counts.items(), key=lambda x: x[1])
keys_list_nan = list(dict(nan_counts_sorted).keys())
top_10_nan = keys_list_nan[:10]

In [91]:
X_nan = X_nan.dropna()
X_train, X_test, X_validation, y_train, y_test, y_validation = split_into_3(
    X_nan, y, test_size=0.2, validation_size=0.2
)
X_train.shape, X_test.shape, X_validation.shape, y_train.shape, y_test.shape, y_validation.shape

((2576, 23), (858, 23), (858, 23), (2576,), (858,), (858,))

In [92]:
lasso = Lasso(alpha=0.1)
lasso.fit(X_train[top_10_nan], y_train)
train_predict = lasso.predict(X_train[top_10_nan])
test_predict = lasso.predict(X_test[top_10_nan])
validation_predict = lasso.predict(X_validation[top_10_nan])

print_mrr(y_train, y_test, y_validation, train_predict, validation_predict, test_predict)

MAE train: 1088.5051116979018
MAE val: 1180.5501837478423
MAE test: 1138.1483511425922
RMSE train: 1488.2618888201878
RMSE val: 1631.2716023359403
RMSE test: 1602.341012205051
R2 train: 0.005347275141870278
R2 val: -0.017594721772081767
R2 test: -0.014981493706352822


In [93]:
X_shuffle = X.copy()
for feature in top_10:
    X_shuffle[feature] = np.random.permutation(X_shuffle[feature].values)

X_train, X_test, X_validation, y_train, y_test, y_validation = split_into_3(
    X_shuffle, y, test_size=0.2, validation_size=0.2
)

In [94]:
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
train_predict = lasso.predict(X_train)
test_predict = lasso.predict(X_test)
validation_predict = lasso.predict(X_validation)

print_mrr(y_train, y_test, y_validation, train_predict, validation_predict, test_predict)

MAE train: 1043.2324649712862
MAE val: 1077.3973486765317
MAE test: 1047.625455229421
RMSE train: 1466.0290904859091
RMSE val: 1510.8894134879615
RMSE test: 1472.1494882480397
R2 train: 0.13630497426179056
R2 val: 0.13732404089351868
R2 test: 0.11866895908309427


In [95]:
X_train, X_test, X_validation, y_train, y_test, y_validation = split_into_3(
    X, y, test_size=0.2, validation_size=0.2
)

In [96]:
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
train_predict = lasso.predict(X_train)
test_predict = lasso.predict(X_test)
validation_predict = lasso.predict(X_validation)

exp = shap.Explainer(lasso, X_train)
shap_values = exp(X_train)

In [97]:
mean_abs_shap = np.mean(np.abs(shap_values.values), axis=0)
shap_val = {}
for i, feature_name in enumerate(feature_list):
    print(f"{feature_name}: {mean_abs_shap[i]}")
    shap_val[feature_name] = mean_abs_shap[i]

shap_val_sorted = sorted(shap_val.items(), key=lambda x: x[1], reverse=True)
keys_list = list(dict(shap_val_sorted).keys())
top_10 = keys_list[:10]

bathrooms: 520.7289139593444
bedrooms: 447.9793352461869
interest_level: 220.7924431091073
Elevator: 104.77977605515683
HardwoodFloors: 60.18294556314592
CatsAllowed: 36.73324135553417
DogsAllowed: 64.59116569913535
Doorman: 272.47851963511715
Dishwasher: 73.06092165032292
NoFee: 40.01597182008555
LaundryinBuilding: 85.33191978834125
FitnessCenter: 84.05070349710664
Pre-War: 28.096890196014844
LaundryinUnit: 123.6462958109627
RoofDeck: 26.040386128471248
OutdoorSpace: 13.138869034877443
DiningRoom: 33.34142171949512
HighSpeedInternet: 35.50786999364339
Balcony: 4.397101046180796
SwimmingPool: 9.15944722096128
LaundryInBuilding: 25.90026203426125
NewConstruction: 15.646253877050455
Terrace: 16.141918494937084


In [98]:
lasso = Lasso(alpha=0.1)
lasso.fit(X_train[top_10], y_train)
test_predict = lasso.predict(X_test[top_10])
validation_predict = lasso.predict(X_validation[top_10])

print_mrr(y_train, y_test, y_validation, train_predict, validation_predict, test_predict)

MAE train: 687.5224005449379
MAE val: 702.924854343843
MAE test: 692.8063641766822
RMSE train: 994.1808495100179
RMSE val: 1025.522533326079
RMSE test: 1022.1104664980079
R2 train: 0.6028029368872916
R2 val: 0.6025589205526094
R2 test: 0.5751546072948159


При сравнении методов отбора признаков — простой сортировки по коэффициентам Lasso и SHAP — результаты в целом оказались схожими и показали хорошее качество модели. Однако метод на основе пропущенных значений (nan-ratio) пострадал из-за наличия многих пропусков в данных, что привело к неполным и менее точным результатам. Метод permutation importance ухудшил корреляционные зависимости между признаками из-за случайного перемешивания, что отразилось на ошибках и устойчивости модели. В целом, SHAP и простая сортировка продемонстрировали хорошую стабильность и метрики по сравнению с другими подходами.

## 7. Hyperparameter optimization
   1. Implement grid search and random search methods for alpha and l1_ratio for sklearn's ElasticNet model.
   2. Find the best combination of model hyperparameters.
   3. Fit the resulting model.
   4. Import optuna and configure the same experiment with ElasticNet.
   5. Estimate metrics and compare approaches.
   6. Run optuna on one of the cross-validation schemes.

In [99]:
param = {
    "alpha" : [0.1, 0.5, 0.9, 0.01, 0.001, 1],
    "l1_ratio" : [0.1, 0.5, 0.9, 1]
}

In [100]:
elastic = ElasticNet()
grid_search = GridSearchCV(estimator=elastic, param_grid=param, cv=3)
grid_search.fit(X_train, y_train)
grid_search.best_params_

{'alpha': 0.01, 'l1_ratio': 0.9}

In [101]:
elastic = ElasticNet()
random_search = RandomizedSearchCV(estimator=elastic, param_distributions=param, n_iter=21, cv=3)
random_search.fit(X_train, y_train)
random_search.best_params_

{'l1_ratio': 0.9, 'alpha': 0.01}

In [102]:
elastic = ElasticNet(alpha=0.01, l1_ratio=0.9)
elastic.fit(X_train, y_train)
test_predict = elastic.predict(X_test)
validation_predict = elastic.predict(X_validation)

print_mrr(y_train, y_test, y_validation, train_predict, validation_predict, test_predict)

MAE train: 687.5224005449379
MAE val: 699.143426049331
MAE test: 688.4418245392203
RMSE train: 994.1808495100179
RMSE val: 1018.718116123141
RMSE test: 1015.0598233476362
R2 train: 0.6028029368872916
R2 val: 0.6078155249351919
R2 test: 0.5809956624903004


In [103]:
def objective(trial):
    alpha = trial.suggest_float("alpha", 0.0001, 1.0, log=True)
    l1_ratio = trial.suggest_float("l1_ratio", 0.0, 1.0)
    elastic = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
    score = cross_val_score(elastic, X_train, y_train, cv=3, scoring="r2")
    return score.mean()
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=21)
study.best_params

[I 2025-10-26 13:44:15,096] A new study created in memory with name: no-name-99987fce-13c3-490e-86a4-08ffd41b2a63
[I 2025-10-26 13:44:15,809] Trial 0 finished with value: 0.6001725420951419 and parameters: {'alpha': 0.02447962499188788, 'l1_ratio': 0.4468518759209197}. Best is trial 0 with value: 0.6001725420951419.
[I 2025-10-26 13:44:16,119] Trial 1 finished with value: 0.5608271146449809 and parameters: {'alpha': 0.32056834594950123, 'l1_ratio': 0.558542537721281}. Best is trial 0 with value: 0.6001725420951419.
[I 2025-10-26 13:44:17,996] Trial 2 finished with value: 0.6011191620556879 and parameters: {'alpha': 0.00019718947086787015, 'l1_ratio': 0.04783458262755269}. Best is trial 2 with value: 0.6011191620556879.
[I 2025-10-26 13:44:19,364] Trial 3 finished with value: 0.6011304481922473 and parameters: {'alpha': 0.001460119485319444, 'l1_ratio': 0.41611906249831276}. Best is trial 3 with value: 0.6011304481922473.
[I 2025-10-26 13:44:19,756] Trial 4 finished with value: 0.587419

{'alpha': 0.00230063321857546, 'l1_ratio': 0.1910608105001324}

In [None]:
elastic = ElasticNet(alpha=0.001845453427371508, l1_ratio=0.009067891349687607)
elastic.fit(X_train, y_train)
test_predict = elastic.predict(X_test)
validation_predict = elastic.predict(X_validation)

print_mrr(y_train, y_test, y_validation, train_predict, validation_predict, test_predict)

MAE train: 687.5224005449379
MAE val: 699.0628583288054
MAE test: 688.285092718828
RMSE train: 994.1808495100179
RMSE val: 1018.7968817411861
RMSE test: 1014.9099982414439
R2 train: 0.6028029368872916
R2 val: 0.6077548764667204
R2 test: 0.5811193453213579


: 