# 1

## I

Leave-One-Out Cross-Validation (LOOCV, Перекрёстная проверка с исключением по одному) — это метод валидации модели, крайний случай k-Fold CV. Из набора данных размера N одно наблюдение исключается и используется в качестве тестового. Модель обучается на оставшихся N-1 наблюдениях. Этот процесс повторяется N раз, так чтобы каждое наблюдение побывало в роли тестового ровно один раз. Финальная оценка модели (например, точность) — это среднее арифметическое всех N полученных результатов.

## II

GridSearch перебирает всевозможные комбинации гиперпараметров, модель обучается для каждой комбинации

Randomized Grid Search - задается распределение для каждого гиперпараметра и производится случайная выборка n комбинаций из этих распределений 

Bayesian Optimization строит вероятностную модель зависимости качества модели от гиперпараметров. И использует эту модель для проверки в следующей итерации

## III

Методы делятся на 3 категории:

Filter Methods - оценивают признаки по связи с целевой переменной

Wrapper Methods - используется модель для оценки подмножеств признаков учитывая взаимодействие признаков

Embedded Methods - процесс отбора признаков встроен непосредственно в процесс обучения модели

Корреляция Пирсона измеряет линейную зависимость между двумя переменными. Она возвращает значение от -1 до 1.
Мы отбираем признаки с наибольшими по модулю значениями коэффициента (как близкие к 1, так и близкие к -1), так как они сильнее всего влияют на целевую переменную.

Недостаток: Не обнаруживает нелинейные зависимости.

Хи-квадрат используется для оценки зависимости между категориальными признаками и целевой переменной. Он измеряет насколько наблюдаемые частоты отличаются от ожидаемых. Чем больше значение хи-квадрат, тем сильнее связь между признаком и целевой переменной.

Lasso добавляет штраф за абсолютное значение коэффициентов модели. Признаки становятся нулевыми и по сути исключаются

Permutation Importance измеряет, насколько сильно ухудшится производительность модели, если значения одного признака случайным образом перемешать. Если производительность значительно падает, значит, признак важен.

SHAP (SHapley Additive exPlanations) — это метод объяснения предсказаний модели, основанный на теории игр (значения Шепли). Он показывает, как каждый признак влияет на предсказание для конкретного объекта.

# 2

In [1]:
pip install shap

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install optuna

Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np
import random
import shap
import optuna
from collections import Counter
from sklearn.model_selection import train_test_split, KFold, GroupKFold, StratifiedKFold, TimeSeriesSplit, GridSearchCV, RandomizedSearchCV,cross_val_score
from sklearn.linear_model import LassoCV, ElasticNet
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.inspection import permutation_importance

In [4]:
data = pd.read_json('datasets/train.json')
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 49352 entries, 4 to 124009
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   bathrooms        49352 non-null  float64
 1   bedrooms         49352 non-null  int64  
 2   building_id      49352 non-null  object 
 3   created          49352 non-null  object 
 4   description      49352 non-null  object 
 5   display_address  49352 non-null  object 
 6   features         49352 non-null  object 
 7   latitude         49352 non-null  float64
 8   listing_id       49352 non-null  int64  
 9   longitude        49352 non-null  float64
 10  manager_id       49352 non-null  object 
 11  photos           49352 non-null  object 
 12  price            49352 non-null  int64  
 13  street_address   49352 non-null  object 
 14  interest_level   49352 non-null  object 
dtypes: float64(3), int64(3), object(9)
memory usage: 6.0+ MB


In [5]:
data['interest_level'] = data['interest_level'].replace({'low':0,'medium':1,'high':2}).astype(int)
data['interest_level'].value_counts()

interest_level
0    34284
1    11229
2     3839
Name: count, dtype: int64

In [6]:
def clean(feature):
  if feature is None:
    return []
  feature_str = str(feature)
  if feature_str.strip() == '[]':
    return []
  cleaned = (feature_str.replace('[', '').replace(']', '').replace("'", "").replace('"', '').strip())
  return [item.strip() for item in cleaned.split(',') if item.strip()]

data['features'] = data['features'].apply(clean)


all_features = []
for feature_list in data['features']:
    all_features.extend(feature_list)

In [7]:
top = Counter(all_features).most_common(20)
for item in top:
  data[f'{item[0]}'] = data['features'].apply(lambda x: int(item[0] in str(x)) if item else 0)

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 49352 entries, 4 to 124009
Data columns (total 35 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   bathrooms            49352 non-null  float64
 1   bedrooms             49352 non-null  int64  
 2   building_id          49352 non-null  object 
 3   created              49352 non-null  object 
 4   description          49352 non-null  object 
 5   display_address      49352 non-null  object 
 6   features             49352 non-null  object 
 7   latitude             49352 non-null  float64
 8   listing_id           49352 non-null  int64  
 9   longitude            49352 non-null  float64
 10  manager_id           49352 non-null  object 
 11  photos               49352 non-null  object 
 12  price                49352 non-null  int64  
 13  street_address       49352 non-null  object 
 14  interest_level       49352 non-null  int32  
 15  Elevator             49352 non-null  int

In [9]:
features = ['bathrooms', 'bedrooms', 'interest_level', 'created']
for item in top:
  features.append(item[0])
features

['bathrooms',
 'bedrooms',
 'interest_level',
 'created',
 'Elevator',
 'Cats Allowed',
 'Hardwood Floors',
 'Dogs Allowed',
 'Doorman',
 'Dishwasher',
 'No Fee',
 'Laundry in Building',
 'Fitness Center',
 'Pre-War',
 'Laundry in Unit',
 'Roof Deck',
 'Outdoor Space',
 'Dining Room',
 'High Speed Internet',
 'Balcony',
 'Swimming Pool',
 'Laundry In Building',
 'New Construction',
 'Terrace']

In [10]:
all_data= data.loc[:, features + ['price']]
all_data.reset_index(inplace=True, drop=True)
low = all_data['price'].quantile(0.01)
up = all_data['price'].quantile(0.99)
cleaned_data = all_data[(all_data['price'] > low) & (all_data['price'] < up)].copy()
cleaned_data

Unnamed: 0,bathrooms,bedrooms,interest_level,created,Elevator,Cats Allowed,Hardwood Floors,Dogs Allowed,Doorman,Dishwasher,...,Roof Deck,Outdoor Space,Dining Room,High Speed Internet,Balcony,Swimming Pool,Laundry In Building,New Construction,Terrace,price
0,1.0,1,1,2016-06-16 05:55:27,0,1,1,1,0,1,...,0,0,1,0,0,0,0,0,0,2400
1,1.0,2,0,2016-06-01 05:44:33,1,0,1,0,1,1,...,0,0,0,0,0,0,0,0,0,3800
2,1.0,2,1,2016-06-14 15:19:59,1,0,1,0,1,1,...,0,0,0,0,0,0,0,0,0,3495
3,1.5,3,1,2016-06-24 07:54:24,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3000
4,1.0,0,0,2016-06-28 03:50:23,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,2795
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49347,1.0,3,0,2016-04-05 03:58:33,1,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,2800
49348,1.0,2,1,2016-04-02 02:25:31,1,1,0,1,1,0,...,0,1,0,0,0,0,1,0,0,2395
49349,1.0,1,1,2016-04-26 05:42:03,1,1,1,1,0,1,...,0,0,1,0,0,0,0,0,0,1850
49350,1.0,2,1,2016-04-19 02:47:33,0,0,0,0,0,1,...,0,1,0,0,0,0,0,0,0,4195


# 3

## Two random parts

In [11]:
X= cleaned_data.drop('price', axis=1)
y= cleaned_data['price']

In [12]:
def two_split(X, y, test_size = 0.2, random_state = 21):
    X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=test_size, random_state=random_state)
    return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test= two_split(X, y)
y_train.info()
y_test.info()
display()

<class 'pandas.core.series.Series'>
Index: 38674 entries, 44088 to 15631
Series name: price
Non-Null Count  Dtype
--------------  -----
38674 non-null  int64
dtypes: int64(1)
memory usage: 604.3 KB
<class 'pandas.core.series.Series'>
Index: 9669 entries, 15316 to 23189
Series name: price
Non-Null Count  Dtype
--------------  -----
9669 non-null   int64
dtypes: int64(1)
memory usage: 151.1 KB


## Three random parts

In [13]:
def three_split(X, y, test_size= 0.2, validation_size= 0.2, random_state= 21):
    X_tmp, X_test, y_tmp, y_test= train_test_split(X, y, test_size= test_size, random_state= random_state)
    X_train, X_val, y_train, y_val= train_test_split(X_tmp, y_tmp, test_size= validation_size / (1 - test_size), random_state= random_state)
    return X_train, X_val, X_test, y_train, y_val, y_test
X_train, X_val, X_test, y_train, y_val, y_test = three_split(X, y)
y_train.info()
y_val.info()
y_test.info()

<class 'pandas.core.series.Series'>
Index: 29005 entries, 37782 to 4664
Series name: price
Non-Null Count  Dtype
--------------  -----
29005 non-null  int64
dtypes: int64(1)
memory usage: 453.2 KB
<class 'pandas.core.series.Series'>
Index: 9669 entries, 2224 to 14147
Series name: price
Non-Null Count  Dtype
--------------  -----
9669 non-null   int64
dtypes: int64(1)
memory usage: 151.1 KB
<class 'pandas.core.series.Series'>
Index: 9669 entries, 15316 to 23189
Series name: price
Non-Null Count  Dtype
--------------  -----
9669 non-null   int64
dtypes: int64(1)
memory usage: 151.1 KB


## Dates split

In [14]:
def date_split(X, y, date_col, date_split):
    train_mask= X[date_col] < date_split
    test_mask= X[date_col] >= date_split

    X_train, y_train= X[train_mask].copy(), y[train_mask]
    X_test, y_test= X[test_mask].copy(), y[test_mask]

    return X_train, X_test, y_train, y_test

In [15]:
X_train, X_test, y_train, y_test = date_split(X, y, date_col= 'created', date_split='2016-06-06')
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(34871, 24) (13472, 24) (34871,) (13472,)


In [16]:
def three_dates_split(X, y, date_column, validation_date, test_date):
    train_mask= X[date_column] < validation_date
    valid_mask= (X[date_column] >= validation_date) & (X[date_column] < test_date)
    test_mask= X[date_column] >= test_date
    
    X_train, y_train= X[train_mask].copy(), y[train_mask]
    X_val, y_val= X[valid_mask].copy(), y[valid_mask]
    X_test, y_test= X[test_mask].copy(), y[test_mask]
    return X_train, X_val, X_test, y_train, y_val, y_test

In [17]:
X_train, X_valid, X_test, y_train, y_valid, y_test = three_dates_split(X, y, date_column= 'created', validation_date='2016-05-05', test_date='2016-06-06')
print(X_train.shape, X_valid.shape, X_test.shape, y_train.shape, y_valid.shape, y_test.shape)

(17722, 24) (17149, 24) (13472, 24) (17722,) (17149,) (13472,)


Детерминированность означает, что функция всегда возвращает одинаковые результаты при одинаковых входных данных

# 4,5

In [18]:
class MyKFold:
    def __init__(self, k= 5):
        if k < 2:
            raise ValueError("k must be at least 2.")
        self.n_splits= k
    def split(self, X):
        n_samples= len(X)
        idexes= np.arange(n_samples)
        fold_size= n_samples // self.n_splits
        for i in range(self.n_splits):
            test_id= idexes[i * fold_size : (i+1) * fold_size]
            train_id= np.setdiff1d(idexes, test_id)
            yield train_id, test_id

In [19]:
kf = MyKFold()

for train_id, test_id in kf.split(X):
    print(f"Train: {train_id}, Test: {test_id}")


Train: [ 9668  9669  9670 ... 48340 48341 48342], Test: [   0    1    2 ... 9665 9666 9667]
Train: [    0     1     2 ... 48340 48341 48342], Test: [ 9668  9669  9670 ... 19333 19334 19335]
Train: [    0     1     2 ... 48340 48341 48342], Test: [19336 19337 19338 ... 29001 29002 29003]
Train: [    0     1     2 ... 48340 48341 48342], Test: [29004 29005 29006 ... 38669 38670 38671]
Train: [    0     1     2 ... 48340 48341 48342], Test: [38672 38673 38674 ... 48337 48338 48339]


In [20]:
sk_kf = KFold()
for train_id, test_id in sk_kf.split(X):
    print(f"Train: {train_id}, Test: {test_id}")

Train: [ 9669  9670  9671 ... 48340 48341 48342], Test: [   0    1    2 ... 9666 9667 9668]
Train: [    0     1     2 ... 48340 48341 48342], Test: [ 9669  9670  9671 ... 19335 19336 19337]
Train: [    0     1     2 ... 48340 48341 48342], Test: [19338 19339 19340 ... 29004 29005 29006]
Train: [    0     1     2 ... 48340 48341 48342], Test: [29007 29008 29009 ... 38672 38673 38674]
Train: [    0     1     2 ... 38672 38673 38674], Test: [38675 38676 38677 ... 48340 48341 48342]


In [21]:
class MyGroupKFold:
    def __init__(self, k=3):
        if k < 2:
            raise ValueError("k must be at least 2.")
        self.n_splits= k
    def split(self, X, group_field): 
        indices = np.arange(len(X))
        unique_groups = np.unique(group_field)
        for i in range(self.n_splits):
            test_groups = unique_groups[i * len(unique_groups) // self.n_splits : (i + 1) * len(unique_groups) // self.n_splits]
            test_id = np.where(np.isin(group_field, test_groups))[0]
            train_id = np.setdiff1d(indices, test_id)
            yield train_id, test_id

In [22]:
groupkf = MyGroupKFold()
for train_id, test_id in groupkf.split(X, group_field=X['interest_level']):
    print(f"Train: {train_id}, Test: {test_id}")

Train: [    0     2     3 ... 48340 48341 48342], Test: [    1     4     5 ... 48336 48337 48338]
Train: [    1     4     5 ... 48337 48338 48342], Test: [    0     2     3 ... 48339 48340 48341]
Train: [    0     1     2 ... 48339 48340 48341], Test: [    7    17    31 ... 48303 48332 48342]


In [23]:
sk_gkf = GroupKFold(n_splits=3)
for train_id, test_id in sk_gkf.split(X, groups=X['interest_level']):
    print(f"Train: {train_id}, Test: {test_id}")

Train: [    0     2     3 ... 48340 48341 48342], Test: [    1     4     5 ... 48336 48337 48338]
Train: [    1     4     5 ... 48337 48338 48342], Test: [    0     2     3 ... 48339 48340 48341]
Train: [    0     1     2 ... 48339 48340 48341], Test: [    7    17    31 ... 48303 48332 48342]


In [24]:
class MyStratifiedKFold:
    def __init__(self, k=3):
        if k < 2:
            raise ValueError("k must be at least 2.")
        self.n_splits = k
        
    def split(self, X, stratify_field):
        unique_classes = np.unique(stratify_field)
        for i in range(self.n_splits):
            test_index = []
            train_index = []
            for cls in unique_classes:
                cls_indices = np.where(stratify_field == cls)[0]
                fold_size = len(cls_indices) // self.n_splits
        
                test_cls_indices = cls_indices[i * fold_size: (i + 1) * fold_size]
                train_cls_indices = np.setdiff1d(cls_indices, test_cls_indices)
                
                test_index.extend(test_cls_indices)
                train_index.extend(train_cls_indices)
                
            yield np.sort(train_index), np.sort(test_index)

In [25]:
stratifiedkfold = MyStratifiedKFold(k=5)
for train_id, test_id in stratifiedkfold.split(X, X['interest_level']):
    print(f"Train: {train_id}, Test: {test_id}")

Train: [ 9563  9564  9565 ... 48340 48341 48342], Test: [   0    1    2 ... 9956 9959 9966]
Train: [    0     1     2 ... 48340 48341 48342], Test: [ 9563  9564  9565 ... 19661 19664 19672]
Train: [    0     1     2 ... 48340 48341 48342], Test: [19196 19199 19201 ... 29171 29175 29176]
Train: [    0     1     2 ... 48340 48341 48342], Test: [28930 28936 28937 ... 38699 38700 38701]
Train: [    0     1     2 ... 48340 48341 48342], Test: [38236 38237 38246 ... 48333 48334 48336]


In [26]:
sk_skf = StratifiedKFold()
for train_id, test_id in sk_skf.split(X, X['interest_level']):
    print(f"Train: {train_id}, Test: {test_id}")

Train: [ 9564  9565  9566 ... 48340 48341 48342], Test: [   0    1    2 ... 9959 9966 9970]
Train: [    0     1     2 ... 48340 48341 48342], Test: [ 9564  9565  9566 ... 19672 19679 19680]
Train: [    0     1     2 ... 48340 48341 48342], Test: [19199 19201 19202 ... 29180 29182 29187]
Train: [    0     1     2 ... 48340 48341 48342], Test: [28936 28937 28938 ... 38715 38721 38726]
Train: [    0     1     2 ... 38715 38721 38726], Test: [38246 38286 38298 ... 48340 48341 48342]


In [27]:
class MyTimeSeriesSplit:
    def __init__(self, k= 4):
        if k < 2:
            raise ValueError("k must be at least 2.")
        self.n_splits = k
    def split(self, X, date_field):
        X_sorted = X.sort_values(by=date_field)
        n_samles = len(X_sorted)
        min_train_size = int(n_samles * 0.2)
        indices = np.arange(n_samles)
        fold_size = (n_samles - min_train_size) // self.n_splits
        if fold_size == 0:
            raise ValueError("Too much folds for this data")
            
        for i in range(self.n_splits):
            start = min_train_size + i * fold_size
            stop = min_train_size + (i + 1) * fold_size if i != self.n_splits - 1 else n_samles
            test_id = indices[start:stop]
            train_id = indices[:start]
            yield train_id, test_id

In [28]:
myts = MyTimeSeriesSplit()
for train_id, test_id in myts.split(X, date_field = 'created'):
    print(f"Train: {(train_id)}, Test: {test_id}")

Train: [   0    1    2 ... 9665 9666 9667], Test: [ 9668  9669  9670 ... 19333 19334 19335]
Train: [    0     1     2 ... 19333 19334 19335], Test: [19336 19337 19338 ... 29001 29002 29003]
Train: [    0     1     2 ... 29001 29002 29003], Test: [29004 29005 29006 ... 38669 38670 38671]
Train: [    0     1     2 ... 38669 38670 38671], Test: [38672 38673 38674 ... 48340 48341 48342]


In [29]:
sk_tss = TimeSeriesSplit(n_splits=4)
for train_id, test_id in sk_tss.split(X, X['created']):
    print(f"Train: {(train_id)}, Test: {test_id}")

Train: [   0    1    2 ... 9668 9669 9670], Test: [ 9671  9672  9673 ... 19336 19337 19338]
Train: [    0     1     2 ... 19336 19337 19338], Test: [19339 19340 19341 ... 29004 29005 29006]
Train: [    0     1     2 ... 29004 29005 29006], Test: [29007 29008 29009 ... 38672 38673 38674]
Train: [    0     1     2 ... 38672 38673 38674], Test: [38675 38676 38677 ... 48340 48341 48342]


K fold good for our aim. But if we had a big distribution of target value, stratified k fold will be beter

# 6

In [30]:
X= cleaned_data.drop(['created','price'], axis=1)
X_train, X_val, X_test, y_train, y_val, y_test = three_split(X, y)

In [31]:
lasso_cv = LassoCV(cv=10)
lasso_cv.fit(X_train, y_train)
print(f'Best alpha: {lasso_cv.alpha_}')
print(f'Coefficients: {lasso_cv.coef_}')

Best alpha: 0.9484269511064273
Coefficients: [1510.13821365  470.80671035 -421.89421329  212.9136184    -5.37815217
 -114.16670549   79.45781212  524.82035797  141.36955146  -92.05328018
 -163.24400675  211.27226192  -43.37725471  437.0392935  -127.66814123
  -57.57824216   94.51672368 -153.68701259  -33.68556827   43.74304983
 -162.77465736  -72.9536528   139.76059722]


In [32]:
coefs = np.abs(lasso_cv.coef_)
ind = np.argsort(coefs)[::-1]
top_10 = ind[:10]
top_10

array([ 0,  7,  1, 13,  2,  3, 11, 10, 20, 17], dtype=int64)

In [33]:
X_train_10= X_train.iloc[:, top_10]
X_val_10= X_val.iloc[:, top_10]
X_test_10= X_test.iloc[:, top_10]


In [34]:
metrics = pd.DataFrame(columns=['name', 'DS_name', 'R2', 'MAE', 'RMSE'])
def write_metrics(model, name, X_train, X_val, X_test):
    y_pred_train = model.predict(X_train)
    y_pred_valid = model.predict(X_val)
    y_pred_test = model.predict(X_test)

    r2_metrics = {'train': r2_score(y_train, y_pred_train),
                  'valid': r2_score(y_val, y_pred_valid),
                  'test': r2_score(y_test, y_pred_test)}
    
    mae_metrics = {'train': mean_absolute_error(y_train, y_pred_train),
                   'valid': mean_absolute_error(y_val, y_pred_valid),
                   'test': mean_absolute_error(y_test, y_pred_test)}
    
    rmse_metrics = {'train': np.sqrt(mean_squared_error(y_train, y_pred_train)),
                    'valid': np.sqrt(mean_squared_error(y_val, y_pred_valid)),
                    'test': np.sqrt(mean_squared_error(y_test, y_pred_test))}
    for ds_name in ['train', 'valid', 'test']:
        metrics.loc[len(metrics)] = [name, ds_name, r2_metrics[ds_name], mae_metrics[ds_name], rmse_metrics[ds_name]]

In [35]:
lasso_10 = LassoCV(cv=10)
lasso_10.fit(X_train_10, y_train)
write_metrics(lasso_10, 'X_train_10', X_train_10, X_val_10, X_test_10)

In [36]:
corr_matrix = cleaned_data.drop('created', axis=1).corr()
corr_matrix = corr_matrix.sort_values(by='price', ascending=False)
top_10_corr = corr_matrix.index[1:11].tolist()
print(top_10_corr)

['bathrooms', 'bedrooms', 'Doorman', 'Laundry in Unit', 'Fitness Center', 'Dishwasher', 'Dining Room', 'Elevator', 'Outdoor Space', 'Laundry in Building']


### Implement method for simple feature selection by nan-ratio in feature and correlation. Apply this method to feature set and take top 10 features, refit model and measure quality.

In [37]:
def select_by_miss_and_corr(X, y, top=10):
    miss= X.isna().mean()
    data= pd.concat([X.reset_index(drop=True), y.reset_index(drop=True)], axis=1)
    corr_with_target = data.corr().iloc[:-1, -1]

    scores = pd.DataFrame({
        'missing': miss,
        'correlation': corr_with_target.abs()
    })
    scores['score'] = scores['correlation'] / (1 + scores['missing'])

    return scores.sort_values(by='score', ascending=False).head(top).index.tolist()

In [38]:
top_10_corr = select_by_miss_and_corr(X_train, y_train)
top_10_corr

['bathrooms',
 'bedrooms',
 'Doorman',
 'Laundry in Unit',
 'Fitness Center',
 'Dishwasher',
 'Dining Room',
 'Elevator',
 'interest_level',
 'Laundry in Building']

In [39]:
X_train_corr = X_train.loc[:, top_10_corr]
X_valid_corr = X_val.loc[:, top_10_corr]
X_test_corr = X_test.loc[:, top_10_corr]

In [40]:
lasso_corr = LassoCV(cv=10)
lasso_corr.fit(X_train_corr, y_train)
write_metrics(lasso_corr, 'X_train_corr', X_train_corr, X_valid_corr, X_test_corr)

### Implement permutation importance method and take top 10 features, refit model and measure quality

In [41]:
def get_top_permutation(model, X, y):
    result = permutation_importance(model, X, y, n_repeats=10, random_state=21, scoring="neg_mean_squared_error")
    importance = np.argsort(result.importances_mean)[::-1] 
    return importance.tolist()[:10]

In [42]:
lasso_perm = LassoCV(cv=10)
lasso_perm.fit(X_train, y_train)
top_10_perm = get_top_permutation(lasso_perm, X_train, y_train)
top_10_perm

[0, 1, 2, 7, 13, 3, 11, 10, 8, 5]

In [43]:
X_train_perm = X_train.iloc[:, top_10_perm]
X_val_perm = X_val.iloc[:, top_10_perm]
X_test_perm = X_test.iloc[:, top_10_perm]

In [44]:
lasso_perm.fit(X_train_perm, y_train)
write_metrics(lasso_perm, 'X_train_perm', X_train_perm, X_val_perm, X_test_perm)

### Импортируйте Shap, а также перестройте модель на основе 10 основных признаков.

In [45]:
lasso_shap = LassoCV(cv=10)
lasso_shap.fit(X_train, y_train)
explainer = shap.Explainer(lasso_shap, X_val)
shap_values = explainer(X_val)
importance_values = np.abs(shap_values.values).mean(axis=0)
top_shap = np.argsort(importance_values)[::-1][:10].tolist()
top_shap

[0, 1, 7, 2, 13, 3, 11, 10, 8, 5]

In [46]:
X_train_shap = X_train.iloc[:, top_shap]
X_val_shap = X_val.iloc[:, top_shap]
X_test_shap = X_test.iloc[:, top_shap]
lasso_shap_10 = LassoCV(cv=10)
lasso_shap_10.fit(X_train_shap, y_train)
write_metrics(lasso_shap_10, 'X_train_shap', X_train_shap, X_val_shap, X_test_shap)

### Compare

In [47]:
metrics

Unnamed: 0,name,DS_name,R2,MAE,RMSE
0,X_train_10,train,0.597708,688.487584,997.536548
1,X_train_10,valid,0.597577,693.440925,1017.13464
2,X_train_10,test,0.611971,691.971388,1000.430892
3,X_train_corr,train,0.596291,690.460349,999.291887
4,X_train_corr,valid,0.59609,694.965709,1019.012515
5,X_train_corr,test,0.610007,694.53981,1002.959246
6,X_train_perm,train,0.597274,688.561811,998.074835
7,X_train_perm,valid,0.596579,694.295245,1018.395191
8,X_train_perm,test,0.610312,692.804843,1002.567229
9,X_train_shap,train,0.597274,688.561403,998.074843


In [48]:
metrics.sort_values(by= ['R2', 'RMSE', 'MAE'], ascending=[False, True, True])

Unnamed: 0,name,DS_name,R2,MAE,RMSE
2,X_train_10,test,0.611971,691.971388,1000.430892
8,X_train_perm,test,0.610312,692.804843,1002.567229
11,X_train_shap,test,0.610312,692.804454,1002.567261
5,X_train_corr,test,0.610007,694.53981,1002.959246
0,X_train_10,train,0.597708,688.487584,997.536548
1,X_train_10,valid,0.597577,693.440925,1017.13464
6,X_train_perm,train,0.597274,688.561811,998.074835
9,X_train_shap,train,0.597274,688.561403,998.074843
7,X_train_perm,valid,0.596579,694.295245,1018.395191
10,X_train_shap,valid,0.596579,694.294885,1018.395322


# 7 Giperparams

## GridSearch

In [49]:
param_grid= {'alpha': np.logspace(0.1, 10, 50),
             'l1_ratio': np.linspace(0.1, 1, 10)}

grid_search= GridSearchCV(estimator= ElasticNet(random_state=21), param_grid= param_grid, scoring= 'neg_mean_squared_error', cv= 5, n_jobs= -1)
grid_search.fit(X_train, y_train)
grid_search.best_estimator_

## RandomizedSearch

In [50]:
param_dist = {'alpha': np.logspace(-4, 3, 1000),
              'l1_ratio': np.linspace(0.1, 1, 100)}
random_search = RandomizedSearchCV(estimator= ElasticNet(random_state=21), param_distributions= param_dist, scoring= 'neg_mean_squared_error', cv= 5, n_jobs= -1)
random_search.fit(X_train, y_train)
random_search.best_params_

{'l1_ratio': 0.3545454545454545, 'alpha': 0.0006292146109610344}

In [51]:
write_metrics(grid_search.best_estimator_, 'GridSearch', X_train, X_val, X_test)
write_metrics(random_search.best_estimator_, 'RandomizedSearch', X_train, X_val, X_test)

## Optuna

In [52]:
def objective(trial):
    alpha = trial.suggest_float('alpha', 1e-4, 1e3, log=True)
    l1_ratio = trial.suggest_float('l1_ratio', 0.0, 1)

    optuna_el = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=21)
    cv = KFold(n_splits=5, shuffle=True, random_state=21)
    score = cross_val_score(optuna_el, X_train, y_train, cv=cv, scoring='r2')

    return np.mean(score)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, show_progress_bar=True)

print(study.best_params)

[I 2025-09-16 01:03:30,415] A new study created in memory with name: no-name-36a1fa33-cbbc-4f8c-a8d5-a73790cdbcf0


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-09-16 01:03:30,523] Trial 0 finished with value: 0.13874848258810973 and parameters: {'alpha': 6.666405427384297, 'l1_ratio': 0.07259726235012431}. Best is trial 0 with value: 0.13874848258810973.
[I 2025-09-16 01:03:30,750] Trial 1 finished with value: 0.6010277757629015 and parameters: {'alpha': 0.00018566066368038355, 'l1_ratio': 0.15548529220462304}. Best is trial 1 with value: 0.6010277757629015.
[I 2025-09-16 01:03:30,836] Trial 2 finished with value: 0.49104354637061476 and parameters: {'alpha': 0.4404025221577187, 'l1_ratio': 0.09143551648439319}. Best is trial 1 with value: 0.6010277757629015.
[I 2025-09-16 01:03:30,920] Trial 3 finished with value: 0.5996122332844573 and parameters: {'alpha': 2.0366935041393495, 'l1_ratio': 0.9945406460781275}. Best is trial 1 with value: 0.6010277757629015.
[I 2025-09-16 01:03:31,160] Trial 4 finished with value: 0.6010274220909896 and parameters: {'alpha': 0.00014097256277155161, 'l1_ratio': 0.058630483011713364}. Best is trial 1 wi

In [53]:
best_model = ElasticNet(**study.best_params, random_state=21)
best_model.fit(X_train, y_train)
pred = best_model.predict(X_test)


In [54]:
write_metrics(best_model, 'Optuna', X_train, X_val, X_test)

In [56]:
metrics.sort_values(by= ['R2', 'RMSE', 'MAE'], ascending=[False, True, True])

Unnamed: 0,name,DS_name,R2,MAE,RMSE
17,RandomizedSearch,test,0.616165,688.848389,995.008774
20,Optuna,test,0.616119,688.800127,995.068116
14,GridSearch,test,0.615867,688.353577,995.395433
2,X_train_10,test,0.611971,691.971388,1000.430892
8,X_train_perm,test,0.610312,692.804843,1002.567229
11,X_train_shap,test,0.610312,692.804454,1002.567261
5,X_train_corr,test,0.610007,694.53981,1002.959246
15,RandomizedSearch,train,0.602288,685.541438,991.842577
18,Optuna,train,0.602281,685.442007,991.850819
12,GridSearch,train,0.602065,684.889898,992.12095
