4. Модели 
    1. Классическая линейная регрессия (sklearn)
    2. Линейная регрессия с регуляризацией: 
        1. Lasso
        2. Ridge
        3. ElasticNet 
        
        ** Попробуйте разное значения параметра нормализации*
        
    3. KNN (sklearn) с разными гиперпараметрами (n_neighbors, weight, metric)
    4. Решающее дерево (sklearn)
    5. Random Forest (sklearn) с разными гиперпараметрами
    6. Бустинги
        1. CatBoost
        2. LightGBM
        3. XGBoost

In [53]:
import numpy as np
import pandas as pd

from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import config

In [54]:
import importlib
importlib.reload(config)

config.CONFIG.keys()

dict_keys(['seed', 'paths', 'preprocessing', 'validation', 'models'])

In [55]:
df = pd.read_csv(config.CONFIG['paths']['train_with_folds'])

TARGET_COL = config.CONFIG['validation']['target_column']
N_SPLITS = config.CONFIG['validation']['n_splits']

print(df.shape)
df.head()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,fold
0,1,0,2,"Braund, Mr. Owen Harris",1,-0.565419,1,0,A/5 21171,-0.879247,2,1
1,2,1,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,0.663488,1,0,PC 17599,1.360456,0,4
2,3,1,2,"Heikkinen, Miss. Laina",0,-0.258192,0,0,STON/O2. 3101282,-0.798092,2,3
3,4,1,0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,0.433068,1,0,113803,1.061442,2,3
4,5,0,2,"Allen, Mr. William Henry",1,0.433068,0,0,373450,-0.783739,2,0


In [None]:
def run_cv(model_cls, model_params, df, target_col=TARGET_COL, n_splits=N_SPLITS, metric_fn=accuracy_score):
    oof_preds = np.zeros(len(df))
    scores = []

    feature_cols = [
        col for col in df.columns
        if col not in [target_col, 'fold'] and pd.api.types.is_numeric_dtype(df[col])
    ]

    for fold in range(n_splits):
        train_mask = df['fold'] != fold
        val_mask = df['fold'] == fold

        X_train = df.loc[train_mask, feature_cols]
        y_train = df.loc[train_mask, target_col]
        X_val = df.loc[val_mask, feature_cols]
        y_val = df.loc[val_mask, target_col]

        model = model_cls(**model_params)
        model.fit(X_train, y_train)

        preds = model.predict(X_val)
        oof_preds[val_mask] = preds

        score = metric_fn(y_val, preds)
        scores.append(score)
        print(f"Fold {fold}: {score:.4f}")

    print(f"Mean score: {np.mean(scores):.4f} +- {np.std(scores):.4f}")
    return oof_preds, scores


# 1. Лог рег

In [57]:
logreg_params = config.CONFIG['models']['logistic_regression']

logreg_oof, logreg_scores = run_cv(LogisticRegression, logreg_params, df)

Fold 0: 0.7765
Fold 1: 0.8034
Fold 2: 0.7753
Fold 3: 0.7809
Fold 4: 0.8146
Mean score: 0.7901 +- 0.0159



# 2.  Ridge, Lasso, ElasticNet



Регуляризация штрафует большие веса и уменьшает переобучение:
- Ridge (L2) — штраф по сумме квадратов весов
- Lasso (L1) — по сумме модулей, может обнулять веса (отбор признаков)
- ElasticNet — комбинация L1 и L2

In [58]:
ridge_params = {'alpha': 1.0, 'random_state': config.CONFIG['seed']}
ridge_oof, ridge_scores = run_cv(RidgeClassifier, ridge_params, df)

Fold 0: 0.7709
Fold 1: 0.7921
Fold 2: 0.8146
Fold 3: 0.7753
Fold 4: 0.8202
Mean score: 0.7946 +- 0.0200


In [59]:
lasso_params = {'penalty': 'l1', 'solver': 'saga', 'C': 0.5, 'max_iter': 10000, 'tol': 1e-3, 'random_state': config.CONFIG['seed']}
lasso_oof, lasso_scores = run_cv(LogisticRegression, lasso_params, df)

Fold 0: 0.6872
Fold 1: 0.7079
Fold 2: 0.6966
Fold 3: 0.6966
Fold 4: 0.6685
Mean score: 0.6914 +- 0.0132


In [60]:
elasticnet_params = {'penalty': 'elasticnet', 'solver': 'saga', 'C': 0.5, 'l1_ratio': 0.5, 'max_iter': 10000, 'tol': 1e-3, 'random_state': config.CONFIG['seed']}
elasticnet_oof, elasticnet_scores = run_cv(LogisticRegression, elasticnet_params, df)

Fold 0: 0.6872
Fold 1: 0.7079
Fold 2: 0.6966
Fold 3: 0.6966
Fold 4: 0.6685
Mean score: 0.6914 +- 0.0132


# 3. KNN (K ближайших соседей)

Без обучения «как такового»: для объекта смотрим k ближайших соседей по признакам и голосуем по их меткам. Гиперпараметры: n_neighborn , weights (uniform / distance), metric (euclidean, manhattan, minkowski).

In [61]:
knn_params = config.CONFIG['models']['knn']

knn_oof, knn_scores = run_cv(KNeighborsClassifier, knn_params, df)

Fold 0: 0.5810
Fold 1: 0.6011
Fold 2: 0.5618
Fold 3: 0.5730
Fold 4: 0.6236
Mean score: 0.5881 +- 0.0219


# 4. Решающее дерево

Одна модель — дерево правил «если признак ≤ порог, идём влево, иначе вправо». Простая и интерпретируемая; склонна к переобучению, поэтому ограничиваем max_depth.

In [62]:
dt_params = config.CONFIG['models']['decision_tree']

dt_oof, dt_scores = run_cv(DecisionTreeClassifier, dt_params, df)

Fold 0: 0.7989
Fold 1: 0.7921
Fold 2: 0.7753
Fold 3: 0.7584
Fold 4: 0.8146
Mean score: 0.7879 +- 0.0194


# 5. Random Forest

Ансамбль из многих деревьев на бутстрэп-выборках и случайных подмножествах признаков; предсказание — голосование. Уменьшает переобучение по сравнению с одним деревом. Гиперпараметры: `n_estimators`, `max_depth`, и др.

In [63]:
rf_params = config.CONFIG['models']['random_forest']

rf_oof, rf_scores = run_cv(RandomForestClassifier, rf_params, df)

Fold 0: 0.8603
Fold 1: 0.8202
Fold 2: 0.8090
Fold 3: 0.8146
Fold 4: 0.8371
Mean score: 0.8282 +- 0.0186



# 6. Бустинги (XGBoost, LightGBM, CatBoost)

Деревья обучаются последовательно: каждое следующее исправляет ошибки предыдущих. Часто дают лучший скор; у каждого свои гиперпараметры (n_estimators, learning_rate, max_depth / depth, и т.д.).

In [64]:
xgb_params = config.CONFIG['models']['xgboost']

xgb_oof, xgb_scores = run_cv(XGBClassifier, xgb_params, df)

lgbm_params = {**config.CONFIG['models']['lightgbm'], 'verbose': -1}  # без [Info]/[Warning]

lgbm_oof, lgbm_scores = run_cv(LGBMClassifier, lgbm_params, df)

cat_params = config.CONFIG['models']['catboost']

cat_oof, cat_scores = run_cv(CatBoostClassifier, cat_params, df)

Fold 0: 0.8380


Fold 1: 0.8202
Fold 2: 0.7809
Fold 3: 0.8146
Fold 4: 0.8371
Mean score: 0.8182 +- 0.0208
Fold 0: 0.8324
Fold 1: 0.8090
Fold 2: 0.7584
Fold 3: 0.7978
Fold 4: 0.8315
Mean score: 0.8058 +- 0.0271
0:	learn: 0.6620259	total: 1.45ms	remaining: 577ms
100:	learn: 0.3592706	total: 121ms	remaining: 358ms
200:	learn: 0.3115296	total: 239ms	remaining: 237ms
300:	learn: 0.2659475	total: 357ms	remaining: 118ms
399:	learn: 0.2305699	total: 469ms	remaining: 0us
Fold 0: 0.8547
0:	learn: 0.6739845	total: 895us	remaining: 357ms
100:	learn: 0.3545655	total: 104ms	remaining: 308ms
200:	learn: 0.3027297	total: 213ms	remaining: 211ms
300:	learn: 0.2570692	total: 321ms	remaining: 106ms
399:	learn: 0.2207188	total: 430ms	remaining: 0us
Fold 1: 0.8427
0:	learn: 0.6628856	total: 1.03ms	remaining: 411ms
100:	learn: 0.3435811	total: 124ms	remaining: 367ms
200:	learn: 0.2933551	total: 251ms	remaining: 248ms
300:	learn: 0.2405216	total: 362ms	remaining: 119ms
399:	learn: 0.2009793	total: 477ms	remaining: 0us
Fold 2:

In [65]:
results = pd.DataFrame({
    "model": [
        "LogisticRegression",
        "Ridge",
        "Lasso",
        "ElasticNet",
        "KNN",
        "DecisionTree",
        "RandomForest",
        "XGBoost",
        "LightGBM",
        "CatBoost",
    ],
    "mean_accuracy": [
        np.mean(logreg_scores),
        np.mean(ridge_scores),
        np.mean(lasso_scores),
        np.mean(elasticnet_scores),
        np.mean(knn_scores),
        np.mean(dt_scores),
        np.mean(rf_scores),
        np.mean(xgb_scores),
        np.mean(lgbm_scores),
        np.mean(cat_scores),
    ],
    "std_accuracy": [
        np.std(logreg_scores),
        np.std(ridge_scores),
        np.std(lasso_scores),
        np.std(elasticnet_scores),
        np.std(knn_scores),
        np.std(dt_scores),
        np.std(rf_scores),
        np.std(xgb_scores),
        np.std(lgbm_scores),
        np.std(cat_scores),
    ],
})

result = results.sort_values("mean_accuracy", ascending=False)
print(result)

path_metrics = config.CONFIG['paths']['metrics_results']
results.to_csv(path_metrics, index=False)
print(f"Метрики сохранены: {path_metrics}")


                model  mean_accuracy  std_accuracy
9            CatBoost       0.833871      0.026856
6        RandomForest       0.828247      0.018595
7             XGBoost       0.818160      0.020767
8            LightGBM       0.805806      0.027149
1               Ridge       0.794639      0.019978
0  LogisticRegression       0.790139      0.015906
5        DecisionTree       0.787866      0.019389
2               Lasso       0.691363      0.013164
3          ElasticNet       0.691363      0.013164
4                 KNN       0.588111      0.021903
Метрики сохранены: C:\newTry2\classicMLpractice\ProjectKaggle\checkpoints\metrics_results.csv
