<h3>Подготовка</h3>
<p>Для работы с датасетом необходимо его загрузить, обработать, а затем выполнить следующие шаги:</p>
<ol><li>Удаление или кодирование категориальных признаков для линейной регрессии и градиентного бустинга.</li><li>Применение различных кодировок для категориальных признаков (One-Hot, Label, Target Encoding).</li><li>Обучение моделей с использованием кодированных данных.</li><li>Обучение модели <code>CatBoost</code>, которая работает с оригинальными категориальными признаками.</li><li>Сравнение метрик для каждой модели и каждой кодировки.</li></ol>

In [28]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import statsmodels.api as sm
from category_encoders import TargetEncoder
from catboost import CatBoostRegressor

# Загрузка датасета
california = fetch_california_housing(as_frame=True)
data = california.frame

# Добавление категориального признака "ocean_proximity" для примера (признак изначально отсутствует в датасете)
np.random.seed(42)
data['ocean_proximity'] = np.random.choice(['NEAR BAY', 'INLAND', 'NEAR OCEAN', 'ISLAND', '1H OCEAN'], size=len(data))
data.dropna(inplace=True)
display(data)

# Целевая переменная - "MedHouseVal"
X = data.drop(columns="MedHouseVal")
y = data["MedHouseVal"]

# Разделение на числовые и категориальные признаки
num_features = X.select_dtypes(include=['float64', 'int64']).columns.tolist()
cat_features = ["ocean_proximity"]

# Разделение на обучающую и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,ocean_proximity
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526,ISLAND
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585,1H OCEAN
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521,NEAR OCEAN
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413,1H OCEAN
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422,1H OCEAN
...,...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781,NEAR BAY
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771,NEAR OCEAN
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923,NEAR BAY
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847,INLAND


<h3>Задание 1: Обучение линейной регрессии и градиентного бустинга на вещественных признаках</h3>
<ol><li>Убираем категориальные признаки.</li><li>Обучаем линейную регрессию с помощью <code>LinearRegression</code> и <code>OLS</code>.</li><li>Обучаем градиентный бустинг.</li></ol>

In [29]:
# Оставляем только вещественные признаки
X_train_num = X_train[num_features]
X_test_num = X_test[num_features]

# Линейная регрессия с использованием LinearRegression
linear_model = LinearRegression()
linear_model.fit(X_train_num, y_train)
y_pred_linear = linear_model.predict(X_test_num)

# Метрики для LinearRegression
r2_linear = r2_score(y_test, y_pred_linear)
mae_linear = mean_absolute_error(y_test, y_pred_linear)
rmse_linear = np.sqrt(mean_squared_error(y_test, y_pred_linear))

print("Linear Regression (sklearn)")
print(f"R^2: {r2_linear:.3f}, MAE: {mae_linear:.3f}, RMSE: {rmse_linear:.3f}\n")

# Линейная регрессия с использованием OLS
X_train_sm = sm.add_constant(X_train_num)
X_test_sm = sm.add_constant(X_test_num)

ols_model = sm.OLS(y_train, X_train_sm).fit()
y_pred_ols = ols_model.predict(X_test_sm)

# Метрики для OLS
r2_ols = r2_score(y_test, y_pred_ols)
mae_ols = mean_absolute_error(y_test, y_pred_ols)
rmse_ols = np.sqrt(mean_squared_error(y_test, y_pred_ols))

print("OLS Regression (statsmodels)")
print(f"R^2: {r2_ols:.3f}, MAE: {mae_ols:.3f}, RMSE: {rmse_ols:.3f}\n")

# Градиентный бустинг
gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train_num, y_train)
y_pred_gb = gb_model.predict(X_test_num)

# Метрики для градиентного бустинга
r2_gb = r2_score(y_test, y_pred_gb)
mae_gb = mean_absolute_error(y_test, y_pred_gb)
rmse_gb = np.sqrt(mean_squared_error(y_test, y_pred_gb))

print("Gradient Boosting")
print(f"R^2: {r2_gb:.3f}, MAE: {mae_gb:.3f}, RMSE: {rmse_gb:.3f}")

Linear Regression (sklearn)
R^2: 0.576, MAE: 0.533, RMSE: 0.746

OLS Regression (statsmodels)
R^2: 0.576, MAE: 0.533, RMSE: 0.746

Gradient Boosting
R^2: 0.776, MAE: 0.372, RMSE: 0.542


<h3>Задание 2: Применение кодировок для категориальных признаков и обучение моделей</h3>
<ol><li>Применим One-Hot Encoding, Label Encoding и Target Encoding для категориальных признаков.</li><li>Обучим линейную регрессию (OLS и LinearRegression) и градиентный бустинг на каждом из закодированных наборов данных.</li></ol>

In [30]:
results = []

# Функция для вычисления метрик
def evaluate_model(model, X_test, y_test, encoding_type, model_type):
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    results.append({
        'Encoding': encoding_type,
        'Model': model_type,
        'R^2': r2,
        'MAE': mae,
        'RMSE': rmse
    })

# --- One-Hot Encoding ---
onehot_encoder = OneHotEncoder(drop='first', sparse_output=False)
X_train_ohe = pd.concat([X_train[num_features], pd.DataFrame(onehot_encoder.fit_transform(X_train[cat_features]))], axis=1)
X_test_ohe = pd.concat([X_test[num_features], pd.DataFrame(onehot_encoder.transform(X_test[cat_features]))], axis=1)
X_train_ohe.columns = X_train_ohe.columns.astype(str)

# Линейная регрессия на One-Hot Encoding
linear_model.fit(X_train_ohe, y_train)
evaluate_model(linear_model, X_test_ohe, y_test, 'One-Hot', 'Linear Regression')

# Линейная регрессия (OLS) на One-Hot Encoding
X_train_ohe_sm = sm.add_constant(X_train_ohe)
X_test_ohe_sm = sm.add_constant(X_test_ohe)
ols_model = sm.OLS(y_train, X_train_ohe_sm).fit()
evaluate_model(ols_model, X_test_ohe_sm, y_test, 'One-Hot', 'OLS')

# Градиентный бустинг на One-Hot Encoding
gb_model.fit(X_train_ohe, y_train)
evaluate_model(gb_model, X_test_ohe, y_test, 'One-Hot', 'Gradient Boosting')


# --- Label Encoding ---
X_train_le = X_train[num_features].copy()
X_test_le = X_test[num_features].copy()

for col in cat_features:
    le = LabelEncoder()
    X_train_le[col] = le.fit_transform(X_train[col])
    X_test_le[col] = le.transform(X_test[col])

# Линейная регрессия на Label Encoding
linear_model.fit(X_train_le, y_train)
evaluate_model(linear_model, X_test_le, y_test, 'Label', 'Linear Regression')

# Линейная регрессия (OLS) на Label Encoding
X_train_le_sm = sm.add_constant(X_train_le)
X_test_le_sm = sm.add_constant(X_test_le)
ols_model = sm.OLS(y_train, X_train_le_sm).fit()
evaluate_model(ols_model, X_test_le_sm, y_test, 'Label', 'OLS')

# Градиентный бустинг на Label Encoding
gb_model.fit(X_train_le, y_train)
evaluate_model(gb_model, X_test_le, y_test, 'Label', 'Gradient Boosting')


# --- Target Encoding ---
target_encoder = TargetEncoder(cols=cat_features)
X_train_te = target_encoder.fit_transform(X_train, y_train)
X_test_te = target_encoder.transform(X_test)

# Линейная регрессия на Target Encoding
linear_model.fit(X_train_te, y_train)
evaluate_model(linear_model, X_test_te, y_test, 'Target', 'Linear Regression')

# Линейная регрессия (OLS) на Target Encoding
X_train_te_sm = sm.add_constant(X_train_te)
X_test_te_sm = sm.add_constant(X_test_te)
ols_model = sm.OLS(y_train, X_train_te_sm).fit()
evaluate_model(ols_model, X_test_te_sm, y_test, 'Target', 'OLS')

# Градиентный бустинг на Target Encoding
gb_model.fit(X_train_te, y_train)
evaluate_model(gb_model, X_test_te, y_test, 'Target', 'Gradient Boosting')

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,0,1,2,3
14196,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03,0.0,1.0,0.0,0.0
8267,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16,0.0,0.0,0.0,1.0
17445,4.1563,4.0,5.645833,0.985119,915.0,2.723214,34.66,-120.48,,,,
14265,1.9425,36.0,4.002817,1.033803,1418.0,3.994366,32.69,-117.11,0.0,0.0,0.0,0.0
2271,3.5542,43.0,6.268421,1.134211,874.0,2.300000,36.78,-119.80,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
16493,,,,,,,,,1.0,0.0,0.0,0.0
16495,,,,,,,,,0.0,0.0,0.0,0.0
16497,,,,,,,,,0.0,0.0,1.0,0.0
16508,,,,,,,,,0.0,0.0,0.0,1.0


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,0,1,2,3
14196,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03,0.0,1.0,0.0,0.0
8267,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16,0.0,0.0,0.0,1.0
14265,1.9425,36.0,4.002817,1.033803,1418.0,3.994366,32.69,-117.11,0.0,0.0,0.0,0.0
2271,3.5542,43.0,6.268421,1.134211,874.0,2.300000,36.78,-119.80,0.0,0.0,1.0,0.0
6252,2.5192,28.0,4.345361,1.074742,1355.0,3.492268,34.04,-117.97,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
11284,6.3700,35.0,6.129032,0.926267,658.0,3.032258,33.78,-117.96,0.0,0.0,1.0,0.0
11964,3.0500,33.0,6.868597,1.269488,1753.0,3.904232,34.02,-117.43,1.0,0.0,0.0,0.0
5390,2.9344,36.0,3.986717,1.079696,1756.0,3.332068,34.03,-118.38,0.0,0.0,0.0,1.0
860,5.7192,15.0,6.395349,1.067979,1777.0,3.178891,37.58,-121.96,0.0,0.0,0.0,0.0


ValueError: Found input variables with inconsistent numbers of samples: [13186, 16512]

<h3>Задание 3: Обучение CatBoost на данных с вещественными и категориальными признаками</h3>

In [10]:
# Индексы категориальных признаков для CatBoost
cat_features_indices = [X.columns.get_loc(col) for col in cat_features]

# Обучение CatBoost на всех признаках
catboost_model = CatBoostRegressor(cat_features=cat_features_indices, verbose=0, random_state=42)
catboost_model.fit(X_train, y_train)

# Метрики для CatBoost
y_pred_catboost = catboost_model.predict(X_test)
r2_catboost = r2_score(y_test, y_pred_catboost)
mae_catboost = mean_absolute_error(y_test, y_pred_catboost)
rmse_catboost = np.sqrt(mean_squared_error(y_test, y_pred_catboost))

# Добавим метрики для CatBoost в результаты
results.append({
    'Encoding': 'None',
    'Model': 'CatBoost',
    'R^2': r2_catboost,
    'MAE': mae_catboost,
    'RMSE': rmse_catboost
})

print("CatBoost Results")
print(f"R^2: {r2_catboost:.3f}, MAE: {mae_catboost:.3f}, RMSE: {rmse_catboost:.3f}")

CatBoost Results
R^2: 0.522, MAE: 0.187, RMSE: 0.296


<h3>Задание 4: Сбор результатов в датафрейм и анализ</h3>

In [19]:
# Сбор результатов в датафрейм и анализ
results_df = pd.DataFrame(results)
print("\nРезультаты всех моделей:")
print(results_df)


Результаты всех моделей:
Empty DataFrame
Columns: []
Index: []
