# MultiOutputRegressor

test the performance of the classical ML algorithms on the CleanRoom prediction problem

**baseline**: predict the mean value of the test data

**Linear Models**:

1. LinearRegression: simple
2. Ridge: 使用 L2 正则化的线性回归模型，可以减小模型的复杂度。
3. Lasso：使用 L1 正则化的线性回归模型，可以增加稀疏性，适用于特征选择。
4. ElasticNet：结合 L1 和 L2 正则化的线性回归模型，综合了 Ridge 和 Lasso 的优势。

**K-neighbors**：（K 最近邻回归）根据最接近的 K 个邻居的目标值来进行回归预测。

**SVM**: SVR（支持向量回归）：使用支持向量机的回归版本，通过找到一个最优超平面来建立回归模型。

**MLP**：（多层感知器）使用多层神经网络进行回归任务。

**GaussianProcessRegressor**：（高斯过程回归）使用高斯过程建立回归模型，可以估计预测的不确定性。

**Tree Models**:

 1. DecisionTreeRegressor：基于决策树的回归模型，通过划分特征空间来建立回归模型。
 2. RandomForestRegressor（随机森林回归）：基于决策树的集成学习方法，
         通过随机特征选择和样本采样来建立回归模型。
 3. AdaBoostRegressor：基于自适应增强的回归模型，通过迭代训练弱回归模型来建立强大的回归模型。
 4. GradientBoostingRegressor（梯度提升回归）：基于梯度提升树的集成学习方法，
         通过迭代训练弱回归模型来建立强大的回归模型。
 5. BaggingRegressor: 基于袋装方法的回归模型，通过自助采样和平均预测结果来建立回归模型。
 6. ExtraTreesRegressor（极端随机森林回归）：在随机森林中进一步随机化特征和划分点的回归模型。
 7. HistGradientBoostingRegressor（直方图梯度提升回归）：基于直方图加速的梯度提升回归模型，具有更快的训练速度。
 8. XGBRegressor：基于梯度提升树的回归模型，具有高效的训练和预测性能。
 9. LightGBMRegressor：基于梯度提升树的回归模型，具有快速训练速度和高准确性。
 10. CatBoostRegressor: 基于梯度提升树的回归模型，具有处理分类和回归任务的能力。


In [1]:
import os

# os.chdir("/root/workspace/CCP/")

In [2]:
"""
* 
* 
* File: mth_MOR.ipynb
* Author: Fan Kai
* Soochow University
* Created: 2023-11-14 08:24:48
* ----------------------------
* Modified: 2023-11-18 04:10:02
* Modified By: Fan Kai
* ========================================================================
* HISTORY:
"""


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import RepeatedKFold, cross_validate, GridSearchCV, KFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

from utils.metrics import calculate_metrics, get_ccp_scoring, print_results_table
from utils.datasets import load_and_split_data

ModuleNotFoundError: No module named 'utils'

In [3]:
# 设置显示中文字体
from pylab import mpl

mpl.rcParams["font.sans-serif"] = ["SimHei"]
# 设置正常显示符号
mpl.rcParams["axes.unicode_minus"] = False

## Data

### read data

In [4]:
data_path = "data/rdc_data_cleaned.csv"
x_train, x_test, y_train, y_test = load_and_split_data(
    data_path, test_size=0.1, random_state=42
)

# 打印划分后的数据集大小
print("训练集大小:", len(x_train))
print("测试集大小:", len(x_test))

训练集大小: 350
测试集大小: 39


# Native Supported

## LinearRegression

### Normal LR

In [5]:
model_lr = LinearRegression().fit(x_train, y_train)
calculate_metrics(model_lr.predict(x_test), y_test, print_metrics=True, title="lr")
model_lr = LinearRegression()
cv_mthd = RepeatedKFold(n_splits=5, n_repeats=3, random_state=44)
ccp_scoring = get_ccp_scoring()
results = cross_validate(
    model_lr, X=x_train, y=y_train, cv=cv_mthd, scoring=ccp_scoring
)
results
print_results_table(results, ccp_scoring, title="lr-cv")

### Ridge

In [68]:
model_lr_ridge = Ridge(alpha=1.0).fit(x_train, y_train)
calculate_metrics(
    model_lr_ridge.predict(x_test), y_test, print_metrics=True, title="lr_ridge"
)
model_lr_lasso = LinearRegression()
cv_mthd = RepeatedKFold(n_splits=5, n_repeats=3, random_state=44)
ccp_scoring = get_ccp_scoring()
results = cross_validate(
    model_lr_ridge, X=x_train, y=y_train, cv=cv_mthd, scoring=ccp_scoring
)
results
print_results_table(results, ccp_scoring, title="lr_ridge-cv")

### Lasso

In [None]:
model_lr_lasso = Lasso(alpha=1.0).fit(x_train, y_train)
calculate_metrics(
    model_lr_lasso.predict(x_test), y_test, print_metrics=True, title="lr_lasso"
)
model_lr_lasso = LinearRegression()
cv_mthd = RepeatedKFold(n_splits=5, n_repeats=3, random_state=44)
ccp_scoring = get_ccp_scoring()
results = cross_validate(
    model_lr_lasso, X=x_train, y=y_train, cv=cv_mthd, scoring=ccp_scoring
)
results
print_results_table(results, ccp_scoring, title="lr_lasso-cv")

### ElasticNet

In [None]:
model_lr_ElasticNet = ElasticNet(alpha=1.0, l1_ratio=0.5).fit(x_train, y_train)
calculate_metrics(
    model_lr_ElasticNet.predict(x_test),
    y_test,
    print_metrics=True,
    title="lr_ElasticNet",
)
model_lr_ElasticNet = LinearRegression()
cv_mthd = RepeatedKFold(n_splits=5, n_repeats=3, random_state=44)
ccp_scoring = get_ccp_scoring()
results = cross_validate(
    model_lr_ElasticNet, X=x_train, y=y_train, cv=cv_mthd, scoring=ccp_scoring
)
results
print_results_table(results, ccp_scoring, title="lr_ElasticNet-cv")

### Grid for all LR

In [66]:
lr_param_grid = {
    "LinearRegression": {},
    "Ridge": {"alpha": [7.58]},  # 7.58
    "Lasso": {"alpha": [0.15]},  # 0.15
    "ElasticNet": {
        "alpha": [1],  # 0.12
        "l1_ratio": [round(x, 5) for x in np.linspace(0.01, 0.99, 5)],  # 0.89
    },
}

lr_models = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(max_iter=30000),
    "Lasso": Lasso(max_iter=30000),
    "ElasticNet": ElasticNet(),
}

for model_name, model in lr_models.items():
    print("=" * 47)
    print(f"Training {model_name}... ")
    print("-" * 80)
    grid_search = GridSearchCV(
        model,
        param_grid=lr_param_grid[model_name],
        scoring=get_ccp_scoring(),
        refit="pres_rmse",
        cv=KFold(n_splits=5, shuffle=True, random_state=42),
        verbose=1,
        n_jobs=1,
    )
    grid_search.fit(x_train, y_train)
    calculate_metrics(
        grid_search.best_estimator_.predict(x_test),
        y_test,
        print_metrics=True,
        title=model_name,
    )
    print("Best Parameters for", model_name, ":", grid_search.best_params_)
    print("Best Score for", model_name, ":", grid_search.best_score_)
    print("=" * 47 + "\n" * 2)

Training LinearRegression... 
--------------------------------------------------------------------------------
Fitting 5 folds for each of 1 candidates, totalling 5 fits


Best Parameters for LinearRegression : {}
Best Score for LinearRegression : -8.74886633048528


Training Ridge... 
--------------------------------------------------------------------------------
Fitting 5 folds for each of 1 candidates, totalling 5 fits


Best Parameters for Ridge : {'alpha': 7.58}
Best Score for Ridge : -8.748390077075655


Training Lasso... 
--------------------------------------------------------------------------------
Fitting 5 folds for each of 1 candidates, totalling 5 fits


Best Parameters for Lasso : {'alpha': 0.15}
Best Score for Lasso : -8.748377796967883


Training ElasticNet... 
--------------------------------------------------------------------------------
Fitting 5 folds for each of 5 candidates, totalling 25 fits


Best Parameters for ElasticNet : {'alpha': 1, 'l1_ratio': 0.01}
Best Score for ElasticNet : -8.82999089564154




## KNeighborsRegressor

### one-shot training

In [None]:
_model = KNeighborsRegressor().fit(x_train, y_train)
_y_pred = _model.predict(x_test)
calculate_metrics(_y_pred, y_test, print_metrics=True)

### cross validation

In [None]:
_model = KNeighborsRegressor()

cv_mthd = RepeatedKFold(n_splits=5, n_repeats=3, random_state=44)
ccp_scoring = get_ccp_scoring()
results = cross_validate(_model, X=x_train, y=y_train, cv=cv_mthd, scoring=ccp_scoring)
print_results_table(results, ccp_scoring)

## Random Forest

### one-shot training

In [None]:
_model = RandomForestRegressor()
_model.fit(x_train, y_train)
_y_pred = _model.predict(x_test)
calculate_metrics(_y_pred, y_test, print_metrics=True)

### cross validation

In [None]:
_model = RandomForestRegressor()
cv_mthd = RepeatedKFold(n_splits=5, n_repeats=3, random_state=44)
ccp_scoring = get_ccp_scoring()
results = cross_validate(_model, X=x_train, y=y_train, cv=cv_mthd, scoring=ccp_scoring)
print_results_table(results, ccp_scoring)

## Desicion Tree

### one-shot training

In [None]:
_model = DecisionTreeRegressor()
_model.fit(x_train, y_train)
_y_pred = _model.predict(x_test)
calculate_metrics(_y_pred, y_test, print_metrics=True)

### cross validation

In [None]:
_model = DecisionTreeRegressor()

cv_mthd = RepeatedKFold(n_splits=5, n_repeats=3, random_state=44)
ccp_scoring = get_ccp_scoring()
results = cross_validate(_model, X=x_train, y=y_train, cv=cv_mthd, scoring=ccp_scoring)
print_results_table(results, ccp_scoring)

# MultiOutputRegressor wrapped

## LinerSVR

In [None]:
from sklearn.svm import LinearSVR
from sklearn.multioutput import MultiOutputRegressor

_model = LinearSVR()
_model = MultiOutputRegressor(_model)
# fit model
_model.fit(x_train, y_train)
_y_pred = _model.predict(x_test)
calculate_metrics(_y_pred, y_test, print_metrics=True)

### cross validation

In [None]:
_model = MultiOutputRegressor(LinearSVR())

cv_mthd = RepeatedKFold(n_splits=5, n_repeats=3, random_state=44)
ccp_scoring = get_ccp_scoring()
results = cross_validate(_model, X=x_train, y=y_train, cv=cv_mthd, scoring=ccp_scoring)
print_results_table(results, ccp_scoring)

# RegressorChain wrapped

## LinerSVR

In [None]:
from sklearn.svm import LinearSVR
from sklearn.multioutput import RegressorChain

_model = LinearSVR()
_model = RegressorChain(_model)
# fit model
_model.fit(x_train, y_train)
_y_pred = _model.predict(x_test)
calculate_metrics(_y_pred, y_test, print_metrics=True)