In [17]:
import warnings

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_regression

from sklearn.multioutput import MultiOutputRegressor

from sklearn.svm import LinearSVR

from sklearn.model_selection import (
    train_test_split,
    RepeatedKFold,
    cross_val_score)

from sklearn.metrics import mean_absolute_error



The objective of this notebook is to demonstrate a wrapper that allows fitting a regression model with two or more outputs., which cannot inherently perform the fitting


# **Info**
---

**@By:** Kaiziferr

**@Git:** https://github.com/Kaiziferr

# **Config**

---



In [18]:
plt.style.use('ggplot')
random_seed = 12354
warnings.filterwarnings('ignore')

# **Data**
---



A synthetic dataset of 1500 records is generated, with 10 features, 6 of which are informative, an error noise of 0.75, and two outputs.

In [19]:
X, y = make_regression(
    n_samples=1500,
    n_features=10,
    n_informative=6,
    n_targets=2,
    random_state = random_seed,
    noise=0.75)

A data frame is created.

In [20]:
data = pd.DataFrame(
    np.concatenate(
        (X, y.reshape(-1, 2)),
        axis=1), columns=[
            f'X{i}' for i in range(1, 11)] + [f'y{i}' for i in range(1, 3)]).head()
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,y1,y2
0,1.346289,-1.065927,1.90884,1.106275,-1.287102,-0.578371,0.25433,-0.234749,0.455759,-0.326307,12.09827,-51.438046
1,-1.310131,-1.758383,0.619499,-0.434613,0.704398,-0.536541,-0.816816,0.086069,-0.272814,0.663045,-104.395994,-225.381916
2,0.924924,-0.632634,-0.685409,-0.069438,-1.690132,0.129748,-1.006345,-0.610594,0.048564,1.227449,15.690527,-105.688188
3,0.549514,-0.447932,0.30615,-2.049563,-1.272985,0.661454,-0.179638,-0.152771,-0.781228,-0.513632,-63.538937,-15.236932
4,2.427584,1.248808,0.02969,-0.300321,-1.026519,0.067276,-0.625563,1.735031,1.272868,-2.4758,12.372237,55.072781


# **Split**
---

In [21]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    random_state=random_seed,
    test_size=0.2
)

# **Model**
---

Three support vector machine models are instantiated, which cannot perform fitting or prediction in problems with two or more outputs

In [22]:
modelA = LinearSVR(random_state=random_seed)
modelB = LinearSVR(random_state=random_seed)
modelC = LinearSVR(random_state=random_seed)

The fitting process is carried out, which generates an error, as support vector machines for regression can only perform fitting with a single output

In [23]:
try:
  modelA.fit(X, y)
except Exception as e:
  print(e)

y should be a 1d array, got an array of shape (1500, 2) instead.


The MultiOutputRegressor wrapper from sklearn is instantiated, which allows fitting the same model instance for each output of the problem. In this case, it internally fits two SVMs for each output independently. For this reason, it is not advisable to use this method if there is dependency between the outputs."

In [24]:
wrapper = MultiOutputRegressor(modelA)

The performance is evaluated with 3 repetitions and 10 folds, using the MAE metric to validate the model's performance with the generated data. Since the data is synthetic, with measurement units of similar magnitude, standardization is not considered necessary.

In [25]:
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=random_seed)

n_scores = cross_val_score(
    wrapper,
    X,
    y,
    scoring='neg_mean_absolute_error',
    cv=cv,
    n_jobs=-1,
)

It has an error below 0.611, with a standard deviation of 0.22. It is necessary to validate with baseline models to determine how optimal the model is.

In [26]:
n_scores = np.absolute(n_scores)
print('MAE: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

MAE: 0.611 (0.022)


In [27]:
data_results = pd.DataFrame(X_test)

Wrapper model fitting

In [28]:
wrapper.fit(X_train, y_train)
y_predict_wrapper = wrapper.predict(X_test)
data_results['predict_wrapper_output_1'] = y_predict_wrapper[:,0]
data_results['predict_wrapper_output_2'] = y_predict_wrapper[:,1]

Two models are fitted with the same seed to validate the fitting of the wrapper model, to demonstrate that the wrapper model creates a second instance of the model, fitting each model independently with each output.

Fitting of the first model

In [29]:
modelB.fit(X_train, y_train[:, 0])
y_predict_B = modelB.predict(X_test)
data_results['predict_model_b'] = y_predict_B

Fitting of the second model

In [30]:
modelC.fit(X_train, y_train[:, 1])
y_predict_C = modelC.predict(X_test)
data_results['predict_model_c'] = y_predict_C

In [31]:
data_results

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,predict_wrapper_output_1,predict_wrapper_output_2,predict_model_b,predict_model_c
0,0.598650,1.662646,-0.009399,-1.365592,0.563967,0.562587,-1.497401,0.729871,-1.537306,0.654108,-49.569469,-68.893649,-49.569469,-68.893649
1,0.885069,1.832476,1.426344,-1.475577,0.213101,0.181255,-1.594822,1.158486,-0.871297,1.212205,10.046949,-70.112993,10.046949,-70.112993
2,-1.017664,-0.042076,1.556161,0.387642,-0.184365,0.111102,0.714112,1.315661,0.592366,0.678785,70.107564,96.246989,70.107564,96.246989
3,-1.401187,1.362455,-0.136258,-0.305641,-0.794530,1.081220,-1.192021,-0.372200,0.060008,1.364998,30.401692,39.343725,30.401692,39.343725
4,0.864372,-1.780043,-1.986367,-0.632876,-0.381204,-0.745019,-0.439773,-0.662958,-0.948546,1.860774,-24.730492,-210.559432,-24.730492,-210.559432
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,1.173615,-1.013315,0.064628,0.243252,-0.657884,-0.380854,0.761572,-0.625682,-1.800853,-0.410481,-81.086770,-69.778149,-81.086770,-69.778149
296,-0.282036,0.922947,-2.290448,0.378612,1.066916,1.361547,-0.547818,0.742398,-0.504308,0.886589,25.448492,85.567721,25.448492,85.567721
297,0.586693,-0.832403,0.505701,-1.511164,0.163632,-1.539200,-1.107070,0.290583,0.900708,2.268320,65.897922,-221.208163,65.897922,-221.208163
298,1.123262,-0.547234,-0.204558,-0.438567,3.490339,0.496974,1.044451,-0.219593,-0.710378,-0.531740,3.069322,91.140152,3.069322,91.140152


It is evidenced that model B has the same outputs as the first output of the wrapper model.

In [35]:
(data_results['predict_wrapper_output_1']==data_results['predict_model_b']).all()

np.False_

It is evidenced that model B has the same outputs as the second output of the wrapper model.

In [36]:
(data_results['predict_wrapper_output_2']==data_results['predict_model_c']).all()

np.True_

# **Info**
---

**@By:** Kaiziferr

**@Git:** https://github.com/Kaiziferr