# **METAMODELING**

In this notebook, we will use the results of Abaqus analyses in order to build a surrogate model, i.e. metamodel, of the Finite Element (FE) analysis solver. Since there not exists the best model for all the situations different hypothesis spaces will be analyzed. Let's start importing the required libraries.

Steps to follow:

1. Open the notebook on Colab.

2. Load from the left panel in the _Files folder_ the data set and model info files.

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random as rn
import time
import copy
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
# !pip install xgboost
from xgboost import XGBRegressor

If you want to obtain reproducible results it is important to fix the seed of random operations.

In [2]:
seed = 123
np.random.seed(seed)

## **Data Preprocessing**

We start by importing some information about the model used to generate the dataset. In this way we will be sure that the dataset we will work on is the correct one. 

In [3]:
directory = 'harmlin/'
info = pd.read_csv(directory + 'model_info.csv', sep=",")
info.index = ['Value']
eff_plies = int(info['EffectivePlies'].values)
train_smp = int(info['Train'].values)
info.head()

Unnamed: 0,Height,Radius,MaxCurvature,MeshSize,Plies,EffectivePlies,Symmetric,Balanced,AnglesFunction,Train,Test
Value,705,300,0.001575,10,8,2,True,True,harmlin,81,27


At this point we have to import the data set containing the input and output of the FE analysis. The data is stored in a dataframe in which the upper part is associated to the training set and the lower part to the test set. The precise number of upper row belonging to the train set is indicated in the info above.

In [4]:
data_orig = pd.read_csv(directory + 'data.csv', sep=',')
data = data_orig.drop(columns='Stiffness')

In [5]:
data.describe()

Unnamed: 0,Amplitude1,PhaseShift1,Omega1,Beta1,Amplitude2,PhaseShift2,Omega2,Beta2,Buckling
count,108.0,108.0,108.0,108.0,108.0,108.0,108.0,108.0,108.0
mean,54.984167,44.978796,0.696019,22.551852,54.951111,44.949352,0.712593,22.492963,224.99013
std,26.135543,26.121595,0.508325,13.085077,26.053405,26.063126,0.505196,13.06682,28.420656
min,10.31,0.46,0.01,0.47,10.87,0.46,0.01,0.06,160.198
25%,33.1,22.3225,0.3075,11.25,32.665,22.835,0.3,11.5525,208.20525
50%,55.375,44.955,0.575,22.745,54.905,45.325,0.645,22.315,225.9805
75%,77.41,67.3475,0.985,33.5125,76.775,67.615,1.02,33.44,242.2225
max,99.18,89.26,1.98,44.84,98.96,89.04,1.92,44.69,291.46


The most important step to perform before training our model is the normalization of the variables. Different strategies are possible for this purpose, among which, 2 are the most used:

* Range normalization: converts all the values to the range $[0, 1]$

* Standard score normalization: forces the variables to have $0$ mean and $1$ standard deviation

The normalization parameters must be taken only from the training set and then used also for the normalization of the test set. In this notebook we will use the standard score normalization.

In [7]:
def std_norm(x, stats):
    """ Remove mean and fix standard deviation to 1 """
    x_norm = (x - stats['mean'].values) / stats['std'].values
    return x_norm

def inv_std_norm(x_norm, stats):
    """ Recover the original value of the variables """
    x = x_norm * stats['std'] + stats['mean']
    return x

def range_norm(x, stats):
    """ Rescale in range [0, 1] """
    x_norm = (x - stats['min'].values) / (stats['max'].values - stats['min'].values)
    return x_norm

Split data into train and test sets


In [8]:
X = data.drop(columns='Buckling')
Y = data['Buckling']

# a = Y > 0
# b = [[1 if i else -1 for i in a]]
# b = np.array(b).T
# X_sf = X * b

# Training set
X_orig_tv = X.iloc[:train_smp, :]
Y_orig_tv = Y.iloc[:train_smp]
train_stats = X_orig_tv.describe()
target_stats = Y_orig_tv.describe()
print('Design matrix dimension: ' + str(X_orig_tv.shape))
print('Target matrix dimension: ' + str(Y_orig_tv.shape))

# Test set
X_orig_test = X.iloc[train_smp:, :]
Y_orig_test = Y.iloc[train_smp:]
test_stats = X_orig_test.describe()
print('Test input matrix dimension: ' + str(X_orig_test.shape))
print('Test target matrix dimension: ' + str(Y_orig_test.shape))

Design matrix dimension: (81, 8)
Target matrix dimension: (81,)
Test input matrix dimension: (27, 8)
Test target matrix dimension: (27,)


Now we can normalize:

In [29]:
X_tv = std_norm(X_orig_tv.values, train_stats.transpose())
Y_tv = (Y_orig_tv.values - target_stats.transpose()['mean']) / target_stats.transpose()['std']

X_test = std_norm(X_orig_test.values, train_stats.transpose())
Y_test = (Y_orig_test.values - target_stats.transpose()['mean']) / target_stats.transpose()['std']

In [30]:
df = pd.DataFrame(X_tv, columns=X_orig_tv.columns)
df.describe()

Unnamed: 0,Amplitude1,PhaseShift1,Omega1,Beta1,Amplitude2,PhaseShift2,Omega2,Beta2
count,81.0,81.0,81.0,81.0,81.0,81.0,81.0,81.0
mean,-4.344947e-16,-5.866364e-16,1.946317e-16,-3.2895500000000005e-17,1.822959e-16,4.934325e-16,3.2210170000000004e-17,3.28955e-16
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-1.706597,-1.70301,-1.364286,-1.687767,-1.687662,-1.702908,-1.380528,-1.715936
25%,-0.8320367,-0.8653647,-0.7808695,-0.853815,-0.8415632,-0.8275561,-0.7878279,-0.8446352
50%,0.002041536,0.007835029,-0.2175705,0.0130658,-0.007327554,-0.002771463,-0.09634428,-0.0001323331
75%,0.8429941,0.8500676,0.627378,0.8500806,0.8337963,0.8572571,0.5951394,0.8627459
max,1.687002,1.691918,2.337393,1.703177,1.683339,1.69047,2.392997,1.701124


We have two data sets since the validation one will be generated with k-fold cross validation (CV). Using k-fold CV we avoid to overfit the validation set which would cause bad performance on the test one.

## **Regression Models**

---

In the cells below we will consider the following regression models:

* **kNN**

* **Polynomial linear regression**

* **Polynomial Ridge regression**

* **Polynimial Lasso regression**

First thing first, we have to define the accuracy metrics required to evaluate the performance of each model.

In [31]:
from sklearn.metrics import r2_score, mean_absolute_error

def r_2(y_true, y_pred):
    """ R squared loss """
    num = np.sum(((y_true - y_pred) ** 2))
    den = np.sum((y_true - np.mean(y_true))**2)
    return np.mean(1 - num / den)

def rmae(y_true, y_pred):
    """ Relative Maximum Absolute Error """
    max_err = np.max(abs(y_true - y_pred))
    std = (np.sum((y_pred - np.mean(y_true)) ** 2) / len(y_true)) ** 0.5
    return np.mean(max_err / std)
    
def raae(y_true, y_pred):
    """ Relative Average Absolute Error """
    num = np.sum(abs(y_true - y_pred))
    den =  (((np.sum((y_pred - np.mean(y_true)) ** 2) / len(y_true))) ** 0.5) * len(y_true)
    return np.mean(num / den)

def mpe(y_true, y_pred):
    """ Maximum Percentage Error """
    val_max = np.max(abs(((y_true - y_pred) / y_true)) * 100)
    idx = np.argmax(val_max)
    return val_max, idx

def mape(y_true, y_pred):
    """ Mean Absolute Percentage Error """
    return np.mean(abs(((y_true - y_pred) / y_true)) * 100)

### **K-Nearest Neighbors**

The label assigned to a query point is computed based on the mean of the labels of its nearest neighbors.

In [32]:
# neighbors = [1, 3, 5, 7, 9, 11, 13, 15, 17, 20]

# metrics = {}
# n_folds = 10
# kf = KFold(n_splits=n_folds, shuffle=True, random_state=seed)

# for n in neighbors:
#     r2_tmp = []
#     rmae_tmp = []
#     raae_tmp = []
#     mape_tmp = []
#     for train_idx, val_idx in kf.split(X_tv):
#         X_t, Y_t = X_tv[train_idx, :], Y_tv[train_idx]
#         X_v, Y_v = X_tv[val_idx, :], Y_tv[val_idx]
#         knn = KNeighborsRegressor(n_neighbors = n, weights='uniform')
#         knn.fit(X_t, Y_t)
#         Ypred_v = knn.predict(X_v)
#         r2_tmp.append(r_2(Y_v, Ypred_v))
#         rmae_tmp.append(rmae(Y_v, Ypred_v))
#         raae_tmp.append(raae(Y_v, Ypred_v))
#         mape_tmp.append(mape(Y_v, Ypred_v))
#     metrics['Neighbors' + str(n)] = [np.mean(r2_tmp), np.mean(rmae_tmp),
#                                      np.mean(raae_tmp), np.mean(mape_tmp)]

# idx_name = ['R2', 'RMAE', 'RAAE', 'MAPE']
# metrics_df = pd.DataFrame(metrics, index=idx_name)

neighbors = [1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 20]

metrics = {}
n_folds = 10
kf = KFold(n_splits=n_folds, shuffle=True, random_state=seed)

for n in neighbors:
    r2_tmp = []
    rmae_tmp = []
    raae_tmp = []
    mape_tmp = []
    knn = KNeighborsRegressor(n_neighbors = n, weights='uniform')
    knn.fit(X_tv, Y_tv)
    Ypred_tv = knn.predict(X_tv)
    r2_tmp.append(r_2(Y_tv, Ypred_tv))
    rmae_tmp.append(rmae(Y_tv, Ypred_tv))
    raae_tmp.append(raae(Y_tv, Ypred_tv))
    mape_tmp.append(mape(Y_tv, Ypred_tv))
    metrics['Neighbors' + str(n)] = [np.mean(r2_tmp), np.mean(rmae_tmp),
                                     np.mean(raae_tmp), np.mean(mape_tmp)]

idx_name = ['R2', 'RMAE', 'RAAE', 'MAPE']
metrics_df = pd.DataFrame(metrics, index=idx_name)

In [33]:
metrics_df

Unnamed: 0,Neighbors1,Neighbors2,Neighbors3,Neighbors5,Neighbors7,Neighbors9,Neighbors11,Neighbors13,Neighbors15,Neighbors17,Neighbors20
R2,1.0,0.649094,0.617269,0.548971,0.508074,0.486527,0.437463,0.410082,0.405683,0.377504,0.329754
RMAE,0.0,1.68826,2.113362,2.885887,3.179358,3.970417,4.935643,5.638212,5.650785,6.072741,6.733798
RAAE,0.0,0.532147,0.719212,0.90563,0.998369,1.230715,1.358135,1.521747,1.572326,1.71947,1.960341
MAPE,0.0,232.321968,176.810558,169.487069,151.017951,156.306717,140.294691,134.470997,113.053746,119.647668,126.848293


After cross-validation we choose a number of neighbors equal to 7 and retrain the model with the entire dataset. After that we have to evaluate the performance onto the test set.

In [14]:
knn = KNeighborsRegressor(n_neighbors = 3, weights='uniform').fit(X_tv, Y_tv)
pred = knn.predict(X_tv)
idx_name = ['R2', 'RMAE', 'RAAE', 'MAPE']
metrics = [r_2(Y_tv, pred), rmae(Y_tv, pred), raae(Y_tv, pred), mape(Y_tv, pred)]
metrics_df = pd.DataFrame(metrics, index=idx_name, columns=['test'])

In [15]:
metrics_df.T

Unnamed: 0,R2,RMAE,RAAE,MAPE
test,0.617269,2.113362,0.719212,176.810558


In [16]:
knn = KNeighborsRegressor(n_neighbors = 3, weights='uniform').fit(X_tv, Y_tv)
pred = knn.predict(X_test)
idx_name = ['R2', 'RMAE', 'RAAE', 'MAPE']
metrics = [r_2(Y_test, pred), rmae(Y_test, pred), raae(Y_test, pred), mape(Y_test, pred)]
metrics_df = pd.DataFrame(metrics, index=idx_name, columns=['test'])

In [17]:
metrics_df.T

Unnamed: 0,R2,RMAE,RAAE,MAPE
test,0.131522,3.39438,1.123726,185.7065


### **Simple Linear Regression**


In [39]:
linear_regressor = LinearRegression()
linear_regressor.fit(X_tv, Y_tv)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [40]:
Ypred_tv = linear_regressor.predict(X_tv)
r2_train = r_2(Y_tv, Ypred_tv)
print("R^2 on train set: %.4f"%(r2_train))

R^2 on train set: 0.3469


We can evaluate the model onto the test set.

In [41]:
Ypred_test = linear_regressor.predict(X_test)
r2_test = r_2(Y_test, Ypred_test)
print("R^2 on test set: %.4f"%(r2_test))

R^2 on test set: 0.0123


### **Ridge Regression**

In [42]:
l2 = [0.01, 0.1, 1, 5, 7, 8, 10, 20, 50, 60, 70, 100, 150]

metrics = {}
n_folds = 10
kf = KFold(n_splits=n_folds, shuffle=True, random_state=seed)

for n in l2:
    r2_tmp = []
    rmae_tmp = []
    raae_tmp = []
    mape_tmp = []
    for train_idx, val_idx in kf.split(X_tv):
        X_t, Y_t = X_tv[train_idx, :], Y_tv[train_idx]
        X_v, Y_v = X_tv[val_idx, :], Y_tv[val_idx]
        ridge = Ridge(alpha=n, max_iter=1000, random_state=seed)
        ridge_model = ridge.fit(X_t, Y_t)
        Ypred_v = ridge.predict(X_v)
        r2_tmp.append(r_2(Y_v, Ypred_v))
        rmae_tmp.append(rmae(Y_v, Ypred_v))
        raae_tmp.append(raae(Y_v, Ypred_v))
        mape_tmp.append(mape(Y_v, Ypred_v))
    metrics['Alpha' + str(n)] = [np.mean(r2_tmp), np.mean(rmae_tmp),
                                     np.mean(raae_tmp), np.mean(mape_tmp)]

idx_name = ['R2', 'RMAE', 'RAAE', 'MAPE']
metrics_df = pd.DataFrame(metrics, index=idx_name)

In [43]:
metrics_df

Unnamed: 0,Alpha0.01,Alpha0.1,Alpha1,Alpha5,Alpha7,Alpha8,Alpha10,Alpha20,Alpha50,Alpha60,Alpha70,Alpha100,Alpha150
R2,-0.438087,-0.435376,-0.409524,-0.31702,-0.281195,-0.265275,-0.236791,-0.140421,-0.038188,-0.026835,-0.020073,-0.014054,-0.020948
RMAE,2.57531,2.577762,2.602051,2.710433,2.766849,2.794405,2.848283,3.126861,3.808509,3.998732,4.175329,4.639459,5.250163
RAAE,1.14955,1.150666,1.161745,1.21036,1.234301,1.246013,1.268945,1.375863,1.648039,1.726951,1.800832,1.999053,2.262513
MAPE,257.871569,257.522303,254.113372,240.697251,234.888682,232.153253,226.984067,205.991063,169.8206,162.256607,156.004053,142.593352,129.649781


In [22]:
ridge = Ridge(alpha=5, max_iter=1000, random_state=seed)
ridge_model = ridge.fit(X_tv, Y_tv)
Ypred_test = ridge.predict(X_test)
r2_test = r_2(Y_test, Ypred_test)
print("R^2 on test set: %.4f"%(r2_test))

R^2 on test set: 0.0191


### **Lasso regression**

In [23]:
l1 = [0.001, 0.01, 0.03, 0.05, 0.07, 0.1, 0.2, 0.5, 5]

metrics = {}
n_folds = 10
kf = KFold(n_splits=n_folds, shuffle=True, random_state=seed)

for n in l1:
    r2_tmp = []
    rmae_tmp = []
    raae_tmp = []
    mape_tmp = []
    for train_idx, val_idx in kf.split(X_tv):
        X_t, Y_t = X_tv[train_idx, :], Y_tv[train_idx]
        X_v, Y_v = X_tv[val_idx, :], Y_tv[val_idx]
        lasso = Lasso(alpha=n, max_iter=10000, random_state=seed, tol=0.0001)
        lasso_model = lasso.fit(X_t, Y_t)
        Ypred_v = lasso.predict(X_v)
        r2_tmp.append(r_2(Y_v, Ypred_v))
        rmae_tmp.append(rmae(Y_v, Ypred_v))
        raae_tmp.append(raae(Y_v, Ypred_v))
        mape_tmp.append(mape(Y_v, Ypred_v))
    metrics['Alpha' + str(n)] = [np.mean(r2_tmp), np.mean(rmae_tmp),
                                     np.mean(raae_tmp), np.mean(mape_tmp)]

idx_name = ['R2', 'RMAE', 'RAAE', 'MAPE']
metrics_df = pd.DataFrame(metrics, index=idx_name)

In [24]:
metrics_df

Unnamed: 0,Alpha0.001,Alpha0.01,Alpha0.03,Alpha0.05,Alpha0.07,Alpha0.1,Alpha0.2,Alpha0.5,Alpha5
R2,-0.43155,-0.374604,-0.269174,-0.190739,-0.155732,-0.139312,-0.088933,-0.137901,-0.137901
RMAE,2.583279,2.651598,2.814116,2.992773,3.180796,3.452483,4.555154,8.084972,8.084972
RAAE,1.153125,1.184719,1.24698,1.310824,1.391098,1.508156,1.964405,3.529563,3.529563
MAPE,256.685595,246.47479,225.459767,208.134787,196.417951,180.4822,133.427026,100.293542,100.293542


In [25]:
lasso = Lasso(alpha=0.03, max_iter=1000, random_state=seed)
lasso_model = lasso.fit(X_tv, Y_tv)
Ypred_test = lasso.predict(X_test)
r2_test = r_2(Y_test, Ypred_test)
print("R^2 on test set: %.4f"%(r2_test))

R^2 on test set: 0.0069


### **Polynomial Linear Regression**

In order to apply polynomial regression we have to first transform the original features and then normalize the data set. The polynomial training and test sets will be called small $x$ instead big $X$ of the previous models.

In [26]:
polynomial2 = PolynomialFeatures(degree=4, include_bias=False)
x = polynomial2.fit_transform(X)

# Training set
x_orig_tv = x[:train_smp, :]
train_stats_poly = pd.DataFrame(x_orig_tv).describe()
print('Design matrix with polynomial features dimension: ' + str(x_orig_tv.shape))

# Test set
x_orig_test = x[train_smp:, :]
print('Test input matrix with polynomial features dimension: ' + str(x_orig_test.shape))

Design matrix with polynomial features dimension: (81, 494)
Test input matrix with polynomial features dimension: (27, 494)


In [27]:
# polynomial2.get_feature_names()

Now we can normalize:

In [28]:
x_tv = std_norm(x_orig_tv, train_stats_poly.transpose())
x_test = std_norm(x_orig_test, train_stats_poly.transpose())

In [29]:
polynomial_linear_regressor = LinearRegression()
polynomial_linear_regressor.fit(x_tv, Y_tv)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [30]:
Ypred_tv = polynomial_linear_regressor.predict(x_tv)
r2_train = r_2(Y_tv, Ypred_tv)
print("R^2 on train set: %.4f"%(r2_train))

R^2 on train set: 1.0000


In [31]:
Ypred_test = polynomial_linear_regressor.predict(x_test)
r2_test = r_2(Y_test, Ypred_test)
print("R^2 on test set: %.4f"%(r2_test))

R^2 on test set: -0.4348


In [32]:
# from sklearn.preprocessing import StandardScaler

# polynomial = PolynomialFeatures(degree=1, include_bias=False)
# X_poly = polynomial.fit_transform(X_orig_tv)
# Xtest_poly = polynomial.fit_transform(X_orig_test)

# scaler = StandardScaler()

# scaler.fit(X_poly)

# X_norm = scaler.transform(X_poly)
# Xtest_norm = scaler.transform(Xtest_poly)

### **Polynomial Ridge Regression**

In [33]:
l2 = [0.01, 0.1, 1, 5, 10, 20, 50, 60, 70, 100, 150]

metrics = {}
n_folds = 10
kf = KFold(n_splits=n_folds, shuffle=True, random_state=seed)

for n in l2:
    r2_tmp = []
    rmae_tmp = []
    raae_tmp = []
    mape_tmp = []
    for train_idx, val_idx in kf.split(x_tv):
        x_t, Y_t = x_tv[train_idx, :], Y_tv[train_idx]
        x_v, Y_v = x_tv[val_idx, :], Y_tv[val_idx]
        ridge = Ridge(alpha=n, max_iter=1000, random_state=seed)
        ridge_model = ridge.fit(x_t, Y_t)
        Ypred_v = ridge.predict(x_v)
        r2_tmp.append(r_2(Y_v, Ypred_v))
        rmae_tmp.append(rmae(Y_v, Ypred_v))
        raae_tmp.append(raae(Y_v, Ypred_v))
        mape_tmp.append(mape(Y_v, Ypred_v))
    metrics['Alpha' + str(n)] = [np.mean(r2_tmp), np.mean(rmae_tmp),
                                     np.mean(raae_tmp), np.mean(mape_tmp)]

idx_name = ['R2', 'RMAE', 'RAAE', 'MAPE']
metrics_df = pd.DataFrame(metrics, index=idx_name)

In [34]:
metrics_df

Unnamed: 0,Alpha0.01,Alpha0.1,Alpha1,Alpha5,Alpha10,Alpha20,Alpha50,Alpha60,Alpha70,Alpha100,Alpha150
R2,-3.653499,-3.542279,-2.788683,-1.542185,-1.004488,-0.577605,-0.210573,-0.162268,-0.127216,-0.065122,-0.023332
RMAE,1.867292,1.861082,1.831563,1.829511,1.868155,1.97627,2.157811,2.20322,2.255213,2.378689,2.51545
RAAE,0.788671,0.790699,0.806842,0.842452,0.862017,0.899086,0.97345,0.989909,1.003841,1.03787,1.08788
MAPE,496.725593,492.216919,454.184509,361.745905,309.322159,261.183118,220.562225,214.393455,209.338858,201.236927,194.966379


In [35]:
ridge = Ridge(alpha=50, max_iter=1000, random_state=seed)
ridge_model = ridge.fit(x_tv, Y_tv)
Ypred_test = ridge.predict(x_test)
r2_test = r_2(Y_test, Ypred_test)
print("R^2 on test set: %.4f"%(r2_test))

R^2 on test set: 0.2327


### **Polynomial Lasso Regression**

In [36]:
l1 = [0.001, 0.01, 0.03, 0.05, 0.07, 0.1, 0.2, 0.5, 50]

metrics = {}
n_folds = 10
kf = KFold(n_splits=n_folds, shuffle=True, random_state=seed)

for n in l1:
    r2_tmp = []
    rmae_tmp = []
    raae_tmp = []
    mape_tmp = []
    for train_idx, val_idx in kf.split(x_tv):
        x_t, Y_t = x_tv[train_idx, :], Y_tv[train_idx]
        x_v, Y_v = x_tv[val_idx, :], Y_tv[val_idx]
        lasso = Lasso(alpha=n, max_iter=1000, random_state=seed)
        lasso_model = lasso.fit(x_t, Y_t)
        Ypred_v = lasso.predict(x_v)
        r2_tmp.append(r_2(Y_v, Ypred_v))
        rmae_tmp.append(rmae(Y_v, Ypred_v))
        raae_tmp.append(raae(Y_v, Ypred_v))
        mape_tmp.append(mape(Y_v, Ypred_v))
    metrics['Alpha' + str(n)] = [np.mean(r2_tmp), np.mean(rmae_tmp),
                                     np.mean(raae_tmp), np.mean(mape_tmp)]

idx_name = ['R2', 'RMAE', 'RAAE', 'MAPE']
metrics_df = pd.DataFrame(metrics, index=idx_name)

  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)


In [37]:
metrics_df

Unnamed: 0,Alpha0.001,Alpha0.01,Alpha0.03,Alpha0.05,Alpha0.07,Alpha0.1,Alpha0.2,Alpha0.5,Alpha50
R2,-2.708153,-0.835911,-0.015332,-0.029543,-0.100658,-0.18449,-0.121447,-0.137901,-0.137901
RMAE,1.961152,2.053671,2.188491,2.579515,2.931158,3.27757,4.419028,8.084972,8.084972
RAAE,0.74109,0.835245,0.93367,1.079038,1.212635,1.362699,1.927559,3.529563,3.529563
MAPE,480.704319,273.422384,198.812661,201.506263,198.570647,188.134577,140.13894,100.293542,100.293542


In [38]:
lasso = Lasso(alpha=0.05, max_iter=1000, random_state=seed)
lasso_model = lasso.fit(x_tv, Y_tv)
Ypred_test = lasso.predict(x_test)
r2_test = r_2(Y_test, Ypred_test)
print("R^2 on test set: %.4f"%(r2_test))

R^2 on test set: 0.2619


### **Support Vector Regression**

#### **Linear**

In [39]:
c = [1, 5, 10, 20, 50, 70, 100, 500]

metrics = {}
n_folds = 10
kf = KFold(n_splits=n_folds, shuffle=True, random_state=seed)

for n in c:
    r2_tmp = []
    rmae_tmp = []
    raae_tmp = []
    mape_tmp = []
    for train_idx, val_idx in kf.split(X_tv):
        X_t, Y_t = X_tv[train_idx, :], Y_tv[train_idx]
        X_v, Y_v = X_tv[val_idx, :], Y_tv[val_idx]
        svr = SVR(kernel='linear', C=n, epsilon=0.1).fit(X_t, Y_t)
        Ypred_v = svr.predict(X_v)
        r2_tmp.append(r_2(Y_v, Ypred_v))
        rmae_tmp.append(rmae(Y_v, Ypred_v))
        raae_tmp.append(raae(Y_v, Ypred_v))
        mape_tmp.append(mape(Y_v, Ypred_v))
    metrics['C' + str(n)] = [np.mean(r2_tmp), np.mean(rmae_tmp),
                                     np.mean(raae_tmp), np.mean(mape_tmp)]

idx_name = ['R2', 'RMAE', 'RAAE', 'MAPE']
metrics_df = pd.DataFrame(metrics, index=idx_name)

In [40]:
metrics_df

Unnamed: 0,C1,C5,C10,C20,C50,C70,C100,C500
R2,-0.460042,-0.482864,-0.48209,-0.481965,-0.482352,-0.482688,-0.482092,-0.483615
RMAE,2.434677,2.406606,2.402486,2.401684,2.403254,2.403641,2.403814,2.403981
RAAE,1.127091,1.117615,1.1174,1.11678,1.116922,1.117548,1.116977,1.11732
MAPE,263.633907,265.403205,265.223536,265.396553,265.514434,265.341892,265.316396,265.499813


In [41]:
svr = SVR(kernel='linear', C=50, epsilon=0.1).fit(X_tv, Y_tv)
Ypred_test = svr.predict(X_test)
r2_test = r_2(Y_test, Ypred_test)
print("R^2 on test set: %.4f"%(r2_test))

R^2 on test set: 0.0155


#### **RBF**

In [46]:
c = [0.1, 1, 5, 10, 20, 50, 70, 100, 500]

metrics = {}
n_folds = x_tv.shape[0]
kf = KFold(n_splits=n_folds, shuffle=True, random_state=seed)

for n in c:
    r2_tmp = []
    rmae_tmp = []
    raae_tmp = []
    mape_tmp = []
    for train_idx, val_idx in kf.split(X_tv):
        X_t, Y_t = X_tv[train_idx, :], Y_orig_tv[train_idx]
        X_v, Y_v = X_tv[val_idx, :], Y_orig_tv[val_idx]
        svr = SVR(kernel='rbf', gamma='scale', C=n, epsilon=0.0001).fit(X_t, Y_t)
        Ypred_v = svr.predict(X_v)
        r2_tmp.append(r_2(Y_v, Ypred_v))
        rmae_tmp.append(rmae(Y_v, Ypred_v))
        raae_tmp.append(raae(Y_v, Ypred_v))
        mape_tmp.append(mape(Y_v, Ypred_v))
    metrics['C' + str(n)] = [np.mean(r2_tmp), np.mean(rmae_tmp),
                                     np.mean(raae_tmp), np.mean(mape_tmp)]

idx_name = ['R2', 'RMAE', 'RAAE', 'MAPE']
metrics_df = pd.DataFrame(metrics, index=idx_name)

  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys


  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys


  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys


  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys


  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys


  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys


  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys


  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys


  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys


  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys


  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys
  import sys

  import sys
  import sys
  import sys


In [47]:
metrics_df

Unnamed: 0,C0.1,C1,C5,C10,C20,C50,C70,C100,C500
R2,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf
RMAE,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
RAAE,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
MAPE,10.356326,10.106675,9.542792,9.086999,8.117727,7.911548,8.260882,8.553109,8.692367


In [44]:
svr = SVR(kernel='rbf', gamma='scale', C=10, epsilon=0.01).fit(X_tv, Y_tv)
Ypred_test = svr.predict(X_test)
r2_test = r_2(Y_test, Ypred_test)
print("R^2 on test set: %.4f"%(r2_test))

R^2 on test set: 0.3012


### **Gradient Boosting**

In [60]:
model = XGBRegressor(colsample_bytree=0.4,
                 gamma=0,                 
                 learning_rate=0.01,
                 max_depth=4,
                 min_child_weight=1,
                 n_estimators=1000,                                                                    
                 reg_alpha=0.9,
                 reg_lambda=0.8,
                 subsample=0.6,
                 seed=42)
model.fit(X_tv,Y_tv)
Ypred_tv = model.predict(X_tv)
r2_train = r_2(Y_tv, Ypred_tv)
print("R^2 on train set: %.4f"%(r2_train))

R^2 on train set: 0.9485


In [61]:
Ypred_tv = model.predict(X_test)
r2_train = r_2(Y_test, Ypred_tv)
print("R^2 on train set: %.4f"%(r2_train))

R^2 on train set: 0.2011
