<a href="https://www.kaggle.com/code/darvack/transformer-paper-regression?scriptVersionId=130680309" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/transformer/DatasetB.csv
/kaggle/input/transformer/DatasetA.csv


Here, we have loaded the data and set Furan as the label.
At first, we have used 25 percent of the dataset A as the test set to come up with a good model, and then use this model to test in the dataset B.

In [2]:
ds_A = pd.read_csv("/kaggle/input/transformer/DatasetA.csv")
ds_B = pd.read_csv("/kaggle/input/transformer/DatasetB.csv")

# Splitting train and test
from sklearn.model_selection import train_test_split
train_set_A, test_set_A = train_test_split(ds_A, test_size = 0.25, random_state = 11)

# Setting the labels
y_train_A = train_set_A['Furan']
y_test_A = test_set_A['Furan']

# Dropping the Furan and Health Index columns
X_train_A = train_set_A.drop(["Furan", "HI"], axis = 1)
X_test_A = test_set_A.drop(["Furan", "HI"], axis = 1)

# For DatasetB
y_B = ds_B['Furan']
X_B = ds_B.drop(["Furan", "HI"], axis = 1)

# The code below is for the second case, where we train the data for the whole
# Dataset A and test it on Dataset B
y_A = ds_A['Furan']
X_A = ds_A.drop(["Furan", "HI"], axis = 1)



In [3]:
#ds_A.hist(bins=50, figsize=(20,15))

The code below, drops the columns that we don't need, and only keeps the common features between dataset A and B.

In [4]:
X_train_A = X_train_A.drop(set(ds_A.columns) - set(ds_B.columns), axis=1)
X_test_A = X_test_A.drop(set(ds_A.columns) - set(ds_B.columns), axis=1)
X_A = X_A.drop(set(ds_A.columns) - set(ds_B.columns), axis=1)
X_B = X_B[X_train_A.columns]
X_train_A

Unnamed: 0,H2,Methane,Acetylene,Ethylene,Ethane,Water,Acid,BDV,IFT
109,12.2,53.50,6.9,127.4,48.0,3,0.043,83.0,20
566,30.2,0.00,0.0,2.6,1.1,3,0.005,84.0,39
410,45.6,18.20,0.0,1.6,1.7,5,0.005,87.0,30
316,19.7,38.50,0.0,2.7,41.6,7,0.005,50.0,32
678,11.0,7.60,0.0,0.3,1.6,3,0.005,61.0,42
...,...,...,...,...,...,...,...,...,...
269,13.7,5.10,0.0,0.4,1.1,1,0.005,94.0,36
337,32.9,3.77,0.0,0.6,2.4,6,0.005,79.0,32
91,22.8,3.30,0.0,4.9,3.0,11,0.140,88.0,16
80,61.2,27.30,0.0,25.6,20.8,9,0.099,70.0,17


# First case: Training using 75% of the data and testing on the remaining 25%

We have experimented a combination of different models in the ensemble.
Although the results were quite similar, we found that a combination of KNN, svm, mlp and logistic regression works best.
In the code below we have created a voting classifier consist of these models.

In [5]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import AdaBoostRegressor

rf_reg = RandomForestRegressor(n_jobs = -1, max_depth = 50)
svm_reg = SVR(kernel='linear', C=1.0)
knn_reg = KNeighborsRegressor(n_neighbors=3)
xgb_reg = XGBRegressor(learning_rate=0.01, n_estimators=300, max_depth=3, subsample=0.7)
mlp_reg = MLPRegressor(hidden_layer_sizes=(100,), max_iter=1000)
#nb_reg = GaussianNB()
ada_reg = AdaBoostRegressor(n_estimators=50, learning_rate=0.003)
lr_reg = LinearRegression()

voting_reg = VotingRegressor(
  estimators=[#('nn', mlp_reg),
              ('svc', svm_reg),
              #('knn', knn_reg), 
              ('ada', ada_reg), #,('nb', nb_reg)
              ('xgb', xgb_reg),
              ('rf', rf_reg)
              #('lr', lr_reg)
             ])
voting_reg.fit(X_train_A, np.array(y_train_A).ravel())




Here is a comparison of different models and the voting classifier.

In [6]:
from sklearn.metrics import mean_squared_error
for clf in (mlp_reg, svm_reg, ada_reg,
            knn_reg, xgb_reg, rf_reg,
            lr_reg, voting_reg):
    clf.fit(X_train_A, y_train_A)
    y_pred_A = clf.predict(X_test_A)
    y_pred_B = clf.predict(X_B)
    print(clf.__class__.__name__ + " for dataset A:", mean_squared_error(y_test_A, y_pred_A))
    print(clf.__class__.__name__ + " for dataset B:", mean_squared_error(y_B, y_pred_B))

MLPRegressor for dataset A: 2.426771829651204
MLPRegressor for dataset B: 7.491636003409089
SVR for dataset A: 0.7403882323887173
SVR for dataset B: 2.0225532366782986
AdaBoostRegressor for dataset A: 0.5402207852576462
AdaBoostRegressor for dataset B: 1.8766823726948667
KNeighborsRegressor for dataset A: 0.8466535761991498
KNeighborsRegressor for dataset B: 2.426818747876317
XGBRegressor for dataset A: 0.495943901498028
XGBRegressor for dataset B: 1.7703121756309461
RandomForestRegressor for dataset A: 0.4821698597605254
RandomForestRegressor for dataset B: 2.22905609294208
LinearRegression for dataset A: 0.5584080811790763
LinearRegression for dataset B: 2.4580137584740305
VotingRegressor for dataset A: 0.5109513563976995
VotingRegressor for dataset B: 1.8231798886114623


In [7]:
xgb_reg.fit(X_train_A, np.array(y_train_A).ravel())
y_pred_A = xgb_reg.predict(X_test_A)
y_pred_B = xgb_reg.predict(X_B)
print(clf.__class__.__name__ + " for dataset A:", mean_squared_error(y_test_A, y_pred_A))
print(clf.__class__.__name__ + " for dataset B:", mean_squared_error(y_B, y_pred_B))

VotingRegressor for dataset A: 0.495943901498028
VotingRegressor for dataset B: 1.7703121756309461


# Second case: Training using all of the data from Dataset A

So far we have used 75% of Dataset A to train the data and 25% to test it.
Here, we used all of the data from Dataset A to train, and then test it on Dataset B.

In [8]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import AdaBoostRegressor

rf_reg = RandomForestRegressor(n_jobs = -1, max_depth = 50)
svm_reg = SVR(kernel='linear', C=1.0)
knn_reg = KNeighborsRegressor(n_neighbors=3)
xgb_reg = XGBRegressor(learning_rate=0.01, n_estimators=300, max_depth=3, subsample=0.7)
mlp_reg = MLPRegressor(hidden_layer_sizes=(100,), max_iter=1000)
ada_reg = AdaBoostRegressor(n_estimators=50, learning_rate=0.003)
lr_reg = LinearRegression()

voting_reg = VotingRegressor(
  estimators=[#('nn', mlp_reg),
              ('svc', svm_reg),
              #('knn', knn_reg), 
              ('ada', ada_reg),
              ('xgb', xgb_reg),
              ('rf', rf_reg)
              #('lr', lr_reg)
             ])
voting_reg.fit(X_A, y_A)

In [9]:
from sklearn.metrics import mean_squared_error
for clf in (mlp_reg, svm_reg, ada_reg,
            knn_reg, xgb_reg, rf_reg,
            lr_reg, voting_reg):
    clf.fit(X_A, y_A)
    y_pred_B = clf.predict(X_B)
    print(clf.__class__.__name__ + " for dataset A:", mean_squared_error(y_test_A, y_pred_A))
    print(clf.__class__.__name__ + " for dataset B:", mean_squared_error(y_B, y_pred_B))

MLPRegressor for dataset A: 0.495943901498028
MLPRegressor for dataset B: 16.85960372486897
SVR for dataset A: 0.495943901498028
SVR for dataset B: 1.9978898195590098
AdaBoostRegressor for dataset A: 0.495943901498028
AdaBoostRegressor for dataset B: 2.0125139564044154
KNeighborsRegressor for dataset A: 0.495943901498028
KNeighborsRegressor for dataset B: 2.5140967431192665
XGBRegressor for dataset A: 0.495943901498028
XGBRegressor for dataset B: 1.7474177678437246
RandomForestRegressor for dataset A: 0.495943901498028
RandomForestRegressor for dataset B: 2.0114294615803097
LinearRegression for dataset A: 0.495943901498028
LinearRegression for dataset B: 2.42436481314223
VotingRegressor for dataset A: 0.495943901498028
VotingRegressor for dataset B: 1.6878012462689789


In [10]:
xgb_reg.fit(X_train_A, np.array(y_train_A).ravel())
y_pred_B = xgb_reg.predict(X_B)
print(clf.__class__.__name__ + " for dataset B:", mean_squared_error(y_B, y_pred_B))

VotingRegressor for dataset B: 1.7703121756309461
