<a href="https://www.kaggle.com/code/darvack/transformer-paper-regression?scriptVersionId=131259095" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/transformer/DatasetB.csv
/kaggle/input/transformer/DatasetA.csv


Here, we have loaded the data and set Furan as the label.
At first, we have used 25 percent of the dataset A as the test set to come up with a good model, and then use this model to test in the dataset B.

In [2]:
ds_A = pd.read_csv("/kaggle/input/transformer/DatasetA.csv")
ds_B = pd.read_csv("/kaggle/input/transformer/DatasetB.csv")

# Splitting train and test
from sklearn.model_selection import train_test_split
train_set_A, test_set_A = train_test_split(ds_A, test_size = 0.2, random_state = 11)

# Setting the labels
y_train_A = train_set_A['Furan']
y_test_A = test_set_A['Furan']

# Dropping the Furan and Health Index columns
X_train_A = train_set_A.drop(["Furan", "HI"], axis = 1)
X_test_A = test_set_A.drop(["Furan", "HI"], axis = 1)

# For DatasetB
y_B = ds_B['Furan']
X_B = ds_B.drop(["Furan", "HI"], axis = 1)

# The code below is for the second case, where we train the data for the whole
# Dataset A and test it on Dataset B
y_A = ds_A['Furan']
X_A = ds_A.drop(["Furan", "HI"], axis = 1)



In [3]:
#ds_A.hist(bins=50, figsize=(20,15))

The code below, drops the columns that we don't need, and only keeps the common features between dataset A and B.

In [4]:
X_train_A = X_train_A.drop(set(ds_A.columns) - set(ds_B.columns), axis=1)
X_test_A = X_test_A.drop(set(ds_A.columns) - set(ds_B.columns), axis=1)
X_A = X_A.drop(set(ds_A.columns) - set(ds_B.columns), axis=1)
X_B = X_B[X_train_A.columns]
X_train_A

Unnamed: 0,H2,Methane,Acetylene,Ethylene,Ethane,Water,Acid,BDV,IFT
503,19.0,11.90,0.0,2.7,3.0,5,0.005,73.0,30
679,13.2,9.20,0.0,0.6,2.0,4,0.005,88.0,41
240,14.4,2.50,0.0,0.9,1.5,4,0.005,79.0,31
21,13.2,1.10,0.0,1.8,2.1,4,0.024,78.0,25
56,20.9,0.00,0.6,4.2,3.8,6,0.056,76.0,18
...,...,...,...,...,...,...,...,...,...
269,13.7,5.10,0.0,0.4,1.1,1,0.005,94.0,36
337,32.9,3.77,0.0,0.6,2.4,6,0.005,79.0,32
91,22.8,3.30,0.0,4.9,3.0,11,0.140,88.0,16
80,61.2,27.30,0.0,25.6,20.8,9,0.099,70.0,17


The code below performs feature selection via step-wise regression or SequentialFeatureSelector. "n_features_to_select" is a hyperparameter that defines how many feature we'd like to keep which needs to be fine tuned. In our experiment, 5 or 7 features to keep, results in better models. Here we used 7.

In [5]:
from sklearn.feature_selection import SequentialFeatureSelector
from catboost import CatBoostRegressor
cat_reg = CatBoostRegressor(iterations=500, learning_rate=0.1, verbose = 0,
                           depth=6, loss_function='RMSE', random_seed=11)

sfs = SequentialFeatureSelector(cat_reg,
                                n_features_to_select=5,
                                direction='forward',
                                scoring='neg_mean_squared_error',
                                cv=5)

sfs.fit(X_train_A, y_train_A)

# Print the selected features
print("Selected features:", sfs.get_feature_names_out())

useful_features1 = ['IFT']
useful_features2 = ['Methane', 'IFT']
useful_features3 = ['Methane', 'Ethane', 'IFT']
useful_features4 = ['Methane', 'Ethane', 'Water', 'IFT']
useful_features5 = ['Methane', 'Ethylene', 'Ethane', 'BDV', 'IFT']
useful_features6 = ['Methane', 'Ethylene', 'Ethane', 'Water', 'BDV', 'IFT']
useful_features7 = ['H2', 'Methane', 'Acetylene', 'Ethylene', 'Ethane', 'BDV', 'IFT']
useful_features8 = ['H2', 'Methane', 'Ethylene', 'Ethane', 'Water', 'Acid', 'BDV', 'IFT']

Selected features: ['Methane' 'Ethylene' 'Ethane' 'Water' 'IFT']


# Capping the outliers

In [6]:
#Capping the outlier rows with Percentiles
upper_lim = X_train_A['Ethylene'].quantile(.97)
X_train_A.loc[(X_train_A['Ethylene'] > upper_lim),'Ethylene'] = upper_lim

upper_lim = X_train_A['Methane'].quantile(.97)
X_train_A.loc[(X_train_A['Methane'] > upper_lim),'Methane'] = upper_lim

upper_lim = X_train_A['BDV'].quantile(.97)
X_train_A.loc[(X_train_A['BDV'] > upper_lim),'BDV'] = upper_lim

upper_lim = X_train_A['Ethane'].quantile(.97)
X_train_A.loc[(X_train_A['Ethane'] > upper_lim),'Ethane'] = upper_lim

upper_lim = X_train_A['H2'].quantile(.97)
X_train_A.loc[(X_train_A['H2'] > upper_lim),'H2'] = upper_lim

In [7]:
#Capping the outlier rows with Percentiles
upper_lim = X_A['Ethylene'].quantile(.97)
X_A.loc[(X_A['Ethylene'] > upper_lim),'Ethylene'] = upper_lim

upper_lim = X_A['Methane'].quantile(.97)
X_A.loc[(X_A['Methane'] > upper_lim),'Methane'] = upper_lim

upper_lim = X_A['BDV'].quantile(.97)
X_A.loc[(X_A['BDV'] > upper_lim),'BDV'] = upper_lim

upper_lim = X_A['Ethane'].quantile(.97)
X_A.loc[(X_A['Ethane'] > upper_lim),'Ethane'] = upper_lim

upper_lim = X_A['H2'].quantile(.97)
X_A.loc[(X_A['H2'] > upper_lim),'H2'] = upper_lim

# First case: Training using 80% of the data and testing on the remaining 20

We have experimented a combination of different models in the ensemble.
Although the results were quite similar, we found that a combination of KNN, svm, mlp and logistic regression works best.
In the code below we have created a voting classifier consist of these models.

In [8]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import BayesianRidge
from catboost import CatBoostRegressor
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import Ridge
import lightgbm as lgb
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

rf_reg = RandomForestRegressor(n_jobs = -1, max_depth = 50)
# svm_reg = SVR(kernel='linear')
# knn_reg = KNeighborsRegressor(n_neighbors=3)
xgb_reg = XGBRegressor(learning_rate=0.01, n_estimators=300, max_depth=3, subsample=0.7)
#mlp_reg = MLPRegressor(hidden_layer_sizes=(100,), max_iter=10000)
ada_reg = AdaBoostRegressor(n_estimators=50, learning_rate=0.003)
#lr_reg = LinearRegression()
#bay_reg = BayesianRidge()
cat_reg = CatBoostRegressor(iterations=500, learning_rate=0.1, verbose = 0,
                           depth=6, loss_function='RMSE', random_seed=11)
#pls_reg = PLSRegression(n_components=2)
#rig_reg = Ridge(alpha=1.0)
lgb_reg = lgb.LGBMRegressor()
#el_reg = ElasticNet(alpha=0.5, l1_ratio=0.5, random_state=0)
'''bag_reg = BaggingRegressor(
    DecisionTreeRegressor(), n_estimators=50, max_samples=100,
    max_features=1.0,
    bootstrap=True,
    n_jobs=-1)'''

voting_reg = VotingRegressor(
  estimators=[#('nn', mlp_reg),
              #('svc', svm_reg),
              #('knn', knn_reg), 
              #('ada', ada_reg),#('by', bay_reg),
              ('xgb', xgb_reg),('cat', cat_reg)
              #('rf', rf_reg),
              #('el', el_reg), ('lgb', lgb_reg)
             ])
voting_reg.fit(X_train_A[useful_features7], y_train_A)

Here is a comparison of different models and the voting classifier.

In [9]:
#from sklearn.linear_model import ElasticNet
#el_reg = ElasticNet(alpha=0.5, l1_ratio=1, random_state=0)

from sklearn.metrics import mean_squared_error
for reg in (#mlp_reg, #svm_reg,
            ada_reg,
            #knn_reg,
            xgb_reg, rf_reg, #bay_reg, el_reg, pls_reg,
            cat_reg,lgb_reg, #bag_reg,lr_reg,
            voting_reg):
    reg.fit(X_train_A[useful_features7], y_train_A)
    y_pred_A = reg.predict(X_test_A[useful_features7])
    y_pred_B = reg.predict(X_B[useful_features7])
    print(reg.__class__.__name__ + " for dataset A:", mean_squared_error(y_test_A, y_pred_A))
    print(reg.__class__.__name__ + " for dataset B:", mean_squared_error(y_B, y_pred_B))

AdaBoostRegressor for dataset A: 0.8855287188086115
AdaBoostRegressor for dataset B: 1.732591798052474
XGBRegressor for dataset A: 0.5262715682224556
XGBRegressor for dataset B: 1.6620733070908118
RandomForestRegressor for dataset A: 0.6131627243934712
RandomForestRegressor for dataset B: 1.8005348086749497
CatBoostRegressor for dataset A: 0.41125827043293767
CatBoostRegressor for dataset B: 1.4900958095550392
LGBMRegressor for dataset A: 0.5430317213977495
LGBMRegressor for dataset B: 1.848011276907881
VotingRegressor for dataset A: 0.43936136503548184
VotingRegressor for dataset B: 1.5483783293521984


In [10]:
X_A

Unnamed: 0,H2,Methane,Acetylene,Ethylene,Ethane,Water,Acid,BDV,IFT
0,11.800,0.0,0.0,1.0,1.4,13,0.005,48.0,24
1,78.921,90.9,0.7,26.7,110.3,22,0.096,68.0,17
2,18.500,4.9,0.4,3.4,2.8,8,0.030,70.0,26
3,33.600,3.3,0.3,5.8,4.1,12,0.032,57.0,26
4,16.600,4.4,0.0,2.8,4.4,6,0.036,77.0,24
...,...,...,...,...,...,...,...,...,...
725,71.500,9.7,0.0,0.8,2.5,8,0.005,91.0,31
726,66.000,13.6,0.0,1.3,5.8,9,0.005,94.0,31
727,73.700,10.1,0.0,0.9,4.2,7,0.005,90.0,32
728,52.800,18.3,0.0,1.7,7.7,6,0.005,68.0,32


In [11]:
xgb_reg = XGBRegressor(learning_rate=0.3, n_estimators=5000, max_depth=3,
                       subsample=0.7,early_stopping_rounds=10)
xgb_reg.fit(X_train_A[useful_features7], y_train_A,
            eval_set=[(X_test_A[useful_features7], y_test_A)], verbose=0)
y_pred_A = xgb_reg.predict(X_test_A[useful_features7])
y_pred_B = xgb_reg.predict(X_B[useful_features7])
print(xgb_reg.__class__.__name__ + " for dataset A:", mean_squared_error(y_test_A, y_pred_A))
print(xgb_reg.__class__.__name__ + " for dataset B:", mean_squared_error(y_B, y_pred_B))

XGBRegressor for dataset A: 0.41662357832206487
XGBRegressor for dataset B: 1.49313769127646


In [12]:
xgb_reg = XGBRegressor(learning_rate=0.01, n_estimators=300, max_depth=3, subsample=0.7)
xgb_reg.fit(X_train_A[useful_features7], np.array(y_train_A).ravel())
y_pred_A = xgb_reg.predict(X_test_A[useful_features7])
y_pred_B = xgb_reg.predict(X_B[useful_features7])
print(xgb_reg.__class__.__name__ + " for dataset A:", mean_squared_error(y_test_A, y_pred_A))
print(xgb_reg.__class__.__name__ + " for dataset B:", mean_squared_error(y_B, y_pred_B))

XGBRegressor for dataset A: 0.5262715682224556
XGBRegressor for dataset B: 1.6620733070908118


In [13]:
cat_reg = CatBoostRegressor(iterations=1000, learning_rate=0.1, verbose = 0,
                           depth=6, loss_function='RMSE', random_seed=11)
cat_reg.fit(X_train_A[useful_features7], y_train_A)
y_pred_A = cat_reg.predict(X_test_A[useful_features7])
y_pred_B = cat_reg.predict(X_B[useful_features7])
print(cat_reg.__class__.__name__ + " for dataset A:", mean_squared_error(y_test_A, y_pred_A))
print(cat_reg.__class__.__name__ + " for dataset B:", mean_squared_error(y_B, y_pred_B))

CatBoostRegressor for dataset A: 0.40638946768650613
CatBoostRegressor for dataset B: 1.4915601517993187


# Second case: Training using all of the data from Dataset A

So far we have used 75% of Dataset A to train the data and 25% to test it.
Here, we used all of the data from Dataset A to train, and then test it on Dataset B.

In [14]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import BayesianRidge
from catboost import CatBoostRegressor
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import Ridge
import lightgbm as lgb
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

rf_reg = RandomForestRegressor(n_jobs = -1, max_depth = 50)
# svm_reg = SVR(kernel='linear')
# knn_reg = KNeighborsRegressor(n_neighbors=3)
xgb_reg = XGBRegressor(learning_rate=0.01, n_estimators=300, max_depth=3, subsample=0.7)
#mlp_reg = MLPRegressor(hidden_layer_sizes=(100,), max_iter=10000)
ada_reg = AdaBoostRegressor(n_estimators=50, learning_rate=0.003)
#lr_reg = LinearRegression()
#bay_reg = BayesianRidge()
cat_reg = CatBoostRegressor(iterations=500, learning_rate=0.1, verbose = 0,
                           depth=6, loss_function='RMSE', random_seed=11)
#pls_reg = PLSRegression(n_components=2)
#rig_reg = Ridge(alpha=1.0)
lgb_reg = lgb.LGBMRegressor()
#el_reg = ElasticNet(alpha=0.5, l1_ratio=0.5, random_state=0)
'''bag_reg = BaggingRegressor(
    DecisionTreeRegressor(), n_estimators=50, max_samples=100,
    max_features=1.0,
    bootstrap=True,
    n_jobs=-1)'''

voting_reg = VotingRegressor(
  estimators=[#('nn', mlp_reg),
              #('svc', svm_reg),
              #('knn', knn_reg), 
              ('ada', ada_reg),#('by', bay_reg),
              ('xgb', xgb_reg),('cat', cat_reg)
              #('rf', rf_reg),
              #('el', el_reg), ('lgb', lgb_reg)
             ])
voting_reg.fit(X_A, y_A)

In [15]:
from sklearn.metrics import mean_squared_error
for clf in (ada_reg,
            xgb_reg, rf_reg,
            cat_reg, voting_reg):
    clf.fit(X_A[useful_features7], y_A)
    y_pred_B = clf.predict(X_B[useful_features7])
    print(clf.__class__.__name__ + " for dataset B:", mean_squared_error(y_B, y_pred_B))

AdaBoostRegressor for dataset B: 2.078609509601341
XGBRegressor for dataset B: 1.7328138039972423
RandomForestRegressor for dataset B: 1.7118440296387807
CatBoostRegressor for dataset B: 1.6613241575693878
VotingRegressor for dataset B: 1.671155843668213


In [16]:
xgb_reg.fit(X_train_A[useful_features7], np.array(y_train_A).ravel())
y_pred_B = xgb_reg.predict(X_B[useful_features7])
print(xgb_reg.__class__.__name__ + " for dataset B:", mean_squared_error(y_B, y_pred_B))

XGBRegressor for dataset B: 1.6620733070908118


In [17]:
cat_reg.fit(X_A[useful_features7], y_A)
y_pred_B = cat_reg.predict(X_B[useful_features7])
print(cat_reg.__class__.__name__ + " for dataset B:", mean_squared_error(y_B, y_pred_B))

CatBoostRegressor for dataset B: 1.6613241575693878
