<h1><center>Santander Customer Transaction Prediction</h1>

# Competition Description
At Santander our mission is to help people and businesses prosper. We are always looking for ways to help our customers understand their financial health and identify which products and services might help them achieve their monetary goals.

Our data science team is continually challenging our machine learning algorithms, working with the global data science community to make sure we can more accurately identify new ways to solve our most common challenge, binary classification problems such as: is a customer satisfied? Will a customer buy this product? Can a customer pay this loan?Top us identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data we have available to solve this prob

# Data Resource
The dataset is downloaded from [Kaggle.com](https://www.kaggle.com/c/santander-customer-transaction-prediction/data?select=train.csv). 

# Data Fields
ID_code

target

var_0 - var_7lem.

In [1]:
import warnings

# Ignore warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import gdown

In [3]:
import tensorflow as tf
from tensorflow import keras

In [4]:
# The random seed
random_seed = 42

# Set random seed in tensorflow
tf.random.set_seed(random_seed)

# Set random seed in numpy
import numpy as np
np.random.seed(random_seed)

# Data Preprocessing

## Part 1: Loading Data

In [5]:
import pandas as pd
import gdown

# Load Train data
data_train_url = 'https://drive.google.com/uc?export=download&id=1Pcd-5jhUagyaZBmP3advWYBJepc5mLke'
output_train = 'tran.csv'
gdown.download(data_train_url, output_train, quiet=False)
df_train = pd.read_csv(output_train)

# Load Test data
data_test_url = 'https://drive.google.com/uc?export=download&id=1G_GahA2no7KVRdl-Tw8vEONZcnfku6vF'
output_test = 'test.csv'
gdown.download(data_test_url, output_test, quiet=False)
df_test=pd.read_csv(output_test)

# Deep copy all data
df_raw_train = df_train.copy(deep=True)

df_raw_test = df_test.copy(deep=True)

Downloading...
From (original): https://drive.google.com/uc?export=download&id=1Pcd-5jhUagyaZBmP3advWYBJepc5mLke
From (redirected): https://drive.google.com/uc?export=download&id=1Pcd-5jhUagyaZBmP3advWYBJepc5mLke&confirm=t&uuid=44fd41c9-7d59-4541-8664-c11926638229
To: C:\Users\yangy\iCloudDrive\Personal Life\Jobs in America\Github\Classification\santander-customer-transaction-prediction\tran.csv
100%|██████████| 302M/302M [00:08<00:00, 33.8MB/s] 
Downloading...
From (original): https://drive.google.com/uc?export=download&id=1G_GahA2no7KVRdl-Tw8vEONZcnfku6vF
From (redirected): https://drive.google.com/uc?export=download&id=1G_GahA2no7KVRdl-Tw8vEONZcnfku6vF&confirm=t&uuid=2c358c7c-4f4f-4810-b069-3a2d573277fe
To: C:\Users\yangy\iCloudDrive\Personal Life\Jobs in America\Github\Classification\santander-customer-transaction-prediction\test.csv
100%|██████████| 302M/302M [00:08<00:00, 34.5MB/s] 


In [6]:
# Shape of Train data
pd.DataFrame([ [df_train.shape[0], df_train.shape[1]]], columns=['#Row', '#Column'])

Unnamed: 0,#Row,#Column
0,200000,202


In [7]:
# Shape of Test data
pd.DataFrame([[df_test.shape[0], df_test.shape[1] ]], columns=['#Row', '#Column'])

Unnamed: 0,#Row,#Column
0,200000,201


## Part 2: Splitting Data

In [8]:
from sklearn.model_selection import train_test_split

# Split the train data into 80% size, and validation data size is 20%
df_train, df_val = train_test_split(df_train, train_size=0.8, random_state=random_seed)

df_train, df_val = df_train.reset_index(drop=True), df_val.reset_index(drop=True)

In [9]:
# Shape of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['#Row', '#Column'])

Unnamed: 0,#Row,#Column
0,160000,202


In [10]:
# Shape of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['#Row', '#Column'])

Unnamed: 0,#Row,#Column
0,40000,202


In [11]:
# Shape of df_test
pd.DataFrame([[df_test.shape[0], df_test.shape[1]]], columns=['#Row', '#Column'])

Unnamed: 0,#Row,#Column
0,200000,201


In [12]:
# Show the head of df_train
df_train.head(5)

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_153248,0,12.3039,-8.3899,9.1944,8.0649,9.0247,-1.9559,5.1565,21.1631,...,5.5185,7.9504,0.9184,5.9945,11.0078,-1.0936,-2.3412,8.1712,12.9046,-1.9309
1,train_67802,0,15.4069,2.782,9.2951,7.1997,8.5359,-4.5422,5.421,9.9651,...,3.0063,5.6555,2.1527,1.3518,15.4728,0.2686,6.5523,8.4698,22.0454,1.4756
2,train_148889,0,9.6427,-4.6261,6.961,5.4054,12.0859,-11.2917,4.529,13.8605,...,3.4351,9.1779,1.5004,1.9895,20.4072,-0.1118,0.5692,9.329,12.898,-9.4318
3,train_103093,1,9.6881,-5.6696,11.2709,8.2812,13.9232,-16.1434,4.9664,20.1092,...,-4.9494,9.2727,1.1371,3.7435,20.6906,1.3752,7.4442,9.2145,18.2777,-2.5865
4,train_104681,0,7.1128,-2.083,11.4807,8.3033,10.618,-6.4743,5.0078,21.0212,...,7.3583,8.1992,1.3436,8.8929,21.6711,-2.0557,6.4975,8.311,13.7728,-5.9028


In [13]:
# Show the head of df_val
df_val.head(5)

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_119737,0,11.0038,-4.5026,9.0662,6.4313,10.7061,-15.2857,5.1233,16.7875,...,-0.82,3.3085,3.1358,5.0959,19.716,-0.1801,5.8437,8.8348,17.0461,8.819
1,train_72272,0,12.8473,-6.1848,6.8799,2.0164,12.7998,10.2781,4.4191,15.694,...,1.1516,3.9019,4.6616,7.6035,12.6402,-0.3037,-4.233,9.7456,14.8337,-3.7167
2,train_158154,0,13.1827,-0.8344,13.4689,3.906,13.5984,4.6475,5.9659,24.0557,...,2.8737,5.8939,0.8525,8.7406,16.6641,0.8745,7.0406,8.6424,20.7107,-5.4186
3,train_65426,0,8.2132,1.2309,11.1464,9.4524,10.2142,4.0416,5.3989,20.4527,...,6.4752,5.7442,2.1907,6.0651,10.9444,-2.0666,-7.9209,9.0522,17.1735,12.4656
4,train_30074,1,5.5681,4.6355,15.235,3.0718,11.8178,-15.0502,3.8357,12.0169,...,4.1796,5.6113,-0.1561,3.101,17.4297,-1.0121,-6.5168,7.9772,18.5248,11.2771


In [14]:
# Show the head of df_test
df_test.head(5)

Unnamed: 0,ID_code,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,test_0,11.0656,7.7798,12.9536,9.4292,11.4327,-2.3805,5.8493,18.2675,2.1337,...,-2.1556,11.8495,-1.43,2.4508,13.7112,2.4669,4.3654,10.72,15.4722,-8.7197
1,test_1,8.5304,1.2543,11.3047,5.1858,9.1974,-4.0117,6.0196,18.6316,-4.4131,...,10.6165,8.8349,0.9403,10.1282,15.5765,0.4773,-1.4852,9.8714,19.1293,-20.976
2,test_2,5.4827,-10.3581,10.1407,7.0479,10.2628,9.8052,4.895,20.2537,1.5233,...,-0.7484,10.9935,1.9803,2.18,12.9813,2.1281,-7.1086,7.0618,19.8956,-23.1794
3,test_3,8.5374,-1.3222,12.022,6.5749,8.8458,3.1744,4.9397,20.566,3.3755,...,9.5702,9.0766,1.658,3.5813,15.1874,3.1656,3.9567,9.2295,13.0168,-4.2108
4,test_4,11.7058,-0.1327,14.1295,7.7506,9.1035,-8.5848,6.8595,10.6048,2.989,...,4.2259,9.1723,1.2835,3.3778,19.5542,-0.286,-5.1612,7.2882,13.926,-9.1846


In [15]:
# Set the target
target = 'target'

## Part 3: Handling the uncommon features

### 3.1 Identity the uncommon features

In [16]:
# Create a function to check the common feature
def common_feature_checker(df_train, df_val, df_test, target):
      df_common_var = pd.DataFrame(np.intersect1d(df_train.columns, np.union1d(np.union1d(df_val.columns, [target]), df_test.columns)), columns=['common_var'])

      return df_common_var

In [17]:
# Utilise the common_feature_checker function
df_common_var = common_feature_checker(df_train, df_val, df_test, target)

# Print the df_common_var
df_common_var

Unnamed: 0,common_var
0,ID_code
1,target
2,var_0
3,var_1
4,var_10
...,...
197,var_95
198,var_96
199,var_97
200,var_98


### 3.2 Identify the uncommon features

In [18]:
# Get the features in train dataset not in others
uncommon_feature_train_not_val_test = np.setdiff1d(df_train.columns, df_common_var['common_var'])

# Print the uncommon features
pd.DataFrame(uncommon_feature_train_not_val_test, columns=['uncommon_feature'])

Unnamed: 0,uncommon_feature


In [19]:
# Get the features in validation dataset not in others
uncommon_feature_val_not_train_test = np.setdiff1d(df_val.columns, df_common_var['common_var'])

# Print the uncommon features
pd.DataFrame(uncommon_feature_val_not_train_test, columns=['uncommon_feature'])

Unnamed: 0,uncommon_feature


In [20]:
# Get the features in test dataset not in others
uncommon_feature_test_not_train_val = np.setdiff1d(df_test.columns, df_common_var['common_var'])

# Print the uncommon features
pd.DataFrame(uncommon_feature_test_not_train_val, columns=['uncommon_feature'])

Unnamed: 0,uncommon_feature


### Summary
There is no uncommon features, so I do not need to remove them.

## Part 4: Handling identifiers

### 4.1 Building an id_checker function

In [21]:
"""
To check the specific id column, I have to ensure the column is object or strings and each row is unique.

Therefore, to ensure the two conditions are set, I need to set condition 1: dtype!='float'; condition 2: each row has a unique value.

"""
def id_checker(df, dtype='float'):
      df_id = df[[ var for var in df.columns
                   # Set the condition 1
                     if (df[var].dtype != dtype
                    # Set the condition 2
                         and df[var].nunique(dropna=True) == df[var].notnull().sum())]]
      return df_id

In [22]:
# Set a total dataset df
df = pd.concat([df_train, df_val, df_test], sort=False)

# Check df_id in the dataset df
df_id = id_checker(df)

# Print df_id
df_id

Unnamed: 0,ID_code
0,train_153248
1,train_67802
2,train_148889
3,train_103093
4,train_104681
...,...
199995,test_199995
199996,test_199996
199997,test_199997
199998,test_199998


In [23]:
# Remove the df_id from df_train
df_train.drop(columns=np.intersect1d(df_id.columns, df_train.columns), inplace=True)

# Remove the df_id from df_val
df_val.drop(columns=np.intersect1d(df_id.columns, df_val.columns), inplace=True)

# Remove the df_id from df_val
df_test.drop(columns=np.intersect1d(df_id.columns, df_test.columns), inplace=True)

In [24]:
# Print the new df_train
df_train.head(5)

Unnamed: 0,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,0,12.3039,-8.3899,9.1944,8.0649,9.0247,-1.9559,5.1565,21.1631,2.7437,...,5.5185,7.9504,0.9184,5.9945,11.0078,-1.0936,-2.3412,8.1712,12.9046,-1.9309
1,0,15.4069,2.782,9.2951,7.1997,8.5359,-4.5422,5.421,9.9651,4.0623,...,3.0063,5.6555,2.1527,1.3518,15.4728,0.2686,6.5523,8.4698,22.0454,1.4756
2,0,9.6427,-4.6261,6.961,5.4054,12.0859,-11.2917,4.529,13.8605,-0.8366,...,3.4351,9.1779,1.5004,1.9895,20.4072,-0.1118,0.5692,9.329,12.898,-9.4318
3,1,9.6881,-5.6696,11.2709,8.2812,13.9232,-16.1434,4.9664,20.1092,-5.9868,...,-4.9494,9.2727,1.1371,3.7435,20.6906,1.3752,7.4442,9.2145,18.2777,-2.5865
4,0,7.1128,-2.083,11.4807,8.3033,10.618,-6.4743,5.0078,21.0212,-4.9779,...,7.3583,8.1992,1.3436,8.8929,21.6711,-2.0557,6.4975,8.311,13.7728,-5.9028


In [25]:
# Print the new df_val
df_val.head(5)

Unnamed: 0,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,0,11.0038,-4.5026,9.0662,6.4313,10.7061,-15.2857,5.1233,16.7875,4.1833,...,-0.82,3.3085,3.1358,5.0959,19.716,-0.1801,5.8437,8.8348,17.0461,8.819
1,0,12.8473,-6.1848,6.8799,2.0164,12.7998,10.2781,4.4191,15.694,-0.6788,...,1.1516,3.9019,4.6616,7.6035,12.6402,-0.3037,-4.233,9.7456,14.8337,-3.7167
2,0,13.1827,-0.8344,13.4689,3.906,13.5984,4.6475,5.9659,24.0557,3.8743,...,2.8737,5.8939,0.8525,8.7406,16.6641,0.8745,7.0406,8.6424,20.7107,-5.4186
3,0,8.2132,1.2309,11.1464,9.4524,10.2142,4.0416,5.3989,20.4527,0.2915,...,6.4752,5.7442,2.1907,6.0651,10.9444,-2.0666,-7.9209,9.0522,17.1735,12.4656
4,1,5.5681,4.6355,15.235,3.0718,11.8178,-15.0502,3.8357,12.0169,3.2997,...,4.1796,5.6113,-0.1561,3.101,17.4297,-1.0121,-6.5168,7.9772,18.5248,11.2771


In [26]:
# Print the new df_test
df_test.head(5)

Unnamed: 0,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,11.0656,7.7798,12.9536,9.4292,11.4327,-2.3805,5.8493,18.2675,2.1337,8.81,...,-2.1556,11.8495,-1.43,2.4508,13.7112,2.4669,4.3654,10.72,15.4722,-8.7197
1,8.5304,1.2543,11.3047,5.1858,9.1974,-4.0117,6.0196,18.6316,-4.4131,5.9739,...,10.6165,8.8349,0.9403,10.1282,15.5765,0.4773,-1.4852,9.8714,19.1293,-20.976
2,5.4827,-10.3581,10.1407,7.0479,10.2628,9.8052,4.895,20.2537,1.5233,8.3442,...,-0.7484,10.9935,1.9803,2.18,12.9813,2.1281,-7.1086,7.0618,19.8956,-23.1794
3,8.5374,-1.3222,12.022,6.5749,8.8458,3.1744,4.9397,20.566,3.3755,7.4578,...,9.5702,9.0766,1.658,3.5813,15.1874,3.1656,3.9567,9.2295,13.0168,-4.2108
4,11.7058,-0.1327,14.1295,7.7506,9.1035,-8.5848,6.8595,10.6048,2.989,7.1437,...,4.2259,9.1723,1.2835,3.3778,19.5542,-0.286,-5.1612,7.2882,13.926,-9.1846


## Part 5: Handling date time variables

### 5.1 Checking whether there is date/time variables

In [27]:
for col in df.columns:
    # Attempt to convert columns with object dtype that could be dates
    if df[col].dtype == 'object':
        try:
            df[col] = pd.to_datetime(df[col])
            print(f"Column '{col}' converted to datetime.")
        except (ValueError, TypeError):
            # If conversion fails, it's not a datetime column; proceed
            continue

# After attempting conversion, recheck for datetime columns
has_datetime = any(pd.api.types.is_datetime64_any_dtype(df[col]) for col in df.columns)

print(f"Does the DataFrame have at least one datetime column after conversion attempt? {has_datetime}")

Does the DataFrame have at least one datetime column after conversion attempt? False


### 5.2 Summary

There is no date time variables in the dataset, so I do not need to handle the date time variables.

## Part 6: Handling missing data

### 6.1 Defining a nan_checker function

In [28]:
# Create a nan_checker function
def nan_checker(df):
      df_nan = pd.DataFrame([[var, df[var].isna().sum() / df.shape[0], df[var].dtype]
                                             for var in df.columns if df[var].isna().sum() > 0],
                                              columns=['var', 'proportion', 'dtype'])
      # Sort df_nan in descending order of the proportion of NaN
      df_nan = df_nan.sort_values(by='proportion', ascending=False).reset_index(drop=True)

      return df_nan

In [29]:
# Call the function nan_checker on df
df_nan=nan_checker(df)

# Print the first 5 rows of df
df_nan.head(5)

Unnamed: 0,var,proportion,dtype
0,target,0.5,float64


### 6.2 Summary
Due to the target loss in test data, and the test data size is 50% of total data, so the proportion means there is no missing data in train and validation data.

## Part 7: Encoding data

There is no categorical data in the features, excepting the target. Therefore, no need to do that.

## Part 8: Split the feature and target

In [31]:
# Get the feature matrix
X_train = df_train[np.setdiff1d(df_train.columns, [target])].values
X_val = df_val[np.setdiff1d(df_val.columns, [target])].values
X_test = df_test[np.setdiff1d(df_test.columns, [target])].values

# Get the target matrix
y_train = df_train[target]
y_val = df_val[target]

## Part 9: Scaling the data

In [32]:
from sklearn.preprocessing import StandardScaler

# The standardScaler
ss = StandardScaler()

In [33]:
# Standardlize the training data
X_train = ss.fit_transform(X_train)

# Standardlize the validation data
X_val = ss.transform(X_val)

# Standardlize the test data
X_test = ss.transform(X_test)

# Hyperparameter Tuning

## Part 1: Model Dictionary

In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

models = {'lr': LogisticRegression(class_weight='balanced', random_state=random_seed),
                 'mlpc': MLPClassifier(early_stopping=True, random_state=random_seed)}

## Part 2: Pipelines Dictionary

In [35]:
from sklearn.pipeline import Pipeline

pipes = {}

for acronym, model in models.items():
      pipes[acronym] = Pipeline([('model', model)])

## Part 3: Predefined Split Cross-validator

In [38]:
from sklearn.model_selection import PredefinedSplit

def get_train_val_ps(X_train, y_train, X_val, y_val):
    if isinstance(X_train, pd.DataFrame):
        X_train = X_train.values
    if isinstance(X_val, pd.DataFrame):
        X_val = X_val.values
    if isinstance(y_train, pd.Series):
        y_train = y_train.values
    if isinstance(y_val, pd.Series):
        y_val = y_val.values

    X_train_val = np.vstack((X_train, X_val))
    y_train_val = np.concatenate((y_train, y_val))
    train_val_idxs = np.append(np.full(len(X_train), -1), np.full(len(X_val), 0))
    ps = PredefinedSplit(train_val_idxs)

    return X_train_val, y_train_val, ps

In [39]:
# Call the function get_train_val_ps
X_train_val, y_train_val, ps = get_train_val_ps(X_train, y_train, X_val, y_val)

## Part 4: GridSearchCV

### 4.1 Parameter Grid Dictionary

In [40]:
param_grids = {}

### 4.2 LogisticRegression Dictionary

In [41]:
# The parameter grid of tol
tol_grid = [10 ** -i for i in range(3,6)]

# The parameter grid of C
C_grid = [ 0.1, 1, 10]

# Update param_grids
param_grids['lr'] = [{'model__tol': tol_grid,
                                   'model__C': C_grid}]

### 4.3 MLPClassifier Parameter Grid

In [42]:
# The grids for alpha
alpha_grids = [10 ** -i for i in range(1, 7)]

# The grids for learning_rate_init
learning_rate_init_grids = [10 ** -i for i in range(2, 6)]

# The model hidden layer sizes
# hidden_layer_sizes = [(50,), (100,), (50, 50), (100, 50)]
hidden_layer_sizes = [(50,), (100,), (50, 50)]

# The Activation
activation = ['relu', 'tanh']

# Update param_grids
param_grids['mlpc'] = [
    {
        'model__alpha': alpha_grids,
        'model__learning_rate_init': learning_rate_init_grids,
        'model__hidden_layer_sizes': hidden_layer_sizes,
        'model__activation': activation
    }
]

### 4.4 CV results Directory

In [43]:
import os
abspath_curr = 'C:/Users/yangy/Desktop/Github_Projects'

# Make directory
directory = os.path.dirname(abspath_curr + 'machine_learning_results/shallow_learning_classification_1/cv_results/GridSearchCV/')
if not os.path.exists(directory):
    os.makedirs(directory)

### 4.5 Tuning the hyperparameters

In [None]:
from sklearn.model_selection import GridSearchCV

# The list of [best_score_, best_params_, best_estimator_] obtained by GridSearchCV
best_score_params_estimator_gs = []

# For each model
for acronym in pipes.keys():
    # GridSearchCV
    gs = GridSearchCV(estimator=pipes[acronym],
                      param_grid=param_grids[acronym],
                      scoring='f1_macro',
                      n_jobs=6,
                      cv=ps,
                      return_train_score=True)

    # Fit the pipeline
    gs = gs.fit(X_train_val, y_train_val)

    # Update best_score_params_estimator_gs
    best_score_params_estimator_gs.append([gs.best_score_, gs.best_params_, gs.best_estimator_])

    # Sort cv_results in ascending order of 'rank_test_score' and 'std_test_score'
    cv_results = pd.DataFrame.from_dict(gs.cv_results_).sort_values(by=['rank_test_score', 'std_test_score'])

    # Get the important columns in cv_results
    important_columns = ['rank_test_score',
                         'mean_test_score',
                         'std_test_score',
                         'mean_train_score',
                         'std_train_score',
                         'mean_fit_time',
                         'std_fit_time',
                         'mean_score_time',
                         'std_score_time']

    # Move the important columns ahead
    cv_results = cv_results[important_columns + sorted(list(set(cv_results.columns) - set(important_columns)))]

    # Write cv_results file
    cv_results.to_csv(path_or_buf=abspath_curr + 'machine_learning_results/shallow_learning_classification_1/cv_results/GridSearchCV/' + acronym + '.csv', index=False)

# Sort best_score_params_estimator_gs in descending order of the best_score_
best_score_params_estimator_gs = sorted(best_score_params_estimator_gs, key=lambda x : x[0], reverse=True)

# Print best_score_params_estimator_gs
pd.DataFrame(best_score_params_estimator_gs, columns=['best_score', 'best_param', 'best_estimator'])

# Model Selection

In [None]:
# Get the best_score, best_params and best_estimator obtained by GridSearchCV
best_score_gs, best_params_gs, best_estimator_gs = best_score_params_estimator_gs[0]

# Generating the submission file
Use the best model selected earlier to generate the submission file for this kaggle competition.

# Generating the submission file

In [None]:
# Make directory
directory = os.path.dirname(abspath_curr + 'machine_learning_results/shallow_learning_classification_1/result/submission/')
if not os.path.exists(directory):
    os.makedirs(directory)

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(y_train)

# Get the prediction on the test data using the best model
y_test_pred = best_estimator_gs.predict(X_test)

# Transform y_test_pred back to the original class
y_test_pred = le.inverse_transform(y_test_pred)

# Get the submission dataframe
df_submit = pd.DataFrame(np.hstack((df_raw_test[['ID_code']], y_test_pred.reshape(-1, 1))),
                         columns=['ID_code', target])

# Generate the submission file
df_submit.to_csv(abspath_curr + 'machine_learning_results/shallow_learning_classification_1/result/submission/submission.csv', index=False)