# Blending models
Here we are going to blend the trained models and test if it can perform better

In [1]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import tensorflow as tf
import pickle
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

2024-03-28 21:04:52.138366: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-28 21:04:52.162938: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
df = pd.read_csv("test_data.csv", sep=";")
df = df.drop("Unnamed: 0", axis=1)
df = df.set_index("id")

In [3]:
country_index = list(df["country"].unique())
platform_index = list(df["creation_platform"].unique())
source_index = list(df["source_pulido"].unique())

In [4]:
df_num_values = df.drop(["country", "creation_platform", "source_pulido"], axis=1)
df_num_values["country_index"] = df["country"].apply(lambda i: country_index.index(i))
df_num_values["creation_platform_index"] = df["creation_platform"].apply(lambda i: platform_index.index(i))
df_num_values["source_pulido_index"] = df["source_pulido"].apply(lambda i: source_index.index(i))

All datasets were splited with the same test_size and random_state, so, every dataset should be the same for all the notebooks

In [5]:
X_train, X_test, y_train, y_test = train_test_split(df_num_values.drop("target", axis=1), df_num_values.target, test_size=0.2, random_state=42)

In [6]:
test_data = lgb.Dataset(X_test, label=y_test)

Loading all models

In [7]:
with open("models/lightgbm_model.pkl", 'rb') as f:
    lgbm_model = pickle.load(f)

In [11]:
with open("models/lightgbm_multiple_train_model.pkl", 'rb') as f:
    lgbm_multiple_train_model = pickle.load(f)

In [8]:
with open("models/lightgbm_feature_selection_model.pkl", 'rb') as f:
    lgbm_feature_selection_model = pickle.load(f)

In [9]:
tf_model = tf.keras.models.load_model('models/binary_classification_model/')

2024-03-28 21:08:57.766167: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-03-28 21:08:57.784225: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-03-28 21:08:57.784640: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-03-28 21:08:57.786538: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-03-28 21:08:57.786681: I tensorflow/compile

Calculating the average for all models predictions

In [24]:
y_pred = (
    lgbm_model.predict(X_test) +
    lgbm_multiple_train_model.predict(X_test) +
    lgbm_feature_selection_model.predict(X_test) +
    tf_model.predict(X_test).reshape(-1)
) / 4



Simple accuracy test

In [26]:
target_0_preds = y_pred[y_test == 0] <= 0.5
target_1_preds = y_pred[y_test == 1] > 0.5

In [27]:
unique, counts = np.unique(target_0_preds, return_counts=True)
print(unique, counts)
print("Accuracy for negative targets:", counts[1]/counts.sum())

[False  True] [   607 107141]
Accuracy for negative targets: 0.994366484760738


In [28]:
unique, counts = np.unique(target_1_preds, return_counts=True)
print(unique, counts)
print("Accuracy for positive targets:", counts[1]/counts.sum())

[False  True] [8377  943]
Accuracy for positive targets: 0.10118025751072961


Trying o dicover the best threshold to improve accuracy

In [29]:
np.percentile(y_pred[y_test == 0], [0, 30, 50, 70, 100])

array([0.00109405, 0.00578518, 0.01667014, 0.05515303, 0.8257726 ])

In [30]:
np.percentile(y_pred[y_test == 1], [0, 30, 50, 70, 100])

array([0.0010975 , 0.15564362, 0.26130802, 0.37139149, 0.84314669])

In [31]:
threshold = 0.1
target_0_preds = y_pred[y_test == 0] <= threshold
target_1_preds = y_pred[y_test == 1] > threshold

In [32]:
unique, counts = np.unique(target_0_preds, return_counts=True)
print(unique, counts)
print("Accuracy for negative targets:", counts[1]/counts.sum())

[False  True] [21595 86153]
Accuracy for negative targets: 0.7995786464713962


In [33]:
unique, counts = np.unique(target_1_preds, return_counts=True)
print(unique, counts)
print("Accuracy for positive targets:", counts[1]/counts.sum())

[False  True] [1821 7499]
Accuracy for positive targets: 0.8046137339055794


After all tests, the best accuracy for both targets were close to 0.8.

Almoast the same for every model