# Light GBM model
In this notebook, we will create a simple Light LGM model and evaluate it's accuracy

First, let's import the libs and load the data

In [2]:
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [3]:
df = pd.read_csv("test_data.csv", sep=";")
df = df.drop("Unnamed: 0", axis=1)
df = df.set_index("id")

In [4]:
country_index = list(df["country"].unique())
platform_index = list(df["creation_platform"].unique())
source_index = list(df["source_pulido"].unique())

In [5]:
df_num_values = df.drop(["country", "creation_platform", "source_pulido"], axis=1)
df_num_values["country_index"] = df["country"].apply(lambda i: country_index.index(i))
df_num_values["creation_platform_index"] = df["creation_platform"].apply(lambda i: platform_index.index(i))
df_num_values["source_pulido_index"] = df["source_pulido"].apply(lambda i: source_index.index(i))

Now we are going to split the data into train and test datasets and load them into lightgbm datasets.

The separation will be:
 - 80% for training
 - 20% for testing

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df_num_values.drop("target", axis=1), df_num_values.target, test_size=0.2, random_state=42)

In [8]:
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test)

Now we are going to define the training parameters: [Check the documentation](https://lightgbm.readthedocs.io/en/latest/Parameters.html)

- The objetive will be a binary classification, with a `binary_error` metric.
- For boosting type, we'll use traditional Gradient Boosting Decision Tree, `gbdt`
- The number of leaves will be `31` 
- Our learning rate will be `0.05`
- And will define some others parameters


In [9]:
params = {
    'objective': 'binary',
    'metric': 'binary_error',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

Now, we will train the model for `100` rounds

In [10]:
num_round = 100
bst = lgb.train(params, train_data, num_round, valid_sets=[test_data])

After training, it's time to check the accuracy

In [11]:
y_pred = bst.predict(X_test, num_iteration=bst.best_iteration)
y_pred_binary = [1 if pred > 0.5 else 0 for pred in y_pred]

In [12]:
accuracy = accuracy_score(y_test, y_pred_binary)
print("Accuracy:", accuracy)

Accuracy: 0.9229336795708477


The accuracy seems pretty high. But because our data is imbalanced, let's check the accuracy per class

In [13]:
target_0_preds = y_pred[y_test == 0] <= 0.5
target_1_preds = y_pred[y_test == 1] > 0.5

In [16]:
unique, counts = np.unique(target_0_preds, return_counts=True)
print(unique, counts)
print("Accuracy for negative targets:", counts[1]/counts.sum())

[False  True] [   691 107057]
Accuracy for negative targets: 0.9935868879236738


In [17]:
unique, counts = np.unique(target_1_preds, return_counts=True)
print(unique, counts)
print("Accuracy for positive targets:", counts[1]/counts.sum())

[False  True] [8331  989]
Accuracy for positive targets: 0.10611587982832618


The accuracy for positiva targets was really high. But for negative targets, it was awful.

Maybe we can fix this by creating a threshold for the classification

So, let's print the percentiles for each possible target, 0 and 1

In [18]:
np.percentile(y_pred[y_test == 0], [0, 30, 50, 70, 100])

array([0.0016176 , 0.0054814 , 0.0162957 , 0.05472345, 0.78071773])

In [19]:
np.percentile(y_pred[y_test == 1], [0, 30, 50, 70, 100])

array([0.0016337 , 0.15414669, 0.25907854, 0.37273034, 0.77300839])

Now we know that 70% of the negative targets have the value up to 0.05

We can try some thresholds around this value to find a good accuracy for both targets

In [20]:
target_0_preds = y_pred[y_test == 0] <= 0.1
target_1_preds = y_pred[y_test == 1] > 0.1

In [22]:
unique, counts = np.unique(target_0_preds, return_counts=True)
print(unique, counts)
print("Accuracy for negative targets:", counts[1]/counts.sum())

[False  True] [21440 86308]
Accuracy for negative targets: 0.8010171882540743


In [23]:
unique, counts = np.unique(target_1_preds, return_counts=True)
print(unique, counts)
print("Accuracy for positive targets:", counts[1]/counts.sum())

[False  True] [1837 7483]
Accuracy for positive targets: 0.8028969957081545


Great! With our new threshold we were able to increase our accuracy to 0.8 in both targets.

Let's save this model as a pickle so we can use it later

In [24]:
import pickle
with open('models/lightgbm_model.pkl', 'wb') as f:
    pickle.dump(bst, f)