# Feature Selection
In this notebook, we will explore if we can train a model using fewer features.

As always, let's do the imports and load and transform the data

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

In [2]:
df = pd.read_csv("test_data.csv", sep=";")

In [3]:
df = df.drop("Unnamed: 0", axis=1)

In [4]:
df = df.set_index("id")

In [5]:
country_index = list(df["country"].unique())
platform_index = list(df["creation_platform"].unique())
source_index = list(df["source_pulido"].unique())

In [6]:
df_num_values = df.drop(["country", "creation_platform", "source_pulido"], axis=1)
df_num_values["country_index"] = df["country"].apply(lambda i: country_index.index(i))
df_num_values["creation_platform_index"] = df["creation_platform"].apply(lambda i: platform_index.index(i))
df_num_values["source_pulido_index"] = df["source_pulido"].apply(lambda i: source_index.index(i))

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df_num_values.drop("target", axis=1), df_num_values.target, test_size=0.2, random_state=42)

In [8]:
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test)

Here where the fun begins.

We'll use the `SelectFromModel` from `sklearn` to select the features with more importance from our dataset.

In [9]:
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
feat_selector = SelectFromModel(rf_clf)

In [10]:
feat_selector.fit(X_train, y_train)

In [11]:
X_train_selected = feat_selector.transform(X_train)
X_test_selected = feat_selector.transform(X_test)

In [12]:
X_train.columns[feat_selector.get_support()]

Index(['creation_weekday', 'creation_hour', 'total_product_categories',
       'total_events_on_Web', 'source_pulido_index'],
      dtype='object')

After running the code, we discover that the most important features are:
 - creation_weekday
 - creation_hour
 - total_product_categories
 - total_events_on_Web
 - source_pulido_index
 
Now, we can train a model using only this features.

In [13]:
params = {
    'objective': 'binary',
    'metric': 'binary_error',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

In [14]:
num_round = 100
bst = lgb.train(params, train_data, num_round, valid_sets=[test_data])

In [15]:
y_pred = bst.predict(X_test, num_iteration=bst.best_iteration)
y_pred_binary = [1 if pred > 0.5 else 0 for pred in y_pred]

In [16]:
accuracy = accuracy_score(y_test, y_pred_binary)
print("Accuracy with feature selection:", accuracy)

Accuracy with feature selection: 0.9229336795708477


The model have a good overall accuracy, but let's do the always and try to find a good threshold for maitain a good accuracy in both positive and negative targets

In [17]:
np.percentile(y_pred[y_test == 0], [0, 30, 50, 70, 100])

array([0.0016176 , 0.0054814 , 0.0162957 , 0.05472345, 0.78071773])

In [18]:
np.percentile(y_pred[y_test == 1], [0, 30, 50, 70, 100])

array([0.0016337 , 0.15414669, 0.25907854, 0.37273034, 0.77300839])

In [31]:
threshold = 0.1
target_0_preds = y_pred[y_test == 0] <= threshold
target_1_preds = y_pred[y_test == 1] > threshold

In [32]:
unique, counts = np.unique(target_0_preds, return_counts=True)
print(unique, counts)
print("Accuracy:", counts[1]/counts.sum())

[False  True] [21440 86308]
Accuracy: 0.8010171882540743


In [33]:
unique, counts = np.unique(target_1_preds, return_counts=True)
print(unique, counts)
print("Accuracy:", counts[1]/counts.sum())

[False  True] [1837 7483]
Accuracy: 0.8028969957081545


And we've reached almost the save result as always: 0.8.

Let's save the model

In [34]:
import pickle
with open('models/lightgbm_feature_selection_model.pkl', 'wb') as f:
    pickle.dump(bst, f)