# "Click Prediction Small" Dataset

In [1]:
from scipy.io import arff
import pandas as pd
import numpy as np

In [2]:
from sklearn.datasets import fetch_openml
data = fetch_openml(data_id = 41434, parser = 'auto')

# The returned dataset is a Bunch object, similar to a dictionary
X = data['data']
y = data['target']

In [3]:
# Summary vectors creation

default_summary  = []
encoder_summary  = []
value_summary    = []
time_summary     = []
n_models_summary = []
card_9_summary   = []

#### Description
This data is derived from the 2012 KDD Cup. The data is subsampled to 0.1% of the original number of instances, downsampling the majority class (click=0) so that the target feature is reasonably balanced (5 to 1).

The data is about advertisements shown alongside search results in a search engine, and whether or not people clicked on these ads. The task is to build the best possible model to predict whether a user will click on a given ad.

A search session contains information on user id, the query issued by the user, ads displayed to the user, and target feature indicating whether a user clicked at least one of the ads in this session. The number of ads displayed to a user in a session is called ‘depth’. The order of an ad in the displayed list is called ‘position’. An ad is displayed as a short text called ‘title’, followed by a slightly longer text called ’description’, and a URL called ‘display URL’.
To construct this dataset each session was split into multiple instances. Each instance describes an ad displayed under a certain setting (‘depth’, ‘position’). Instances with the same user id, ad id, query, and setting are merged. Each ad and each user have some additional properties located in separate data files that can be looked up using ids in the instances.

#### Attributes Information
- Click – [target] binary variable indicating whether a user clicked on at least one ad.
- Impression : the number of search sessions in which AdID was impressed by UserID who issued Query.
- Url_hash : URL is hashed for anonymity
- AdID
- AdvertiserID : some advertisers consistently optimize their ads, so the title and description of their ads are more attractive than those of others’ ads.
- Depth : number of ads displayed to a user in a session
- Position : order of an ad in the displayed list
- QueryID : is the key of the data file 'queryid_tokensid.txt'. (follow the link to the original KDD Cup page, track 2)
- KeywordID : is the key of 'purchasedkeyword_tokensid.txt' (follow the link to the original KDD Cup page, track 2)
- TitleID : is the key of 'titleid_tokensid.txt'
- DescriptionID : is the key of 'descriptionid_tokensid.txt' (follow the link to the original KDD Cup page, track 2)
- UserID : is also the key of 'userid_profile.txt' (follow the link to the original KDD Cup page, track 2). 0 is a special value denoting that the user could be identified.

Convertir a tipo objeto todas las columnas.

In [4]:
X.head()

Unnamed: 0,impression,url_hash,ad_id,advertiser_id,depth,position,query_id,keyword_id,title_id,description_id,user_id
0,1,1.071003e+19,8343295,11700,3,3,7702266,21264,27892,1559,0
1,1,1.736385e+19,20017077,23798,1,1,93079,35498,4,36476,562934
2,1,8.915473e+18,21348354,36654,1,1,10981,19975,36105,33292,11621116
3,1,4.426693e+18,20366086,33280,3,3,0,5942,4057,4390,8778348
4,1,1.15726e+19,6803526,10790,2,1,9881978,60593,25242,1679,12118311


In [5]:
X.dtypes

impression           int64
url_hash           float64
ad_id             category
advertiser_id     category
depth                int64
position             int64
query_id             int64
keyword_id        category
title_id          category
description_id    category
user_id           category
dtype: object

In [6]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39948 entries, 0 to 39947
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   impression      39948 non-null  int64   
 1   url_hash        39948 non-null  float64 
 2   ad_id           39948 non-null  category
 3   advertiser_id   39948 non-null  category
 4   depth           39948 non-null  int64   
 5   position        39948 non-null  int64   
 6   query_id        39948 non-null  int64   
 7   keyword_id      39948 non-null  category
 8   title_id        39948 non-null  category
 9   description_id  39948 non-null  category
 10  user_id         39948 non-null  category
dtypes: category(6), float64(1), int64(4)
memory usage: 6.6 MB


In [7]:
X.shape

(39948, 11)

In [8]:
X.describe()

Unnamed: 0,impression,url_hash,depth,position,query_id
count,39948.0,39948.0,39948.0,39948.0,39948.0
mean,2.100205,9.64135e+18,1.960023,1.463853,3142146.0
std,65.867383,4.986705e+18,0.715407,0.631545,5841540.0
min,1.0,482436900000000.0,1.0,1.0,0.0
25%,1.0,5.468728e+18,1.0,1.0,2364.25
50%,1.0,1.034947e+19,2.0,1.0,112836.5
75%,1.0,1.434039e+19,2.0,2.0,3147909.0
max,11820.0,1.844094e+19,3.0,3.0,26240100.0


We check for duplicate rows.

In [9]:
X.duplicated().sum()

22

## Study of NA's

In [10]:
X.isna().sum().sort_values(ascending = False)

impression        0
url_hash          0
ad_id             0
advertiser_id     0
depth             0
position          0
query_id          0
keyword_id        0
title_id          0
description_id    0
user_id           0
dtype: int64

As can be seen, there are no np.nan in most of the variables. However, this does not mean that there are no variables containing missing values.

In [11]:
(X=='?').any().sum()

0

In [12]:
(X=='NA').any().sum()

0

In [13]:
(X=='Not_Say').any().sum()

0

## Type of Variables

In [14]:
len(X.select_dtypes(include=['category', 'object']).columns)

6

In [15]:
len(X.select_dtypes(include=['float64','int']).columns)

5

General review of the values of all variables.

In [16]:
X.select_dtypes(include=['category']).apply(lambda col: col.nunique()).sort_values(ascending=False)

user_id           30114
title_id          25321
description_id    22381
keyword_id        19803
ad_id             19228
advertiser_id      6064
dtype: int64

In [17]:
X.select_dtypes(include=['number']).apply(lambda col: col.nunique()).sort_values(ascending=False)

query_id      30748
url_hash       6941
impression       99
depth             3
position          3
dtype: int64

Due to information in OpenML, **query_id** and **url_hash** do not give relevant information of the rsponse variable.

In [18]:
X = X.drop(['query_id','url_hash'], axis=1)

## Value counts of the variables with more cardinality

#### user_id

In [19]:
X.user_id.value_counts()[0:10]

user_id
0      9633
2         6
187       4
154       3
124       3
125       3
61        3
52        3
229       3
56        3
Name: count, dtype: int64

#### title_id

In [20]:
X.title_id.value_counts()[0:10]

title_id
0    355
4    168
2    167
1    152
3    135
5    126
7    114
8    114
9    113
6    111
Name: count, dtype: int64

#### description_id

In [21]:
X.description_id.value_counts()[0:10]

description_id
0    355
1    190
5    167
2    159
4    154
3    152
6    146
9    143
7    135
8    129
Name: count, dtype: int64

## Response variable distribution

One of the problems we encountered is that by eliminating repeated entries, they are all classified as allowed, which further shortens the number of positive observations.

In [22]:
y.value_counts()

click
0    33220
1     6728
Name: count, dtype: int64

In [23]:
y.value_counts(normalize=True)

click
0    0.831581
1    0.168419
Name: proportion, dtype: float64

## Train-Test Split

In [24]:
from sklearn.model_selection import train_test_split

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size = 0.33, 
                                                    random_state = 42,
                                                    stratify = y)

## Pipelines (Encoding in all variables)

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import HistGradientBoostingClassifier

from sklearn.metrics import balanced_accuracy_score

import scipy.stats
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV


import time

In [27]:
num_cols = X_train.select_dtypes(include=['number']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['category']).columns.to_list()

In [28]:
len(num_cols)
num_cols

['impression', 'depth', 'position']

In [29]:
# Define the HistGradientBoostingClassifier models
hgb_default = HistGradientBoostingClassifier(max_iter=1000, random_state=1234,
                                             early_stopping=True,
                                             scoring='balanced_accuracy',
                                             validation_fraction=0.1,
                                             n_iter_no_change=5,
                                             class_weight='balanced')

# Define the hyperparameter search space
param_distributions = {
    'model__learning_rate': scipy.stats.uniform(0.01, 0.3),
    'model__min_samples_leaf': scipy.stats.randint(1, 10),
}

# Create a StratifiedKFold cross-validation instance
stratified_kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=1234)

### One Hot Encoding + HistGradientBoosting

#### Preprocessing

In [30]:
cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("encoder", OneHotEncoder(drop = "first", handle_unknown = "ignore"))
])

num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "median")),
])

preprop_pipeline = ColumnTransformer(
    transformers = [("num_col", num_pipeline, num_cols),
                    ("one_hot", cat_pipeline, cat_cols)],
    sparse_threshold=0
)

#### Create a HistGradientBoostingClassifier model with default parameters and early stopping

In [31]:
ohe_hgb_default_pipeline = Pipeline([("preprocessing",preprop_pipeline),
                                     ('model', hgb_default)])

In [32]:
tic = time.time()

ohe_hgb_default = ohe_hgb_default_pipeline.fit(X_train, y_train)

toc = time.time()
ohe_hgb_default_time_taken = toc-tic

In [33]:
# Display pipeline
print("Time taken: ", ohe_hgb_default_time_taken)
ohe_hgb_default

Time taken:  274.4326241016388


In [34]:
y_ohe_hgb_default_pred = ohe_hgb_default.predict(X_test)
ohe_hgb_default_accuracy = balanced_accuracy_score(y_test, y_ohe_hgb_default_pred)
print(f'Balanced accuracy with default parameters: {ohe_hgb_default_accuracy}')



Balanced accuracy with default parameters: 0.6059510778679802


In [35]:
default_summary.append("Default")
card_9_summary.append("AllVariables")
encoder_summary.append("OneHotEncoding")
value_summary.append(ohe_hgb_default_accuracy)
time_summary.append(ohe_hgb_default_time_taken)
n_models_summary.append(1)

#### Create a HistGradientBoostingClassifier model for tuning

In [36]:
ohe_hgb_tune = RandomizedSearchCV(estimator = ohe_hgb_default_pipeline, 
                                  param_distributions = param_distributions, 
                                  n_iter = 75,
                                  cv = stratified_kfold,
                                  scoring = 'balanced_accuracy', 
                                  random_state = 1234,
                                  n_jobs = -1)

In [37]:
tic = time.time()

ohe_hgb_tune = ohe_hgb_tune.fit(X_train, y_train)

toc = time.time()
ohe_hgb_tune_time_taken = toc-tic

223 fits failed out of a total of 225.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
69 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\VNG\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\VNG\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "c:\Users\VNG\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\pipeline.py", line 423, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "c:\Users\VNG\AppData\Local\Programs\Python\Pyt

In [38]:
# Display pipeline
print("Time taken: ", ohe_hgb_tune_time_taken)
ohe_hgb_tune

Time taken:  822.6240494251251


In [39]:
# Get the best parameters
ohe_hgb_tune_best_params = ohe_hgb_tune.best_params_
print(f'Best parameters: {ohe_hgb_tune_best_params}')

# Predict using the model with the best parameters
y_ohe_hgb_tune_pred = ohe_hgb_tune.predict(X_test)
ohe_hgb_tune_accuracy = balanced_accuracy_score(y_test, y_ohe_hgb_tune_pred)
print(f'Balanced accuracy with best parameters: {ohe_hgb_tune_accuracy}')

# Save results
default_summary.append("Tune")
card_9_summary.append("AllVariables")
encoder_summary.append("OneHotEncoding")
value_summary.append(ohe_hgb_tune_accuracy)
time_summary.append(ohe_hgb_tune_time_taken)
n_models_summary.append(ohe_hgb_tune.n_iter * ohe_hgb_tune.n_splits_)

Best parameters: {'model__learning_rate': 0.06745583511366768, 'model__min_samples_leaf': 7}




Balanced accuracy with best parameters: 0.6114296614410635


### Count Encoder + HistGradientBoosting


In [40]:
from category_encoders.count import CountEncoder

#### Preprocessing

In [41]:
cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("encoder", CountEncoder())
])

preprop_pipeline = ColumnTransformer(
    transformers = [("num", num_pipeline, num_cols),
                    ("count_encoder", cat_pipeline, cat_cols)],
    sparse_threshold=0
)

#### Create a HistGradientBoostingClassifier model with default parameters and early stopping

In [42]:
count_hgb_default_pipeline = Pipeline([('preprocessing', preprop_pipeline),
                                       ('model', hgb_default)])

In [43]:
tic = time.time()

count_hgb_default = count_hgb_default_pipeline.fit(X_train, y_train)

toc = time.time()
count_hgb_default_time_taken = toc-tic

In [44]:
# Display pipeline
print("Time taken:", count_hgb_default_time_taken)
count_hgb_default

Time taken: 1.1700656414031982


In [45]:
# Predict using the model with default parameters
y_count_hgb_default_pred = count_hgb_default.predict(X_test)
count_hgb_default_accuracy = balanced_accuracy_score(y_test, y_count_hgb_default_pred)
print(f'Balanced accuracy with default parameters: {count_hgb_default_accuracy}')

# Save results
default_summary.append("Default")
card_9_summary.append("AllVariables")
encoder_summary.append("CountEncoding")
value_summary.append(count_hgb_default_accuracy)
time_summary.append(count_hgb_default_time_taken)
n_models_summary.append(1)

Balanced accuracy with default parameters: 0.6227235467703405


#### Create a HistGradientBoostingClassifier model for tuning

In [46]:
count_hgb_tune = RandomizedSearchCV(estimator = count_hgb_default_pipeline, 
                                    param_distributions = param_distributions, 
                                    n_iter = 200,
                                    cv = stratified_kfold,
                                    scoring = 'balanced_accuracy', 
                                    random_state = 1234,
                                    n_jobs = -1)

In [47]:
tic = time.time()

count_hgb_tune = count_hgb_tune.fit(X_train, y_train)

toc = time.time()
count_hgb_tune_time_taken = toc-tic

In [48]:
# Display pipeline
print("Time taken: ", count_hgb_tune_time_taken)
count_hgb_tune

Time taken:  105.07929420471191


In [49]:
# Predict using the model with the best parameters
y_count_hgb_tune_pred = count_hgb_tune.predict(X_test)

# Get the best parameters
count_hgb_tune_best_params = count_hgb_tune.best_params_
print(f'Best parameters: {count_hgb_tune_best_params}')

# Calculate balanced accuracy for the model with the best parameters
count_hgb_tune_accuracy = balanced_accuracy_score(y_test, y_count_hgb_tune_pred)
print(f'Balanced accuracy with best parameters: {count_hgb_tune_accuracy}')

# Save results
default_summary.append("Tune")
card_9_summary.append("AllVariables")
encoder_summary.append("CountEncoding")
value_summary.append(count_hgb_tune_accuracy)
time_summary.append(count_hgb_tune_time_taken)
n_models_summary.append(count_hgb_tune.n_iter * count_hgb_tune.n_splits_)

Best parameters: {'model__learning_rate': 0.18071049381369467, 'model__min_samples_leaf': 5}
Balanced accuracy with best parameters: 0.633659348028134


### Ordinal Encoding + HistGradientBoosting


In [50]:
from sklearn.preprocessing import OrdinalEncoder

#### Preprocessing

In [51]:
cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("encoder", OrdinalEncoder(dtype = int,
                               handle_unknown = 'use_encoded_value',
                               unknown_value = 99999,
                               encoded_missing_value = 99999))
])

preprop_pipeline = ColumnTransformer(
    transformers = [("num", num_pipeline, num_cols),
                    ("count_encoder", cat_pipeline, cat_cols)],
    sparse_threshold=0
)

#### Create a HistGradientBoostingClassifier model with default parameters and early stopping

In [52]:
ordinal_hgb_default_pipeline = Pipeline([('preprocessing', preprop_pipeline),
                                         ('model', hgb_default)])


In [53]:
tic = time.time()

ordinal_hgb_default = ordinal_hgb_default_pipeline.fit(X_train, y_train)

toc = time.time()
ordinal_hgb_default_time_taken = toc-tic

In [54]:
# Display pipeline
print("Time taken: ", ordinal_hgb_default_time_taken)
ordinal_hgb_default

Time taken:  0.7790176868438721


In [55]:
# Calculate balanced accuracy for the model with default parameters
y_ordinal_hgb_default_pred = ordinal_hgb_default.predict(X_test)
ordinal_hgb_default_accuracy = balanced_accuracy_score(y_test, y_ordinal_hgb_default_pred)
print(f'Balanced accuracy with default parameters: {ordinal_hgb_default_accuracy}')

# Save results
default_summary.append("Default")
card_9_summary.append("AllVariables")
encoder_summary.append("OrdinalEncoder")
value_summary.append(ordinal_hgb_default_accuracy)
time_summary.append(ordinal_hgb_default_time_taken)
n_models_summary.append(1)

Balanced accuracy with default parameters: 0.622528767936047


#### Create a HistGradientBoostingClassifier model for tuning

In [56]:
ordinal_hgb_tune = RandomizedSearchCV(estimator = ordinal_hgb_default_pipeline, 
                                      param_distributions = param_distributions, 
                                      n_iter = 100,
                                      cv = stratified_kfold,
                                      scoring = 'balanced_accuracy', 
                                      random_state = 1234,
                                      n_jobs = -1)

In [57]:
tic = time.time()

ordinal_hgb_tune = ordinal_hgb_tune.fit(X_train, y_train)

toc = time.time()
ordinal_hgb_tune_time_taken = toc-tic

In [58]:
# Display pipeline
print("Time taken: ", ordinal_hgb_tune_time_taken)
ordinal_hgb_tune

Time taken:  52.32787561416626


In [59]:
# Predict using the model with the best parameters
y_ordinal_hgb_tune_pred = ordinal_hgb_tune.predict(X_test)

# Get the best parameters
ordinal_hgb_tune_best_params = ordinal_hgb_tune.best_params_
print(f'Best parameters: {ordinal_hgb_tune_best_params}')

# Calculate balanced accuracy for the model with the best parameters
ordinal_hgb_tune_accuracy = balanced_accuracy_score(y_test, y_ordinal_hgb_tune_pred)
print(f'Balanced accuracy with best parameters: {ordinal_hgb_tune_accuracy}')

# Save results
default_summary.append("Tune")
card_9_summary.append("AllVariables")
encoder_summary.append("OrdinalEncoder")
value_summary.append(ordinal_hgb_tune_accuracy)
time_summary.append(ordinal_hgb_tune_time_taken)
n_models_summary.append(ordinal_hgb_tune.n_iter * ordinal_hgb_tune.n_splits_)

Best parameters: {'model__learning_rate': 0.014113415036227952, 'model__min_samples_leaf': 5}
Balanced accuracy with best parameters: 0.6148987421244103


### Native HistGradientBoosting support for categorical variables


In [60]:
from sklearn.preprocessing import OrdinalEncoder

#### Preprocessing

In [61]:
category_features_for_nativesupport = [col in cat_cols for col in X.columns]
category_features_for_nativesupport

[False, True, True, False, False, True, True, True, True]

In [62]:
hgb_default_categories_support = HistGradientBoostingClassifier(max_iter=1000, random_state=1234,
                                                                early_stopping=True,
                                                                scoring='balanced_accuracy',
                                                                validation_fraction=0.1,
                                                                n_iter_no_change=5,
                                                                categorical_features=category_features_for_nativesupport,
                                                                class_weight='balanced')

In [63]:
cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("encoder", OrdinalEncoder(dtype = int,
                               handle_unknown = 'use_encoded_value',
                               unknown_value = 99999,
                               encoded_missing_value = 99999,
                               max_categories = 254))
])

preprop_pipeline = ColumnTransformer(
    transformers = [("num_cols", num_pipeline, num_cols),
                    ("cat_cols", cat_pipeline, cat_cols)],
    sparse_threshold=0
)

#### Create a HistGradientBoostingClassifier model with default parameters and early stopping

In [64]:
catsup_hgb_default_pipeline = Pipeline([('preprocessing', preprop_pipeline),
                                        ('model', hgb_default_categories_support)])

In [65]:
tic = time.time()

catsup_hgb_default = catsup_hgb_default_pipeline.fit(X_train, y_train)

toc = time.time()
catsup_hgb_default_time_taken = toc-tic

In [66]:
# Display pipeline
print("Time taken: ", catsup_hgb_default_time_taken)
catsup_hgb_default

Time taken:  1.76055908203125


In [67]:
# Calculate balanced accuracy for the model with default parameters
y_catsup_hgb_default_pred = catsup_hgb_default.predict(X_test)
catsup_hgb_default_accuracy = balanced_accuracy_score(y_test, y_catsup_hgb_default_pred)
print(f'Balanced accuracy with default parameters: {catsup_hgb_default_accuracy}')

# Save results
default_summary.append("Default")
card_9_summary.append("AllVariables")
encoder_summary.append("HGB_NativeSupport")
value_summary.append(catsup_hgb_default_accuracy)
time_summary.append(catsup_hgb_default_time_taken)
n_models_summary.append(1)

Balanced accuracy with default parameters: 0.5818498832682906


#### Create a HistGradientBoostingClassifier model for tuning

In [68]:
catsup_hgb_tune = RandomizedSearchCV(estimator = catsup_hgb_default_pipeline, 
                                     param_distributions = param_distributions, 
                                     n_iter = 500,
                                     cv = stratified_kfold,
                                     scoring = 'balanced_accuracy', 
                                     random_state = 1234,
                                     n_jobs = -1)

In [69]:
tic = time.time() 

catsup_hgb_tune = catsup_hgb_tune.fit(X_train, y_train)

toc = time.time()
catsup_hgb_tune_time_taken = toc-tic

In [70]:
# Display pipeline
print("Time taken: ", catsup_hgb_tune_time_taken)
catsup_hgb_tune

Time taken:  320.8845784664154


In [71]:
# Predict using the model with the best parameters
y_catsup_hgb_tune_pred = catsup_hgb_tune.predict(X_test)

# Get the best parameters
catsup_hgb_tune_best_params = catsup_hgb_tune.best_params_
print(f'Best parameters: {catsup_hgb_tune_best_params}')

# Calculate balanced accuracy for the model with the best parameters
catsup_hgb_tune_accuracy = balanced_accuracy_score(y_test, y_catsup_hgb_tune_pred)
print(f'Balanced accuracy with best parameters: {catsup_hgb_tune_accuracy}')

# Save results
default_summary.append("Tune")
card_9_summary.append("AllVariables")
encoder_summary.append("HGB_NativeSupport")
value_summary.append(catsup_hgb_tune_accuracy)
time_summary.append(catsup_hgb_tune_time_taken)
n_models_summary.append(catsup_hgb_tune.n_iter * catsup_hgb_tune.n_splits_)

Best parameters: {'model__learning_rate': 0.13149886602979405, 'model__min_samples_leaf': 2}
Balanced accuracy with best parameters: 0.5800741108708818


### Target Encoder + HistGradientBoosting


In [72]:
from sklearn.preprocessing import TargetEncoder

#### Preprocessing

In [73]:
cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("encoder", TargetEncoder())
])

preprop_pipeline = ColumnTransformer(
    transformers = [("num_cols", num_pipeline, num_cols),
                    ("target_encoder", cat_pipeline, cat_cols)],
    sparse_threshold=0
)

#### Create a HistGradientBoostingClassifier model with default parameters and early stopping

In [74]:
target_hgb_default_pipeline = Pipeline([('preprocessing', preprop_pipeline),
                                        ('model', hgb_default)])

In [75]:
tic = time.time()

target_hgb_default = target_hgb_default_pipeline.fit(X_train, y_train)

toc = time.time()
target_hgb_default_time_taken = toc-tic

In [76]:
# Display pipeline
print("Time taken: ", target_hgb_default_time_taken)
target_hgb_default

Time taken:  1.452019214630127


In [77]:
# Calculate balanced accuracy for the model with default parameters
y_target_hgb_default_pred = target_hgb_default.predict(X_test)
target_hgb_default_accuracy = balanced_accuracy_score(y_test, y_target_hgb_default_pred)
print(f'Balanced accuracy with default parameters: {target_hgb_default_accuracy}')

# Save results
default_summary.append("Default")
card_9_summary.append("AllVariables")
encoder_summary.append("TargetEncoder")
value_summary.append(target_hgb_default_accuracy)
time_summary.append(target_hgb_default_time_taken)
n_models_summary.append(1)

Balanced accuracy with default parameters: 0.6294448649141708


#### Create a HistGradientBoostingClassifier model for tuning

In [78]:
target_hgb_tune = RandomizedSearchCV(estimator = target_hgb_default_pipeline, 
                                    param_distributions = param_distributions, 
                                    n_iter = 200,
                                    cv = stratified_kfold,
                                    scoring = 'balanced_accuracy', 
                                    random_state = 1234,
                                    n_jobs = -1)

In [79]:
tic = time.time() 

target_hgb_tune = target_hgb_tune.fit(X_train, y_train)

toc = time.time()
target_hgb_tune_time_taken = toc-tic

In [80]:
# Display pipeline
print("Time taken: ", target_hgb_tune_time_taken)
target_hgb_tune

Time taken:  112.61414957046509


In [81]:
# Predict using the model with the best parameters
y_target_hgb_tune_pred = target_hgb_tune.predict(X_test)

# Get the best parameters
target_hgb_tune_best_params = target_hgb_tune.best_params_
print(f'Best parameters: {target_hgb_tune_best_params}')

# Calculate balanced accuracy for the model with the best parameters
target_hgb_tune_accuracy = balanced_accuracy_score(y_test, y_target_hgb_tune_pred)
print(f'Balanced accuracy with best parameters: {target_hgb_tune_accuracy}')

# Save results
default_summary.append("Tune")
card_9_summary.append("AllVariables")
encoder_summary.append("TargetEncoder")
value_summary.append(target_hgb_tune_accuracy)
time_summary.append(target_hgb_tune_time_taken)
n_models_summary.append(target_hgb_tune.n_iter * target_hgb_tune.n_splits_)

Best parameters: {'model__learning_rate': 0.18048742139773216, 'model__min_samples_leaf': 8}
Balanced accuracy with best parameters: 0.6380737870954964


### CatBoost

In [82]:
from catboost import CatBoostClassifier

#### Preprocessing

In [83]:
cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent"))
])

preprop_pipeline = ColumnTransformer(
    transformers = [("cat", cat_pipeline, cat_cols)],
    sparse_threshold=0
)

In [84]:
category_features_for_catboostsupport = [index for index in range(len(cat_cols))]
print(category_features_for_catboostsupport)

[0, 1, 2, 3, 4, 5]


Catboost allows to give a maximum value of unique categories for which a variable is encoded or not by One-Hot-Encoder.

In [85]:
# Create catboost models
catboost_default_raw = CatBoostClassifier(iterations=1000,
                                        eval_metric = 'BalancedAccuracy',
                                        loss_function = 'Logloss',
                                        auto_class_weights = 'Balanced',
                                        early_stopping_rounds=5,
                                        od_type='Iter',
                                        one_hot_max_size = 0,
                                        random_seed = 1234,
                                        verbose = False)

catboost_default_raw.set_params(cat_features=category_features_for_catboostsupport)

# Default CatBoostClassifier Pipeline
catboost_default_pipeline = Pipeline([('preprocessing', preprop_pipeline),
                                      ('model', catboost_default_raw)])

# Define the hyperparameter search space
catboost_param_distributions = {
    'model__iterations': scipy.stats.randint(10, 1000),
    'model__depth': scipy.stats.randint(4,11),
    'model__learning_rate': scipy.stats.uniform(0.01, 0.3),
}

# Create a StratifiedKFold cross-validation instance
stratified_kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=1234)

catboost_tune_raw = RandomizedSearchCV(estimator = catboost_default_pipeline, 
                                   param_distributions = catboost_param_distributions, 
                                   n_iter = 5,
                                   cv = stratified_kfold,
                                   scoring = 'balanced_accuracy', 
                                   random_state = 1234,
                                   n_jobs = -1)

In [86]:
tic = time.time()

catboost_default = catboost_default_pipeline.fit(X_train, y_train)

toc = time.time()
catboost_default_time_taken = toc-tic

In [87]:
# Display pipeline
print("Time taken: ", catboost_default_time_taken)
catboost_default

Time taken:  44.7258083820343


In [88]:
# Calculate balanced accuracy for the model with default parameters
y_catboost_default_pred = catboost_default.predict(X_test)
catboost_default_accuracy = balanced_accuracy_score(y_test, y_catboost_default_pred)
print(f'Balanced accuracy with default parameters: {catboost_default_accuracy}')

# Save results
default_summary.append("Default")
card_9_summary.append("AllVariables")
encoder_summary.append("CatboostNativeSupport")
value_summary.append(catboost_default_accuracy)
time_summary.append(catboost_default_time_taken)
n_models_summary.append(1)

Balanced accuracy with default parameters: 0.5934971686089081


In [89]:
tic = time.time()

catboost_tune = catboost_tune_raw.fit(X_train, y_train)

toc = time.time()
catboost_tune_time_taken = toc-tic

In [90]:
# Display pipeline
print("Time taken: ", catboost_tune_time_taken)
catboost_tune

Time taken:  200.8923463821411


In [91]:
# Predict using the model with the best parameters
y_catboost_tune_pred = catboost_tune.predict(X_test)

# Get the best parameters
catboost_tune_best_params = catboost_tune.best_params_
print(f'Best parameters: {catboost_tune_best_params}')

# Calculate balanced accuracy for the model with the best parameters
catboost_tune_accuracy = balanced_accuracy_score(y_test, y_catboost_tune_pred)
print(f'Balanced accuracy with best parameters: {catboost_tune_accuracy}')

# Save results
default_summary.append("Tune")
card_9_summary.append("AllVariables")
encoder_summary.append("CatboostNativeSupport")
value_summary.append(catboost_tune_accuracy)
time_summary.append(catboost_tune_time_taken)
n_models_summary.append(catboost_tune.n_iter * catboost_tune.n_splits_)

Best parameters: {'model__depth': 5, 'model__iterations': 289, 'model__learning_rate': 0.05519108968182859}
Balanced accuracy with best parameters: 0.5995929592823691


### Results Summary

In [92]:
results_summary = pd.DataFrame({"Dataset":"Click_prediction_small",
                                "Model":"HistGradientBoosting",
                                "Variables":card_9_summary,
                                "Default/Tune":default_summary,
                                "Encoder":encoder_summary,
                                "Metric":"BalancedAccuracy",
                                "Value":value_summary,
                                "Time":time_summary,
                                "n_Models":n_models_summary})
results_summary["mean_Time"] = (results_summary["Time"] / results_summary["n_Models"])
results_summary

Unnamed: 0,Dataset,Model,Variables,Default/Tune,Encoder,Metric,Value,Time,n_Models,mean_Time
0,Click_prediction_small,HistGradientBoosting,AllVariables,Default,OneHotEncoding,BalancedAccuracy,0.605951,274.432624,1,274.432624
1,Click_prediction_small,HistGradientBoosting,AllVariables,Tune,OneHotEncoding,BalancedAccuracy,0.61143,822.624049,225,3.656107
2,Click_prediction_small,HistGradientBoosting,AllVariables,Default,CountEncoding,BalancedAccuracy,0.622724,1.170066,1,1.170066
3,Click_prediction_small,HistGradientBoosting,AllVariables,Tune,CountEncoding,BalancedAccuracy,0.633659,105.079294,600,0.175132
4,Click_prediction_small,HistGradientBoosting,AllVariables,Default,OrdinalEncoder,BalancedAccuracy,0.622529,0.779018,1,0.779018
5,Click_prediction_small,HistGradientBoosting,AllVariables,Tune,OrdinalEncoder,BalancedAccuracy,0.614899,52.327876,300,0.174426
6,Click_prediction_small,HistGradientBoosting,AllVariables,Default,HGB_NativeSupport,BalancedAccuracy,0.58185,1.760559,1,1.760559
7,Click_prediction_small,HistGradientBoosting,AllVariables,Tune,HGB_NativeSupport,BalancedAccuracy,0.580074,320.884578,1500,0.213923
8,Click_prediction_small,HistGradientBoosting,AllVariables,Default,TargetEncoder,BalancedAccuracy,0.629445,1.452019,1,1.452019
9,Click_prediction_small,HistGradientBoosting,AllVariables,Tune,TargetEncoder,BalancedAccuracy,0.638074,112.61415,600,0.18769


In [93]:
results_summary.to_csv("Click_prediction_small_results.csv")