The two most popular classification objectives are:

    binary:logistic - binary classification (the target contains only two classes, i.e., cat or dog)

    multi:softprob - multi-class classification (more than two classes in the target, i.e., apple/orange/banana)


In [1]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import warnings

## Columns definitions

**start_days** is number of days from the first session

**created_date** is a date of the first session of a contact

a **session** is a number of pageviews during a 30min windows

which means if you stop navigating for 30min and start again it will be a second session

In [2]:
target = "paying" # It is what we try to predict

# a file was generated for 'session_59d'
df = pd.read_csv("/vagrant/ai_random_forest_py/contacts/202402070946/segments.csv", low_memory=False) # returns DataFrame

# del df["created_date"]
# del df["id"]

# 1 if an entity is in a segment at the end of a period
# 0 (zero) if an entity is not in a segment at the end of a period
# 0 (zero) if an entity has never been in a segment

# Viewing the top 5 rows
df.head()

Unnamed: 0,created_date,id,segm_1,segm_2,segm_3,segm_4,segm_5,segm_6,segm_7,segm_8,...,segm_210,segm_211,segm_212,segm_213,segm_214,segm_215,segm_216,segm_217,segm_218,paying
0,2019-06-18,1,0,0,0,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2019-06-18,2,0,0,0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
2,2019-06-18,3,0,0,0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3,2019-06-18,4,0,0,0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
4,2019-06-18,5,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0


In [3]:
print('data: ', df.shape[0])

print('Paying True: ', df[df['paying'] == True].shape[0])
print('Paying False: ', df[df['paying'] == False].shape[0])


data:  2096
Paying True:  36
Paying False:  2060


As you can see we have **Unbalanced Data**

An unbalanced dataset is one in which the target variable has more observations in one specific class than the others.

We will play around it in another script


In [4]:
del df["created_date"]
del df["id"]

df.head()

Unnamed: 0,segm_1,segm_2,segm_3,segm_4,segm_5,segm_6,segm_7,segm_8,segm_9,segm_10,...,segm_210,segm_211,segm_212,segm_213,segm_214,segm_215,segm_216,segm_217,segm_218,paying
0,0,0,0,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,1,1,1,0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,1,1,1,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,1,1,1,0,1,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# Create a list of the feature column's names


# print(df.shape[1] - 2)
features = df.columns[:(df.shape[1] - 1)]


# View features
print(f'Features: {features}')

df.head()

Features: Index(['segm_1', 'segm_2', 'segm_3', 'segm_4', 'segm_5', 'segm_6', 'segm_7',
       'segm_8', 'segm_9', 'segm_10',
       ...
       'segm_209', 'segm_210', 'segm_211', 'segm_212', 'segm_213', 'segm_214',
       'segm_215', 'segm_216', 'segm_217', 'segm_218'],
      dtype='object', length=218)


Unnamed: 0,segm_1,segm_2,segm_3,segm_4,segm_5,segm_6,segm_7,segm_8,segm_9,segm_10,...,segm_210,segm_211,segm_212,segm_213,segm_214,segm_215,segm_216,segm_217,segm_218,paying
0,0,0,0,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,1,1,1,0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,1,1,1,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,1,1,1,0,1,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
from sklearn.model_selection import train_test_split

x = df[features].to_numpy()
y = df['paying'].to_numpy()

# Split the data
# x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, stratify=y)
# x_train, x_test, y_train, y_test = train_test_split(df, df['paying'], test_size=0.30, random_state = 2020, stratify = df['paying'])

In [7]:
from sklearn.base import clone
from sklearn.model_selection import StratifiedKFold, cross_validate
import xgboost as xgb

def fit_and_score(estimator, x_train, x_test, y_train, y_test):
    """Fit the estimator on the train set and score it on both sets"""
    estimator.fit(x_train, y_train, eval_set=[(x_test, y_test)])

    train_score = estimator.score(x_train, y_train)
    test_score = estimator.score(x_test, y_test)

    return estimator, train_score, test_score


cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=94)

clf = xgb.XGBClassifier(tree_method="hist", early_stopping_rounds=3)

xy_test = []

res = []


for train, test in cv.split(x, y):
    x_train = x[train]
    x_test = x[test]
    y_train = y[train]
    y_test = y[test]
    est, train_score, test_score = fit_and_score(
        clone(clf), x_train, x_test, y_train, y_test
    )
    xy_test.append([x_test, y_test])
    res.append((est, train_score, test_score))


[0]	validation_0-logloss:0.13178
[1]	validation_0-logloss:0.10591
[2]	validation_0-logloss:0.08882
[3]	validation_0-logloss:0.07432
[4]	validation_0-logloss:0.06253
[5]	validation_0-logloss:0.05408
[6]	validation_0-logloss:0.04835
[7]	validation_0-logloss:0.04396
[8]	validation_0-logloss:0.04078
[9]	validation_0-logloss:0.03836
[10]	validation_0-logloss:0.03688
[11]	validation_0-logloss:0.03464
[12]	validation_0-logloss:0.03412
[13]	validation_0-logloss:0.03420
[14]	validation_0-logloss:0.03401
[15]	validation_0-logloss:0.03411
[16]	validation_0-logloss:0.03368
[17]	validation_0-logloss:0.03398
[18]	validation_0-logloss:0.03387
[0]	validation_0-logloss:0.12691
[1]	validation_0-logloss:0.09977
[2]	validation_0-logloss:0.08216
[3]	validation_0-logloss:0.06878
[4]	validation_0-logloss:0.05992
[5]	validation_0-logloss:0.05317
[6]	validation_0-logloss:0.04876
[7]	validation_0-logloss:0.04567
[8]	validation_0-logloss:0.04367
[9]	validation_0-logloss:0.04185
[10]	validation_0-logloss:0.04073


In [8]:
# results

for i in range(0, len(res)):
    print(f' {i} train_score: {res[i][1]}; test_score: {res[i][2]}')

# print('train_score', res[4][1])
# print('test_score', res[4][2])

clf = res[4][0]
clf

 0 train_score: 0.9952267303102625; test_score: 0.9928571428571429
 1 train_score: 0.9952295766249255; test_score: 0.9952267303102625
 2 train_score: 0.9970184853905785; test_score: 0.9904534606205251
 3 train_score: 0.9964221824686941; test_score: 0.9856801909307876
 4 train_score: 0.9946332737030411; test_score: 0.9952267303102625


### R^2 (coefficient of determination) regression score function.

In [9]:
from sklearn.metrics import r2_score

# sklearn.metrics.r2_score(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average', force_finite=True)
x_test = xy_test[4][0]
y_test = xy_test[4][1]
y_pred = clf.predict(x_test)

print('R^2: ', r2_score(y_test, y_pred))
# print('R^2: ', r2_score(y_test.reshape(-1), y_pred))

R^2:  0.7094313453536755



It is much better then we have in **xg_boost_classification.ipynb**

In [10]:
from sklearn.metrics import recall_score

display(pd.crosstab(y_test.astype(bool), y_pred.astype(bool), rownames = ['Actual Paying'], colnames = ['Predicted Paying']))

Predicted Paying,False,True
Actual Paying,Unnamed: 1_level_1,Unnamed: 2_level_1
False,412,0
True,2,5


## Saving Trained model

[Scikit-Learn interface](https://xgboost.readthedocs.io/en/latest/python/python_intro.html#scikit-learn-interface)

In [11]:
clf.save_model("/vagrant/ai_random_forest_py/xg_boost_model.json")

In [12]:
# clf.get_booster().save_model("/vagrant/ai_random_forest_py/booster_xg_boost_model.json")