Задачу NBA будем решать на синтетическом датасете. В данных содержится информация о клиентах вымышленного интернет-магазина, которым рекламировали Ноутбук (б/у, очень старый), Телефон или Зарядное устройство.

**Задача**: построить модель NBA как композицию бинарный моделей отклика и сравить лучшее предложение на основе вероятностей с лучшим предложением с учетом NPV продукта.


# Загрузка и подготовка данных

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import random
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.utils import shuffle

import warnings
warnings.filterwarnings("ignore")

# Установка настроек для отображения всех колонок и строк при печати
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

# заранее установим в константу random_state
random_state = 47

sns.set(style="whitegrid")

In [2]:
df_Laptop = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/OTUS/synthetic_laptop_data.csv')
df_Charger = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/OTUS/synthetic_charger_data.csv')
df_Phone = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/OTUS/synthetic_mobile_phone_data.csv')

df = pd.concat([df_Laptop, df_Charger, df_Phone])
print(df.shape)
df.head()

(30000, 18)


Unnamed: 0,Age,Gender,Geography,PreviousPurchases,Salary,NumChildren,Product,EducationLevel,MaritalStatus,EmploymentStatus,HousingStatus,CreditScore,InternetUsage,NumberOfCars,HealthStatus,ShoppingFrequency,MembershipDuration,PurchaseFlag
0,56,Male,North America,8386.985501,42217.503431,4,Laptop,Bachelor,Divorced,Employed,Rent,754,1.545836,0,Fair,6,9,1
1,69,Female,South America,7391.924025,44369.165492,3,Laptop,Master,Single,Employed,Own,745,8.005712,0,Good,1,3,1
2,46,Female,South America,7951.832329,85406.301432,3,Laptop,High School,Widowed,Unemployed,Own,552,8.154606,1,Fair,5,9,0
3,32,Female,Africa,2284.841225,116197.240945,4,Laptop,PhD,Widowed,Retired,Living with Parents,458,1.938812,1,Fair,17,2,1
4,60,Male,Asia,6520.780649,55461.01441,0,Laptop,High School,Married,Employed,Living with Parents,434,7.656888,0,Fair,10,15,0


Описание фичей:

**Age**: Возраст клиента

**Gender**: Пол клиента

**Geography**: Регион проживания клиента

**PreviousPurchases**: Сумма предыдущих покупок клиента в валюте

**Salary**: Годовая зарплата клиента в валюте

**NumChildren**: Количество детей у клиента

**Product**: Продукт, который рассматривается для покупки

**EducationLevel**: Уровень образования клиента

**MaritalStatus**: Семейное положение клиента

**EmploymentStatus**: Статус занятости клиента

**HousingStatus**: Жилищный статус клиента

**CreditScore**: Кредитный рейтинг клиента

**InternetUsage**: Время, проведенное в интернете, в часах в день

**NumberOfCars**: Количество автомобилей у клиента

**HealthStatus**: Состояние здоровья клиента

**ShoppingFrequency**: Частота покупок клиента, количество покупок в месяц

**MembershipDuration**: Длительность членства клиента в годах

**PurchaseFlag**: Флаг покупки, указывает на то, совершил ли клиент покупку

Эти фичи содержат демографические данные клиентов, а также их покупательские и финансовые характеристики, которые используются для предсказания вероятности покупки продукта.

In [3]:
df.groupby(['Product', 'PurchaseFlag']).agg({'Gender': 'count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Gender
Product,PurchaseFlag,Unnamed: 2_level_1
Charger,0,6541
Charger,1,3459
Laptop,0,4594
Laptop,1,5406
Mobile Phone,0,5049
Mobile Phone,1,4951


In [4]:
df.groupby(['Product']).agg({'PurchaseFlag': 'mean'})

Unnamed: 0_level_0,PurchaseFlag
Product,Unnamed: 1_level_1
Charger,0.3459
Laptop,0.5406
Mobile Phone,0.4951


NPV зададим следующим образом:

In [5]:
npv_values = {
    'Laptop': 100.0,
    'Phone': 300.0,
    'Charger': 200.0
}

## Кодирование

Обучаться будем на бустинге. CatBoost гибок к типам данных. Это значит, что можно не заниматься кодиррванием переменных, а просто присвоить категориальный тип данных.

In [6]:
object_cols = list(df.drop(['Product'], axis=1).select_dtypes('object').columns)
print(f'we have {len(object_cols)} object_cols')

df[object_cols] = df[object_cols].astype('category')

we have 7 object_cols


## Семплирование

In [7]:
df = df.rename(columns={'PurchaseFlag': 'target'})

In [8]:
df_Charger = df[df['Product'] == 'Charger']
df_Laptop = df[df['Product'] == 'Laptop']
df_Phone = df[df['Product'] == 'Mobile Phone']

print(f"Размеры датасетов:\n"
      f"Charger: {df_Charger.shape[0]} записей и {df_Charger.shape[1]} фичей\n"
      f"Laptop: {df_Laptop.shape[0]} записей и {df_Laptop.shape[1]} фичей\n"
      f"Mobile Phone: {df_Phone.shape[0]} записей и {df_Phone.shape[1]} фичей\n")

print(f"Средняя доля покупок:\n"
      f"Charger: {df_Charger.target.mean():.2f}\n"
      f"Laptop: {df_Laptop.target.mean():.2f}\n"
      f"Mobile Phone: {df_Phone.target.mean():.2f}\n")

Размеры датасетов:
Charger: 10000 записей и 18 фичей
Laptop: 10000 записей и 18 фичей
Mobile Phone: 10000 записей и 18 фичей

Средняя доля покупок:
Charger: 0.35
Laptop: 0.54
Mobile Phone: 0.50



In [9]:
features_Charger = df_Charger.drop(['Product', 'target'], axis=1)
target_Charger = df_Charger['target']

features_Laptop = df_Laptop.drop(['Product', 'target'], axis=1)
target_Laptop = df_Laptop['target']

features_Phone = df_Phone.drop(['Product', 'target'], axis=1)
target_Phone = df_Phone['target']

In [10]:
def make_samples(features, target):
  # отделяем 20% - пятую часть всего - на тестовую выборку
  X_train_valid, X_test, y_train_valid, y_test = train_test_split(features, target,
                                                                  test_size=0.2,
                                                                  random_state=random_state)
  # отделяем 25% - четвертую часть трейн+валид - на валидирующую выборку
  X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid,
                                                        test_size=0.25,
                                                        random_state=random_state)

  s1 = y_train.size
  s2 = y_valid.size
  s3 = y_test.size
  print('Разбиение на выборки train:valid:test в соотношении '
        + str(round(s1/s3)) + ':' + str(round(s2/s3)) + ':' + str(round(s3/s3)))
  print('target rate на разбиениях:', round(y_train.mean(), 4), round(y_valid.mean(), 4), round(y_test.mean(), 4))
  return X_train, X_valid, X_test, y_train, y_valid, y_test

In [11]:
X_train_Charger, X_valid_Charger, X_test_Charger, y_train_Charger, y_valid_Charger, y_test_Charger = make_samples(features_Charger, target_Charger)
X_train_Laptop, X_valid_Laptop, X_test_Laptop, y_train_Laptop, y_valid_Laptop, y_test_Laptop = make_samples(features_Laptop, target_Laptop)
X_train_Phone, X_valid_Phone, X_test_Phone, y_train_Phone, y_valid_Phone, y_test_Phone = make_samples(features_Phone, target_Phone)

Разбиение на выборки train:valid:test в соотношении 3:1:1
target rate на разбиениях: 0.3532 0.3325 0.3375
Разбиение на выборки train:valid:test в соотношении 3:1:1
target rate на разбиениях: 0.5388 0.549 0.5375
Разбиение на выборки train:valid:test в соотношении 3:1:1
target rate на разбиениях: 0.4938 0.506 0.488


In [12]:
X_train_Charger.head(2)

Unnamed: 0,Age,Gender,Geography,PreviousPurchases,Salary,NumChildren,EducationLevel,MaritalStatus,EmploymentStatus,HousingStatus,CreditScore,InternetUsage,NumberOfCars,HealthStatus,ShoppingFrequency,MembershipDuration
9731,23,Female,South America,6042.440703,67691.979045,2,Master,Single,Student,Living with Parents,655,4.588958,0,Poor,2,11
6920,64,Male,North America,2497.156731,35648.888199,1,High School,Widowed,Student,Rent,379,6.842913,0,Poor,11,1


# Вспомогательные функции

In [13]:
# Функция для оценки модели
def calc_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test):
    # Обучение
    y_train_pred = model.predict(X_train)
    y_train_proba = model.predict_proba(X_train)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_train)

    # Валидация
    y_valid_pred = model.predict(X_valid)
    y_valid_proba = model.predict_proba(X_valid)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_valid)

    # Тестирование
    y_test_pred = model.predict(X_test)
    y_test_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_test)

    train_metrics = {
        'precision': precision_score(y_train, y_train_pred),
        'recall': recall_score(y_train, y_train_pred),
        'f1': f1_score(y_train, y_train_pred),
        'roc_auc': roc_auc_score(y_train, y_train_proba)
    }

    valid_metrics = {
        'precision': precision_score(y_valid, y_valid_pred),
        'recall': recall_score(y_valid, y_valid_pred),
        'f1': f1_score(y_valid, y_valid_pred),
        'roc_auc': roc_auc_score(y_valid, y_valid_proba)
    }

    test_metrics = {
        'precision': precision_score(y_test, y_test_pred),
        'recall': recall_score(y_test, y_test_pred),
        'f1': f1_score(y_test, y_test_pred),
        'roc_auc': roc_auc_score(y_test, y_test_proba)
    }

    return train_metrics, valid_metrics, test_metrics

def print_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test):
    res = calc_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test)
    metrics = pd.DataFrame(res, index=['train', 'valid', 'test'])
    return metrics

# Train models

https://catboost.ai/en/docs/concepts/python-reference_catboostclassifier

In [14]:
!pip install catboost optuna



In [15]:
from catboost import CatBoostClassifier
import optuna

In [16]:
def objective(trial, X_train, y_train, X_valid, y_valid):
    param = {
        "learning_rate": trial.suggest_float('learning_rate', 0.01, 0.9),
        "max_depth": trial.suggest_int("max_depth", 2, 7),
        "l2_leaf_reg":trial.suggest_float('l2_leaf_reg', 0.01, 2),
        "subsample": trial.suggest_float('subsample', 0.01, 1),
        "random_strength": trial.suggest_float('random_strength', 1, 200),
        "min_data_in_leaf":trial.suggest_float('min_data_in_leaf', 1, 500)
    }

    cat = CatBoostClassifier(
        logging_level="Silent",
        eval_metric="AUC",
        grow_policy="Lossguide",
        random_seed=42,
        **param)
    cat.fit(X_train, y_train, cat_features=object_cols,
            eval_set=(X_valid, y_valid),
            verbose=False,
            early_stopping_rounds=10
           )

    preds = cat.predict_proba(X_valid)[:,1]
    auc = roc_auc_score(y_valid, preds)

    return auc

In [17]:
def train_model(X_train, y_train, X_valid, y_valid, X_test, y_test):

  study = optuna.create_study(direction="maximize", study_name='CatBoostClassifier')
  study.optimize(lambda trial: objective(trial, X_train, y_train, X_valid, y_valid), n_trials=100)

  best_cat = CatBoostClassifier(**study.best_params, random_state=random_state)
  best_cat.fit(X_train, y_train, cat_features=object_cols,
              eval_set=(X_valid, y_valid),
              verbose=False,
              early_stopping_rounds=10
            )

  res_cat = print_metrics(best_cat, X_train, y_train, X_valid, y_valid, X_test, y_test)
  return res_cat, best_cat

In [18]:
res_Laptop, model_Laptop = train_model(X_train_Laptop, y_train_Laptop, X_valid_Laptop, y_valid_Laptop, X_test_Laptop, y_test_Laptop)
res_Charger, model_Charger = train_model(X_train_Charger, y_train_Charger, X_valid_Charger, y_valid_Charger, X_test_Charger, y_test_Charger)
res_Phone, model_Phone = train_model(X_train_Phone, y_train_Phone, X_valid_Phone, y_valid_Phone, X_test_Phone, y_test_Phone)

[I 2024-06-13 18:47:52,458] A new study created in memory with name: CatBoostClassifier
[I 2024-06-13 18:47:53,152] Trial 0 finished with value: 0.6820281988214815 and parameters: {'learning_rate': 0.7843864039274303, 'max_depth': 6, 'l2_leaf_reg': 0.09582490629611397, 'subsample': 0.23996090299613035, 'random_strength': 165.89234306358253, 'min_data_in_leaf': 349.41643330950427}. Best is trial 0 with value: 0.6820281988214815.
[I 2024-06-13 18:47:54,409] Trial 1 finished with value: 0.683363018428992 and parameters: {'learning_rate': 0.1556715950233523, 'max_depth': 7, 'l2_leaf_reg': 0.5620468279555199, 'subsample': 0.4112719822753225, 'random_strength': 104.52796956020892, 'min_data_in_leaf': 464.0671536231361}. Best is trial 1 with value: 0.683363018428992.
[I 2024-06-13 18:47:55,472] Trial 2 finished with value: 0.6834129984369888 and parameters: {'learning_rate': 0.47173315601139937, 'max_depth': 3, 'l2_leaf_reg': 0.6987778021992275, 'subsample': 0.579474891442067, 'random_strengt

In [19]:
res_Laptop

Unnamed: 0,precision,recall,f1,roc_auc
train,0.625231,0.732756,0.674737,0.662935
valid,0.645289,0.742259,0.690385,0.683781
test,0.607812,0.723721,0.660722,0.628712


In [20]:
res_Phone

Unnamed: 0,precision,recall,f1,roc_auc
train,0.622988,0.561593,0.590699,0.668652
valid,0.619469,0.55336,0.584551,0.642531
test,0.610811,0.578893,0.594424,0.653939


In [21]:
res_Charger

Unnamed: 0,precision,recall,f1,roc_auc
train,0.600998,0.113733,0.19127,0.67869
valid,0.598726,0.141353,0.22871,0.700032
test,0.56,0.124444,0.203636,0.669407


# Калибровка

In [22]:
# калиброваться будем на изотонике
from sklearn.isotonic import IsotonicRegression

# функция для калибровки
def calibrate(model, X_valid, y_valid):
  y_pred = model.predict_proba(X_valid)[:, 1]

  iso_reg = IsotonicRegression(out_of_bounds='clip')
  iso_reg.fit(y_pred, y_valid)
  y_pred_iso = iso_reg.transform(y_pred)
  return iso_reg

In [23]:
# обучаем калибровку на валидационном датасете (можно и на трейне)
iso_reg_Laptop = calibrate(model_Laptop, X_valid_Laptop, y_valid_Laptop)
iso_reg_Charger = calibrate(model_Charger, X_valid_Charger, y_valid_Charger)
iso_reg_Phone = calibrate(model_Phone, X_valid_Phone, y_valid_Phone)

In [24]:
# сборка датасета для удобного хранения тестовых данных
cols = X_test_Charger.columns.tolist()

X_test_Charger['product'] = 'Charger'
X_test_Laptop['product'] = 'Laptop'
X_test_Phone['product'] = 'Phone'

X_test = pd.concat([X_test_Charger, X_test_Laptop, X_test_Phone])
y_test = pd.concat([y_test_Charger, y_test_Laptop, y_test_Phone])

test_df = X_test[['product']]
test_df['NPV'] = test_df['product'].map(npv_values)
test_df['target'] = y_test
test_df['predict_Charger'] = model_Charger.predict_proba(X_test[cols])[:, 1]
test_df['predict_Laptop'] = model_Laptop.predict_proba(X_test[cols])[:, 1]
test_df['predict_Phone'] = model_Phone.predict_proba(X_test[cols])[:, 1]

In [25]:
# проверим, что сборка прошла удачно, нигде не ошиблись, и ROC_AUC примерно ожидаемый
for product in ['Charger', 'Laptop', 'Phone']:
  tmp = test_df[test_df['product'] == product]
  print(f'ROC_AUC for {product}:', roc_auc_score(tmp['target'], tmp[f'predict_{product}']))

ROC_AUC for Charger: 0.6694071278825997
ROC_AUC for Laptop: 0.6287115021998743
ROC_AUC for Phone: 0.6539391689613218


In [26]:
# применяем калибровку к тестовым семплам
test_df['predict_Charger_calibrated'] = iso_reg_Charger.transform(test_df['predict_Charger'])
test_df['predict_Laptop_calibrated'] = iso_reg_Laptop.transform(test_df['predict_Laptop'])
test_df['predict_Phone_calibrated'] = iso_reg_Phone.transform(test_df['predict_Phone'])

In [27]:
# Названия продуктов и их соответствие скорам
product_scores = {
    'predict_Charger_calibrated': 'Charger',
    'predict_Laptop_calibrated': 'Laptop',
    'predict_Phone_calibrated': 'Phone'
}

# Вычисление максимального скора и соответствующего продукта
test_df['Max_Score_Product'] = test_df[
    ['predict_Charger_calibrated', 'predict_Laptop_calibrated', 'predict_Phone_calibrated']
].idxmax(axis=1).map(product_scores)

# Вычисление максимального значения скор * NPV и соответствующего продукта
def calculate_max_score_npv_product(row):
    scores_with_npv = {
        'Charger': row['predict_Charger_calibrated'] * npv_values['Charger'],
        'Laptop': row['predict_Laptop_calibrated'] * npv_values['Laptop'],
        'Phone': row['predict_Phone_calibrated'] * npv_values['Phone']
    }
    max_product = max(scores_with_npv, key=scores_with_npv.get)
    return max_product

test_df['Max_Score_NPV_Product'] = test_df.apply(calculate_max_score_npv_product, axis=1)

In [28]:
test_df[['product', 'NPV', 'target', 'predict_Charger_calibrated','predict_Laptop_calibrated',
         'predict_Phone_calibrated',	'Max_Score_Product',	'Max_Score_NPV_Product']].sample(6)

Unnamed: 0,product,NPV,target,predict_Charger_calibrated,predict_Laptop_calibrated,predict_Phone_calibrated,Max_Score_Product,Max_Score_NPV_Product
5623,Charger,200.0,0,0.339483,0.684211,0.35,Laptop,Phone
5685,Laptop,100.0,1,0.381944,0.666667,0.622449,Laptop,Phone
8991,Laptop,100.0,1,0.21028,0.222222,0.320988,Phone,Phone
8904,Phone,300.0,0,0.217105,0.438596,0.474359,Phone,Phone
2983,Laptop,100.0,1,0.536765,0.684211,0.320988,Laptop,Charger
2698,Charger,200.0,0,0.082192,0.829412,0.488789,Laptop,Phone


In [29]:
# вспомним, какой мы задали NPV
npv_values

{'Laptop': 100.0, 'Phone': 300.0, 'Charger': 200.0}

In [30]:
test_df['Max_Score_Product'].value_counts()

Max_Score_Product
Laptop     3188
Phone      2249
Charger     563
Name: count, dtype: int64

In [31]:
test_df['Max_Score_NPV_Product'].value_counts()

Max_Score_NPV_Product
Phone      5623
Charger     364
Laptop       13
Name: count, dtype: int64