# **Uplift Modeling: One Model and Two Model Approaches**

---







## **Introduction**







Uplift modeling is a technique that allows us to identify the incremental effect of a treatment (e.g., a marketing campaign) on a target variable. This notebook demonstrates two popular approaches for uplift modeling:







- **One Model Approach**: Uses a single model with a transformed target to capture treatment effects.



- **Two Model Approach**: Employs separate models for treatment and control groups, estimating probabilities separately for both groups.







In this notebook, we will use the **CatBoostClassifier** from the CatBoost library and **TwoModels** from the `scikit-uplift` library to implement these approaches. Let’s get started!







---


## **1. Requirements**

In [1]:
!pip install catboost scikit-uplift dill

Collecting scikit-uplift
  Downloading scikit_uplift-0.5.1-py3-none-any.whl.metadata (11 kB)
Downloading scikit_uplift-0.5.1-py3-none-any.whl (42 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-uplift
Successfully installed scikit-uplift-0.5.1


## **2. Load and Explore Data**

In [2]:
import pandas as pd
import numpy as np

# Load the datasets
df_train = pd.read_csv("/kaggle/input/vk-2024-2e/train.csv")
df_test = pd.read_csv("/kaggle/input/vk-2024-2e/test.csv")

# Display the first few rows of the training data
df_train.head()

Unnamed: 0,application_1,cc_1,cc_2,cc_3,cc_4,feature_1,mb_1,cc_5,cc_6,feature_2,...,cc_21,application_15,feature_25,feature_26,cc_22,partner_24,application_16,retro_date,successful_utilization,treatment
0,,1.0,Representatives,123.0,Первичная ДК,1,14.0,147000.0,PLT,0.0,...,,0,1.0,0.0,-1.2,1.0,0,2024-07-04,0,1
1,,1.0,Offline,43.0,Airports,0,1.0,120000.0,PLT,0.0,...,1.0,0,1.0,0.0,-1.2,1.0,0,2024-06-06,0,1
2,0.0,1.0,Web,2.0,seo,0,,15000.0,PLT,0.0,...,,0,,0.0,-1.2,1.0,0,2024-07-21,1,1
3,0.0,1.0,MB,2.0,One Click Offer,0,91.0,260000.0,PLT,0.0,...,1.0,0,,1.0,-1.2,1.0,0,2024-05-23,0,1
4,0.0,1.0,Representatives,123.0,Первичная ДК,1,1.0,130000.0,PLT,0.0,...,,0,1.0,0.0,-1.2,1.0,0,2024-06-28,0,1


### **Dataset Overview**







- **Target Variable**: `successful_utilization` - indicates whether a given action was successful.



- **Treatment Indicator**: `treatment` - indicates whether a sample belongs to the treatment (1) or control (0) group.



- **Categorical and Date Features**: We have categorical features and a date feature (`retro_date`) that we will preprocess.







---


## **3. Data Preprocessing**

In [3]:
from datetime import datetime

# Convert date to numeric features

def extract_time_features(df, date_col):
    dt = pd.to_datetime(df[date_col])
    df['day'] = dt.dt.day
    df['dayofweek'] = dt.dt.weekday
    df['month'] = dt.dt.month
    df['year'] = dt.dt.year
    df.drop(columns=[date_col], inplace=True)
    return df

# Apply date transformation
df_train = extract_time_features(df_train, 'retro_date')
df_test = extract_time_features(df_test, 'retro_date')

### **Handling Missing Values**







- **Categorical Columns**: Replace missing values with the most frequent value.



- **Numerical Columns**: Replace missing values with the median value.






In [4]:
categorical_columns = ['cc_2', 'cc_4', 'cc_6']
numeric_columns = df_train.select_dtypes(include=['float64', 'int64']).columns

# Handling missing values for categorical columns
for col in categorical_columns:
    if col in df_train.columns:
        most_frequent = df_train[col].mode()[0]
        df_train[col] = df_train[col].fillna(most_frequent)
    if col in df_test.columns:
        df_test[col] = df_test[col].fillna(most_frequent)

# Handling missing values for numeric columns
numeric_columns = df_train.select_dtypes(include=['float64', 'int64']).columns
for col in numeric_columns:
    median_value = df_train[col].median() if col in df_train.columns else None
    if col in df_train.columns:
        df_train[col] = df_train[col].fillna(median_value)
    if col in df_test.columns:
        df_test[col] = df_test[col].fillna(median_value)

## **4. One Model Approach**

In the One Model approach, we define a **transformed target** to capture the treatment effect by combining the target and treatment values.







### **Transformed Target Definition**


In [5]:
# Define transformed target
df_train['new_target'] = (df_train['successful_utilization'] + df_train['treatment'] + 1) % 2

# Define features
features = df_train.columns.drop(['successful_utilization', 'treatment', 'new_target'])

  df_train['new_target'] = (df_train['successful_utilization'] + df_train['treatment'] + 1) % 2


### **Split Data**

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    df_train[features], df_train['new_target'], test_size=0.2, random_state=42
)

### **Train the One Model**

In [7]:
from catboost import CatBoostClassifier

model_one = CatBoostClassifier(
    iterations=2000, random_seed=42, depth=6,
    cat_features=categorical_columns, eval_metric='AUC', 
    auto_class_weights='Balanced', task_type="GPU", verbose=100
)

model_one.fit(X_train, y_train, eval_set=(X_val, y_val), use_best_model=True)

Learning rate set to 0.033568


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6496548	best: 0.6496548 (0)	total: 6.84s	remaining: 3h 47m 46s
100:	test: 0.6831170	best: 0.6831170 (100)	total: 9.36s	remaining: 2m 55s
200:	test: 0.6906721	best: 0.6906721 (200)	total: 11.9s	remaining: 1m 46s
300:	test: 0.6945066	best: 0.6945066 (300)	total: 14.4s	remaining: 1m 21s
400:	test: 0.6968951	best: 0.6968951 (400)	total: 16.9s	remaining: 1m 7s
500:	test: 0.6985017	best: 0.6985017 (500)	total: 19.4s	remaining: 58s
600:	test: 0.6996079	best: 0.6996079 (600)	total: 21.9s	remaining: 51s
700:	test: 0.7004599	best: 0.7004599 (700)	total: 24.4s	remaining: 45.3s
800:	test: 0.7009368	best: 0.7009450 (795)	total: 26.9s	remaining: 40.3s
900:	test: 0.7013677	best: 0.7013677 (900)	total: 29.4s	remaining: 35.8s
1000:	test: 0.7017278	best: 0.7017278 (1000)	total: 31.9s	remaining: 31.8s
1100:	test: 0.7019863	best: 0.7019863 (1100)	total: 34.3s	remaining: 28s
1200:	test: 0.7023373	best: 0.7023373 (1200)	total: 36.8s	remaining: 24.5s
1300:	test: 0.7025011	best: 0.7025011 (1300)	to

<catboost.core.CatBoostClassifier at 0x7f66abcede40>

## **5. Two Model Approach**

In the Two Model approach, we train two separate models for the treatment and control groups.







### **Split Data for Two Models**


In [8]:
X = df_train.drop(columns=['successful_utilization', 'treatment', 'new_target']) # Don't forget drop new_target from One Model Approach!
y = df_train['successful_utilization']
treatment_series = df_train['treatment']

X_train_2, X_val_2, trmnt_train, trmnt_val, y_train_2, y_val_2 = train_test_split(
    X, treatment_series, y, stratify=pd.concat([treatment_series, y], axis=1), test_size=0.2, random_state=2024
)

### **Train the Two Model Approach**

In [9]:
from sklift.models import TwoModels

estimator_trmnt = CatBoostClassifier(
    thread_count=2, random_state=42, auto_class_weights='Balanced', iterations=1000, task_type="GPU", verbose=100
)

estimator_ctrl = CatBoostClassifier(
    thread_count=2, random_state=42, auto_class_weights='Balanced', iterations=1000, task_type="GPU", verbose=100
)

two_model = TwoModels(
    estimator_trmnt=estimator_trmnt,
    estimator_ctrl=estimator_ctrl,
    method='vanilla'
)



two_model.fit(
    X=X_train_2, y=y_train_2, treatment=trmnt_train,
    estimator_trmnt_fit_params={'cat_features': categorical_columns},
    estimator_ctrl_fit_params={'cat_features': categorical_columns}
)

Learning rate set to 0.02943
0:	learn: 0.6907771	total: 55.6ms	remaining: 55.5s
100:	learn: 0.6164030	total: 5.04s	remaining: 44.8s
200:	learn: 0.5913467	total: 10s	remaining: 39.8s
300:	learn: 0.5741458	total: 14.9s	remaining: 34.6s
400:	learn: 0.5567438	total: 19.8s	remaining: 29.5s
500:	learn: 0.5401676	total: 24.7s	remaining: 24.6s
600:	learn: 0.5230515	total: 29.5s	remaining: 19.6s
700:	learn: 0.5093931	total: 34.4s	remaining: 14.7s
800:	learn: 0.4964616	total: 39.4s	remaining: 9.78s
900:	learn: 0.4852316	total: 44.3s	remaining: 4.87s
999:	learn: 0.4748222	total: 49.2s	remaining: 0us
Learning rate set to 0.025021
0:	learn: 0.6903647	total: 27ms	remaining: 26.9s
100:	learn: 0.6210514	total: 2.5s	remaining: 22.3s
200:	learn: 0.6091461	total: 5.03s	remaining: 20s
300:	learn: 0.6027669	total: 7.5s	remaining: 17.4s
400:	learn: 0.5984655	total: 9.93s	remaining: 14.8s
500:	learn: 0.5952032	total: 12.4s	remaining: 12.3s
600:	learn: 0.5921697	total: 14.8s	remaining: 9.8s
700:	learn: 0.5895

## **6. Predictions and Evaluation**

Now, we’ll make predictions for the test data and compare the uplift results from both approaches.







### **One Model Predictions**


In [10]:
preds_test_one_model = model_one.predict_proba(df_test[features])[:, 1]

### **Two Model Predictions**

In [11]:
uplift_predictions_test = two_model.predict(df_test)

## **7. Saving Results**

We save both sets of predictions in a single CSV file for easy comparison.


In [12]:
sub = pd.read_csv('/kaggle/input/vk-2024-2e/sample_submission.csv')
sub['successful_utilization_one_model'] = preds_test_one_model
sub['successful_utilization_two_model'] = uplift_predictions_test
sub.to_csv('submission.csv', index=False)