Predict whether a customer was influenced by Facebook marketing

Model ranking by positive F1 score  

CatBoost: 73%  
XGBoost: 69%  
Random Forest: 66%  
KNN: 64%  
Naive Bayes: 63%  
Decision Tree: 59%     
Logistic Regression: 58%  
SVM: 56%  

Best CatBoost with Tuning:  36%  
Best Parameters: 'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix


In [6]:
data = pd.read_csv("../data/TechCorner_Sales_converted.csv", index_col=0)

In [3]:
data.columns

Index(['Date', 'Cus_Location', 'Age', 'Gender', 'SellPrice', 'from_FB',
       'follows_page', 'bought_before', 'heard_of_shop', 'is_local', 'is_male',
       'Mobile Name_Galaxy M35 5G 8/128',
       'Mobile Name_Galaxy S24 Ultra 12/256', 'Mobile Name_Moto G85 5G 8/128',
       'Mobile Name_Narzo N53 4/64', 'Mobile Name_Note 11S 6/128',
       'Mobile Name_Note 14 Pro 5G 8/256', 'Mobile Name_Pixel 7a 8/128',
       'Mobile Name_Pixel 8 Pro 12/256', 'Mobile Name_R-70 Turbo 5G 6/128',
       'Mobile Name_Redmi Note 12 Pro 8/128', 'Mobile Name_Vivo T3x 5G 8/128',
       'Mobile Name_Vivo Y200 5G 6/128', 'Mobile Name_iPhone 16 Pro 256GB',
       'Mobile Name_iPhone 16 Pro Max 1TB',
       'Mobile Name_iQOO Neo 9 Pro 5G 12/256', 'Mobile Name_iQOO Z7 5G 6/128'],
      dtype='object')

Define Target and Features

In [7]:
# Facebook influence: whether the customer came from Facebook or follows the page
data['FB_influence'] = (data['from_FB'] | data['follows_page']).astype(int)

# Target column
y = data['FB_influence']

# Drop target and irrelevant columns
X = data.drop(columns=['Date', 'Cus_Location', 'Gender', 'from_FB', 'follows_page', 'FB_influence'])

Train-Test Split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=7, stratify=y
)

In [9]:
# Check class imbalance
print(y_train.value_counts())

FB_influence
1    4319
0    2777
Name: count, dtype: int64


Scale age and sell price on training data, apply to test data

In [10]:
sc = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[['Age', 'SellPrice']] = sc.fit_transform(X_train_scaled[['Age', 'SellPrice']])
X_test_scaled[['Age', 'SellPrice']] = sc.transform(X_test_scaled[['Age', 'SellPrice']])

Train and compare classifier models' confusion matrices  
Focus on best F1 and Recall for the positive FB influence

In [11]:
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

***Decision Tree***

In [None]:
# Decision Tree
model = tree.DecisionTreeClassifier()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.38      0.40      0.39       695
           1       0.60      0.58      0.59      1080

    accuracy                           0.51      1775
   macro avg       0.49      0.49      0.49      1775
weighted avg       0.51      0.51      0.51      1775

Confusion Matrix:
 [[275 420]
 [449 631]]


In [17]:
# Decision Tree smote
model = tree.DecisionTreeClassifier()
model.fit(X_train_sm_scaled, y_train_sm)

y_pred = model.predict(X_test_scaled)

print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.66      0.70      1336
           1       0.23      0.32      0.27       439

    accuracy                           0.57      1775
   macro avg       0.49      0.49      0.48      1775
weighted avg       0.62      0.57      0.59      1775

Confusion Matrix:
 [[878 458]
 [299 140]]


***Naive Bayes***

In [None]:
# Naive Bayes
model = GaussianNB()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.39      0.34      0.37       695
           1       0.61      0.66      0.63      1080

    accuracy                           0.53      1775
   macro avg       0.50      0.50      0.50      1775
weighted avg       0.52      0.53      0.53      1775

Confusion Matrix:
 [[239 456]
 [371 709]]


***KNN***

In [None]:
# KNN
model = KNeighborsClassifier()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.38      0.31      0.34       695
           1       0.60      0.68      0.64      1080

    accuracy                           0.53      1775
   macro avg       0.49      0.50      0.49      1775
weighted avg       0.52      0.53      0.52      1775

Confusion Matrix:
 [[217 478]
 [348 732]]


***Random Forest***

In [None]:
# Random Forest
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.40      0.29      0.34       695
           1       0.61      0.71      0.66      1080

    accuracy                           0.55      1775
   macro avg       0.50      0.50      0.50      1775
weighted avg       0.53      0.55      0.53      1775

Confusion Matrix:
 [[205 490]
 [312 768]]


***Logistic Regression***

In [None]:
# Logistic Regression
model = LogisticRegression(max_iter=1000, class_weight='balanced')
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.40      0.47      0.44       695
           1       0.62      0.55      0.58      1080

    accuracy                           0.52      1775
   macro avg       0.51      0.51      0.51      1775
weighted avg       0.53      0.52      0.53      1775

Confusion Matrix:
 [[329 366]
 [485 595]]


***XGBoost***

In [None]:
# XGBoost
model = XGBClassifier()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.43      0.26      0.32       695
           1       0.62      0.78      0.69      1080

    accuracy                           0.58      1775
   macro avg       0.53      0.52      0.51      1775
weighted avg       0.55      0.58      0.55      1775

Confusion Matrix:
 [[180 515]
 [238 842]]


***SVM***

In [None]:
# SVM
model = SVC(probability = True, class_weight='balanced')
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.39      0.48      0.43       695
           1       0.61      0.52      0.56      1080

    accuracy                           0.50      1775
   macro avg       0.50      0.50      0.50      1775
weighted avg       0.52      0.50      0.51      1775

Confusion Matrix:
 [[336 359]
 [520 560]]


***CatBoost***

In [23]:
# CatBoost
model = CatBoostClassifier()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

Learning rate set to 0.023786
0:	learn: 0.6918958	total: 6.23ms	remaining: 6.22s
1:	learn: 0.6906033	total: 12ms	remaining: 6s
2:	learn: 0.6894085	total: 18ms	remaining: 5.97s
3:	learn: 0.6882916	total: 23.7ms	remaining: 5.89s
4:	learn: 0.6873131	total: 29.5ms	remaining: 5.87s
5:	learn: 0.6863368	total: 35.6ms	remaining: 5.89s
6:	learn: 0.6854571	total: 41.2ms	remaining: 5.84s
7:	learn: 0.6845519	total: 46.7ms	remaining: 5.79s
8:	learn: 0.6837727	total: 50.2ms	remaining: 5.52s
9:	learn: 0.6829118	total: 55.3ms	remaining: 5.47s
10:	learn: 0.6820614	total: 60.7ms	remaining: 5.45s
11:	learn: 0.6812561	total: 65.9ms	remaining: 5.42s
12:	learn: 0.6805262	total: 70.3ms	remaining: 5.33s
13:	learn: 0.6797983	total: 75.7ms	remaining: 5.33s
14:	learn: 0.6790498	total: 80.2ms	remaining: 5.26s
15:	learn: 0.6783511	total: 85.1ms	remaining: 5.23s
16:	learn: 0.6777235	total: 90.4ms	remaining: 5.22s
17:	learn: 0.6771466	total: 96.8ms	remaining: 5.28s
18:	learn: 0.6766045	total: 112ms	remaining: 5.81s


In [24]:
# CatBoost
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.42      0.10      0.17       695
           1       0.61      0.91      0.73      1080

    accuracy                           0.59      1775
   macro avg       0.52      0.51      0.45      1775
weighted avg       0.54      0.59      0.51      1775

Confusion Matrix:
 [[ 72 623]
 [ 98 982]]


Tune CatBoost model with GridSearchCV

In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score

In [26]:
# Define hyperparameter grid
param_grid = {
    'iterations': [100, 200],
    'depth': [4, 6, 8],
    'learning_rate': [0.01, 0.1],
    'l2_leaf_reg': [1, 3, 5],
}

In [27]:
# Set up GridSearchCV
grid = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='f1',
    cv=5,
    n_jobs=-1,
    verbose=2
)

In [28]:
# Fit to training data
grid.fit(X_train_scaled, y_train)


Fitting 5 folds for each of 36 candidates, totalling 180 fits
0:	learn: 0.6926203	total: 6.63ms	remaining: 1.32s
1:	learn: 0.6920560	total: 13.4ms	remaining: 1.33s
2:	learn: 0.6915256	total: 20.7ms	remaining: 1.36s
3:	learn: 0.6910173	total: 26.5ms	remaining: 1.3s
4:	learn: 0.6905557	total: 31.5ms	remaining: 1.23s
5:	learn: 0.6900859	total: 37.3ms	remaining: 1.21s
6:	learn: 0.6896481	total: 42.3ms	remaining: 1.17s
7:	learn: 0.6891687	total: 47ms	remaining: 1.13s
8:	learn: 0.6887518	total: 53.4ms	remaining: 1.13s
9:	learn: 0.6883037	total: 58.1ms	remaining: 1.1s
10:	learn: 0.6878877	total: 62.5ms	remaining: 1.07s
11:	learn: 0.6875008	total: 66.1ms	remaining: 1.03s
12:	learn: 0.6870664	total: 71.6ms	remaining: 1.03s
13:	learn: 0.6866516	total: 77ms	remaining: 1.02s
14:	learn: 0.6862351	total: 81.5ms	remaining: 1s
15:	learn: 0.6858239	total: 87.6ms	remaining: 1.01s
16:	learn: 0.6854331	total: 92ms	remaining: 990ms
17:	learn: 0.6850503	total: 96.5ms	remaining: 976ms
18:	learn: 0.6846742	to

In [29]:
# Best model
best_model = grid.best_estimator_

# Predict
y_pred = best_model.predict(X_test_scaled)

# Evaluate
print("Best Parameters:", grid.best_params_)
print("F1 Score:", f1_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Best Parameters: {'depth': 6, 'iterations': 200, 'l2_leaf_reg': 3, 'learning_rate': 0.01}
F1 Score: 0.7561317449194114
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00       695
           1       0.61      1.00      0.76      1080

    accuracy                           0.61      1775
   macro avg       0.30      0.50      0.38      1775
weighted avg       0.37      0.61      0.46      1775



Export trained model  
because best parameters found no minority class, the regular CatBoost model will be exported

In [30]:
import joblib

# joblib.dump(best_model, 'phone_customer_FB.pkl')
joblib.dump(model, 'phone_customer_FB.pkl')

['phone_customer_FB.pkl']