# Client Churn Prediction
### CRISP-DM Cycle 4
---
The Top Bank company operates in Europe with a bank account as the main product, this product can keep client's salary and make payments. This account doesn't have any cost in the first 12 months, however, after that time trial, the client needs to rehire the bank for upcoming 12 months and redo this process every year. Recently the Analytics Team noticed that the churn rate is increasing.

As a Data Science Consultant, you need to create an action plan to decrease the number of churn customers and show the financial return on your solution.
At the end of your consultancy, you need to deliver to the TopBottom CEO a model in production, which will receive a customer base via API and return that same base with an extra column with the probability of each customer going into churn.
In addition, you will need to provide a report reporting your model's performance and the financial impact of your solution. Questions that the CEO and the Analytics team would like to see in their report:

1.  What is Top Bank's current Churn rate?
2.  How does the churn rate vary monthly?
3.  What is the performance of the model in classifying customers as churns
4.  What is the expected return, in terms of revenue, if the company uses its model to avoid churn from customers?

> Disclaimer: This is a fictional business case

## 0. Preparation

### 0.1 Imports & Settings

In [1]:
from IPython.core.display import HTML
from pathlib import Path
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import optuna
import numpy as np
from catboost import CatBoostClassifier


In [2]:
def jupyter_settings():
    """
    Plots pre settings.
    """

    %matplotlib inline
    plt.style.use("seaborn-v0_8-whitegrid")
    plt.rcParams["figure.figsize"] = [25, 12]
    plt.rcParams["font.size"] = 24
    display(HTML("<style>.container {width:100% !important;}</style>"))
    sns.set()


jupyter_settings()

seed = 42

### 0.3 Path

In [3]:
# found the main project folders
path = Path().resolve().parent
data_path = path / "data/processed"

### 0.4 Data

This dataset is available [here](https://www.kaggle.com/mervetorkan/churndataset).


**Data fields**

- **RowNumber**: the number of the columns
- **CustomerID**: unique identifier of clients
- **Surname**: client's last name
- **CreditScore**: clients credit score for the financial market
- **Geography**: the country of the client
- **Gender**: the gender of the client
- **Age**: the client's age
- **Tenure**: number of years the client is in the bank 
- **Balance**: the amount that the client has in their account 
- **NumOfProducts**: the number of products that the client bought 
- **HasCrCard**: if the client has a credit card 
- **IsActiveMember**: if the client is active (within the last 12 months) 
- **EstimateSalary**: estimate of annual salary of clients 
- **Exited**: if the client is a churn (*target variable*)

In [4]:
X_train = pd.read_parquet(data_path / "X_train_fs.parquet")
X_test = pd.read_parquet(data_path / "X_test_fs.parquet")
X_val = pd.read_parquet(data_path / "X_val_fs.parquet")
y_train = pd.read_pickle(data_path / "y_train.pkl")
y_test = pd.read_pickle(data_path / "y_test.pkl")
y_val = pd.read_pickle(data_path / "y_val.pkl")

### 0.5 Optuna Trial

In [5]:
from sklearn.metrics import recall_score, precision_score, f1_score
from sklearn.model_selection import StratifiedKFold


def catboost_objective(
    trial: int,
    X: pd.DataFrame,
    y: pd.DataFrame,
    weight: float,
    threshold: float,
    kfold: int = 5,
    selected_score: str = "recall",
):
    """Objective function for Optuna optimization.

    Args:
        trial (int): Number of trials.
        X (dataframe): Dataframe with training data.
        y (dataframe): Dataframe with target data.
        weight (float): Scale pos weight.
        threshold (float): Threshold for the model.
        kfold (int): Number of folds.
        selected_score (str): Selected score.

    Returns:
        int: Recall mean.
    """

    # Catboost parameters
    param_grid = {
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.1),
        "depth": trial.suggest_int("depth", 3, 10),
        "n_estimators": trial.suggest_int("n_estimators", 100, 2000, step=100),
    }

    model = CatBoostClassifier(scale_pos_weight=weight, verbose=False, **param_grid)

    # Stratified Kfold
    folds = StratifiedKFold(n_splits=kfold)

    # List of recalls for each fold
    score_list = []

    for train_cv, val_cv in folds.split(X, y):
        # Split into train and validation
        X_train_fold = X.iloc[train_cv]
        y_train_fold = y.iloc[train_cv]
        X_val_fold = X.iloc[val_cv]
        y_val_fold = y.iloc[val_cv]

        # Train the model
        model.fit(X_train_fold, y_train_fold)

        # Predict the validation fold
        y_pred_val = model.predict_proba(X_val_fold)[:, 1]

        if selected_score == "recall":
            score_val = recall_score(y_val_fold, y_pred_val >= threshold)
        elif selected_score == "precision":
            score_val = precision_score(y_val_fold, y_pred_val >= threshold)
        elif selected_score == "f1":
            score_val = f1_score(y_val_fold, y_pred_val >= threshold)
        else:
            print("Please, select a valid score")

        # Add to the list
        score_list.append(score_val)

    mean_score = np.mean(score_list)

    return mean_score

## 5. Hyperparameter Fine Tuning

### 5.1 Bayesian Search

Bayesian Optimization provides a principled technique based on Bayes Theorem to direct a search of a global optimization problem that is efficient and effective. In this case, Optuna was used.

#### 5.1.1 Threshold 0.43

In [6]:
proportion = float(len(y_train[y_train == 0])) / len(y_train[y_train == 1])

In [7]:
cb_study = optuna.create_study(direction="maximize")

func = lambda trial: catboost_objective(trial, X_train, y_train, proportion, 0.43)

cb_study.optimize(func, n_trials=70, n_jobs=-1)
cb_best_params = cb_study.best_params

cb_best_params

[I 2024-08-18 20:14:30,884] A new study created in memory with name: no-name-b589822b-f509-4fd8-8e70-e35b9bf27c1a
[I 2024-08-18 20:14:32,848] Trial 4 finished with value: 0.8050717703349282 and parameters: {'learning_rate': 0.0828846412106792, 'depth': 4, 'n_estimators': 100}. Best is trial 4 with value: 0.8050717703349282.
[I 2024-08-18 20:14:36,356] Trial 8 finished with value: 0.8001693043798307 and parameters: {'learning_rate': 0.04820793668202971, 'depth': 3, 'n_estimators': 300}. Best is trial 4 with value: 0.8050717703349282.
[I 2024-08-18 20:14:39,071] Trial 6 finished with value: 0.8064801864801865 and parameters: {'learning_rate': 0.019762640038526593, 'depth': 3, 'n_estimators': 700}. Best is trial 6 with value: 0.8064801864801865.
[I 2024-08-18 20:14:40,800] Trial 1 finished with value: 0.7419506808980494 and parameters: {'learning_rate': 0.05409051977091841, 'depth': 5, 'n_estimators': 500}. Best is trial 6 with value: 0.8064801864801865.
[I 2024-08-18 20:14:43,010] Trial 

{'learning_rate': 0.010732547330507854, 'depth': 3, 'n_estimators': 100}

In [8]:
cb_study = optuna.create_study(direction="maximize")

func = lambda trial: catboost_objective(
    trial, X_train, y_train, proportion, 0.43, selected_score="precision"
)

cb_study.optimize(func, n_trials=70, n_jobs=-1)
cb_precision_best_params = cb_study.best_params

cb_precision_best_params

[I 2024-08-18 20:21:26,513] A new study created in memory with name: no-name-8eb10120-1c54-49b5-87f5-87f7d28d42d4
[I 2024-08-18 20:21:28,360] Trial 4 finished with value: 0.4475268258063281 and parameters: {'learning_rate': 0.06759275692624069, 'depth': 4, 'n_estimators': 100}. Best is trial 4 with value: 0.4475268258063281.
[I 2024-08-18 20:21:38,434] Trial 5 finished with value: 0.5082858799418871 and parameters: {'learning_rate': 0.06877353651313951, 'depth': 4, 'n_estimators': 700}. Best is trial 5 with value: 0.5082858799418871.
[I 2024-08-18 20:21:44,694] Trial 3 finished with value: 0.47323645983911905 and parameters: {'learning_rate': 0.022507495336396663, 'depth': 3, 'n_estimators': 1300}. Best is trial 5 with value: 0.5082858799418871.
[I 2024-08-18 20:21:47,061] Trial 6 finished with value: 0.5087661938676742 and parameters: {'learning_rate': 0.03400272539490122, 'depth': 5, 'n_estimators': 900}. Best is trial 6 with value: 0.5087661938676742.
[I 2024-08-18 20:21:49,946] Tri

{'learning_rate': 0.07790765187207421, 'depth': 9, 'n_estimators': 1500}

In [9]:
cb_study = optuna.create_study(direction="maximize")

func = lambda trial: catboost_objective(
    trial, X_train, y_train, proportion, 0.43, selected_score="f1"
)

cb_study.optimize(func, n_trials=70, n_jobs=-1)
cb_f1_best_params = cb_study.best_params

cb_f1_best_params

[I 2024-08-18 20:49:47,200] A new study created in memory with name: no-name-315dd68e-8df4-4bc3-91c2-bafbb541ed2f
[I 2024-08-18 20:50:05,303] Trial 3 finished with value: 0.5954823370361585 and parameters: {'learning_rate': 0.09617258495327356, 'depth': 4, 'n_estimators': 900}. Best is trial 3 with value: 0.5954823370361585.
[I 2024-08-18 20:50:10,174] Trial 4 finished with value: 0.5973848730844186 and parameters: {'learning_rate': 0.08941452600960408, 'depth': 10, 'n_estimators': 100}. Best is trial 4 with value: 0.5973848730844186.
[I 2024-08-18 20:50:14,593] Trial 1 finished with value: 0.6021577390802921 and parameters: {'learning_rate': 0.09805702976592359, 'depth': 8, 'n_estimators': 300}. Best is trial 1 with value: 0.6021577390802921.
[I 2024-08-18 20:50:23,449] Trial 5 finished with value: 0.5992225632645682 and parameters: {'learning_rate': 0.03738196397294193, 'depth': 8, 'n_estimators': 400}. Best is trial 1 with value: 0.6021577390802921.
[I 2024-08-18 20:50:29,264] Trial

{'learning_rate': 0.04069330665047851, 'depth': 9, 'n_estimators': 400}

#### 5.1.2 Threshold 0.47

In [10]:
cb_study = optuna.create_study(direction="maximize")

func = lambda trial: catboost_objective(trial, X_train, y_train, proportion, 0.47)

cb_study.optimize(func, n_trials=70, n_jobs=-1)
cb_best_params = cb_study.best_params

cb_best_params

[I 2024-08-18 21:04:31,854] A new study created in memory with name: no-name-cb03ccbf-201b-44a1-bb44-05a443845d47
[I 2024-08-18 21:04:43,803] Trial 4 finished with value: 0.7244190896822477 and parameters: {'learning_rate': 0.08665896694940448, 'depth': 3, 'n_estimators': 800}. Best is trial 4 with value: 0.7244190896822477.
[I 2024-08-18 21:04:59,768] Trial 6 finished with value: 0.6227358606305975 and parameters: {'learning_rate': 0.066969023369698, 'depth': 6, 'n_estimators': 800}. Best is trial 4 with value: 0.7244190896822477.
[I 2024-08-18 21:05:05,133] Trial 9 finished with value: 0.7643994601889339 and parameters: {'learning_rate': 0.03006803953496051, 'depth': 7, 'n_estimators': 100}. Best is trial 9 with value: 0.7643994601889339.
[I 2024-08-18 21:05:17,461] Trial 7 finished with value: 0.6157207704576126 and parameters: {'learning_rate': 0.05522059337026013, 'depth': 6, 'n_estimators': 1300}. Best is trial 9 with value: 0.7643994601889339.
[I 2024-08-18 21:05:23,352] Trial 5

{'learning_rate': 0.010195450583486167, 'depth': 3, 'n_estimators': 600}

In [11]:
cb_study = optuna.create_study(direction="maximize")

func = lambda trial: catboost_objective(
    trial, X_train, y_train, proportion, 0.47, selected_score="precision"
)

cb_study.optimize(func, n_trials=70, n_jobs=-1)
cb_precision_best_params = cb_study.best_params

cb_precision_best_params

[I 2024-08-18 21:12:07,999] A new study created in memory with name: no-name-87eabe40-cdc7-4377-aa8a-febac2d00539
[I 2024-08-18 21:12:09,901] Trial 2 finished with value: 0.489536822433598 and parameters: {'learning_rate': 0.09668800448674814, 'depth': 4, 'n_estimators': 100}. Best is trial 2 with value: 0.489536822433598.
[I 2024-08-18 21:12:12,841] Trial 3 finished with value: 0.5078086689025021 and parameters: {'learning_rate': 0.08102223172172807, 'depth': 5, 'n_estimators': 200}. Best is trial 3 with value: 0.5078086689025021.
[I 2024-08-18 21:12:22,602] Trial 4 finished with value: 0.5432276157036702 and parameters: {'learning_rate': 0.09724143244390948, 'depth': 4, 'n_estimators': 800}. Best is trial 4 with value: 0.5432276157036702.
[I 2024-08-18 21:12:32,106] Trial 9 finished with value: 0.5139989321196891 and parameters: {'learning_rate': 0.03678630167837482, 'depth': 3, 'n_estimators': 1300}. Best is trial 4 with value: 0.5432276157036702.
[I 2024-08-18 21:12:37,051] Trial 1

{'learning_rate': 0.09556004903941957, 'depth': 10, 'n_estimators': 1200}

In [12]:
cb_study = optuna.create_study(direction="maximize")

func = lambda trial: catboost_objective(
    trial, X_train, y_train, proportion, 0.47, selected_score="f1"
)

cb_study.optimize(func, n_trials=70, n_jobs=-1)
cb_f1_best_params = cb_study.best_params

cb_f1_best_params

[I 2024-08-18 21:51:48,721] A new study created in memory with name: no-name-db65673d-463f-4419-ae2a-8759c464ca13
[I 2024-08-18 21:52:11,869] Trial 6 finished with value: 0.6058839216859357 and parameters: {'learning_rate': 0.05615269731077033, 'depth': 5, 'n_estimators': 900}. Best is trial 6 with value: 0.6058839216859357.
[I 2024-08-18 21:52:15,557] Trial 1 finished with value: 0.5991011038723856 and parameters: {'learning_rate': 0.0735209749622004, 'depth': 3, 'n_estimators': 1700}. Best is trial 6 with value: 0.6058839216859357.
[I 2024-08-18 21:52:16,634] Trial 3 finished with value: 0.6077843794201928 and parameters: {'learning_rate': 0.024366828286805843, 'depth': 5, 'n_estimators': 1100}. Best is trial 3 with value: 0.6077843794201928.
[I 2024-08-18 21:52:19,668] Trial 9 finished with value: 0.6036330505962677 and parameters: {'learning_rate': 0.09221538689899479, 'depth': 4, 'n_estimators': 200}. Best is trial 3 with value: 0.6077843794201928.
[I 2024-08-18 21:52:27,192] Tria

{'learning_rate': 0.07930597072928085, 'depth': 7, 'n_estimators': 300}

#### 5.1.3 Threshold 0.5

In [13]:
cb_study = optuna.create_study(direction="maximize")

func = lambda trial: catboost_objective(trial, X_train, y_train, proportion, 0.5)

cb_study.optimize(func, n_trials=70, n_jobs=-1)
cb_best_params = cb_study.best_params

cb_best_params

[I 2024-08-18 21:59:02,829] A new study created in memory with name: no-name-81551427-82ef-41a4-9426-cbbf8244a275
[I 2024-08-18 21:59:25,085] Trial 6 finished with value: 0.7377597840755735 and parameters: {'learning_rate': 0.012802426507845693, 'depth': 4, 'n_estimators': 1100}. Best is trial 6 with value: 0.7377597840755735.
[I 2024-08-18 21:59:30,886] Trial 7 finished with value: 0.6059035701140965 and parameters: {'learning_rate': 0.08170126920484971, 'depth': 7, 'n_estimators': 500}. Best is trial 6 with value: 0.7377597840755735.
[I 2024-08-18 21:59:36,029] Trial 1 finished with value: 0.6269512943197153 and parameters: {'learning_rate': 0.09204997816975802, 'depth': 3, 'n_estimators': 2000}. Best is trial 6 with value: 0.7377597840755735.
[I 2024-08-18 21:59:36,492] Trial 2 finished with value: 0.6037958532695376 and parameters: {'learning_rate': 0.06374421320485082, 'depth': 7, 'n_estimators': 600}. Best is trial 6 with value: 0.7377597840755735.
[I 2024-08-18 21:59:40,146] Tri

{'learning_rate': 0.05632450741656245, 'depth': 3, 'n_estimators': 400}

In [14]:
cb_study = optuna.create_study(direction="maximize")

func = lambda trial: catboost_objective(
    trial, X_train, y_train, proportion, 0.5, selected_score="precision"
)

cb_study.optimize(func, n_trials=70, n_jobs=-1)
cb_precision_best_params = cb_study.best_params

cb_precision_best_params

[I 2024-08-18 22:02:39,475] A new study created in memory with name: no-name-c44e30a9-ea9f-4f3c-91c8-c274220316e1
[I 2024-08-18 22:02:45,497] Trial 6 finished with value: 0.5247077869141954 and parameters: {'learning_rate': 0.06869816801494746, 'depth': 7, 'n_estimators': 100}. Best is trial 6 with value: 0.5247077869141954.
[I 2024-08-18 22:02:46,111] Trial 4 finished with value: 0.527658389761165 and parameters: {'learning_rate': 0.08168336128187967, 'depth': 4, 'n_estimators': 300}. Best is trial 4 with value: 0.527658389761165.
[I 2024-08-18 22:02:57,178] Trial 2 finished with value: 0.532075582172958 and parameters: {'learning_rate': 0.046690800013954564, 'depth': 3, 'n_estimators': 1000}. Best is trial 2 with value: 0.532075582172958.
[I 2024-08-18 22:03:08,942] Trial 10 finished with value: 0.5441015827593211 and parameters: {'learning_rate': 0.059557850642675506, 'depth': 5, 'n_estimators': 400}. Best is trial 10 with value: 0.5441015827593211.
[I 2024-08-18 22:03:18,947] Trial

{'learning_rate': 0.05971996849177379, 'depth': 10, 'n_estimators': 900}

In [15]:
cb_study = optuna.create_study(direction="maximize")

func = lambda trial: catboost_objective(
    trial, X_train, y_train, proportion, 0.5, selected_score="f1"
)

cb_study.optimize(func, n_trials=70, n_jobs=-1)
cb_f1_best_params = cb_study.best_params

cb_f1_best_params

[I 2024-08-18 22:36:41,119] A new study created in memory with name: no-name-35b6bd6a-0967-411d-8a66-79bffb0f754a
[I 2024-08-18 22:36:54,811] Trial 6 finished with value: 0.6093431188345785 and parameters: {'learning_rate': 0.09552107399850795, 'depth': 3, 'n_estimators': 700}. Best is trial 6 with value: 0.6093431188345785.
[I 2024-08-18 22:36:58,352] Trial 2 finished with value: 0.6069415868999902 and parameters: {'learning_rate': 0.07686827241904728, 'depth': 9, 'n_estimators': 100}. Best is trial 6 with value: 0.6093431188345785.
[I 2024-08-18 22:37:29,578] Trial 8 finished with value: 0.6128895640431611 and parameters: {'learning_rate': 0.026150035786501852, 'depth': 6, 'n_estimators': 800}. Best is trial 8 with value: 0.6128895640431611.
[I 2024-08-18 22:37:38,228] Trial 1 finished with value: 0.5936632800695204 and parameters: {'learning_rate': 0.08588352292024885, 'depth': 6, 'n_estimators': 1300}. Best is trial 8 with value: 0.6128895640431611.
[I 2024-08-18 22:38:09,759] Tria

{'learning_rate': 0.05644301052955968, 'depth': 4, 'n_estimators': 800}