<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Banking Customer Churn Analysis using Vantage</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction:</b></p>

<center><img src="images/churn.webp"/></center>

<p style = 'font-size:16px;font-family:Arial'>Source: <a href = 'https://medium.com/@islamhasabo/predicting-customer-churn-bc76f7760377'>Medium</a></p>

<p style = 'font-size:16px;font-family:Arial'>Customer churn is an important metric in banking because it can directly impact a bank's revenue and profitability. When customers leave, banks lose the revenue they would have earned from those customers' transactions, investments, and account fees. Additionally, attracting new customers to replace those who have left can be expensive and time-consuming, so reducing customer churn is often more cost-effective than acquiring new customers.</p>

<p style = 'font-size:16px;font-family:Arial'>Customer churn can also be an indicator of customer satisfaction and loyalty. If customers are leaving at a high rate, it may indicate that they are dissatisfied with the bank's products or services, customer service, or overall experience.</p>

<p style = 'font-size:16px;font-family:Arial'>Banks can use various strategies to reduce customer churn, such as improving customer service, offering more competitive rates and fees, providing personalized recommendations and offers, and enhancing digital channels and mobile apps. By tracking and analyzing customer churn rates, banks can identify areas for improvement and make strategic decisions to retain customers and improve overall customer satisfaction.</p>

<p style = 'font-size:16px;font-family:Arial'>In this demo, we demonstrate how the entire lifecycle of churn prediction can be implemented using Vantage technologies and, specifically, the combination of Bring Your Own Model (BYOM), Vantage Analytics Library (VAL) and teradataml python client library solution.</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Downloading and installing additional software needed</b>

In [None]:
%%capture
# '%%capture' suppresses the display of installation steps of the following packages
!pip install sklearn2pmml
!pip install jdk4py
!pip install teradataml

<p style = 'font-size:16px;font-family:Arial'>
    <i><b>*BEFORE proceeding, please RESTART the kernel to bring new software into Jupyter.</b></i>
</p>
<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import getpass
import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm, tree
from xgboost import XGBClassifier
from sklearn2pmml import sklearn2pmml
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn2pmml.pipeline import PMMLPipeline
from jdk4py import JAVA, JAVA_HOME, JAVA_VERSION
from sklearn.ensemble import RandomForestClassifier

from teradataml import *

# Modify the following to match the specific client environment settings
configure.val_install_location = 'val'
configure.byom_install_location = 'mldb'
os.environ['PATH'] = os.environ['PATH'] + os.pathsep + str(JAVA_HOME)
os.environ['PATH'] = os.environ['PATH'] + os.pathsep + str(JAVA)[:-5]

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>1. Initiate a connection to Vantage</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Let's start by connecting to the Teradata system </b></p>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
eng.execute('''SET query_band='DEMO=BankingChurn.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_BankChurn_cloud');"        # Takes 40 seconds
%run -i ../run_procedure.py "call get_data('DEMO_BankChurn_local');"        # Takes 20 seconds

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Data Exploration</b>
<p style = 'font-size:16px;font-family:Arial'>Create a "Virtual DataFrame" that points to the data set in Vantage. Check the shape of the dataframe as check the datatypes of all the columns of the dataframe.</p>
<p style = 'font-size:16px;font-family:Arial'><b><i>*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.</i></b></p>

In [None]:
tdf = DataFrame(in_schema("DEMO_BankChurn", "customer_churn"))
print("Shape of the data: ", tdf.shape)
tdf.head()

In [None]:
tdf.dtypes

<p style = 'font-size:16px;font-family:Arial'>By looking at the datatypes and sample data, we classify the columns into ID column, target variable(y), numerical, categorical and binary ones. We skip using <i>RowNumber</i> and <i>Surname</i> columns as they are not useful in the analysis.</p>

In [None]:
y = "Exited"
num_x = ["Age","Balance","CreditScore","EstimatedSalary","Tenure"]
cat_x = ["Gender","Geography","NumOfProducts"]
bin_x = ["HasCrCard","IsActiveMember"]
idcol = ["CustomerId"]
customer_data = tdf.select(idcol +[y] + num_x + cat_x + bin_x )

In [None]:
customer_data.head(5)

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>3. Data Transformation</b>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Transformation of string variables into flags (OneHotEncoding)</b></p>
    
<p style = 'font-size:16px;font-family:Arial'>We will use the following string or category variables as an example:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Gender</li>
    <li>Geography</li>
    <li>NumOfProducts</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>And for each of them we are going to use the <i>OneHotEncoder</i> function to generate the set of marks.</p>
  
<p style = 'font-size:16px;font-family:Arial'><i><b>Note:</b> The process can be achieved using a single script, here it has been separated into steps for didactic purposes.</i>

In [None]:
# 0-male, 1-female
values1 = {"Female": "Gender"}
dummy1 = OneHotEncoder(values=values1, columns="Gender")

# 0-france, 1-germany, 2-spain
values2 = {"France": "France", "Germany": "Germany", "Spain": "Spain"}
dummy2 = OneHotEncoder(values=values2, columns="Geography")

values3 = {1: "OneProduct", 2: "TwoProduct", 3: "ThreeProduct", 4: "FourProduct"}
dummy3 = OneHotEncoder(values=values3, columns="NumOfProducts")

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Standardize for numeric variables (Z-score)</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will use the following numerical variables as an example:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Age</li>
    <li>Balance</li>
    <li>CreditScore</li>
    <li>EstimatedSalary</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>And for each of them we are going to use the <i>ZScore</i> function to generate the transformation.</p>
  
<p style = 'font-size:16px;font-family:Arial'><i><b>Note:</b> The process can be achieved using a single script, here it has been separated into steps for didactic purposes.</i>

In [None]:
# FillNa allows user to perform missing value/null replacement transformations.
fn = FillNa(style = "mode", columns = num_x)

# Z-Score transforms each column value into the number of standard deviations from the mean value of the column.
# This is non-linear transformation
zs = ZScore(columns = num_x,
            out_columns = num_x,
            fillna = fn)

# Keep the other variables that do not not need trasformation.
retain = Retain(columns=bin_x+[y])

In [None]:
# Process the transformation
df_transformed = valib.Transform(
                            data = customer_data, 
                            zscore = zs, 
                            one_hot_encode = [dummy1, dummy2, dummy3],
                            retain = retain,
                            index_columns = idcol,
                            key_columns = idcol
                         )
df_transformed.result.head(5).to_pandas()

In [None]:
# Move the temporary table to physical table

df_transformed.result.to_sql("BankCustomerChurn_dataset",
                        schema_name = "demo_user",
                        primary_index="CustomerId",
                        if_exists="replace")

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Train/Test Split</b></p>
<p style = 'font-size:16px;font-family:Arial'>Split the dataset into train and test datasets according to the split ratio, here 0.8</p>

In [None]:
train_ratio = 0.8

df = DataFrame("BankCustomerChurn_dataset")
df_sample = df.sample(frac = [train_ratio, 1.0-train_ratio])

# Split into 2 virtual dataframes
df_train = df_sample[df_sample.sampleid==1].drop(["sampleid"], axis=1)
df_test = df_sample[df_sample.sampleid==2].drop(["sampleid"], axis=1)

In [None]:
# Move the temporary table to physical table
df_train.to_sql("BankCustomerChurn_train",
                schema_name = "demo_user",
                primary_index="CustomerId",
                if_exists="replace")

df_test.to_sql("BankCustomerChurn_test",
               schema_name = "demo_user",
               primary_index="CustomerId",
               if_exists="replace")

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>4. Model Training (Outside Vantage)</b>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Read the training data</b></p>
<p style = 'font-size:16px;font-family:Arial'>Here we use <b>to_pandas()</b> function to get the data outside Vantage and simulate and environment where the models are trained outside Vantage.</p>

In [None]:
# Read the training table with feature
df_train = DataFrame("BankCustomerChurn_train").to_pandas()

In [None]:
# Setup y and Xs columns
y_col = ['Exited']
x_cols = df_train.columns.to_list()
x_cols.remove('Exited')

X = df_train[x_cols]
y = df_train[y_col]

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Train a Decision Tree Model</b></p>
<p style = 'font-size:16px;font-family:Arial'>Train a basic Decision Tree model and save it in PMML format.</p>

In [None]:
pipeline = PMMLPipeline([
    ("classifier", tree.DecisionTreeClassifier())
])
pipeline.fit(X, y.values.ravel())

In [None]:
# Export the model in PMML format
sklearn2pmml(pipeline, "bankchurn_dt_model.pmml", with_repr = True)

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Train a Random Forest Model</b></p>
<p style = 'font-size:16px;font-family:Arial'>Train a basic Random Forest model and save it in PMML format.</p>

In [None]:
# Train the random forest model
pipeline = PMMLPipeline([
    ("classifier", RandomForestClassifier())
])
pipeline.fit(X, y.values.ravel())

In [None]:
# Export the model in PMML format
sklearn2pmml(pipeline, "bankchurn_rf_model.pmml", with_repr = True)

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Train a XGBoost Model</b></p>
<p style = 'font-size:16px;font-family:Arial'>Train a basic XGBoost model and save it in PMML format.</p>

In [None]:
# Train the XGBoost model
pipeline = PMMLPipeline([
     ("classifier", XGBClassifier())
     ])

pipeline.fit(X, y.values.ravel())

In [None]:
# Export the model in PMML format
sklearn2pmml(pipeline, "bankchurn_xgb_model.pmml", with_repr = True)

<p style = 'font-size:16px;font-family:Arial'>Here we load all the 3 PMML files/models into a table in Vantage. This will help to execute in-database scoring in the next section.</p>

In [None]:
# Load the PMML file into Vantage

model_ids = ['dt', 'rf', 'xgb']
model_files = ['bankchurn_dt_model.pmml', 'bankchurn_rf_model.pmml', 'bankchurn_xgb_model.pmml']
table_name = 'bank_models'

for model_id, model_file in zip(model_ids, model_files):
    try:
        res = save_byom(model_id = model_id, model_file = model_file, table_name = table_name)
    except Exception as e:
        # if our model exists, delete and rewrite
        if str(e.args).find('TDML_2200') >= 1:
            res = delete_byom(model_id = model_id, table_name = table_name)
            res = save_byom(model_id = model_id, model_file = model_file, table_name = table_name)
            pass
        else:
            raise
# Show the bank_models table
list_byom('bank_models')

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>5. Model Scoring (Inside Vantage)</b>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Scoring Decision Tree Model</b></p>
<p style = 'font-size:16px;font-family:Arial'>Scoring the Decision Tree model which is stored in bank_models table in PMML format  using Vantage's PMMLPredict funtionality. All the scoring is done in-database inside Vantage.</p>

In [None]:
%%time
# Obtain a pointer to the model
table_name = 'bank_models'
model_id = 'dt'
model_tdf = DataFrame.from_query(f"SELECT * FROM {table_name} WHERE model_id = '{model_id}'")
result_dt = PMMLPredict(
            modeldata = model_tdf,
            newdata = df_test,
            accumulate = ['CustomerId', 'Exited'],
            model_output_fields = ['probability(1)', 'probability(0)']
            )

In [None]:
%%time
# Create a local pandas dataframe of the results
result_dt_pandas = result_dt.result.to_pandas(all_rows = True)
result_dt_pandas.head(5)

<p style = 'font-size:16px;font-family:Arial'><i><b>Note:</b> If the scoring performed with a classification model does not return a predicted value, the prediction output column could be empty. If the scoring is performed on regression or models which result in a single field, the prediction column will contain a value.</i>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Scoring Random Forest Model</b></p>
<p style = 'font-size:16px;font-family:Arial'>Scoring the Random Forest model which is stored in bank_models table in PMML format  using Vantage's PMMLPredict funtionality. All the scoring is done in-database inside Vantage.</p>

In [None]:
%%time
# Obtain a pointer to the model
table_name = 'bank_models'
model_id = 'rf'
model_tdf = DataFrame.from_query(f"SELECT * FROM {table_name} WHERE model_id = '{model_id}'")
result_rf = PMMLPredict(
            modeldata = model_tdf,
            newdata = df_test,
            accumulate = ['CustomerId', 'Exited'],
            model_output_fields = ['probability(1)', 'probability(0)']
            )

In [None]:
%%time
# Create a local pandas dataframe of the results
result_rf_pandas = result_rf.result.to_pandas(all_rows = True)
result_rf_pandas.head(5)

In [None]:
%%time
result_rf.result.head(5)

<p style = 'font-size:16px;font-family:Arial'><i><b>Note:</b> If the scoring performed with a classification model does not return a predicted value, the prediction output column could be empty. If the scoring is performed on regression or models which result in a single field, the prediction column will contain a value.</i>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Scoring XGBoost Model</b></p>
<p style = 'font-size:16px;font-family:Arial'>Scoring the XGBoost model which is stored in bank_models table in PMML format  using Vantage's PMMLPredict funtionality. All the scoring is done in-database inside Vantage.</p>

In [None]:
# Obtain a pointer to the model
table_name = 'bank_models'
model_id = 'xgb'
model_tdf = DataFrame.from_query(f"SELECT * FROM {table_name} WHERE model_id = '{model_id}'")
result_xgb = PMMLPredict(
            modeldata = model_tdf,
            newdata = df_test,
            accumulate = ['CustomerId', 'Exited'],
            model_output_fields = ['probability(1)', 'probability(0)']
            )

In [None]:
%%time
# Create a local pandas dataframe of the results
result_xgb_pandas = result_xgb.result.to_pandas(all_rows = True)
result_xgb_pandas.head(5)

<p style = 'font-size:16px;font-family:Arial'><i><b>Note:</b> If the scoring performed with a classification model does not return a predicted value, the prediction output column could be empty. If the scoring is performed on regression or models which result in a single field, the prediction column will contain a value.</i></p>

<p style = 'font-size:16px;font-family:Arial'>The ROC curve is a graph between TPR(True Positive Rate) and FPR(False Positive Rate). The area under the ROC curve is a metric of how well the model can distinguish between positive and negative classes. The higher the AUC, the better the model's performance in distinguishing between the positive and negative categories. AUC above 0.75 is generally considered decent.</p>

In [None]:
# ROC curve for Decision Tree model
fpr_dt, tpr_dt, thresholds_dt = roc_curve(result_dt_pandas['Exited'], result_dt_pandas['probability(1)'])
auc_dt = roc_auc_score(result_dt_pandas['Exited'], result_dt_pandas['probability(1)'])
plt.plot(fpr_dt, tpr_dt, color='orange', label='Decision Tree ROC. AUC = {}'.format(str(round(auc_dt, 4))))

# ROC curve for Random Forest model
fpr_rf, tpr_rf, thresholds_rf = roc_curve(result_rf_pandas['Exited'], result_rf_pandas['probability(1)'])
auc_rf = roc_auc_score(result_rf_pandas['Exited'], result_rf_pandas['probability(1)'])
plt.plot(fpr_rf, tpr_rf, color='cyan', label='Random Forest ROC. AUC = {}'.format(str(round(auc_rf, 4))))

# ROC curve for XGB
fpr_xgb, tpr_xgb, thresholds_xgb = roc_curve(result_xgb_pandas['Exited'], result_xgb_pandas['probability(1)'])
auc_xgb = roc_auc_score(result_xgb_pandas['Exited'], result_xgb_pandas['probability(1)'])
plt.plot(fpr_xgb, tpr_xgb, color='green', label='XGB ROC. AUC = {}'.format(str(round(auc_xgb, 4))))


plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()


<p style = 'font-size:16px;font-family:Arial'>Looking at the above ROC Curve, we can confidently say that our models have performed well on testing data. The AUC Scores are on the higher side and resonates with our understanding that the models is performing well. Among the models we used, Rabdom Forest is performing the best with AUC Score.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>6. Cleanup</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Cleanup work tables to prevent errors next time.</p>

In [None]:
eng.execute("DROP TABLE bank_models;")

In [None]:
eng.execute("DROP TABLE BankCustomerChurn_dataset;")

In [None]:
eng.execute("DROP TABLE BankCustomerChurn_train;")

In [None]:
eng.execute("DROP TABLE BankCustomerChurn_test;")

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_BankChurn');"        # Takes 5 seconds

In [None]:
remove_context()

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Dataset:</b>

- `Surname`: Surname
- `CreditScore`: Credit score
- `Geography`: Country (Germany / France / Spain)
- `Gender`: Gender (Female / Male)
- `Age`: Age
- `Tenure`: No of years the customer has been associated with the bank
- `Balance`: Balance
- `NumOfProducts`: No of bank products used
- `HasCrCard`: Credit card status (0 = No, 1 = Yes)
- `IsActiveMember`: Active membership status (0 = No, 1 = Yes)
- `EstimatedSalary`: Estimated salary
- `Exited`: Abandoned or not? (0 = No, 1 = Yes)

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">Copyright © Teradata Corporation - 2023. All Rights Reserved.</footer>