<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Banking Customer Churn Analysis using Vantage
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>

<center><img src="images/churn.webp"/></center>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Source: <a href = 'https://medium.com/@islamhasabo/predicting-customer-churn-bc76f7760377'>Medium</a></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Customer churn is a critical metric in banking because it can directly impact a bank's revenue and profitability. When customers leave, banks lose the income they would have earned from those customers' transactions, investments, and account fees. Additionally, attracting new customers to replace those who have left can be expensive and time-consuming, so reducing customer churn is often more cost-effective than acquiring new customers.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Customer churn can also be an indicator of customer satisfaction and loyalty. If customers leave at a high rate, they may be dissatisfied with the bank's products or services, customer service, or overall experience.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Banks can use various strategies to reduce customer churns, such as improving customer service, offering more competitive rates and fees, providing personalized recommendations and offers, and enhancing digital channels and mobile apps. By tracking and analyzing customer churn rates, banks can identify areas for improvement and make strategic decisions to retain customers and improve overall customer satisfaction.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this demo, we demonstrate how to implement the entire lifecycle of churn prediction can using Vantage technologies and, specifically, the combination of Bring Your Own Model (BYOM), Vantage Analytics Library (VAL) and teradataml python client library solution.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Downloading and installing additional software needed</b>

In [None]:
%%capture
# # '%%capture' suppresses the display of installation steps of the following packages
# !pip install xgboost==1.7.3
# !pip install jdk4py==17.0.3.0
# !pip install sklearn2pmml==0.90.3

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>The above statements may need to be uncommented if you run the notebooks on a platform other than ClearScape Analytics Experience that does not have the libraries installed. If you uncomment those installs, be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import os
import warnings
warnings.filterwarnings('ignore')

from sklearn import tree
from xgboost import XGBClassifier
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
from jdk4py import JAVA, JAVA_HOME, JAVA_VERSION

from teradataml import *

# Modify the following to match the specific client environment settings
display.max_rows = 5
configure.val_install_location = 'val'
configure.byom_install_location = 'mldb'
os.environ['PATH'] = os.environ['PATH'] + os.pathsep + str(JAVA_HOME)
os.environ['PATH'] = os.environ['PATH'] + os.pathsep + str(JAVA)[:-5]

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>1. Initiate a connection to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=BankingChurn.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_BankChurn_cloud');"        # Takes 30 seconds
%run -i ../run_procedure.py "call get_data('DEMO_BankChurn_local');"        # Takes 1 minute

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>2. Data Exploration</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create a "Virtual DataFrame" that points to the data set in Vantage. Check the shape of the dataframe as check the datatype of all the columns of the dataframe.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b><i>*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.</i></b></p>

In [None]:
tdf = DataFrame(in_schema("DEMO_BankChurn", "customer_churn"))
print("Shape of the data: ", tdf.shape)
tdf

In [None]:
tdf.dtypes

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>By looking at the datatypes and sample data, we classify the columns into ID column, target variable(y), numerical, categorical and binary. We skip using <i>RowNumber</i> and <i>Surname</i> columns as they are not helpful in the analysis.</p>

In [None]:
target_variable = "Exited"
numeric_columns = ["Age", "Balance", "CreditScore", "EstimatedSalary", "Tenure"]
categorical_columns = ["Gender", "Geography", "NumOfProducts"]
binary_columns = ["HasCrCard", "IsActiveMember"]
id_column = ["CustomerId"]

customer_data = tdf.select(
    id_column + [target_variable] + numeric_columns + categorical_columns + binary_columns
)

In [None]:
customer_data

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>3. Data Transformation</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Transformation of string variables into flags (OneHotEncoding)</b></p>
    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use the following string or category variables as an example:</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Gender</li>
    <li>Geography</li>
    <li>NumOfProducts</li>
</ul>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>And for each of them we are going to use the <i>OneHotEncoder</i> function to generate the set of marks.</p>
  
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><i><b>Note:</b> The process can be achieved using a single script, here it has been separated into steps for didactic purposes.</i>

In [None]:
# 0-male, 1-female
gender_values = {"Female": "Gender"}
gender_encoder = OneHotEncoder(values=gender_values, columns="Gender")

# 0-france, 1-germany, 2-spain
geography_values = {"France": "France", "Germany": "Germany", "Spain": "Spain"}
geography_encoder = OneHotEncoder(values=geography_values, columns="Geography")

num_of_products_values = {1: "OneProduct", 2: "TwoProduct", 3: "ThreeProduct", 4: "FourProduct"}
num_of_products_encoder = OneHotEncoder(values=num_of_products_values, columns="NumOfProducts")

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Standardize for numeric variables (Z-score)</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use the following numerical variables as an example:</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Age</li>
    <li>Balance</li>
    <li>CreditScore</li>
    <li>EstimatedSalary</li>
</ul>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>And for each of them we are going to use the <i>ZScore</i> function to generate the transformation.</p>
  
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><i><b>Note:</b> The process can be achieved using a single script, here it has been separated into steps for didactic purposes.</i>

In [None]:
# FillNa allows user to perform missing value/null replacement transformations.
fn = FillNa(style = "mode", columns = numeric_columns)

# Z-Score transforms each column value into the number of standard deviations from the mean value of the column.
# This is non-linear transformation
zs = ZScore(columns = numeric_columns,
            out_columns = numeric_columns,
            fillna = fn)

# Keep the other variables that do not not need transformation.
retain = Retain(columns=binary_columns+[target_variable])

In [None]:
# Process the transformation
df_transformed = valib.Transform(
    data = customer_data,
    zscore = zs,
    one_hot_encode = [gender_encoder, geography_encoder, num_of_products_encoder],
    retain = retain,
    index_columns = id_column,
    key_columns = id_column
).result

df_transformed

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Train/Test Split</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Split the dataset into train and test datasets according to the split ratio, here 0.8</p>

In [None]:
train_ratio = 0.8

df_sample = df_transformed.sample(frac = [train_ratio, 1.0-train_ratio])

# Split into 2 virtual dataframes
tdf_train = df_sample[df_sample.sampleid==1].drop(["sampleid"], axis=1)
tdf_test = df_sample[df_sample.sampleid==2].drop(["sampleid"], axis=1)

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>4. Model Training (Outside Vantage)</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Read the training data</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we use <b>to_pandas()</b> function to get the data outside Vantage and simulate an environment where the models are trained outside Vantage.</p>

In [None]:
# Read the training table with feature
df_train = tdf_train.to_pandas()

In [None]:
# Setup y and Xs columns
y_col = ['Exited']
x_cols = df_train.columns.to_list()
x_cols.remove('Exited')

X = df_train[x_cols]
y = df_train[y_col]

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Train a Decision Tree Model</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Train a basic Decision Tree model and save it in PMML format.</p>

In [None]:
pipeline = PMMLPipeline([
    ("classifier", tree.DecisionTreeClassifier())
])
pipeline.fit(X, y.values.ravel())

In [None]:
# Export the model in PMML format
sklearn2pmml(pipeline, "bankchurn_dt_model.pmml", with_repr = True)

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Train a XGBoost Model</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Train a basic XGBoost model and save it in PMML format.</p>

In [None]:
# Train the XGBoost model
pipeline = PMMLPipeline([
     ("classifier", XGBClassifier())
])

pipeline.fit(X, y.values.ravel())

In [None]:
# Export the model in PMML format
sklearn2pmml(pipeline, "bankchurn_xgb_model.pmml", with_repr = True)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we load both the PMML files/models into a table in Vantage. This table will help to execute in-database scoring in the next section.</p>

In [None]:
# Load the PMML file into Vantage

model_ids = ['dt', 'xgb']
model_files = ['bankchurn_dt_model.pmml', 'bankchurn_xgb_model.pmml']
table_name = 'bank_models'

for model_id, model_file in zip(model_ids, model_files):
    try:
        save_byom(model_id = model_id, model_file = model_file, table_name = table_name)
    except Exception as e:
        # if our model exists, delete and rewrite
        if str(e.args).find('TDML_2200') >= 1:
            delete_byom(model_id = model_id, table_name = table_name)
            save_byom(model_id = model_id, model_file = model_file, table_name = table_name)
        else:
            raise ValueError(f"Unable to save the model '{model_id}' in '{table_name}' due to the following error: {e}")

# Show the bank_models table
list_byom(table_name)

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>5. Model Scoring (Inside Vantage)</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Scoring Decision Tree Model</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We are scoring the Decision Tree model, stored in the bank_models table in PMML format using Vantage's PMMLPredict functionality. All the scoring is done in-database inside Vantage.</p>

In [None]:
# Obtain a pointer to the model
table_name = 'bank_models'
model_id = 'dt'
model_tdf = retrieve_byom(model_id, table_name=table_name) 

result_dt = PMMLPredict(
    modeldata = model_tdf,
    newdata = tdf_test,
    accumulate = ['CustomerId', 'Exited'],
    model_output_fields = ['probability(1)', 'probability(0)']
)

result_dt.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><i><b>Note:</b> If the scoring performed with a classification model does not return a predicted value, the prediction output column could be empty. If the scoring is performed on regression or models which result in a single field, the prediction column will contain a value.</i>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Scoring XGBoost Model</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We are scoring the XGBoost model stored in the bank_models table in PMML format using Vantage's PMMLPredict functionality. All the scoring is done in-database inside Vantage.</p>

In [None]:
# Obtain a pointer to the model
table_name = 'bank_models'
model_id = 'xgb'
model_tdf = retrieve_byom(model_id, table_name=table_name)

result_xgb = PMMLPredict(
    modeldata = model_tdf,
    newdata = tdf_test,
    accumulate = ['CustomerId', 'Exited'],
    model_output_fields = ['probability(1)', 'probability(0)']
)

result_xgb.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><i><b>Note:</b> If the scoring performed with a classification model does not return a predicted value, the prediction output column could be empty. If the scoring is performed on regression or models which result in a single field, the prediction column will contain a value.</i></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ROC curve is a graph between TPR(True Positive Rate) and FPR(False Positive Rate). The area under the ROC curve measures how well the model can distinguish between positive and negative classes. The higher the AUC, the better the model's performance in distinguishing between the positive and negative categories. AUC above 0.75 is generally considered decent.</p>

In [None]:
roc_dt = ROC(
    probability_column = '"probability(1)"',
    observation_column = "Exited",
    positive_class = '1',
    data = result_dt.result,
    num_thresholds = 100
)

roc_xgb = ROC(
    probability_column = '"probability(1)"',
    observation_column = "Exited",
    positive_class = '1',
    data = result_xgb.result,
    num_thresholds = 100
)

auc_dt = roc_dt.result.get('AUC').get_values()[0][0]
auc_xgb = roc_xgb.result.get('AUC').get_values()[0][0]

In [None]:
roc_xgb.output_data.plot(
    x = roc_dt.output_data.fpr,
    y = [roc_dt.output_data.tpr, roc_xgb.output_data.tpr, roc_dt.output_data.fpr],
    legend = [
                'Decision Tree: AUC = {}'.format(str(round(auc_dt, 4))),
                'XGBoost: AUC = {}'.format(str(round(auc_xgb, 4))),
                'Baseline: AUC = {}'.format(str(round(0.5, 4)))
             ],
    legend_style = 'lower right',
    title = 'Receiver Operating Characteristic (ROC) Curve',
    xlabel = 'False Positive Rate',
    ylabel = 'True Positive Rate',
    color = ['green', 'orange', 'blue'],
    linestyle = ['-', '-', '--']
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Looking at the above ROC Curve, we can confidently say that our models have performed well on testing data. The AUC Scores are on the higher side and resonate with our understanding that the models are performing well. Among the models we used, XGBoost performs best with a higher AUC Score.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>6. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Cleanup work tables to prevent errors next time.</p>

In [None]:
tables = ['bank_models']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_BankChurn');"        # Takes 10 seconds

In [None]:
remove_context()

<hr style="height:1px;border:none;background-color:#00233C;">
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Dataset:</b>

- `Surname`: Surname
- `CreditScore`: Credit score
- `Geography`: Country (Germany / France / Spain)
- `Gender`: Gender (Female / Male)
- `Age`: Age
- `Tenure`: No of years the customer has been associated with the bank
- `Balance`: Balance
- `NumOfProducts`: No of bank products used
- `HasCrCard`: Credit card status (0 = No, 1 = Yes)
- `IsActiveMember`: Active membership status (0 = No, 1 = Yes)
- `EstimatedSalary`: Estimated salary
- `Exited`: Abandoned or not? (0 = No, 1 = Yes)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023, 2024. All Rights Reserved
        </div>
    </div>
</footer>