<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Marketing Campaign Effectiveness Prediction using Vantage
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233c'><b>Introduction:</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Marketing campaigns revolve around prioritizing customer needs and ensuring their overall satisfaction. However, the success of a marketing campaign hinges on various factors. Certain variables must be carefully considered when formulating a marketing campaign. The process by which companies create value for customers and build strong customer relationships in order to capture value from customers in return.</p>

<center><img src="images/header_img.jpg" alt="marketing tips1" width=400 height=400/></center>
<p>image source: <a href="https://unsplash.com/photos/--kQ4tBklJI">unsplash.com</a></p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Marketing campaigns are characterize by focusing on the customer's needs and their overall satisfaction. Nevertheless, there are different variables that determine whether a marketing campaign will be successful or not. There are certain variables that we need to take into consideration when making a marketing campaign. We want to provide the best possible predictive model for the marketing campaign of their new product, which shows if a customer buys the new product or not.</p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Teradata Vantage provides us with the necessary capabilities to analyze the vast amounts of data collected for marketing campaigns, such as the customer's age, marital status, education, number of family members, etc. In addition to this, we have data related to the last contact of the current campaign, i.e., contact, month, day, and duration. By processing this data, we can find patterns of campaign effectiveness and take proactive measures to improve the next marketing campaign.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>With Teradata Vantage, we can help clients stay ahead of the curve, providing them with cutting-edge analytics capabilities to improve the next marketing campaign, reduce marketing costs, and reduce customer annoyance.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233c'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Configuring the environment</li>
    <li>Connect to Vantage</li>
    <li>Data Exploration</li>
    <li>Data Preparation</li>
    <li>Train-Test Split</li>
    <li>In-Database Machine Learning</li>
    <li>Visualize the results</li>
    <li>Cleanup</li>
</ol>

<hr style='height:2px;border:none;background-color:#00233C;'>
<b style = 'font-size:28px;font-family:Arial;color:#00233c'>1. Configuring the environment</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import io
import numpy as np
import pandas as pd
import seaborn as sns
from PIL import Image
import plotly.express as px
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    ConfusionMatrixDisplay,
    roc_auc_score,
    roc_curve,
    auc,
)

# teradata lib
from teradataml import *
from teradataml import ROC

# Modify the following to match the specific client environment settings
display.max_rows = 5
configure.val_install_location = "val"

# Suppress warnings
warnings.filterwarnings("ignore")
display.max_rows = 5

<hr style='height:2px;border:none;background-color:#00233C;'>
<b style = 'font-size:28px;font-family:Arial;color:#00233c'>2. Connect to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Marketing_Campaign_Effectiveness_Preditction_PY_SQL.ipynb;' UPDATE FOR SESSION;''') 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Begin running steps with Shift + Enter keys. </p>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>1.1 Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_MarketingCamp_cloud');"        # Takes 1 minute
# %run -i ../run_procedure.py "call get_data('DEMO_MarketingCamp_local');"        # Takes 2 minutes

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style='height:2px;border:none;background-color:#00233C;'>
<b style = 'font-size:28px;font-family:Arial;color:#00233c'>3. Data Exploration</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The goal of the Marketing Campaign Effectiveness prediction is to reduce marketing resources by identifying customers who would purchase the product and thereby directing marketing efforts to them.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The data is from the last marketing campaign, with thousands of rows of customer data like age, job, marital status, education, etc.<p/>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Each row is a snapshot of data taken during the last marketing campaign, and each column is a different variable. The input dataset can be divided into three categories, as below:</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> 
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>customer data i.e. age, profession, eduction, monthly income, etc.</li>
    <li>attributes related with the last contact of the current campaign i.e. contact, month, day, etc.</li>
    <li>other attributes i.e. campaign, previous outcome, payment methods, etc.</li>
   <li>target attribute - purchased.</li>

</ol>
</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The source data from <a href="https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset">kaggle</a> is loaded in Vantage and supplemented with information about city, monthly income, family members, etc. The data is loaded into vantage table named <i>Retail_Marketing</i>.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b><i>*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.</i></b></p>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>3.1 Examine the Retail Marketing Campaign table</b></p>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's look at the sample data in the Retail_Marketing table.</p>

In [None]:
tdf = DataFrame(in_schema("DEMO_MarketingCamp", "Retail_Marketing"))
df = tdf.to_pandas()
print("Data information: \n", tdf.shape)
tdf.sort("customer_id")

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>There are 11K records in all, and there are 23 variables. Purchased is the target variable. We shall classify the purchased variable in accordance with the remaining features.</p>

In [None]:
def get_histogram(df, x, y, color, title, x_title, y_title, width=800, height=500):
    fig = px.histogram(
        df,
        x=x,
        y=y,
        title=title,
        nbins=df.shape[0],
        barmode="group",
        color=color,
        color_discrete_map={"no": "#dd8452", "yes": "#4c72b0"},
    )
    fig.update_yaxes(title=y_title)
    fig.update_xaxes(title=x_title)
    fig.update_layout(
        autosize=False,
        width=width,
        height=height,
    )
    return fig

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>3.1.1 Analyze how the marital status affects the feature of purchases.</b></p>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now, let's do some data exploration with marital status and purchase.</p>

In [None]:
query = """
SELECT marital,
       purchased,
       Count(*) / Cast(Sum(Count(*)) OVER (partition BY marital) AS FLOAT) * 100 AS purchased_perc
FROM   demo_marketingcamp.retail_marketing
GROUP  BY 1, 2
"""

df_marital_purchased = DataFrame.from_query(query)
df_marital_purchased.sort("purchased_perc", ascending=False)

In [None]:
df_marital_purchased = df_marital_purchased.to_pandas()
get_histogram(
    df_marital_purchased,
    x="marital",
    y="purchased_perc",
    color="purchased",
    title="Number of purchased by Marital Status",
    x_title="marital",
    y_title="Purchase (%)",
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Few observations from the above graph are:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Married customers</b> as a whole have a purchased a product rate of <b>38%</b>, compared to a non-purchased rate of <b>62%</b>.</li>
    <li>Out of all the <b>divorcing customers</b>, <b>39%</b> have purchased a product while <b>61%</b> have not.</li>
    <li>The percentage of <b>single customers</b> who have purchased a product is <b>76%</b>, while <b>24%</b> have not.</li>
</ol>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Compared to other marital statuses, single clients are buying more products.</p>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>3.1.2 Study the impact of the customer's profession on the characteristic of the purchase.</b></p>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Exploring the customers profession and purchase features.</p>

In [None]:
query = """
SELECT profession,
       purchased,
       Count(*) / Cast(Sum(Count(*)) OVER (partition BY profession) AS FLOAT) * 100 AS purchased_perc
FROM   demo_marketingcamp.retail_marketing
GROUP  BY 1,2
"""

df_profession_purchased = DataFrame.from_query(query)
df_profession_purchased.sort("purchased_perc", ascending=False)

In [None]:
df_profession_purchased = df_profession_purchased.to_pandas()
get_histogram(
    df_profession_purchased,
    x="profession",
    y="purchased_perc",
    color="purchased",
    title="Number of purchased by profession",
    x_title="profession type",
    y_title="Purchase (%)",
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From the above graph we can observe that:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>A product purchase rate of <b>73%</b> among all <b>Student</b> customers, as opposed to a non-purchase rate of <b>27%</b>.</li>
    <li>A little more than half of all clients who are in <b>Technician, Management, admin and retired</b> have bought something.</li>
</ol>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Customers in blue-collar jobs are the least likely to make purchases, whereas students make the greatest purchases out of all professions.</p>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>3.1.3 Investigate the impact of customers education on purchase</b></p>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Exploring the customers education and purchase.</p>

In [None]:
query = """
SELECT education,purchased,
       Count(*) / Cast(Sum(Count(*)) OVER (partition BY education) AS FLOAT) * 100 AS purchased_perc
FROM   demo_marketingcamp.retail_marketing
GROUP  BY 1, 2

"""

df_edu_purchased = DataFrame.from_query(query)
df_edu_purchased.sort(["purchased_perc", "purchased"], ascending=False)

In [None]:
df_edu_purchased = df_edu_purchased.to_pandas()
get_histogram(
    df_edu_purchased,
    x="education",
    y="purchased_perc",
    color="purchased",
    title="Number of purchased by education type",
    x_title="education type",
    y_title="Purchase (%)",
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From the above graph we can observe that:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>a <b>55%</b> rate for purchased, compared to a <b>45%</b> non-purchased rate, among all customers who completed their <b>tertiary</b>.</li>
    <li>Approximately <b>50%</b> of all customers whose education is <b>unknown or secondary</b> have purchased a product.</li>
</ol>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Customers with primary-level education are least likely to purchase.</p>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>3.1.4 Examine how prior marketing campaign results affected the buy feature.</b></p>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Exploring the results of earlier campaigns carried out with purchases.</p>

In [None]:
query = """
SELECT prev_campaign_outcome,
       purchased,
       Count(*) / Cast(Sum(Count(*)) OVER (partition BY prev_campaign_outcome) AS FLOAT) * 100 AS purchased_perc
FROM   demo_marketingcamp.retail_marketing
GROUP  BY 1,
          2
"""

df_poutcome_purchased = DataFrame.from_query(query)
df_poutcome_purchased.sort("purchased_perc", ascending=False)

In [None]:
df_poutcome_purchased = df_poutcome_purchased.to_pandas()
get_histogram(
    df_poutcome_purchased,
    x="prev_campaign_outcome",
    y="purchased_perc",
    color="purchased",
    title="Number of purchased by previous outcome type",
    x_title="poutcome type",
    y_title="Purchase (%)",
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From the above graph we can observe that:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>If previous outcome is <b>success</b> then there are high probability to purchase a product.</li>
    <li>Approximately <b>50%</b> of chance that if previous outcome is <b>failure or unknown</b> then that customer will purchase a product.</li>
</ol>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>3.1.5 Examine how a customer's age affects a buying feature.</b></p>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Exploring the customer's age with purchase decision.</p>

In [None]:
grp_gen = (
    tdf.select(["age", "purchased"]).groupby(["age"]).agg(["mean", "count"]).to_pandas()
)

plt.figure(figsize=(15, 6))

sns.barplot(x="age", y="count_purchased", data=grp_gen)

plt.xticks(rotation=90)

plt.title("purchased rate by age")

plt.ylabel("total count of purchase")

plt.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>An obvious trend can be seen in the graph, showing a <b>positive</b> association between age and purchase rates <b>up to the age of 31</b>, and a <b>negative correlation</b> thereafter. To put it another way, we can say that buyers are less inclined to buy the product as they get older. Customers, for instance, purchase fewer than 50 product overall after the age of 61.</p>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>3.1.6 Analyze the impact of client purchasing behavior on purchase feature</b></p>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Exploring the customers purchase frequency in past with purchase.</p>

In [None]:
query = """
SELECT purchase_frequency,
       purchased,
       Count(*) / Cast(Sum(Count(*)) OVER (partition BY purchase_frequency) AS FLOAT) * 100 AS purchased_perc
FROM   demo_marketingcamp.retail_marketing
GROUP  BY 1, 2
"""

df_purchase_frequency_purchased = DataFrame.from_query(query)
df_purchase_frequency_purchased.sort("purchased_perc", ascending=False)

In [None]:
df_purchase_frequency_purchased = df_purchase_frequency_purchased.to_pandas()
get_histogram(
    df_purchase_frequency_purchased,
    x="purchase_frequency",
    y="purchased_perc",
    color="purchased",
    title="Number of purchased by customer's purchase frequency",
    x_title="purchase frequency",
    y_title="Purchase (%)",
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Purchase frequency describes the number of times that your customers make a purchase from you within a specified period of time. This information is crucial in helping you to understand your customer retention rate, your customers' buying behaviors, and even the degree to which they're satisfied.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we can observe that:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>We can see that there is a <b>higher</b> likelihood of purchasing a product when the frequency of purchases is higher, such as <b>daily, weekly, biweekly, etc.</b></li>
    <li>The likelihood of a customer buying a product is <b>lower</b> if they only buy <b> quarterly or annually</b>.</li>
</ol>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>3.1.7 Determine which earlier campaigns were the most or least successful.</b></p>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Examining the success and failure rates of previous campaigns with the purchasing feature.</p>

In [None]:
query = """
SELECT campaign,
       STRTOK(campaign,'_',2) as camp_no,
       purchased,
       Count(*) / Cast(Sum(Count(*)) OVER (partition BY campaign) AS FLOAT) * 100 AS purchased_perc
FROM   demo_marketingcamp.retail_marketing
GROUP  BY 1, 3
"""

df_campaign_purchased = DataFrame.from_query(query)
df_campaign_purchased.sort("purchased_perc", ascending=False)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Checking the number of contacts performed during each campaign.</p>

In [None]:
query = """
SELECT campaign, 
       STRTOK(campaign,'_',2) as camp_no,  
       COUNT(1)  as total_count
FROM DEMO_MarketingCamp.Retail_Marketing
GROUP  BY 1
"""

df_last_campaign = DataFrame.from_query(query)
df_last_campaign.sort("total_count", ascending=False)

In [None]:
df_last_campaign = df_last_campaign.to_pandas()
df_campaign_purchased = df_campaign_purchased.to_pandas()
f, ax = plt.subplots(1, 2, figsize=(25, 6))

palette = colors = ["#dd8452", "#4c72b0"]

plt.suptitle(
    "Information on how many contacts were made, and how many of those contacts made purchases.",
    fontsize=16,
)

# plot1
sns.barplot(
    x="camp_no", y="total_count", data=df_last_campaign, palette=palette, ax=ax[0]
)
ax[0].set_ylabel("Number of records in this campaign", fontsize=12)
ax[0].set_xlabel("campaign id", fontsize=12)

# plot2
sns.barplot(
    x="camp_no",
    y="purchased_perc",
    hue="purchased",
    data=df_campaign_purchased,
    palette=palette,
)
ax[1].set_ylabel("Number of purchased by last contact month", fontsize=12)
ax[1].set_xlabel("last_contact_month", fontsize=12)

plt.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The graph above demonstrates below observations.:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Campaigns like <b>campaign_33, campaign_27, campaign_41, and campaign_26, campaign_29</b> have a <b>100%</b> purchase rate since there are fewer contacts made during these campaigns and every customer reached makes a purchase. It might be test campaign for marketing efforts to tests.</li>
    <li>The <b>non-purchase rate</b>, however, is <b>100%</b> for campaigns such as <b>campaign_31, campaign_25, campaign_43, campaign_20, campaign_63, campaign_23, campaign_28, and campaign_32</b>. As we can see, only a few client connections were made throughout this campaign, and none of those customers made a purchase.</li>
</ol>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>3.1.8 Examine the impact of the most recent contact month on purchasing patterns</b></p>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Analysis of how the client's purchasing decision was impacted by the last contact month..</p>

In [None]:
query = """
SELECT last_contact_month, 
       purchased, COUNT(*) / CAST( SUM(count(*)) over (partition by last_contact_month) as float) * 100 as purchased_perc
FROM DEMO_MarketingCamp.Retail_Marketing
GROUP BY 1,2
"""

df_last_contact_month_purchased = DataFrame.from_query(query)
df_last_contact_month_purchased.sort("purchased_perc", ascending=False)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Checking the number of contacts performed during each months.</p>

In [None]:
query = """
SELECT last_contact_month,  
       COUNT(1)  as total_count
FROM DEMO_MarketingCamp.Retail_Marketing
GROUP BY 1
"""

df_last_contact_month_tot = DataFrame.from_query(query)
df_last_contact_month_tot.sort("total_count", ascending=False)

In [None]:
df_last_contact_month_tot = df_last_contact_month_tot.to_pandas()
df_last_contact_month_purchased = df_last_contact_month_purchased.to_pandas()

In [None]:
f, ax = plt.subplots(1, 2, figsize=(16, 6))

colors = ["#dd8452", "#4c72b0"]
palette = ["#dd8452", "#4c72b0"]

plt.suptitle("Information on Last contat month and purchase", fontsize=16)

# plot1
sns.barplot(
    x="last_contact_month",
    y="total_count",
    data=df_last_contact_month_tot,
    palette=palette,
    ax=ax[0],
)
ax[0].set_ylabel("Number of contact performed last month", fontsize=12)
ax[0].set_xlabel("last_contact_month", fontsize=12)

# plot2
sns.barplot(
    x="last_contact_month",
    y="purchased_perc",
    hue="purchased",
    data=df_last_contact_month_purchased,
    palette=palette,
)
ax[1].set_ylabel("Number of purchased by last contact month", fontsize=12)
ax[1].set_xlabel("last_contact_month", fontsize=12)

plt.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In May, June, July, and August of last year, the marketing team contacted the majority of their customers.Most of those customers were contacted in the month of <b>May</b>, which is also the month in which most clients show little interest in purchasing the product.The months of <b>March, September, and December</b> see very little engagement with the customers. The customers should be contacted more frequently throughout these months.</p>

<hr style='height:2px;border:none;background-color:#00233C;'>
<b style = 'font-size:28px;font-family:Arial;color:#00233c'>4. Data Preparation</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>We'll perform the following steps:</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Missing Value Analysis</li>
    <li>Data distribution plot for numerical variables.</li>
    <li>Features selection using correlation</li>   
    <li>FutileColumns using CategoricalSummary</li>   
    <li>Outlier Analysis</li>
   <li>Data Transformation</li>
</ul>


<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.1 Missing Value analysis</b></p>

In [None]:
tdf.info(null_counts=True)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From the above results, Fortunately, there are no missing values. If there were missing values we will have to fill them with the median, mean,  mode or some other techniques. So, we no longer need to process missing values separately.</p>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.2 Distribution plots for numeric variables</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Since normal distribution is of so much importance, we need to check if the collected data is normal or not. Here, we will demonstrate the Q-Q plot to check the normality of skewness of data. Q stands for quantile and therefore, Q-Q plot represents quantile-quantile plot.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To view the QQ plot using TD_plot, first we have to prepare the data to feed into TD_plot.</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Create a RankTable to rank all the columns</li>
    <li>Create Distributions table using TD_QQNorm</li>
    <li>Create a lineGraph table using TD_Plot</li>

</ol>

In [None]:
q = """
CREATE MULTISET TABLE RankTable AS (
    SELECT
       age, 
       monthly_income_in_thousand,
       last_contact_day,
       last_contact_duration,
       days_from_last_contact,
       prev_contacts_performed, 
       recency, 
    
    CAST (ROW_NUMBER() OVER (ORDER BY age ASC NULLS LAST) AS BIGINT)
    AS rank_age,
    
    CAST (ROW_NUMBER() OVER (ORDER BY monthly_income_in_thousand ASC NULLS LAST) AS BIGINT)
    AS rank_monthly_income_in_thousand,
    
    CAST (ROW_NUMBER() OVER (ORDER BY last_contact_day ASC NULLS LAST) AS BIGINT)
    AS rank_last_contact_day,
    
    CAST (ROW_NUMBER() OVER (ORDER BY last_contact_duration ASC NULLS LAST) AS BIGINT)
    AS rank_last_contact_duration,
    
    CAST (ROW_NUMBER() OVER (ORDER BY days_from_last_contact ASC NULLS LAST) AS BIGINT)
    AS rank_days_from_last_contact,
    
    CAST (ROW_NUMBER() OVER (ORDER BY prev_contacts_performed ASC NULLS LAST) AS BIGINT)
    AS rank_prev_contacts_performed,
    
    CAST (ROW_NUMBER() OVER (ORDER BY recency ASC NULLS LAST) AS BIGINT)
    AS rank_recency
    
    FROM DEMO_MarketingCamp.Retail_Marketing AS dt
) WITH DATA;
"""

try:
    execute_sql(q)
except:
    db_drop_table("RankTable")
    execute_sql(q)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> TD_QQNorm checks whether the values in the specified input table columns are normally distributed. The function returns the quantiles of the column values and corresponding theoretical quantile values from a normal distribution. If the column values are normally distributed, then the quantiles of column values and normal quantile values appear in a straight line when plotted on a 2D graph.</p>

In [None]:
q = """
CREATE MULTISET VOLATILE TABLE Distributions AS (
SELECT * FROM TD_QQNorm (
  ON RankTable AS InputTable
  USING
  TargetColumns ('[0:6]')
  RankColumns ('[7:13]')) AS dt) WITH DATA
ON COMMIT PRESERVE ROWS;
"""

try:
    execute_sql(q)
except:
    db_drop_table("Distributions")
    execute_sql(q)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> Create Distributions_Table from Distributions to add idcol</p>

In [None]:
q = """
CREATE MULTISET VOLATILE TABLE Distributions_Table AS (
    SELECT 1 AS idcol,
   age, age_theoretical_quantiles, monthly_income_in_thousand, monthly_income_in_thousand_theoretical_quantiles, 
last_contact_day,last_contact_day_theoretical_quantiles, last_contact_duration, last_contact_duration_theoretical_quantiles, 
days_from_last_contact, days_from_last_contact_theoretical_quantiles, prev_contacts_performed, prev_contacts_performed_theoretical_quantiles,
recency, recency_theoretical_quantiles
    FROM Distributions AS dt) 
WITH DATA
ON COMMIT PRESERVE ROWS;
"""

try:
    execute_sql(q)
except:
    db_drop_table("Distributions_Table")
    execute_sql(q)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>TD_PLOT provides the ability to generate charts. The generated charts can be in the JPG, PNG, or SVG formats.TD_PLOT takes single series, many series on a single plot, and composite plots that display different result sets on a single plot. TD_PLOT supports up to 1024 different series per plot.</p>

In [None]:
q = """
EXECUTE FUNCTION INTO VOLATILE ART(lineGraph)
TD_Plot
(
    SERIES_SPEC
    (
        TABLE_NAME(Distributions_Table),
        ROW_AXIS(SEQUENCE(age)),
        SERIES_ID(idcol),
        PAYLOAD
        (
           FIELDS(age_theoretical_quantiles),
           CONTENT(REAL)
        )
    ),
    
    SERIES_SPEC
    (
        TABLE_NAME(Distributions_Table),
        ROW_AXIS(SEQUENCE(monthly_income_in_thousand)),
        SERIES_ID(idcol),
        PAYLOAD
        (
           FIELDS(monthly_income_in_thousand_theoretical_quantiles),
           CONTENT(REAL)
        )
    ),
    
    SERIES_SPEC
    (
        TABLE_NAME(Distributions_Table),
        ROW_AXIS(SEQUENCE(last_contact_day)),
        SERIES_ID(idcol),
        PAYLOAD
        (
           FIELDS(last_contact_day_theoretical_quantiles),
           CONTENT(REAL)
        )
    ),
    
    SERIES_SPEC
    (
        TABLE_NAME(Distributions_Table),
        ROW_AXIS(SEQUENCE(last_contact_duration)),
        SERIES_ID(idcol),
        PAYLOAD
        (
           FIELDS(last_contact_duration_theoretical_quantiles),
           CONTENT(REAL)
        )
    ),
    
    SERIES_SPEC
    (
        TABLE_NAME(Distributions_Table),
        ROW_AXIS(SEQUENCE(days_from_last_contact)),
        SERIES_ID(idcol),
        PAYLOAD
        (
           FIELDS(days_from_last_contact_theoretical_quantiles),
           CONTENT(REAL)
        )
    ),
    
    SERIES_SPEC
    (
        TABLE_NAME(Distributions_Table),
        ROW_AXIS(SEQUENCE(prev_contacts_performed)),
        SERIES_ID(idcol),
        PAYLOAD
        (
           FIELDS(prev_contacts_performed_theoretical_quantiles),
           CONTENT(REAL)
        )
    ),
    
    SERIES_SPEC
    (
        TABLE_NAME(Distributions_Table),
        ROW_AXIS(SEQUENCE(recency)),
        SERIES_ID(idcol),
        PAYLOAD
        (
           FIELDS(recency_theoretical_quantiles),
           CONTENT(REAL)
        )
    ),
    FUNC_PARAMS
    (
        LAYOUT(4,3),
        WIDTH(1920),
        HEIGHT(1080),
        TITLE('Distribution Visulization'),
        PLOTS[
            (
                ID(1),
                CELL(1,1),
                TITLE ('age_theoretical_quantiles'),
                TYPE('line'),
                MARKER('o'),
                LEGEND('best'),
                XLABEL('x-axis'),
                YLABEL('Distribution')
            ),
            (   ID(2),
                CELL(2,1),
                TITLE ('monthly_income_in_thousand_theoretical_quantiles'),
                TYPE('line'),
                MARKER('o'),
                LEGEND('best'),
                XLABEL('x-axis'),
                YLABEL('Distribution')
            ),
            (   ID(3),
                CELL(3,1),
                TITLE ('last_contact_day_theoretical_quantiles'),
                TYPE('line'),
                MARKER('o'),
                LEGEND('best'),
                XLABEL('x-axis'),
                YLABEL('Distribution')
            ),
            (
                ID(4),
                CELL(4,1),
                TYPE('line'),
                TITLE ('last_contact_duration_theoretical_quantiles'),
                TYPE('line'),
                MARKER('o'),
                LEGEND('best'),
                XLABEL('x-axis'),
                YLABEL('Distribution')
            ),
            (
                ID(5),
                CELL(1,2),
                TYPE('line'),
                TITLE ('days_from_last_contact_theoretical_quantiles'),
                TYPE('line'),
                MARKER('o'),
                LEGEND('best'),
                XLABEL('x-axis'),
                YLABEL('Distribution')
            ),
            (
                ID(6),
                CELL(2,2),
                TYPE('line'),
                TITLE ('prev_contacts_performed_theoretical_quantiles'),
                TYPE('line'),
                MARKER('o'),
                LEGEND('best'),
                XLABEL('x-axis'),
                YLABEL('Distribution')
            ),
            (
                ID(7),
                CELL(3,2),
                TYPE('line'),
                TITLE ('recency_theoretical_quantiles'),
                TYPE('line'),
                MARKER('o'),
                LEGEND('best'),
                XLABEL('x-axis'),
                YLABEL('Distribution')
            )
        ]
    )
);
"""

try:
    execute_sql(q)
except:
    db_drop_table("lineGraph")
    execute_sql(q)

In [None]:
q = """
create multiset table lineGraph_result as (select * from lineGraph) with data;
"""

try:
    execute_sql(q)
except:
    db_drop_table("lineGraph_result")
    execute_sql(q)

plot_df = DataFrame("lineGraph_result").to_pandas()

img = plot_df.IMAGE.iloc[0]
Image.open(io.BytesIO(img))

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Overall, a QQ plot provides a visual comparison between the quantiles of the observed data and the quantiles expected from a theoretical distribution. It helps to identify departures from the assumed distribution, such as skewness, heavy tails, or other deviations.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Interpreting a QQ plot involves examining how the observed quantiles deviate from the expected quantiles. Here are some key aspects to consider:</p>

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Linearity</b>: In an ideal scenario where the data perfectly follows the theoretical distribution, the observed quantiles will align with the expected quantiles, resulting in a straight line. Deviations from a straight line suggest departures from the theoretical distribution.</li>

<li><b>Slope</b>: The slope of the line provides information about the data's spread. If the line is steeper than the reference line (y = x), it indicates heavier tails or a greater spread than the theoretical distribution. Conversely, a flatter line indicates lighter tails or a smaller spread.</li>

<li><b>Endpoints</b>: The behavior of the plot at the endpoints is significant. If the observed quantiles deviate from the expected quantiles at the extremes, it suggests deviations in the tails of the distribution.</li>

<li><b>Outliers</b>: Outliers in the dataset can be identified as points that significantly deviate from the expected quantiles. These points might indicate extreme values or errors in the data.</li>

</ul>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From the above results, we can observe the below points:</p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
<li>Age: In the Age column, we can see that the observed quantiles are nearly straight lines. So we can conclude that it is following the <b>normal distribution</b>.</li>
<li>Monthly_income_in_thousand: The behavior of the plot at the endpoints is significant, which leads one to conclude that it is <b>not following the normal distribution</b>.</li>
<li>Last_contact_day: In the plot, the endpoints are significant, but the rest of the quantiles are slightly left-skewed, which means this column is <b>left-skewed and heavy-tailed</b>.</li>
<li>Last_contact_duration: The observed quantiles are heavily left-skewed, which means this column is <b>left-skewed and heavy-tailed</b>.</li>
<li>Days_from_last_contact: In the plot, one of the endpoints is significant, and the rest of the observed quantiles are heavily left-skewed, which means this column is <b>left-skewed and heavy-tailed</b>.</li>
<li>Prev_contacts_performed: This column is left-skewed and heavy-tailed because only one of the endpoints in the plot is significant, and the remaining observed quantiles are substantially <b>left-skewed</b>.</li>
<li>Recency: The endpoints in the plot are notable, but the remaining quantiles are almost straight lines, which suggests that this column is pointing to <b>deviations in the distribution's tails</b>.</li>
</ol>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.3 Features selection using correlation</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we'll check the correlation of all the numeric features. Measuring correlation lets you
    determine if the value of one variable is useful in predicting the value of another.</p>
    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For instance, if <b>monthly income and age</b> have a positive correlation of <b>0.7</b>, then if <b>age increases by 1 unit, monthly income will grow by 0.7 X times.</b></p>
    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
The Sample Pearson product moment correlation coefficient is a measure of the linear association between variables. The boundary on the computed coefficient ranges from -1.00 to +1.00.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Note that high correlation does not imply a causal relationship between the variables. The following table indicates the meaning of four extreme values for the coefficient of correlation between two variables.
</p>

<table style = 'font-size:16px;font-family:Arial'>
    <th>IF the correlation coefficient has this value</th>
    <th>THEN the association between the variables</th>
    <tr>
        <td>-1.00</td>
        <td>is perfectly linear, but inverse. <br>
        As the value for y varies, the value for x varies identically in the opposite direction.</td>
    </tr>
    <tr>
        <td>0</td>
        <td>does not exist and they are said to be uncorrelated.</td>
    </tr>
     <tr>
        <td>+1.00</td>
        <td>is perfectly linear.<br>
        As the value for y varies, the value for x varies identically in the same direction..</td>
    </tr>
</table>

In [None]:
def get_heatmap(df):
    # heatmap
    corr = df.corr(numeric_only=True)
    mask = np.triu(np.ones_like(corr, dtype=bool))
    fig = px.imshow(
        corr,
        text_auto=".2f",
        width=1100,
        height=1100,
        aspect="auto",
        color_continuous_scale=["lightblue", "lightyellow"],
    )
    return fig.show()

In [None]:
get_heatmap(df)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>By examining the aforementioned correlation matrix, we can find that <b>days_from_last_contact</b> and <b>prev_contacts_performed</b> have a <b>positive correlation</b> with a value of <b>0.51</b>; however, this correlation is not statistically significant. despite the fact that the correlations between the other features are relatively low, at less than 0.5.</p>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.4 Check FutileColumns using CategoricalSummary</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The CategoricalSummary function displays the distinct values and their counts for each specified input DataFrame column.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The GetFutileColumns function returns the futile column names if either
    of the conditions is met: </p>
    <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
        <li>If all the values in the columns are unique</li>
        <li>If all the values in the columns are the same</li>
        <li>If the count of distinct values in the columns divided by the count of the total number of rows in the input data is greater than or equal to the threshold value</li>
    </ul>

In [None]:
cat_cols = [
    "profession",
    "marital",
    "education",
    "city",
    "communication_type",
    "last_contact_month",
    "campaign",
    "payment_method",
    "purchase_frequency",
    "prev_campaign_outcome",
    "gender",
    "purchased",
]


num_cols = [
    "customer_id",
    "age",
    "monthly_income_in_thousand",
    "family_members",
    "last_contact_day",
    "credit_card",
    "num_of_cars",
    "last_contact_duration",
    "days_from_last_contact",
    "prev_contacts_performed",
    "recency",
]


from teradataml import *


CategoricalSummary_out = CategoricalSummary(data=tdf, target_columns=cat_cols)


# futile column names

GetFutileColumns_out = GetFutileColumns(
    data=tdf,
    object=CategoricalSummary_out,
    category_summary_column="ColumnName",
    threshold_value=0.9,
)

print(GetFutileColumns_out.result)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From the above results, fortunately, there are no futile columns in our dataset. If there are any futile columns, we will have to drop them as they are not going to contribute any significant value to our model. So, we no longer need to process this separately.</p>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.5 Outlier Analysis</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Outliers are those data points that are significantly different from the rest of the dataset. They are often abnormal observations that skew the data distribution, and arise due to inconsistent data entry, or erroneous observations.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's first visualize the outliers using box-plot. In the graph the pink dots outside the box are outliers.</p>

In [None]:
flierprops = dict(
    marker="o",
    markerfacecolor="r",
    markersize=12,
    linestyle="none",
    markeredgecolor="b",
)


def check_outliers(df, cols):
    plotnumber = 1

    h, l, c = 10, len(cols), 4

    r = int(np.ceil(l / c))

    plt.figure(figsize=(20, 5 * r))

    for col in cols:
        if plotnumber <= l:
            ax = plt.subplot(r, c, plotnumber)

            plt.boxplot(df[[col]].get_values(), flierprops=flierprops)

            plt.xlabel(col, fontsize=12)

        plotnumber += 1

    plt.tight_layout()

    plt.show()

In [None]:
check_outliers(
    tdf,
    [
        "age",
        "monthly_income_in_thousand",
        "family_members",
        "last_contact_day",
        "last_contact_duration",
        "days_from_last_contact",
        "prev_contacts_performed",
        "recency",
    ],
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The pink dots outside the box in the above visualization indicate that several values, in the columns like <b>age, monthly_income_in_thousand, last_contact_duration, days_from_last_contact, and prev_contacts_performed </b>, have outliers.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now, Let's check outliers using another approach - Vantage kurtosis Function</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Vantage kurtosis Function returns the kurtosis of the distribution of value_expression.
    Kurtosis is the fourth moment of the distribution of the standardized (z) values.
    It is a measure of the outlier (rare, extreme observation) character of the distribution as
    compared with the normal (or Gaussian) distribution. </p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>The normal distribution has a kurtosis of 0.</li>
    <li>Positive kurtosis indicates that the distribution is more outlier-prone than the normal distribution.</li>
    <li>Negative kurtosis indicates that the distribution is less outlier-prone than the normal distribution.</li>
</ul>

In [None]:
tdf[num_cols].kurtosis()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From the above table we can observe that, below columns have a positive kurtosis: </p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>age</li>
    <li>monthly_income_in_thousand</li>
    <li>last_contact_duration</li>
    <li>days_from_last_contact</li>
    <li>prev_contacts_performed</li>
</ol>

In [None]:
cols_with_outliers = [
    "age",
    "monthly_income_in_thousand",
    "last_contact_duration",
    "days_from_last_contact",
    "prev_contacts_performed",
]

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now, Let's use The OutlierFilterFit function calculates the lower_percentile, upper_percentile, count of rows and median for all the "target_columns" provided by the user. These metrics for each column helps the function OutlierTransform detect outliers in the input table. It also stores parameters from arguments into a FIT table used during transformation.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> In the function OutlierFilterFit, we are replacing outlier values with "NULL". In the next step we'll impute these outlier values by mean value of that perticular column.</p>

In [None]:
# find the outlier values and replace it with null
OutlierFilterFit_out = OutlierFilterFit(
    data=tdf, target_columns=cols_with_outliers, replacement_value="NULL"
)

# do the actual transformation
OutlierFilterTransform_out = OutlierFilterTransform(
    data=tdf, object=OutlierFilterFit_out.result
)

# impute outliers with mean values
fit_obj_num = SimpleImputeFit(
    data=OutlierFilterTransform_out.result,
    stats_columns=[
        "age",
        "monthly_income_in_thousand",
        "last_contact_duration",
        "days_from_last_contact",
        "prev_contacts_performed",
    ],
    stats="mean",
)

# assign imputed data to new dataframe
tdf2 = SimpleImputeTransform(data=tdf, object=fit_obj_num.output).result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The OutlierFilterTransform function filters the outliers from the input teradataml DataFrame. OutlierFilterTransform uses the result DataFrame from OutlierFilterFit() function to get statistics like median, count of rows, lower percentile and upper percentile for every column specified in target columns argument and filters the outliers in the input data.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> The SimpleImputeFit function outputs values to substitute for missing values in the input data. The output values are input to SimpleImputeTransform function, which makes the substitutions. </p>
    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The SimpleImputeTransform function substitutes specified values for missing values in the input data. The specified values is generated by SimpleImputeFit function output.</p>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.6 Data Transformation</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Machine learning models are only as good as the data that is used to train them. A key characteristic of good training data is that it is provided in a way that is optimized for learning and generalization. The process of putting together the data in this optimal format is known in the industry as feature transformation.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Data normalization</b> is the process of making sure all values in your dataset are on the same scale. It’s a common data transformation technique, and it’s often used when working with numerical data. For instance, we have a feature <b>last_contact_duration</b> with values that are measured in <b>seconds</b> and for feature <b>age</b> with values that are measured in <b>years</b>. To develop a machine learning model using this data, you would first need to normalize it so that all the features are on the same scale. Otherwise, the model wouldn't be able to predict outcomes accurately.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Label encoding</b> is a technique used in machine learning and data analysis to convert categorical variables into numerical format. It is particularly useful when working with algorithms that require numerical input, as most machine learning models can only operate on numerical data</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>ZScore</b> will allows rescaling of continuous numeric data in a more sophisticated way than a Rescaling transformation. In a Z-Score transformation, a numeric column is transformed into its Z-score based on the mean value and standard deviation of the data in the column. Z-Score transforms each column value into the number of standard deviations from the mean value of the column. This non-linear transformation is useful in data mining rather than in a linear Rescaling transformation.</p>

In [None]:
# Define the label encoders

profession_encoder = LabelEncoder(
    values=[
        ("admin.", 1),
        ("technician", 2),
        ("services", 3),
        ("management", 4),
        ("retired", 5),
        ("blue-collar", 6),
        ("unemployed", 7),
        ("entrepreneur", 8),
        ("housemaid", 9),
        ("unknown", 10),
        ("self-employed", 11),
        ("student", 12),
    ],
    columns="profession",
    datatype="integer",
)

marital_encoder = LabelEncoder(
    values=[("married", 1), ("single", 2), ("divorced", 3)],
    columns="marital",
    datatype="integer",
)

education_encoder = LabelEncoder(
    values=[("secondary", 1), ("tertiary", 2), ("primary", 3), ("unknown", 4)],
    columns="education",
    datatype="integer",
)
city_encoder = LabelEncoder(
    values=[
        ("Philadelphia", 1),
        ("San Diego", 2),
        ("New York", 3),
        ("Phoenix", 4),
        ("Los Angeles", 5),
        ("Chicago", 6),
        ("Houston", 7),
        ("Dallas", 8),
        ("San Jose", 9),
        ("San Antonio", 10),
    ],
    columns="city",
    datatype="integer",
)
communication_type_encoder = LabelEncoder(
    values=[("unknown", 1), ("cellular", 2), ("telephone", 3)],
    columns="communication_type",
    datatype="integer",
)
last_contact_month_encoder = LabelEncoder(
    values=[
        ("jan", 1),
        ("feb", 2),
        ("mar", 3),
        ("apr", 4),
        ("may", 5),
        ("jun", 6),
        ("jul", 7),
        ("aug", 8),
        ("sep", 9),
        ("oct", 10),
        ("nov", 11),
        ("dec", 12),
    ],
    columns="last_contact_month",
    datatype="integer",
)

campaign_encoder = LabelEncoder(
    values=[
        ("campaign_1", 1),
        ("campaign_2", 2),
        ("campaign_3", 3),
        ("campaign_4", 4),
        ("campaign_6", 5),
        ("campaign_5", 6),
        ("campaign_8", 7),
        ("campaign_11", 8),
        ("campaign_9", 9),
        ("campaign_10", 10),
        ("campaign_15", 11),
        ("campaign_12", 12),
        ("campaign_14", 13),
        ("campaign_7", 14),
        ("campaign_24", 15),
        ("campaign_13", 16),
        ("campaign_17", 17),
        ("campaign_29", 18),
        ("campaign_21", 19),
        ("campaign_20", 20),
        ("campaign_16", 21),
        ("campaign_32", 22),
        ("campaign_19", 23),
        ("campaign_25", 24),
        ("campaign_22", 25),
        ("campaign_43", 26),
        ("campaign_18", 27),
        ("campaign_41", 28),
        ("campaign_63", 29),
        ("campaign_27", 30),
        ("campaign_30", 31),
        ("campaign_26", 32),
        ("campaign_23", 33),
        ("campaign_28", 34),
        ("campaign_33", 35),
        ("campaign_31", 36),
    ],
    columns="campaign",
    datatype="integer",
)

payment_method_encoder = LabelEncoder(
    values=[
        ("QRcodes", 1),
        ("credit_card", 2),
        ("ewallets", 3),
        ("cash", 4),
        ("payment_links", 5),
        ("debit_card", 6),
    ],
    columns="payment_method",
    datatype="integer",
)

purchase_frequency_encoder = LabelEncoder(
    values=[
        ("biweekly", 3),
        ("quarterly", 5),
        ("yearly", 6),
        ("monthly", 4),
        ("weekly", 2),
        ("daily", 1),
    ],
    columns="purchase_frequency",
    datatype="integer",
)

prev_campaign_outcome_encoder = LabelEncoder(
    values=[("unknown", 1), ("other", 2), ("failure", 3), ("success", 4)],
    columns="prev_campaign_outcome",
    datatype="integer",
)

# OneHotEncoder
# credit_card_encoder = OneHotEncoder(style="contrast", values=1, reference_value=0, columns="credit_card")
gender_encoder = OneHotEncoder(
    style="contrast", values="male", reference_value=1, columns="gender"
)
purchased_encoder = OneHotEncoder(
    style="contrast", values="yes", reference_value="1", columns="purchased"
)

# Define the standard scaler
z_scaler = ZScore(
    columns=[
        "age",
        "monthly_income_in_thousand",
        "family_members",
        "last_contact_day",
        "num_of_cars",
        "last_contact_duration",
        "days_from_last_contact",
        "prev_contacts_performed",
        "recency",
    ]
)

# Define the retain object
retain = Retain(columns=["credit_card"])

In [None]:
# Process the transformation
df_transformed = valib.Transform(
    data=tdf2,
    zscore=z_scaler,
    label_encode=[
        profession_encoder,
        marital_encoder,
        education_encoder,
        city_encoder,
        communication_type_encoder,
        last_contact_month_encoder,
        campaign_encoder,
        payment_method_encoder,
        purchase_frequency_encoder,
        prev_campaign_outcome_encoder,
    ],
    one_hot_encode=[gender_encoder, purchased_encoder],
    index_columns="customer_id",
    key_columns="customer_id",
    retain=retain,
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Transform function applies numeric transformations to input columns,using Fit() output.</p>

In [None]:
df_transformed.result.to_sql(
    "marketing_campaign_trans_data", primary_index="customer_id", if_exists="replace"
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have applied LabelEncoder and OneHotEncoder for convert categorical features to numerical. Also applied ZScore for rescaling of continuous numerical features</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now, we have transformed data, so to use it further first we have to save the transformed dataframe into a vantage table named <b>marketing_campaign_trans_data</b>.</p>

<hr style='height:2px;border:none;background-color:#00233C;'>
<b style = 'font-size:28px;font-family:Arial;color:#00233c'>5. Train-Test Split</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the next step, we'll split the transformed dataset into training and testing datasets in the ratio 80:20, and we will save the datasets into Vantage.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Post spliting the dataset into train/test. Let's see number of records in train and test.</p>

In [None]:
query = f"""CREATE MULTISET TABLE TrainTestSplit_output AS (
    SELECT * FROM TD_TrainTestSplit(
        ON marketing_campaign_trans_data AS InputTable
        USING
        IDColumn('customer_id')
        trainSize(0.80)
        testSize(0.20)
        Seed(123)
    ) AS dt
) WITH DATA;"""

try:
    execute_sql(query)
except:
    db_drop_table("TrainTestSplit_output")
    execute_sql(query)

In [None]:
query = f"""CREATE MULTISET TABLE rmc_train AS (
    SELECT * FROM TrainTestSplit_output WHERE TD_IsTrainRow = 1
) WITH DATA;"""

try:
    execute_sql(query)
except:
    db_drop_table("rmc_train")
    execute_sql(query)

In [None]:
query = f"""CREATE MULTISET TABLE rmc_test AS (
    SELECT * FROM TrainTestSplit_output WHERE TD_IsTrainRow = 0
) WITH DATA;"""

try:
    execute_sql(query)
except:
    db_drop_table("rmc_test")
    execute_sql(query)

In [None]:
df_train = DataFrame("rmc_train")
df_test = DataFrame("rmc_test")
print(
    "Training Set = "
    + str(df_train.shape[0])
    + ". Testing Set = "
    + str(df_test.shape[0])
)

<hr style='height:2px;border:none;background-color:#00233C;'>
<b style = 'font-size:28px;font-family:Arial;color:#00233c'>6. In-Database Machine Learning</b>

<p style = 'font-size:20px;font-family:Arial;color:#00233c'><b>6.1 Train a XGBoost Model</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the next step, we'll use the TD_XGBOOST function to train an xgboost model using the yes_purchased column as the target variable for classification. XGBoost's tree-based ensemble approach, regularization techniques, handling of missing values, scalability, and feature importance capabilities make it a powerful and effective choice for modeling tabular data, often leading to superior performance compared to other machine learning algorithms.
<br>
<br>
The TD_XGBoost function, eXtreme Gradient Boosting, implements the gradient-boosted decision tree designed for speed and performance. It has recently been dominating applied machine learning.
<br>
<br>
In gradient boosting, each iteration fits a model to the residuals (errors) of the previous iteration to correct the errors made by existing models. The predicted residual is multiplied by this learning rate and then added to the previous prediction. Models are added sequentially until no further improvements can be made. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.
</p>

In [None]:
# Create a table xgb_model using TD_XGBoost from Teradata
# The TD_XGBoost function partitions the data by any column, trains an XGBoost regression model with default trees,
# maximum depth of 5, and 10 iterations, and saves the output to a metadata table xgb_out.
# If the table xgb_model already exists, drop it and the metadata table xgb_out before creating the new table.

query = f"""CREATE MULTISET TABLE xgb_model AS (
SELECT * FROM TD_XGBoost(
ON rmc_train PARTITION BY ANY
OUT TABLE MetaInformationTable(xgb_out) 
USING
    ResponseColumn('yes_purchased')
    InputColumns('credit_card', 'male_gender', 'profession', 'marital', 'education', 'city', 'communication_type', 'last_contact_month', 'campaign', 'payment_method',
    'purchase_frequency', 'prev_campaign_outcome', 'age', 'monthly_income_in_thousand', 'family_members', 'last_contact_day', 'num_of_cars', 'last_contact_duration',
    'days_from_last_contact', 'prev_contacts_performed', 'recency')
    MaxDepth(5)
    NumBoostedTrees(-1)
    ModelType('classification')
    Seed(465)
    ShrinkageFactor(0.1)
    IterNum(10) 
    ColumnSampling(1.0) 
) AS dt) WITH DATA;
"""

try:
    execute_sql(query)
except Exception as e:
    # Drop the tables and try again if the table already exists
    db_drop_table("xgb_model")
    db_drop_table("xgb_out")
    execute_sql(query)

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>6.2 XGBoost - Model Scoring</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
In the next step, we'll use the TD_XGBoostPredict function to score the xgboost model trained in the previous step.</p>

In [None]:
query = """CREATE MULTISET TABLE xgb_predict_out AS (
SELECT * FROM TD_XGBoostPredict(
ON rmc_test AS inputtable PARTITION BY ANY
ON xgb_model AS modeltable DIMENSION order by task_index, tree_num, iter, class_num, tree_order
USING
    IdColumn('customer_id')
    ModelType('classification')
    Accumulate('yes_purchased')
) AS dt) WITH DATA;
"""

try:
    execute_sql(query)
except:
    db_drop_table("xgb_predict_out")
    execute_sql(query)

In [None]:
xgb_result = DataFrame("xgb_predict_out")
xgb_result_pd = (
    xgb_result.to_pandas()
    .reset_index()
    .sort_values("customer_id")
    .rename(columns={"yes_purchased": "Actual"})
)
xgb_result_pd.head()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next, we'll use the TD_ClassificationEvaluator function to evaluate the trained xgboost model on test data. This will let us know how well our model has performed on unseen data.</p>

In [None]:
query = """
CREATE multiset table xgb_predict_out1 as (
     select customer_id,
     CAST(yes_purchased AS INTEGER) AS purchased,
     CAST(Prediction AS INTEGER) as prediction,
    Confidence_Lower, 
    Confidence_upper
    FROM xgb_predict_out
) with data;"""

try:
    execute_sql(query)
except:
    db_drop_table("xgb_predict_out1;")
    execute_sql(query)

In [None]:
# Evaluate the XGBoost model's performance using TD_RegressionEvaluator
# Check if the necessary tables exist before executing the query


if not get_connection().dialect.has_table(get_connection(), "xgb_predict_out1"):
    print("Error: xgb_predict_out1 table does not exist.")
    sys.exit(1)

query = """
SELECT * FROM TD_ClassificationEvaluator(
   ON (select prediction, purchased from xgb_predict_out1) AS InputTable
   OUT VOLATILE TABLE OutputTable(additional_metrics_xgb)
   USING
   ObservationColumn('purchased')
   PredictionColumn('prediction')
   Labels(0,1)
) AS dt;
"""


try:
    execute_sql(query)
except:
    db_drop_table("additional_metrics_xgb")
    execute_sql(query)

In [None]:
df = DataFrame.from_query("sel * from additional_metrics_xgb")
df

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The result table displays the evaluation metrics for XGBoost models retrieved from TD_ClassificationEvaluator.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above output has the secondary output table that returns micro, macro, and weighted-averaged metrics of precision, recall, and F1-score values.</p>
<table style = 'font-size:16px;font-family:Arial;color:#00233C'>
  <tr>
    <th>Column</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>Precision</td>
    <td>The positive predictive value. Refers to the fraction of relevant instances among
the total retrieved instances.</td>
  </tr>
  <tr>
    <td>Recall</td>
    <td>Refers to the fraction of relevant instances retrieved over the total amount of
relevant instances.</td>
  </tr>
  <tr>
    <td>F1</td>
    <td>F1 score, defined as the harmonic mean of the precision and recall.</td>
  </tr>
  <tr>
    <td>Support</td>
    <td>The number of times a label displays in the ObservationColumn.</td>
  </tr>
</table>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>A <b>confusion matrix</b> is a useful machine learning method that allows you to measure <b>recall, precision, accuracy, and AUC-ROC curve</b>. The confusion matrix is a systematic way to allocate the predictions to the original classes to which the data originally belonged. A confusion matrix is also a performance measurement technique for machine learning classification. If you train a machine learning classification model on a dataset, the resulting confusion matrix will show how accurately the model categorized each record and where there might be errors. The matrix rows represent the actual labels contained in the training dataset, and the matrix columns represent the outcomes.</p>

In [None]:
def get_conf_matrix(df):
    # df = pd.read_sql('SELECT customer_id, cast(yes_purchased as int) "purchased", cast(prediction as int) prediction FROM xgb_predict_out', eng)
    cm = confusion_matrix(df["purchased"], df["prediction"])
    cmd = ConfusionMatrixDisplay(cm, display_labels=["Not_purchased", "purchased"])
    return cm, cmd

In [None]:
cm_xgb_df = DataFrame.from_query(
    'SELECT customer_id, cast(yes_purchased as int) "purchased", cast(prediction as int) prediction FROM xgb_predict_out'
).to_pandas()

cm, cmd = get_conf_matrix(cm_xgb_df)

cmd.plot()

In [None]:
def conf_mat_template(cm):
    return (
        f"<p style = 'font-size:16px;font-family:Arial;color:#00233C'>"
        f"""From the above <b>confusion matrix</b> we can conclude that
<br><b> Out of all the actual non-purchase cases ({cm[0][0] + cm[0][1]})</b> 
    <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>{round(cm[0][0]/(cm[0][0] + cm[0][1])*100, 2)}% were correctly classified as non-purchase </li>
    <li>{round(cm[0][1]/(cm[0][0] + cm[0][1])*100, 2)}% were incorrectly classified as purchased.</li></ul>
    <p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Similarly, out of all the actual purchase cases ({cm[1][0] + cm[1][1]}) </b></p>
    <ul style = 'font-size:16px;font-family:Arial;color:#00233C'><li>{round(cm[1][1]/(cm[1][0] + cm[1][1])*100, 2)}% were correctly classified as purchased</li>
    <li>{round(cm[1][0]/(cm[1][0] + cm[1][1])*100, 2)}% were incorrectly classified as non-purchase. </li></ul>"""
        "</p>"
    )

In [None]:
from IPython.display import display, Markdown

display(Markdown(conf_mat_template(cm)))

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>6.3 Train a Decision Forest Model</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Decision Forest is a powerful method used for predicting outcomes in both classification and regression problems. It's an improvement on the technique of combining (or "bagging") multiple decision trees. Normally, building a decision tree involves assessing the importance of each feature in the data to determine how to divide the information. This method takes a unique approach by only considering a random subset of features at each division point in the tree. This forces each decision tree within the "forest" to be different from one another, which ultimately improves the accuracy of the predictions. The function relies on a training dataset to develop a prediction model. Then, the TD_DecisionForestPredict function uses the model built by the TD_DecisionForest function to make predictions. It supports regression, binary, and multi-class classification tasks.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Typically, constructing a decision tree involves evaluating the value for each input feature in the data to select a split point. The function reduces the features to a random subset (that can be considered at each split point); the algorithm can force each decision tree in the forest to be very different to improve prediction accuracy. The function uses a training dataset to create a predictive model. The TD_DecisionForestPredict function uses the model created by the TD_DecisionForest function for making predictions. The function supports regression, binary, and multi-class classification.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Consider the following points:
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>All input features are numeric. Convert the categorical columns to numerical columns as preprocessing step.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>For classification, class labels (ResponseColumn values) can only be integers. A maximum of 500 classes is supported for classification.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Observations with missing values in any input column will be ignored during training. To fill in missing values, use the TD_SimpleImpute function.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The number of trees built by the TD_DecisionForest function depends on the values of NumTrees, TreeSize, and CoverageFactor, as well as the data distribution in the cluster. The trees are built simultaneously by all the processing units (AMPs) that have a non-empty portion of the data.</li>
</p>


In [None]:
query = """Create multiset table DF_train as (
SELECT * FROM TD_DecisionForest (
    ON rmc_train AS INPUTTABLE partition by ANY
USING
    ResponseColumn('yes_purchased')
    InputColumns('credit_card', 'male_gender', 'profession', 'marital', 'education', 'city', 'communication_type', 'last_contact_month', 'campaign', 'payment_method',
    'purchase_frequency', 'prev_campaign_outcome', 'age', 'monthly_income_in_thousand', 'family_members', 'last_contact_day', 'num_of_cars', 'last_contact_duration',
    'days_from_last_contact', 'prev_contacts_performed', 'recency')
    MaxDepth(10)
    MinNodeSize(1)
    NumTrees(5)
    ModelType('CLASSIFICATION')
    Seed(1)
    Mtry(-1)
    MtrySeed(1)
) AS dt
) with data;
"""
try:
    execute_sql(query)
except:
    db_drop_table("DF_train")
    execute_sql(query)

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>6.4 Decision Forest - Model Scoring</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
In the next step, we'll use the TD_DecisionForestPredict function to score the decision forest model trained in the previous step.</p>

In [None]:
query = """
Create multiset table DF_Predict_out as (
SELECT * FROM TD_DecisionForestPredict (
ON rmc_test AS InputTable PARTITION BY ANY
ON DF_train AS ModelTable DIMENSION
USING
  IdColumn ('customer_id')
  Detailed('false')
  Accumulate('yes_purchased')
) AS dt) with data;"""

try:
    execute_sql(query)
except:
    db_drop_table("DF_Predict_out")
    execute_sql(query)

In [None]:
df_result = DataFrame("DF_Predict_out")
df_result_pd = (
    df_result.to_pandas()
    .reset_index()
    .sort_values("customer_id")
    .rename(columns={"yes_purchased": "Actual"})
)
df_result_pd.head()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The TD_CLASSIFICATIONEVALUATOR function computes metrics to evaluate and compare multiple models and summarizes how close predictions are to their expected values.</p>

In [None]:
query = """
CREATE multiset table DF_Predict_out1 as (
     select customer_id,
     CAST(yes_purchased AS INTEGER) AS purchased,
     CAST(Prediction AS INTEGER) as prediction,
    Confidence_Lower, 
    Confidence_upper
    FROM DF_Predict_out
) with data;"""

try:
    execute_sql(query)
except:
    db_drop_table("DF_Predict_out1")
    execute_sql(query)

In [None]:
query = """
SELECT * FROM TD_CLASSIFICATIONEVALUATOR(
    ON  DF_Predict_out1 AS InputTable
OUT TABLE OutputTable(additional_metrics_df)
USING
    Labels(0,1)
    ObservationColumn('purchased')
    PredictionColumn ('prediction')
) as dt1 ; 
"""

try:
    execute_sql(query)
except:
    db_drop_table("additional_metrics_df")
    execute_sql(query)

In [None]:
DF_metrics = DataFrame.from_query("sel * from additional_metrics_df")
DF_metrics

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The result table displays the evaluation metrics for DecisionForest models retrieved from TD_CLASSIFICATIONEVALUATOR.</p>

In [None]:
cm_df_df = DataFrame.from_query(
    "SELECT customer_id, purchased, prediction FROM DF_Predict_out1"
).to_pandas()

cm, cmd = get_conf_matrix(cm_df_df)

cmd.plot()

In [None]:
display(Markdown(conf_mat_template(cm)))

<hr style='height:2px;border:none;background-color:#00233C;'>
<b style = 'font-size:28px;font-family:Arial;color:#00233c'>7. Visualize the results</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we have used 2 models for training and evaluation. From Vantage TD_CLASSIFICATIONEVALUATOR function is used to evaluate and compare the models.</p>  

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's visualise the the Decision Forest Vs XGBoost evaluation result to compare values in graph.</p>

In [None]:
query = """CREATE MULTISET TABLE metric_union as (select cast('XGBoost' as VARCHAR(15)) as Model, trim(Metric) as Metric,MetricValue from additional_metrics_xgb a 
union all 
select 'DecisionForest' as Model ,  trim(Metric) as Metric,MetricValue from additional_metrics_df b
)with data PRIMARY INDEX (Metric)
;
"""

try:
    execute_sql(query)
except:
    db_drop_table("metric_union")
    execute_sql(query)

df_chart = DataFrame.from_query("select * from metric_union")
df_chart = df_chart.to_pandas().reset_index()

In [None]:
df_chart["Metric"] = df_chart["Metric"].str.replace("\x00", "")
fig = px.bar(
    df_chart,
    x="Metric",
    y="MetricValue",
    color="Model",
    barmode="group",
    title="Compare models",
    labels={"Metric": "Metrics", "MetricValue": "Metric Values"},
)

fig.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Decision Forest and XGBoost models are compared using the aforementioned measures. We can observe that the performance of the Decision Forest and XGBoost models is essentially the same.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Another way to compare and select the best model is by calculate AUC(Area Under the Curve) for Receiver Operating Characteristic Curve</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ROC curve is a graph between TPR(True Positive Rate) and FPR(False Positive Rate). The area under the ROC curve is a metric of how well the model can distinguish between positive and negative classes. The higher the AUC, the better the model's performance in distinguishing between the positive and negative classes. AUC above 0.75 is generally considered decent.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we can see the comparison for AUC and ROC for XGBoost and DecisionForest.</p> 


In [None]:
# ROC curve for Decision Tree model
result_dt_pandas = DataFrame(in_schema("demo_user", "DF_Predict_out1")).to_pandas()
fpr_dt, tpr_dt, thresholds_dt = roc_curve(
    result_dt_pandas["purchased"], result_dt_pandas["prediction"]
)
auc_dt = roc_auc_score(result_dt_pandas["purchased"], result_dt_pandas["prediction"])
plt.plot(
    fpr_dt,
    tpr_dt,
    color="orange",
    label="Decision Tree ROC. AUC = {}".format(str(round(auc_dt, 4))),
)

# ROC curve for XGB
result_xgb_pandas = DataFrame(in_schema("demo_user", "xgb_predict_out1")).to_pandas()
fpr_xgb, tpr_xgb, thresholds_xgb = roc_curve(
    result_xgb_pandas["purchased"], result_xgb_pandas["prediction"]
)
auc_xgb = roc_auc_score(result_xgb_pandas["purchased"], result_xgb_pandas["prediction"])
plt.plot(
    fpr_xgb,
    tpr_xgb,
    color="green",
    label="XGB ROC. AUC = {}".format(str(round(auc_xgb, 4))),
)


plt.plot([0, 1], [0, 1], color="darkblue", linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend()
plt.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We may state with confidence that the model has performed well on testing data by looking at the ROC Curve shown above. The AUC number is close to 0.75, which supports our perception that the model is operating effectively. The graph above shows that the performance of both models (XGBoost and DecisionForest) is very similar. The performances barely differ from one another.</p>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>7.1 Conclusion:</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In conclusion, the implementation of a retail marketing campaign solution can greatly benefit to the client by reducing reducing the marketing efforts and cost along with annoyance of customers.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>If Multiple Campaigns and multiple contacts are performed for the customers, there is more chance for the customers to not interested to purchase the product. Almost 2 or 3 contacts can be preferred to perform for the customers.</p>

<hr style='height:2px;border:none;background-color:#00233C;'>
<b style = 'font-size:28px;font-family:Arial;color:#00233c'>8. Cleanup</b>
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>8.1 Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Cleanup work tables to prevent errors next time.</p>

In [None]:
tables = [
    "marketing_campaign_trans_data",
    "rmc_train",
    "rmc_test",
    "TrainTestSplit_output",
    "xgb_out",
    "xgb_model",
    "additional_metrics_xgb",
    "xgb_predict_out1",
    "DF_train",
    "DF_Predict_out",
    "DF_Predict_out1",
    "additional_metrics_df",
    "metric_union",
    "RankTable",
    "Distributions",
    "Distributions_Table",
    "lineGraph",
    "lineGraph_Result",
]


for t in tables:
    try:
        db_drop_table(table_name=t)

    except:
        pass

<p style = 'font-size:18px;font-family:Arial;color:#00233c'> <b>8.2 Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_MarketingCamp');"        # Takes 5 seconds

In [None]:
remove_context()

<b style = 'font-size:28px;font-family:Arial;color:#00233c'>Dataset:</b>

- `customer_id`: Unique row customer id
- `age`: customer age (numeric)
- `profession` : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student","blue-collar","self-employed","retired","technician","services")
- `marital` : marital status (categorical: "married","divorced","single"; note: "divorced" meansdivorced or widowed)
- `education` customer eduction (categorical: "unknown","secondary","primary","tertiary")
- `city`: city of customer (categorical: 'New York','Los Angeles','Chicago','Houston','Phoenix','Philadelphia','San Antonio','San Diego','Dallas','San Jose')
- `monthly_income_in_thousand`: customer's monthly income, in dollar (numeric)
- `family_members`: number of family members (numeric)
- `communication_type`: communication type (categorical: "unknown","telephone","cellular")
- `last_contact_day`: last contact day of the month (numeric)
- `last_contact_month`: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
- `credit_card`: does customer have a credit card? (binary: 'yes','no')
- `num_of_cars`: number of cars (numeric)
- `last_contact_duration`: last contact duration, in seconds (numeric)
- `campaign`: number of contacts performed during this campaign and for this client (categorical,includes last contact)
- `days_from_last_contact`: number of days that passed by after the client was last contacted from a previouscampaign (numeric, -1 means client was not previously contacted)
- `prev_contacts_performed`: number of contacts performed before this campaign and for this client (numeric)
- `prev_campaign_outcome`: outcome of the previous marketing campaign (categorical:"unknown","other","failure","success")
- `payment_method`: payment method use by customer (categorical: 'cash','credit_card','debit_card','ewallets', 'payment_links', 'QRcodes')
- `purchase_frequency`: how frequently customer is purchasing (categorical: 'daily','weekly','biweekly','monthly','quarterly','yearly')
- `gender`: gender of customer? (binary: 'male','female')
- `recency`: number of days since the last purchase (numeric)


Output variable (desired target):
- `purchased`: does customer did a purchase - target column (binary: 'yes','no')

<p style = 'font-size:16px;font-family:Arial;color:#00233c'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023. All Rights Reserved
        </div>
    </div>
</footer>