<header>
   <p  style='font-size:36px;font-family:Arial;color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Telco Churn using Enterprise Feature Store and AutoML in Vantage
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;'><b>AutoML Approach</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Teradataml <b>Auto</b>mated <b>M</b>achine <b>L</b>earning (AutoML) provides functionality to automate the end-to-end machine learning flow. AutoML takes data scientist productivity to next-level by automatically train high-quality models specific to their business needs. AutoML represents a method for streamlining the entire process of machine learning pipeline in automated way. It encompasses various distinct phases of the machine learning pipeline, including feature exploration, features engineering, data preparation, model selection, model training with hyperparameters tuning, and model evaluation. By automating these tasks, AutoML eliminates the need for manual intervention by trained data scientists and reduces the prerequisite knowledge required for beginners. This accessibility allows individuals of varying expertise levels to effortlessly use AutoML to create machine learning models in an automated fashion.
</p>

<p style = 'font-size:16px;font-family:Arial;'>Key Features of Teradata AutoML approach:</p>
<ul style = 'font-size:16px;font-family:Arial;'>
    <li>Helps users determine the most optimal model automatically.</li>
    <li>Increases ease of use in model building</li>
    <li>Supports various problem types, including Regression, Binary Classification, and Multiclass Classification.</li>
    <li>Provides five different models for training: GLM, SVM, Decision Forest, XGBoost, and KNN.</li>
    <li>Flexibility to select specific models out of the available models.</li>
    <li>All five phases are automated and can be customized based on user input.</li>
    <li>Generates model leaderboard and leader for a given dataset.</li>
    <li>Allows prediction on validation dataset and on user passed data on the leader board</li>
</ul>

<p style = 'font-size:16px;font-family:Arial;'>Below are the different phases of AutoML:</p>
</p>
<center><img src="images/AutoML_phases.png" alt="efs" width=800 height=1200 style='border: 4px solid #404040; padding-right:15px; border-radius: 10px;'></center>

<p style = 'font-size:18px;font-family:Arial;'><b>Why Vantage?</b></p>
<p style = 'font-size:16px;font-family:Arial;'>To maximize the business value of advanced analytic techniques including Machine Learning and Artificial Intelligence, it is estimated that organizations must scale their model development and deployment pipelines to 100s or 1000s of times greater amounts of data, models, or both.</p>

<p style = 'font-size:16px;font-family:Arial;'>There are several reasons why EFS naturally fits to Teradata Vantage:</p>
<li style = 'font-size:16px;font-family:Arial;'>Utilizes Teradata Vantage with its powerful Analytical Library and SQL Engine.</li>
<li style = 'font-size:16px;font-family:Arial;'>Primary Index enables efficient single-row access for online feature use.</li>
<li style = 'font-size:16px;font-family:Arial;'>Single platform for both online and offline feature stores.</li>
<li style = 'font-size:16px;font-family:Arial;'>Macros reduce parsing overhead from API access.</li>
<li style = 'font-size:16px;font-family:Arial;'>R and Python code execution within SQL Engine.</li>
<li style = 'font-size:16px;font-family:Arial;'>Bi-temporal querying capability.</li>
<li style = 'font-size:16px;font-family:Arial;'>Scalable MPP power for feature computation.</li>
<li style = 'font-size:16px;font-family:Arial;'>Industry-specific Logical Data Model as a feature source.</li>
<li style = 'font-size:16px;font-family:Arial;'>Connectivity to Object Storage via NOS for feature data sourcing.</li>
<li style = 'font-size:16px;font-family:Arial;'>Query Grid facilitates access to multiple data sources.</li>
<li style = 'font-size:16px;font-family:Arial;'>Close-to-real-time feature delivery using Query Services and Teradata Intelligent Memory.</li>
<li style = 'font-size:16px;font-family:Arial;'>Workload management prioritizes tasks effectively.</li></p>
<p style = 'font-size:16px;font-family:Arial;'>The unique massively-parallel architecture of Teradata Vantage allows users to prepare data, train, evaluate, and deploy models at unprecedented scale.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>1. Connect to Vantage, Import python packages and explore the dataset</b></p>


<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
%%capture
# '%%capture' suppresses the display of installation steps of the following packages
!pip install --upgrade teradataml

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;'><b>Note: </b><i>Please execute the above pip install to get the latest version of the required library. Be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

In [None]:
# Standard libraries
import json
import warnings

# Suppress warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

# Teradata libraries
from teradataml import *
display.max_rows = 5

# Data manipulation and visualization libraries
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

<p style = 'font-size:16px;font-family:Arial;'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=EE_Telco_Customer_Churn_AutoML_Approach.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>2. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_Telco_cloud');"
 # takes about 30 seconds, estimated space: 0 MB
%run -i ../../run_procedure.py "call get_data('DEMO_Telco_local');" 
# takes about 1 minute 30 seconds, estimated space: 4 MB

<p style = 'font-size:16px;font-family:Arial;'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>3. Data Exploration</b></p>

<p style = 'font-size:16px;font-family:Arial;'>Let us start by creating a "Virtual DataFrame" that points directly to the dataset in Vantage. We then begin our analysis by checking the shape of the DataFrame and examining the data types of all its columns.</p>

In [None]:
tdf = DataFrame(in_schema("DEMO_Telco", "Customer_Churn"))
tdf 

In [None]:
print("Shape of the data: ", tdf.shape)

<p style = 'font-size:16px;font-family:Arial;'> As we can see from above result our dataset has 7043 rows with 21 columns.</p>

<p style = 'font-size:16px;font-family:Arial;'><b>Summary of Columns</b><br>
<p style = 'font-size:16px;font-family:Arial;'>We can use the <b>ColumnSummary</b> function for quickly examining the columns, their datatypes, and summary of NULLs/non-NULLs for a given table. </p>  

In [None]:
from teradataml import ColumnSummary
obj = ColumnSummary(data=tdf,
                        target_columns=[':']
                       )

In [None]:
obj.result.head(21)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>4. Exploratory Data Analysis</b></p>

<p style = 'font-size:16px;font-family:Arial;'>
Exploratory Data Analysis (EDA) is a process where we visually and statistically examine, analyze, and summarize data to comprehend its characteristics, patterns, and relationships. This approach is crucial for gaining insights and a deeper understanding of the dataset at hand.<br>First let us analyse the Gender and Churn distributions in our data.</p>

In [None]:
d1=tdf.select(['Gender','CustomerID']).groupby('Gender').count()
d1 = d1.assign(drop_columns=True,
          Gender=d1.Gender,
          Count=d1.count_CustomerID)
d1

In [None]:
d2=tdf.select(['Churn','CustomerID']).groupby('Churn').count()
d2 = d2.assign(drop_columns=True,
          Churn=d2.Churn,
          Count=d2.count_CustomerID)
d2

<p style = 'font-size:16px;font-family:Arial;'>
We can see that the aggregated data is available to us in teradataml dataframe. Let's visualize this data to better understand the Churn and gender distributions. Clearscape Analytics can easily integrate with 3rd party visualization tools like Tableau, PowerBI or many python modules available like plotly, seaborn etc. We can do all the calculations and pre-processing on Vantage and pass only the necessary information to visulazation tools, this will not only make the calculation faster but also reduce the overall time due to less data movement between tools.</p>

In [None]:
d1=d1.to_pandas().reset_index()
d2=d2.to_pandas().reset_index()
#Gender and Churn percentage distribution
# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=d1['Gender'], values=d1['Count'], name="Gender"),
              1, 1)
fig.add_trace(go.Pie(labels=d2['Churn'], values=d2['Count'], name="Churn"),
              1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name", textfont_size=16)

fig.update_layout(
    title_text="Gender and Churn Distributions",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='Gender', x=0.16, y=0.5, font_size=20, showarrow=False),
                 dict(text='Churn', x=0.84, y=0.5, font_size=20, showarrow=False)])
fig.show()

<p style = 'font-size:16px;font-family:Arial;'>From the above plot we can see that 26.6 % of customers switched to another firm.<br>And of total customers 49.5 % are female and 50.5 % are male.</p>

<p style = 'font-size:16px;font-family:Arial;'>Now, let us see the chrun with respect to gender.</p>

In [None]:
d3=tdf.select(['Churn','Gender','CustomerID']).groupby(['Churn','Gender']).count()
d3 = d3.assign(drop_columns=True,
          Churn=d3.Churn,
          Gender=d3.Gender,     
          Count=d3.count_CustomerID)
d3

In [None]:
d3=d3.to_pandas().reset_index()
fig2=px.sunburst(d3,path=['Churn','Gender'],values='Count')
fig2.update_layout(
    title_text="Churn Distribution w.r.t Gender")
fig2.show()

<p style = 'font-size:16px;font-family:Arial;'>We can see that there is negligible difference in customer count who changed the service provider. Both genders behaved in similar fashion when it comes to migrating to another service provider.</p>

In [None]:
d4=tdf.select(['Churn','Contract','CustomerID']).groupby(['Churn','Contract']).count()
d4 = d4.assign(drop_columns=True,
          Churn=d4.Churn,
          Contract=d4.Contract,     
          Count=d4.count_CustomerID)
d4

In [None]:
d4=d4.to_pandas().reset_index()
fig4 = px.bar(d4,x="Churn",y="Count", color="Contract", barmode="group", title="<b>Customer contract distribution<b>")
fig4.update_layout(width=700, height=500, bargap=0.1)
fig4.show()

<p style = 'font-size:16px;font-family:Arial;'> We can see that about 75% of customer with Month-to-Month Contract opted to move out as compared to 13% of customers with One Year Contract and 3% with Two Year Contract.</p>

In [None]:
d5=tdf.select(['PaymentMethod','CustomerID']).groupby('PaymentMethod').count()
d5 = d5.assign(drop_columns=True,
          PaymentMethod=d5.PaymentMethod,
          Count=d5.count_CustomerID)
d5

In [None]:
d5=d5.to_pandas().reset_index()
fig5 = go.Figure(data=[go.Pie(labels=d5['PaymentMethod'], values=d5['Count'], hole=.3)])
fig5.update_layout(title_text="<b>Payment Method Distribution</b>")
fig5.show()

In [None]:
d6=tdf.select(['Churn','PaymentMethod','CustomerID']).groupby(['Churn','PaymentMethod']).count()
d6 = d6.assign(drop_columns=True,
          Churn=d6.Churn,
          PaymentMethod=d6.PaymentMethod,     
          Count=d6.count_CustomerID)
d6

In [None]:
d6=d6.to_pandas().reset_index()
fig6 = px.bar(d6,x="Churn",y="Count", color="PaymentMethod", barmode="stack", title="<b>Customer Payment Method distribution w.r.t. Churn<b>")
fig6.update_layout(width=700, height=500, bargap=0.1)
fig6.show()

<p style = 'font-size:16px;font-family:Arial;'>Major customers who moved out were having Electronic Check as Payment Method.
<br>Customers who opted for Credit-Card automatic transfer or Bank Automatic Transfer and Mailed Check as Payment Method were less likely to move out. </p>

In [None]:
d7=tdf.select(['Churn','InternetService','Gender','CustomerID']).groupby(['Churn','InternetService','Gender']).count()
d7 = d7.assign(drop_columns=True,
          Churn=d7.Churn,
          InternetService=d7.InternetService, 
          Gender=d7.Gender,
          Count=d7.count_CustomerID)
d7

In [None]:
d7.sort(["InternetService"]).head(21)

In [None]:
d7=d7.to_pandas().reset_index()
fig7 = go.Figure()

for t in d7['Churn'].unique():
    dfp = d7[d7['Churn']==t]
    fig7.add_traces(go.Bar(x=[dfp['InternetService'], dfp['Gender']],
                          y=dfp['Count'],
                          width=0.75,
                          customdata=d7['Churn'],
                          name='Churn :' +str(dfp['Churn'].values[0]) 
                         )
                  )

fig7.update_layout(barmode='stack',
                  title_text="<b>Churn Distribution w.r.t. Internet Service and Gender</b>")
fig7.show()

<p style = 'font-size:16px;font-family:Arial;'> We can see that a lot of customers choose the Fiber optic service as compared to DSL but it's also evident that the customers who use Fiber optic have high churn rate, this might suggest a dissatisfaction with this type of internet service.
<br> Customers having DSL service have less churn rate compared to Fiber optic service.</p>

In [None]:
d8=tdf.select(['Churn','Dependents','CustomerID']).groupby(['Churn','Dependents']).count()
d8 = d8.assign(drop_columns=True,
          Churn=d8.Churn,
          Dependents=d8.Dependents,
          Count=d8.count_CustomerID)
d8

In [None]:
d8=d8.to_pandas().reset_index()
color_map = {"Yes": "#FF97FF", "No": "#AB63FA"}
fig8 = px.bar(d8, x="Churn",y="Count", color="Dependents", barmode="group", title="<b>Dependents distribution</b>", color_discrete_map=color_map)
fig8.update_layout(width=700, height=500, bargap=0.1)
fig8.show()

<p style = 'font-size:16px;font-family:Arial;'>Customers without dependents are more likely to churn.</p>

In [None]:
d9=tdf.select(['Churn','Partner','CustomerID']).groupby(['Churn','Partner']).count()
d9 = d9.assign(drop_columns=True,
          Churn=d9.Churn,
          Partner=d9.Partner,
          Count=d9.count_CustomerID)
d9

In [None]:
d9=d9.to_pandas().reset_index()
color_map = {"Yes": '#FFA15A', "No": '#00CC96'}
fig9 = px.bar(d9, x="Churn",y="Count", color="Partner", barmode="group", title="<b>Chrun distribution w.r.t. Partners</b>", color_discrete_map=color_map)
fig9.update_layout(width=700, height=500, bargap=0.1)
fig9.show()

<p style = 'font-size:16px;font-family:Arial;'>Customers that don't have partners are more likely to churn.</p>

In [None]:
d10=tdf.select(['Churn','PaperlessBilling','CustomerID']).groupby(['Churn','PaperlessBilling']).count()
d10 = d10.assign(drop_columns=True,
          Churn=d10.Churn,
          PaperlessBilling=d10.PaperlessBilling,
          Count=d10.count_CustomerID)
d10

In [None]:
d10=d10.to_pandas().reset_index()
color_map = {"Yes": '#FFA15A', "No": '#00CC96'}
fig10 = px.bar(d10, x="Churn",y="Count", color="PaperlessBilling",  title="<b>Chrun distribution w.r.t. Paperless Billing</b>", color_discrete_map=color_map)
fig10.update_layout(width=700, height=500, bargap=0.1)
fig10.show()

<p style = 'font-size:16px;font-family:Arial;'>Customers with Paperless Billing are most likely to churn.</p>

<hr style="height:1px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>5. Feature Engineering</b>

<p style='font-size:16px;font-family:Arial;'>Teradata Enterprise Feature Store (EFS) Functions are designed to handle feature management within the Vantage environment. While inspired by the syntax of Feast, Teradata EFS Functions stands out, offering efficiency and robustness in data management and feature handling tailored specifically for the use of Teradata Vantage. Teradata EFS Functions use Teradata Dataframes for Feature management, to the contrary of the pandas dataframe of Feast. With Teradata Dataframes we avoid extracting the data to create or use Features from the Enterprise Feature Store (EFS). The EFS Functions are crafted to empower Data Science teams for effective and streamlined feature management. This notebook will walk you through the capabilities of EFS Functions, demonstrating how it integrates seamlessly with your data models and processes.</p>

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial;'>5.1 Setup a Feature Store Repository</b>

<p style='font-size:16px;font-family:Arial;'>The Enterprise Feature Store (EFS) SDK is designed with a totally object-oriented approach, focusing on intuitive interaction with feature stores. Central to this design are several core objects: Feature, Entity, DataSource, FeatureGroup. Together, these objects facilitate the efficient management and utilization of features within your data ecosystem, leveraging Teradata Vantage for metadata storage.</p>
<p style='font-size:16px;font-family:Arial;'>A feature store repository serves as the foundational environment for storing and managing your data features. The owner of the FeatureStore can grant/revoke read only, write only or read and write authorization to other user(s)</p>

In [None]:
telco_fs = FeatureStore(repo='TelcoFS')
telco_fs.setup(perm_size='10e8')

In [None]:
# List whether FeatureStore is setup or not.
telco_fs.list_repos()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial;'><b>5.2 Create and Register Entity </b></p>

<p style = 'font-size:16px;font-family:Arial;'>Let us now start with feature engineering, for which we will create the required columns in the dataframe and than use those columns to register as features in the feature group of feature store created in the step above.</p>

In [None]:
df = DataFrame(in_schema("DEMO_Telco", "Customer_Churn"))
df

<p style = 'font-size:16px;font-family:Arial;'>This code performs the following operations:</p>
    <ol style = 'font-size:16px;font-family:Arial;'>
        <li><strong>Assigning New Values:</strong> The <code>df.assign()</code> function is used to create new columns or modify existing ones in the DataFrame <code>df</code>.</li>
        <li><strong>Replacing Values:</strong>
            <ul>
                <li><span class="highlight">MultipleLines</span>: Replaces "No phone service" with "No".</li>
                <li><span class="highlight">OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies</span>: Replaces "No internet service" with "No" for each of these columns.</li>
            </ul>
        </li>
        <li><strong>Converting Churn Values:</strong>
            <ul>
                <li><span class="highlight">Churn</span>: Uses the <code>case</code> function to convert "Yes" to 1 and "No" to 0. If the value is neither "Yes" nor "No", it defaults to 0.</li>
            </ul>
        </li>
        <li><strong>Displaying the DataFrame:</strong> The final <code>df</code> statement displays the modified DataFrame.</li>
    </ol>

In [None]:
df = tdf.assign(
    MultipleLines = tdf.MultipleLines.replace("No phone service","No"),
    OnlineSecurity = tdf.OnlineSecurity.replace("No internet service","No"),
    OnlineBackup = tdf.OnlineBackup.replace("No internet service","No"),
    DeviceProtection = tdf.DeviceProtection.replace("No internet service","No"),
    TechSupport = tdf.TechSupport.replace("No internet service","No"),
    StreamingTV = tdf.StreamingTV.replace("No internet service","No"),
    StreamingMovies = tdf.StreamingMovies.replace("No internet service","No"),
    Churn = case({ "Yes" : 1, "No" : 0}, value=tdf.Churn,else_=0)
)

df

In [None]:
df = ConvertTo(
    data=df,
    target_columns=['CustomerID', 'Gender', 'Partner', 'Dependents', 'PhoneService',
                    'MultipleLines', 'InternetService','OnlineSecurity', 'OnlineBackup',
                    'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
                    'Contract', 'PaperlessBilling', 'PaymentMethod'],
    target_datatype=["VARCHAR(charlen=10,charset=UNICODE,casespecific=NO)"]
).result

<p style = 'font-size:16px;font-family:Arial;'>Let's store the transformed data to table.</p>

In [None]:
copy_to_sql(
    df=df,
    table_name='transformed_data_automl',
    if_exists='replace'
)

<p style = 'font-size:16px;font-family:Arial;'>Now we will proceed to save the features as well as the feature processing logic in feature store.</p>
<p style = 'font-size:16px;font-family:Arial;'>This will allow us to re-use the features and processing later-on, avoiding to re-write the processing logic.</p>

In [None]:
df = DataFrame('transformed_data_automl')

In [None]:
# Create entity for DataFrame 'patient_profile_df'
entity=Entity(name='CustId', columns=df.CustomerID)

In [None]:
# Register the Entity.
telco_fs.apply(entity)

In [None]:
# Look at existing Entities after registering the Entity.
telco_fs.list_entities()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial;'><b>5.3 Create and Register FeatureGroup </b></p>
<li style = 'font-size:16px;font-family:Arial;'>FeatureGroup can be created using Teradata DataFrame.</li>
<li style = 'font-size:16px;font-family:Arial;'>FeatureGroup can be created using SQL Query. </li>
<li style = 'font-size:16px;font-family:Arial;'>FeatureGroup can be created using objects of Feature, Entity, DataSource.  </li>


<p style = 'font-size:16px;font-family:Arial;'><b>Creating a FeatureGroup from Teradata DataFrame
</b></p>

In [None]:
telco_fg = FeatureGroup.from_DataFrame(
    name='TelcoFG', 
    entity_columns='CustomerID', 
    df=df
)

In [None]:
# Let's look at Properties.
telco_fg.features, telco_fg.entity, telco_fg.data_source, telco_fg.description

In [None]:
telco_fs.apply(telco_fg)

In [None]:
telco_fs.list_features()

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial;'>5.4 Reuse features from Enterprise Feature Store with teradataml analytic functions for AutoML processing.</b>


<p style = 'font-size:16px;font-family:Arial'>Since FeatureStore stores DataSource also, you can retrive Teradata DataFrame from FeatureStore. <br> `FeatureStore.get_dataset()` get's Teradata DataFrame from FeatureGroup.</p>

In [None]:
# Get DataSet for FeatureGroup TelcoFG. 
df = telco_fs.get_dataset('TelcoFG')
df

<p style = 'font-size:16px;font-family:Arial;'> We have our training dataset which is created, with all the feature engineering</p>
<p style = 'font-size:16px;font-family:Arial;'> We can see from that the column Multiple lines has only two values yes and no. The same features can also be re-used accross multiple use-cases and models without any data preperation</p>

<p style = 'font-size:16px;font-family:Arial;'>We split the dataset in to training and testing dataset with 80:20 split ratio.</p>

In [None]:
# Performing sampling to get 80% for trainning and 20% for testing
tdf_sample = df.sample(frac = [0.8, 0.2])

# Fetching train and test data
tdf_train= tdf_sample[tdf_sample['sampleid'] == 1].drop('sampleid', axis=1)
tdf_test = tdf_sample[tdf_sample['sampleid'] == 2].drop('sampleid', axis=1)

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>6. AutoML Training</b>

<p style = 'font-size:16px;font-family:Arial;'>AutoML (Automated Machine Learning) is an approach that automates the process of building, training, and validating machine learning models. It involves various algorithms to automate various aspects of the machine learning workflow, such as data preparation, feature engineering, model selection, hyperparameter tuning, and model deployment. It aims to simplify the process of building machine learning models, by automating some of the more time-consuming and labor-intensive tasks involved in the process.</p>

<p style = 'font-size:16px;font-family:Arial;'>We create a <code>AutoClassifier</code> instance which is a special purpose AutoML feature to run classification specific tasks. We use the <code>exclude</code> parameter to specify model algorithms to be excluded from model training phase. Here we exclude the 'knn' model. The <code>max_runtime_secs</code> specifies the time limit in seconds for model training.
<br><br>
<code>verbose</code>: specifies the detailed execution steps based on verbose level as follows:
</p>

<ul style = 'font-size:16px;font-family:Arial;'>
    <li><b>0</b>: prints the progress bar and leaderboard</li>
    <li><b>1</b>: prints the execution steps of AutoML.</li>
    <li><b>2</b>: prints the intermediate data between the execution of each step of AutoML.</li>
</ul>

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial;'>6.1. AutoML Training</b>

<p style = 'font-size:16px;font-family:Arial'>AutoML (Automated Machine Learning) is an approach that automates the process of building, training, and validating machine learning models. It involves various algorithms to automate various aspects of the machine learning workflow, such as data preparation, feature engineering, model selection, hyperparameter tuning, and model deployment. It aims to simplify the process of building machine learning models, by automating some of the more time-consuming and labor-intensive tasks involved in the process.</p>

<p style = 'font-size:16px;font-family:Arial'>We create a <code>AutoClassifier</code> instance which is a special purpose AutoML feature to run classification specific tasks. We use the <code>exclude</code> parameter to specify model algorithms to be excluded from model training phase. Here we exclude the 'knn' model. The <code>max_runtime_secs</code> specifies the time limit in seconds for model training.
<br><br>
<code>verbose</code>: specifies the detailed execution steps based on verbose level as follows:
</p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li><b>0</b>: prints the progress bar and leaderboard</li>
    <li><b>1</b>: prints the execution steps of AutoML.</li>
    <li><b>2</b>: prints the intermediate data between the execution of each step of AutoML.</li>
</ul>

In [None]:
# Creating AutoClassifier Instance
# Selecting 'Auto' mode for AutoML training
# Excluding knn,glm and svm model from default model list for training
# Used early stopping timer criteria with value 600 sec

aml = AutoClassifier(
    exclude          = ['knn','svm'],
    verbose          = 2,
    max_runtime_secs = 600
)

<p style = 'font-size:16px;font-family:Arial'><b><i>Note: Since the AutoML functionality does a lot of steps like Feature exploration and Data Preparation along with Model Training and Evaluating to select the Best model the below step may take anywhere between 12-15 minutes</i></b></p>

In [None]:
# Fitting train data 
aml.fit(data = tdf_train, target_column = 'Churn')

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>6.2. Model Leaderboard Generation</b>

<p style = 'font-size:16px;font-family:Arial'>Here, we generate model leaderboard and leader for a given dataset. Leaderboard is a ranked table with a list of models with all their evaluation metrics.</p>

In [None]:
# Fetching leaderboard

leaderboard = aml.leaderboard()
leaderboard

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>6.3 Best Performing Model</b>

<p style = 'font-size:16px;font-family:Arial'>The following function displays the best performing model.</p>

In [None]:
# Fetching best performing model
aml.leader()

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>7. Prediction</b>

<p style = 'font-size:16px;font-family:Arial'>The predict function generates predictions using either the default test data or any specified dataset, based on the model's rank in the leaderboard, and displays the performance metrics of the chosen model. If the test data contains a target column, both predictions and performance metrics are displayed; otherwise, only the predictions are shown.
<br><br>
You can also use the <code>rank</code> parameter in the predict function. The <code>rank</code> parameter specifies the model's rank in the leaderboard to be used for prediction. By default, the rank is set to 1, meaning the best-performing model is used.</p>

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>7.1 Generating prediction on test data using Best Model</b>

<p style = 'font-size:16px;font-family:Arial'>Here, we specify the <code>tdf_test</code> dataset for prediction. When using external data instead of the default test data, the predict function applies all the data transformation steps performed during the training phase on the external data before passing the data to the model for prediction.</p>

In [None]:
# Fetching prediction and metrics on test data
prediction = aml.predict(tdf_test)

In [None]:
# Printing prediction
prediction

<b style = 'font-size:18px;font-family:Arial'>Generating predictions using 2nd Best Model</b>

In [None]:
#Prediction using the second best performing model
prediction_second = aml.predict(tdf_test, rank=2)

#Printing prediction
prediction_second

<b style = 'font-size:18px;font-family:Arial'>Generating predictions using 3rd Best Model</b>

In [None]:
prediction_third = aml.predict(tdf_test, rank=3)

#Printing prediction
prediction_third

<hr style="height:1px;border:none">
<b style = 'font-size:18px;font-family:Arial'>7.2 Generating and Comparing ROC for the Top 3 Models</b>

<p style = 'font-size:16px;font-family:Arial'>The ROC curve is a graph between TPR(True Positive Rate) and FPR(False Positive Rate). The area under the ROC curve measures how well the model can distinguish between positive and negative classes. The higher the AUC, the better the model's performance in distinguishing between the positive and negative categories. AUC above 0.75 is generally considered decent.</p>

In [None]:
#Calculating True-Positive Rate (TPR), False-Positive Rate (FPR), Threshold_values for both the models
roc_first = ROC(
    probability_column = "prob_1",
    observation_column = "Churn",
    positive_class = '1',
    num_thresholds = 100,
    data = prediction
)

roc_second = ROC(
    probability_column = "prob_1",
    observation_column = "Churn",
    positive_class = '1',
    num_thresholds = 100,
    data = prediction_second
)

roc_third = ROC(
    probability_column = "prob_1",
    observation_column = "Churn",
    positive_class = '1',
    num_thresholds = 100,
    data = prediction_third
)

#Getting auc_score for both models
auc_first = roc_first.result.get_values()[0][0]
auc_second = roc_second.result.get_values()[0][0]
auc_third = roc_third.result.get_values()[0][0]

In [None]:
#first model
first_model = leaderboard.MODEL_ID.iloc[0]

#second model
second_model = leaderboard.MODEL_ID.iloc[1]

third_model = leaderboard.MODEL_ID.iloc[2]

#Plotting the ROC Curve
roc_second.output_data.plot(
    x = roc_first.output_data.fpr,
    y = [roc_first.output_data.tpr, roc_second.output_data.tpr, roc_third.output_data.tpr,roc_first.output_data.fpr],
    legend = [
                '{}: AUC = {}'.format(first_model,str(auc_first)),
                '{}: AUC = {}'.format(second_model,str(auc_second)),
                '{}: AUC = {}'.format(third_model,str(auc_second)),
                'Baseline: AUC = {}'.format(str(round(0.5, 4)))
             ],
    legend_style = 'lower right',
    title = 'Receiver Operating Characteristic (ROC) Curve',
    xlabel = 'False Positive Rate',
    ylabel = 'True Positive Rate',
    color = ['green', 'orange', 'blue'],
    linestyle = ['-', '-', '--']
)

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>Conclusion</b>

<p style = 'font-size:16px;font-family:Arial'>We used feature store to store features as well as its processing. We re-used it in model training. The features and processing can be re-used accross multiple machine learning models and use-case , helping to improve data science productivity</p>

<p style = 'font-size:16px;font-family:Arial'>Teradata's AutoML functionality plays a crucial role in this context by automating the complex process of building and deploying machine learning models. AutoML ensures the most optimal preparation and training of models, delivering high-quality machine learning models in minutes. Through hyperparameter tuning (HPT), Teradata's AutoML can automatically select the best parameters for machine learning algorithms using grid search and random search techniques, significantly enhancing model performance.
<br><br>
By leveraging Teradata's AutoML, companies can save time and reduce costs associated with manual model building and tuning. The technology not only improves the accuracy of predictive models but also democratizes the power of machine learning, allowing customers to utilize advanced analytics without requiring extensive coding or data science expertise. This capability enables companies to swiftly and effectively analyze customer churn data, develop predictive models, and implement proactive strategies to retain customers and enhance their satisfaction.
<br><br>
In conclusion, Teradata's AutoML functionality is a vital tool for banks aiming to reduce customer churn. By automating and optimizing the machine learning process, Teradata empowers various industries to make data-driven decisions that improve customer retention and drive long-term profitability.</p>

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>8. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial'> <b>Work Tables </b></p>

In [None]:
tables = ['transformed_data']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass

In [None]:
telco_fs.archive_feature_group(feature_group='TelcoFG')

In [None]:
telco_fs.delete_feature_group(feature_group='TelcoFG')

<p style = 'font-size:18px;font-family:Arial'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../../run_procedure.py "call remove_data('DEMO_Telco');" 
#Takes 10 seconds

In [None]:
remove_context()

<hr style="height:2px;border:none;">

<b style = 'font-size:20px;font-family:Arial'>Required Materials</b>
<p style = 'font-size:16px;font-family:Arial'>Let’s look at the elements we have available for reference for this notebook:</p>

<p style = 'font-size:18px;font-family:Arial'><b>Filters:</b></p>
    <ul style = 'font-size:16px;font-family:Arial'>
    <li><b>Industry:</b> Telco</li>
    <li><b>Functionality:</b> Feature Store and AutoML</li>
    <li><b>Use Case:</b> Customer Retention</li>
    </ul>
    <p style = 'font-size:18px;font-family:Arial'><b>Related Resources:</b></p>
    <ul style = 'font-size:16px;font-family:Arial'>
    <li><a href = 'https://www.teradata.com/Blogs/NPS-is-a-metric-not-the-goal'>·In the fight to improve customer experience, NPS is a metric, not the goal</a></li>
    <li><a href = 'https://www.teradata.com/Blogs/Hyper-scale-time-series-forecasting-done-right'>·Hyper-scale time series forecasting done right</a></li>
    <li><a href = 'https://www.teradata.com/Resources/Datasheets/Digital-Identity-Management-and-Great-CX?utm_campaign=i_coremedia-AMS&utm_source=google&utm_medium=paidsearch&utm_content=GS_CoreMedia_NA-US_BKW&utm_creative=Brand-Vantage&utm_term=teradata%20analytic%20platform&gclid=Cj0KCQjwnMWkBhDLARIsAHBOftrWZxDktHkKMsaWjMmNRnQ6Ys-bZBAUhXjWTo1Xa02fsci-IHWBV_waAppkEALw_wcB'>·Close the Gap Between Digital Identity Management and Great Customer Experiences</a></li>
        </ul>

<p style = 'font-size:18px;font-family:Arial'><b>Reference Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'> 
       <li>Teradata Vantage™ - Analytics Database Analytic Functions - 17.20: <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Introduction-to-Analytics-Database-Analytic-Functions '>https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Introduction-to-Analytics-Database-Analytic-Functions </a></li>    
  <li>Teradata® Package for Python User Guide - 17.20: <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide-17.20/Introduction-to-Teradata-Package-for-Python'>https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide-17.20/Introduction-to-Teradata-Package-for-Python</a></li>
  <li>Teradata® Package for Python Function Reference - 17.20: <a href = 'https://docs.teradata.com/r/Enterprise/Teradata-Package-for-Python-Function-Reference-17.20/Teradata-Package-for-Python-Function-Reference'>https://docs.teradata.com/r/Enterprise/Teradata-Package-for-Python-Function-Reference-17.20/Teradata-Package-for-Python-Function-Reference</a></li>      
</ul>

<b style = 'font-size:18px;font-family:Arial'>Dataset:</b>

- `CustomerID`: unique id of customer
- `Gender`: Whether the customer is a male or a female
- `SeniorCitizen`:Whether the customer is a senior citizen or not (1, 0)
- `Partner`:Whether the customer has a partner or not (Yes, No)
- `Dependents`:Whether the customer has dependents or not (Yes, No)
- `Tenure`:Number of months the customer has stayed with the company
- `PhoneService`:Whether the customer has a phone service or not (Yes, No)
- `MultipleLines`:Whether the customer has multiple lines or not (Yes, No, No phone service)
- `InternetService`:Customer’s internet service provider (DSL, Fiber optic, No)
- `OnlineSecurity`:Whether the customer has online security or not (Yes, No, No internet service)
- `OnlineBackup`:Whether the customer has online backup or not (Yes, No, No internet service)
- `DeviceProtection`:Whether the customer has device protection or not (Yes, No, No internet service)
- `TechSupport`:Whether the customer has tech support or not (Yes, No, No internet service)
- `StreamingTV`:Whether the customer has streaming TV or not (Yes, No, No internet service)
- `StreamingMovies`:Whether the customer has streaming movies or not (Yes, No, No internet service)
- `Contract`:The contract term of the customer (Month-to-month, One year, Two year)
- `PaperlessBilling`:Whether the customer has paperless billing or not (Yes, No)
- `PaymentMethod`:The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- `MonthlyCharges`:The amount charged to the customer monthly
- `TotalCharges`:The total amount charged to the customer
- `Churn`:Whether the customer churned or not (Yes or No)

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>