<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Automatic Data Pre-Processing with tdprepview
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>

<center><img src="images/tdprepview_logo.png"/ width="500px" height="300px"></center>


<p style = 'font-size:16px;font-family:Arial'>Python Package that creates Data Preparation Pipelines in Views written in Teradata-SQL.</p>

<p style = 'font-size:16px;font-family:Arial'>If you're a data science practitioner looking for a fast and efficient approach to prepare datasets for any tabular supervised or unsupervised machine learning with ClearScape Analytics, this package and notebook is what you need. This notebook is about preparing the data to predict customer churn for a bank. However, the methods and strategies are broadly applicable across all use cases that require data preprocessing of tabular data.</p>

<p style = 'font-size:20px;font-family:Arial'><b>What is <code>tdprepview</code>?</b></p>

<ul style = 'font-size:16px;font-family:Arial'>
<li> A Python package that creates data preparation Pipelines in Views written in Teradata-SQL.</li>
<li> No Python client needed for transforming data with fitted pipelines.</li>
<li> Rationale: Most data preprocessing functions can be expressed in plain Teradata-SQL.</li>
  <ol style = 'font-size:16px;font-family:Arial'>
   <li> E.g., Imputation --> COALESCE</li>
  <li> But writing this manually is tiresome and error-prone.</li>
      <li> Why not let a Python Package do that for you?</li></ol>
<li> What it is not: </li>
  <ol><li>It is not a package for data exploration or feature engineering that relies on aggregation of rows. It is really for the final step, transforming everything into a clean analytic data set (ADS).</li></ol></ul>


<p style = 'font-size:20px;font-family:Arial'><b>How will <code>tdprepview</code> improve your data preprocessing work?</b></p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Super easy and fast to develop with.</li>
    <ul style = 'font-size:16px;font-family:Arial'>
        <li>Picks up well-known sklearn-API.</li>
        <li>Lightweight thanks to encapsulating pipelines as views.</li></ul>
<li> Super fast execution runtime, </li>
    <ul style = 'font-size:16px;font-family:Arial'>
        <li> Thanks to row-wise transformation queries Teradata-SQL.</li>
      </ul>
<li> Super compatible and future-proof.</li> 
     <ul style = 'font-size:16px;font-family:Arial'>
        <li> Only Teradata-SQL is created.</li>
      </ul>
<li> Robust, transparent & reusable.</li>
<li> (Semi-)automatic pipeline creation based on heuristics and properties of the data.</li>
<li> Suits well with Teradata's Bring Your Own Model (BYOM) capability:</li>
 <ul style = 'font-size:16px;font-family:Arial'>
        <li> All data preparation is done in Vantage, only the perfectly clean training set is used for training off-Vantage.</li>
      </ul>
</ol>


<p style = 'font-size:20px;font-family:Arial'><b> What Preprocessors are included in <code>tdprepview</code>?</b></p>

<img alt="tdprepview" width=98% src="images/supportedpreprocessors_v131.png" />

<p style = 'font-size:20px;font-family:Arial'><b>What are the key features of <code>tdprepview</code>?</b></p>

<img alt="tdprepview" width=100% src="images/tdprepview_keyfeatures.png" />

<p style = 'font-size:20px;font-family:Arial'><b>Where can i get <code>tdprepview</code>?</b></p>

<p style = 'font-size:16px;font-family:Arial'>on <code>pypi</code> via <code>pip install tdprepview</code> </p>

<p style = 'font-size:16px;font-family:Arial'>Have a look here: <a href="https://pypi.org/project/tdprepview/">https://pypi.org/project/tdprepview/</a></p>

<p style = 'font-size:16px;font-family:Arial'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Initiate a connection to Vantage</li>
    <li>Data Exploration</li>
    <li>Quickstart: Automatic data preprocessing in less than a minute!</li>
    <li>Refine auto-generated Pipeline</li>
    <li>Appendix: Feature Overview - All Preprocessors in Action</li>
    <li>Cleanup</li>
</ol>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>Downloading and installing additional software needed</b>

In [1]:
%%capture

!pip install tdprepview

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial'><b>Note: </b><i>Restart the kernel after executing the previous cell to bring the installed libraries into the session. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>
<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 
import pandas as pd
from teradataml import *
import tdprepview
import json
from IPython.display import display, Markdown

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>1. Initiate a connection to Vantage</b>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>1.1 Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Automatic_DataPreprocessing_tdprepview.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys. </p>

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>1.2 Getting Data for This Demo</b></p>

<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
# %run -i run_procedure.py "call get_data('DEMO_BankChurn_cloud');"        # Takes 30 seconds
%run -i run_procedure.py "call get_data('DEMO_BankChurn_local');"        # Takes 1 minute

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>2. Data Exploration</b>
<p style = 'font-size:16px;font-family:Arial'>Create a "Virtual DataFrame" that points to the data set in Vantage. Check the shape of the dataframe as check the datatype of all the columns of the dataframe.</p>
<p style = 'font-size:16px;font-family:Arial'><b><i>*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.</i></b></p>

In [None]:
tdf = DataFrame(in_schema("DEMO_BankChurn", "customer_churn"))
print("Shape of the data: ", tdf.shape)
tdf

<p style = 'font-size:16px;font-family:Arial'>By looking at the datatypes and sample data, we classify the columns into ID column, target variable(y), numerical, categorical and binary. We skip using <i>RowNumber</i> and <i>Surname</i> columns as they are not helpful in the analysis.</p>

In [None]:
target_variable = "Exited"
numeric_columns = ["Age", "Balance", "CreditScore", "EstimatedSalary", "Tenure"]
categorical_columns = ["Gender", "Geography", "NumOfProducts"]
binary_columns = ["HasCrCard", "IsActiveMember"]
id_column = ["CustomerId"]

customer_data = tdf.select(
    id_column + [target_variable] + numeric_columns + categorical_columns + binary_columns
)

In [None]:
customer_data

<div class="alert alert-block alert-info" id="no-azure">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Best Practice Tips: Raw Data: Distinguish Data for Training and for Scoring.</b></i></p>
</div>
     <ul style = 'font-size:16px;font-family:Arial'>
        <li>Consider now the deployment phase of the model, specifically for making predictions. This includes batch processing, where scoring occurs at regular intervals (hourly, daily, weekly, or monthly), and event-based processing, where scoring is triggered by specific events.</li>
        <li>Prepare the raw scoring dataset by creating a view on top of a source table to incorporate logic for scoring. Utilize a SQL WHERE statement for filtering, such as <code>WHERE ROW_DATE >= TRUNC(CURRENT_DATE, 'MONTH')</code>, to apply the correct Teradata SQL syntax.</li>
        <li>If distinguishing between training and scoring data directly in a view definition (with a WHERE statement) is not feasible or desired, consider using the entire content of a table for scoring. Ensure that the data-to-be-scored table is managed by filling it through an external process and emptying it after scoring is completed.</li>
        <li>When using stateful functions (e.g., Scaling), ensure to apply the parameters from the training dataset to the scoring dataset as well to prevent missing phenomena like feature drift. To do so, you can crystallize the scoring ADS as a view during data preparation with <code>CREATE VIEW your_view AS SELECT ...</code></li>
    <li>One way to do this is using <code>tdprepview</code>. It is a package for fitting and transforming re-usable data preparation pipelines that are saved in view definitions. Hence, no other permanent database objects are required. By using views, you naturally levarage the superior performance of Teradata's Parsing Engine and its Optimizer.</li></ul>

In [None]:
view_raw_training = "order_raw_training"
view_raw_scoring = "order_raw_scoring"
Param = {'database':'demo_user'}

key = "CustomerId"
target = "Exited"

In [None]:
execute_sql(f"""
REPLACE VIEW {Param["database"]}.{view_raw_training} AS 
(
 {customer_data.loc[customer_data[key]>15690738].show_query()}
)""")

In [None]:
execute_sql(f"""
REPLACE VIEW {Param["database"]}.{view_raw_scoring} AS
(
 {customer_data.drop(columns=[target]).loc[customer_data[key]<=15690738].show_query()}
)""")

In [None]:
DF_train_raw = DataFrame(in_schema(Param["database"], view_raw_training))
print(DF_train_raw.shape)
DF_train_raw

In [None]:
DF_score_raw = DataFrame(in_schema(Param["database"], view_raw_scoring))
print(DF_score_raw.shape)
DF_score_raw

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>3. Quickstart: Automatic Data Preparation in less than a Minute with <code>tdprepview</code></b>

In [None]:
# 1. the raw data
DF_train_raw # <- a teradataml.DataFrame

In [None]:
import tdprepview
from tdprepview import Pipeline

In [None]:
# 2. Generate and Fit the preprocessing Pipeline automagically!
pl = Pipeline.from_DataFrame(
                            DF_train_raw, # <- the raw input data
                            non_feature_cols=[key,target,"Surname"], # <- columns we want to ignore
                            fit_pipeline=True) # <- fit pipeline already

In [None]:
# 3. inspect the transformed training dataset
DF_train_transformed = pl.transform(DF_train_raw) # <- note the similarity to sklearn!
DF_train_transformed

In [None]:
# 4. Visualize the prepation pipeline. plotly and seaborn required
fig = pl.plot_sankey()
fig.update_layout(height=1000) # <- adjust the height if it doesnt all fit in

In [None]:
# 5. check output types. we want all features to be floats so we can create a single FloatTensorType for ONNX.
DF_train_transformed.tdtypes

<p style = 'font-size:16px;font-family:Arial'>We take advantage of the fact that a view does not hold any actual data, but rather the computation logic. Therefore, there is no need to alter the computation logic if we wish to rerun the pipeline with new data.</p>

In [None]:
# 7. Crystallize pipeline for training & scoring 
input_schema = Param["database"]
output_schema = Param["database"]
view_ADS_training = "order_ADS_trainig"
view_ADS_scoring = "order_ADS_scoring"

# training
pl.transform(
    create_replace_view=True, # <- this parameter is key. it will call a REPLACE VIEW statement.
    
    schema_name=input_schema,
    table_name=view_raw_training,
    return_type=None,   
    output_schema_name=output_schema,
    output_view_name=view_ADS_training)

#note how we can take the pipeline fitted with the training data set for the scoring data set as well!
#tdprepview will take *automatically* care of managing columns that were not present at training or at scoring (e.g the target column)
pl.transform(
    create_replace_view=True, # this parameter is key. it will call a REPLACE VIEW statement.
    
    schema_name=input_schema,
    table_name=view_raw_scoring,
    return_type=None,
    output_schema_name=output_schema,
    output_view_name=view_ADS_scoring)

In [None]:
# let's inspect the views
DF_ADS_training = DataFrame(in_schema(output_schema,view_ADS_training))
DF_ADS_training

In [None]:
DF_ADS_scoring = DataFrame(in_schema(output_schema,view_ADS_scoring))
DF_ADS_scoring

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>4. Refine autogenerated Pipeline</b>

<p style = 'font-size:16px;font-family:Arial'>Assume we want to make some adjustments to the auto-generated pipeline. This is what is done in the next chapter.</p>

In [None]:
# Start with suggested Pipeline: get suggested code as text, copy in cell, adjust as needed
steps_str = tdprepview.auto_code(DF_train_raw, non_feature_cols=[key,target,"Surname"])
print(steps_str)

<div class="alert alert-block alert-info" id="no-azure">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Best Practice Tips</b></i></p>
</div>

<ul style="font-size:16px;font-family:Arial">
  <li>Ensure imputation is applied to all features; even if some or all columns currently do not contain NULL values, include imputation to prevent model failure in production.</li>
  <li>Ensure that all features in the Analytical Data Set (ADS) are float values, which simplifies the creation of a working and robust model—you'll appreciate this later.</li>
  <li>Perform all data preprocessing in Vantage for consistency; otherwise, you must verify whether the model conversion tools are compatible with your preprocessing functions.</li>
  <li>Scale all your features to a range between 0 and 1, or use Z-score standardization. While this may not affect tree-based models, it ensures that features contribute equally to the performance of other model types.</li>
  <li>Refine the data preprocessing pipeline iteratively: Employ data exploration to ascertain the efficacy of different transformations and selectively preserve those columns that demonstrably enhance your model's performance, as evidenced by measures like feature importance.</li>
</ul>


<p style = 'font-size:20px;font-family:Arial'><b>The first draft from <code>tdprepview.auto_code()</code> was good, but based on data exploration, we identified the need for some more data cleansing steps:</b></p>

<ul style="font-size:16px;font-family:Arial">
  <li>Fill NULLs for all features (numeric/string) to ensure safety during deployment.</li>
  <li>Extract academic title from surname and create one-hot encoded (OHE) features from it.</li>
  <li>One-hot encode geography.</li>
  <li>Label encode gender.</li>
  <li>Limit age to the range of 10 to 100 to remove implausible values.</li>
  <li>Log transform balance and estimatedsalary.</li>
  <li>Split bank_products (a list of values) into indicator variables.</li>
  <li>Scale all float/int features to the 0-1 range.</li>
  <li>Ensure all feature columns are of type float with the Cast transformer.</li>
</ul>


<p style = 'font-size:20px;font-family:Arial'><b>To define the data preparation pipeline, you must specify the steps sequentially in a list. Each step is a tuple adheres to the following structure:</b></p>


<ul style="font-size:16px;font-family:Arial">
  <li><b>Input Column Names:</b> Specify the column name or a list of column names to be processed. You can also make a selection of columns based on regex patterns, data types or exclusion list.</li>
  <li><b>Preprocessing Functions:</b> Apply one or more tdprepview preprocessing functions, along with their arguments, to the input columns.</li>
  <li><b>Column Name Modification (Optional):</b> Define how the column names should be altered as a result of the preprocessing.</li>
</ul>


<p style="font-size:16px;font-family:Arial"> This format ensures a clear and structured approach to setting up your data preparation pipeline in <code>tdprepview</code>.</p>

In [None]:
# get list of variables with same data type
df_types = DF_train_raw.dtypes._column_names_and_types
int_feats = [c[0] for c in df_types if c[1] == "int"]
int_feats.remove(key)
int_feats.remove(target)
float_feats = [c[0] for c in df_types if c[1] == "float"]
str_feats = [c[0] for c in df_types if c[1] == "str"]


print("int_feats:",  str(int_feats))
print("float_feats:",  str(float_feats))
print("str_feats:",  str(str_feats))
print("key:",  str([key]))
print("target:",  str([target]))

In [None]:
from tdprepview import (Pipeline, # we will put all steps in a single pipeline. tdprepview will take care of creating a meaningful graph, calculating all necessary statistics
                        SimpleImputer, # impute numeric values 
                        ImputeText, # impute text columns
                        MultiLabelBinarizer, # to split up a list in a varchar column into multiple binary indicators, also used for Dr. and Prof. titles
                        OneHotEncoder, #One Hot Encode Order types into mutliple binary indicators
                        LabelEncoder, # encode gender
                        CutOff, # cut offs for age
                        CustomTransformer, # Log transform of balance and salary
                        MultiLabelBinarizer, # to split up a list in a varchar column into multiple binary indicators
                        MinMaxScaler, # min max scale all float features in to the range of 0-1
                        Cast # all to float
                       )

In [None]:
DF_train_raw

In [None]:
steps = [   
    #1 fill NULLs
    (int_feats, SimpleImputer(strategy="constant",fill_value=0)),
    (float_feats,SimpleImputer(strategy="mean") ),
    ("Geography", ImputeText(kind="custom",value="France") ),
    

    #2. get one hot encoding `geography`
    ("Geography", OneHotEncoder()),  

    #3. convert `gender` to numeric: "Female" --> 1, all other values including ("Male") --> 0 
    ("Gender", LabelEncoder(elements=["Female"])),

    #4. cutoff implausible `age` value
    ("Age", CutOff(cutoff_min=10, cutoff_max=100)),

    #5. log transform `balance` and `estimatedsalary`, infuse custom SQL in pipeline, make sure its positive first
    (['Balance', 'EstimatedSalary'], [CutOff(cutoff_min=1), CustomTransformer(" LN(%%COL%%) ")]),

    #6 min max scale all float features plus int features > 1.0 in to the range of 0-1
    (["CreditScore", "Tenure", "NumOfProducts"], Cast(new_type='FLOAT')), # needs to be done because those values are currently integer
    (float_feats + ["CreditScore", "Tenure"], MinMaxScaler()),

    #7 cast all features (but not the key and target column) to float
    ({"columns_exclude":[key, target]}, Cast(new_type='FLOAT')),
]

In [None]:
#initialise the pipeline
pl2 = Pipeline(steps)

In [None]:
pl2.fit(DF_train_raw)

In [None]:
#inspect the transformed training dataset
DF_train_transformed2 = pl2.transform(DF_train_raw)
DF_train_transformed2

In [None]:
#inspect the transformed scoring dataset
DF_score_transformed2 = pl2.transform(DF_train_raw)
DF_score_transformed2

In [None]:
#visualise the prepation pipeline. plotly and seaborn required
fig2 = pl2.plot_sankey()
# adjust the height if it doesnt all fit in
fig2.update_layout(height=1000)

In [None]:
# 7. We can save the pipeline python object as json for later use
pl2.to_json("mypipeline.json")
!head -n 10 mypipeline.json

Once the pipeline is completed, you can go ahead and build the actual ML model, may it be with in-DB function or with Python and Bring Your Own Model (BYOM.

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>5. Appendix: Feature Overview - All Preprocessors in Action</b>

<p style = 'font-size:16px;font-family:Arial'>The following is a list of all available preprocessors in tdprepview</p>

<img alt="tdprepview" width=90% src="images/supportedpreprocessors_v131.png" />

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.1 Impute</b></p>    
<p style = 'font-size:16px;font-family:Arial'>Impute, ImputeText, SimpleImputer.</p>

In [None]:
from tdprepview import Impute, ImputeText, SimpleImputer

In [None]:
steps = [
    ("CreditScore", Impute("mean")),
    ("HasCrCard", Impute("min")),
    ("IsActiveMember", Impute("max")),
    ("Age", Impute("median")),
    ("NumOfProducts", SimpleImputer(strategy='constant', fill_value=0)),
    ("Geography", ImputeText("custom","France")),
    ("EstimatedSalary", SimpleImputer(strategy='mean')),
    ("Balance", SimpleImputer(strategy='median')),
    ("Tenure", SimpleImputer(strategy='constant', fill_value=0))
]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.2 IterativeImputer</b></p>    
<p style = 'font-size:16px;font-family:Arial'>Impute missing values using an iterative approach.</p>

In [None]:
from tdprepview import IterativeImputer

In [None]:
steps = [
    (["Balance","EstimatedSalary","CreditScore"], IterativeImputer()),
]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.3 Transform</b></p>    
<p style = 'font-size:16px;font-family:Arial'><b>Scales</b> numerical values using a chosen method (MinMax, Z-Score, RobustScaling) and parameters.</p>

In [None]:
from tdprepview import Scale

In [None]:
DF_train_raw.tdtypes

In [None]:
steps = [
    ("Age", Scale(kind="custom", numerator_subtr="mean",denominator="max-P50" )),
    ("Tenure", Scale(kind="custom", numerator_subtr=45,denominator=23 )),
    ("CreditScore", Scale(kind="custom", numerator_subtr="P10",denominator="P90-P10" )),
    ("EstimatedSalary", Scale(kind="custom", numerator_subtr="min",denominator="max-P49" )),
    ("Balance", Scale(kind="custom", numerator_subtr="mode",denominator="std" )),
     ]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.4 Normalizer</b></p>    
<p style = 'font-size:16px;font-family:Arial'>A preprocessor that normalizes the input data. similar to sklearn's Normalizer</p>

In [None]:
from tdprepview import Normalizer

In [None]:
steps = [
    (["Balance","EstimatedSalary","CreditScore"], Normalizer("l2")),
]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.5 StandardScaler, MaxAbsScaler, MinMaxScaler, RobustScaler</b></p>    
<p style = 'font-size:16px;font-family:Arial'><b>Standardize</b> Standardize features by removing the mean and scaling to unit variance.</p>
<p style = 'font-size:16px;font-family:Arial'><b>MaxAbsScaler</b> Scale each feature by its maximum absolute value.</p>
<p style = 'font-size:16px;font-family:Arial'><b>MinMaxScaler</b> Scale each feature by its maximum and minimum value.</p>
<p style = 'font-size:16px;font-family:Arial'><b>RobustScaler</b> Scale features using statistics that are robust to outliers.</p>

In [None]:
from tdprepview import StandardScaler, MaxAbsScaler, MinMaxScaler, RobustScaler

In [None]:
steps = [
    ("CreditScore", StandardScaler()),
    ("Age", MaxAbsScaler()),
    ("Tenure", MinMaxScaler(feature_range=(0,10), clip = True)),
    ("Balance", RobustScaler(quantile_range=(10,90))),
     ]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.6 CutOff</b></p>    
<p style = 'font-size:16px;font-family:Arial'>Clips numeric values that fall outside a given range.</p>

In [None]:
from tdprepview import CutOff

In [None]:
steps = [
    ("CreditScore", CutOff("min",1000)),
    ("Tenure", CutOff(cutoff_min=0)),
    ("Age",  CutOff(cutoff_max=100)),
    ("Balance", CutOff(0,"P90")),
     ]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.7 PowerTransformer</b></p>    
<p style = 'font-size:16px;font-family:Arial'>As PowerTransformer from sklearn: </p>

<p style = 'font-size:16px;font-family:Arial'>Apply a power transform featurewise to make data more Gaussian-like.
Power transforms are a family of parametric, monotonic transformations that are applied to make data more
Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other
situations where normality is desired.</p>

<p style = 'font-size:16px;font-family:Arial'>Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter
for stabilizing variance and minimizing skewness is estimated through maximum likelihood.</p>

<p style = 'font-size:16px;font-family:Arial'>Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.</p>

In [None]:
from tdprepview import PowerTransformer

In [None]:
steps = [
    ("CreditScore", PowerTransformer(method='box-cox')),
    ("Balance",  PowerTransformer(method='yeo-johnson')),
     ]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.8 CustomTransformer</b></p>    
<p style = 'font-size:16px;font-family:Arial'>A custom transformer that applies a custom SQL expression to a column.</p>


In [None]:
from tdprepview import CustomTransformer

In [None]:
steps = [
    ("CreditScore", CustomTransformer(" 2 * POWER(%%COL%%, 2) + 3 * %%COL%% ")),
    ("Balance", CustomTransformer(" CASE WHEN %%COL%% < 0 THEN 'negative' ELSE 'positive' END", "VARCHAR()")),
     ]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

## 

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.9 Discretize</b></p>    
<p style = 'font-size:16px;font-family:Arial'><b>LabelEncoder:</b> Encodes a text column into numerical values using a label encoding scheme.</p>
<p style = 'font-size:16px;font-family:Arial'><b>SimpleHashEncoder:</b> Encodes a single text based column with td-built-in hashfunction to an INTEGER value.
Stateless and hence very performant, but can lead to collisions.</p>


In [None]:
from tdprepview import LabelEncoder

In [None]:
steps = [("Geography", LabelEncoder(["Spain","France","Germany"]))]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

In [None]:
from tdprepview import SimpleHashEncoder

In [None]:
steps = [
    ("Geography", SimpleHashEncoder(10,"salt")),
     ]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.10 FixedWidthBinning, VariableWidthBinning, QuantileTransformer, DecisionTreeBinning</b></p>    
<p style = 'font-size:16px;font-family:Arial'><b>FixedWidthBinning:</b> Performs fixed-width binning on a numerical column. </p>
<p style = 'font-size:16px;font-family:Arial'><b>VariableWidthBinning:</b> Binning numerical data into variable-width bins. </p>
<p style = 'font-size:16px;font-family:Arial'><b>QuantileTransformer:</b> Transform features using quantiles information.</p>
<p style = 'font-size:16px;font-family:Arial'><b>DecisionTreeBinning:</b> Binning numerical data into variable-width bins, based on a decision tree.
Trees are trained in sklearn.</p>


In [None]:
from tdprepview import FixedWidthBinning, VariableWidthBinning, QuantileTransformer, DecisionTreeBinning

In [None]:
steps = [
    ("CreditScore", FixedWidthBinning(5)),
    ("Age", FixedWidthBinning(5,0,100)),
    ("Tenure", VariableWidthBinning(kind="custom", boundaries=[-0.5,0.2,0.5])),
    ("Balance", QuantileTransformer(n_quantiles=11)),
    ("EstimatedSalary", DecisionTreeBinning(target_var="Exited",no_bins=5))
     ]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.11 Binarizer, ListBinarizer, ThresholdBinarizer</b></p>    
<p style = 'font-size:16px;font-family:Arial'><b>Binarizer:</b> Binarize data according to a threshold.</p>
<p style = 'font-size:16px;font-family:Arial'><b>ListBinarizer:</b> Preprocessor for text columns that outputs 1 if the value is in a given list or among the K most frequent values. </p>
<p style = 'font-size:16px;font-family:Arial'><b>ThresholdBinarizer:</b> Binarizes numeric data using a threshold value.</p></p>

In [None]:
from tdprepview import  ThresholdBinarizer, Binarizer, ListBinarizer

In [None]:
DF_train_raw

In [None]:
steps = [
    ("Geography", ListBinarizer(['Spain',"France"])),
    ("Age", ThresholdBinarizer("median")),
    ("Balance", Binarizer(threshold=0.0)),
     ]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.12 Feature Engineering</b></p>    
<p style = 'font-size:16px;font-family:Arial'><b>PolynomialFeatures:</b> Generate polynomial and interaction features from input data.</p>
<p style = 'font-size:16px;font-family:Arial'><b>OneHotEncoder:</b> One-hot encoder for categorical features.</p>
<p style = 'font-size:16px;font-family:Arial'><b>MultiLabelBinarizer:</b> MultiLabelBinarizer for categorical features. The input column is a delimiter separated list of values. The output is one indicator variable per unique value. </p>

In [None]:
from tdprepview import PolynomialFeatures, OneHotEncoder, MultiLabelBinarizer

In [None]:
DF_train_raw.to_pandas().Geography.value_counts()

In [None]:
steps = [
    (["HasCrCard","IsActiveMember","Tenure" ], PolynomialFeatures(degree=2, interaction_only=True)),
    ("Geography", OneHotEncoder()),
    ("Gender", MultiLabelBinarizer(max_categories=2)),
]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.13 Dimensionality Reduction & Miscellaneous</b></p>    
<p style = 'font-size:16px;font-family:Arial'><b>PCA:</b> Principal component analysis (PCA) is a technique used to reduce the dimensionality of a dataset while retaining
most of its variance.</p>

In [None]:
from tdprepview import PCA

In [None]:
steps = [
    (["CreditScore","Balance", "EstimatedSalary"], PCA(n_components=2))
]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>5.14 Cast, TryCast</b></p>    
<p style = 'font-size:16px;font-family:Arial'><b>Cast:</b> A preprocessor that converts a column to a new data type using SQL CAST function. It will be mostly useful to transform all features into FLOAT as last preprocessing step. If the input columns are text based, it is safer to use TryCast.</p>
<p style = 'font-size:16px;font-family:Arial'><b>TryCast:</b> A preprocessor that attempts to convert a text column to a new data type using SQL TRYCAST function.</p>

In [None]:
from tdprepview import Cast, TryCast

In [None]:
steps = [
    ("Gender", TryCast("INT")),
    ("HasCrCard", Cast("FLOAT"))
]

In [None]:
pl = Pipeline(steps)
pl.fit(DF_train_raw)

<div id='section8'></div>
<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>6. Cleanup</b>
<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>6.1 Work Tables</b></p>

In [None]:
for v in [view_raw_training, view_raw_scoring, view_ADS_training, view_ADS_scoring]:
    try:
        execute_sql(f"DROP VIEW {v}")
    except:
        pass

In [None]:
%run -i run_procedure.py "call remove_data('DEMO_BankChurn');"

In [None]:
remove_context()

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>Dataset:</b>

- `Surname`: Surname
- `CreditScore`: Credit score
- `Geography`: Country (Germany / France / Spain)
- `Gender`: Gender (Female / Male)
- `Age`: Age
- `Tenure`: No of years the customer has been associated with the bank
- `Balance`: Balance
- `NumOfProducts`: No of bank products used
- `HasCrCard`: Credit card status (0 = No, 1 = Yes)
- `IsActiveMember`: Active membership status (0 = No, 1 = Yes)
- `EstimatedSalary`: Estimated salary
- `Exited`: Abandoned or not? (0 = No, 1 = Yes)

<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023. All Rights Reserved
        </div>
    </div>
</footer>