<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Credit Card Fraud Detection - Data Cleansing and Feature Engineering Pipeline</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial'>
This is a demonstration of Vantage capabilities for functional demos e.g.
    <li style = 'font-size:16px;font-family:Arial'> Data Cleansing Functions - like  TD_GetFutileColumns, TD_SimpleImputeFit and TD_SimpleImputeTransform </li>
        <li style = 'font-size:16px;font-family:Arial'> Data Exploration Functions - like  TD_ColumnSummary and TD_CategoicalSummary </li>
    <li style = 'font-size:16px;font-family:Arial'> Feature Engineering Functions - like TD_BinCodeFit & Transform, TD_OrdinalEncodingFit & Transform, TD_OnehotencodingFit & Transform, TD_ScaleFit & Transform and TD_ColumnTransformer </li>
</p>
<br>
<p style = 'font-size:16px;font-family:Arial'>
In a typical Data Science project there are multiple preprocessing steps involved to process the raw incoming data before it can actually be used in a model for predictions. On an estimate about 70-80% of the time and effort goes into the pre-processing steps. With the help of Vantage's in Db functions we can perform these functions very effectively and at scale.
In this demo notebook we are using a sample financial data of credit card application with target of loan defaulters, we will go through the general preprocessing steps that are involved in getting the source data and making the data usuable for model creation.
</p>  

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b> Accessing the Data </b> </p>
<p style = 'font-size:16px;font-family:Arial'>These demos will work either with foreign tables accessed from Cloud Storage via NOS or you may import the tables to your machine. If you import data for multiple demos, you may need to use the Data Dictionary "Manage Your Space" routine to cleanup tables you no longer need.     
    
<p style = 'font-size:16px;font-family:Arial'>Use the link below to access the 2 options for using data from the data dictionary notebook:

[Click Here to get data for this notebook](../Data_Dictionary/Data_Dictionary.ipynb#TRNG_CreditCard)

[Click Here to Manage Your Space](../Data_Dictionary/Data_Dictionary.ipynb#Manage_Your_Space)

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>1. Connect to Vantage and explore the dataset</b></p>
Below command will connect to the Vantage environment.

In [None]:
%connect local

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b> Access data in Vantage  </b> </p>
<p style = 'font-size:16px;font-family:Arial'>For this demo, data is already resident in Object Storage which we are accessing via ReadNOS.  Create a reference to the table, and sample the contents.  Data could just as easily reside in permanent tables, another RDBMS, or another Vantage system.</p>

In [None]:
sel top 5 * from TRNG_CreditCard.credit_card;

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b> Checking data demographics  </b> </p>
<p style = 'font-size:16px;font-family:Arial'><b>TD_ColumnSummary </b>function  displays Column name, datatype and other demogarphics like count of NULLs etc for each specified input table column</p>

In [None]:
SELECT * FROM TD_ColumnSummary (
 ON TRNG_CreditCard.credit_card AS InputTable
 USING
 TargetColumns ('[:]')
) AS dt;

<p style = 'font-size:16px;font-family:Arial'>In below sql we are checking the null percentage of columns</p>

In [None]:
SELECT columnname, datatype, nullpercentage FROM TD_ColumnSummary (
 ON TRNG_CreditCard.credit_card AS InputTable
 USING
 TargetColumns ('[:]')
) AS dt
where 
nullpercentage > 0
order by 3 desc;

<p style = 'font-size:16px;font-family:Arial'>As the column HOUSETYPE_MODE has more than 50% of null values we can remove this column from our model calculations.<br>
    Lets check the other varchar columns. 
OCCUPATION_TYPE also has high % of null values.</p>

<p style = 'font-size:16px;font-family:Arial'><b>TD_CATEGORICALSUMMARY </b>function   displays the distinct values and their counts for each specified input table column</p>

In [None]:
Create volatile table cateogrySummaryTable as (
SELECT * FROM TD_CATEGORICALSUMMARY (
ON TRNG_CreditCard.credit_card as inputtable
USING
TargetColumns('CODE_GENDER'
,'NAME_CONTRACT_TYPE'
,'NAME_FAMILY_STATUS'
,'FLAG_OWN_CAR'
,'OCCUPATION_TYPE')
) AS dt)With data 
on commit preserve rows;

In [None]:
select top 5* from cateogrySummaryTable;

<p style = 'font-size:16px;font-family:Arial'><b>TD_GETFUTILECOLUMNS </b>function displays the categorical columns which will have no effect on the model i.e if all the values are same or unique or If the count of distinct values in the columns divided by the count of the total number of rows in the input
table is greater than or equal to the threshold value</p>

In [None]:
Select * from TD_getFutileColumns(
ON TRNG_CreditCard.credit_card as inputtable partition by any
ON cateogrySummaryTable as categorytable Dimension
USING
CategoricalSummaryColumn('ColumnName') 
ThresholdValue(0.05)
)As dt;

<p style = 'font-size:16px;font-family:Arial'>Here we can see that FLAG_OWN_CAR will have no effect on the model as all the values in this column are same so we can remove this column from model creation</p>

<p style = 'font-size:16px;font-family:Arial'>Let us check the values in the OCCUPATION_TYPE column to see what we can do for the NULLs in the column 
</p>

In [None]:
SELECT * FROM cateogrySummaryTable where columnname = 'OCCUPATION_TYPE' order by DistinctValueCount desc;

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b> Impute Missing Values  </b> </p>
<p style = 'font-size:16px;font-family:Arial'><b>TD_SimpleImputeFit </b>will output a table with the values that will be used to substitute the missing values<br>
    <b>TD_SimpleImputeTransform</b> will return the input data set with the missing values filled in.
Verify the NULL values have been removed.<br>
    *Note one can also use the Fit table as input to <b>TD_ColumnTransformer</b>

In [None]:
-- fit the SimpleImpute function on categorical columns
SELECT * FROM TD_SimpleImputeFit (
    ON TRNG_CreditCard.credit_card as InputTable
    OUT VOLATILE TABLE OutputTable(impute_fit_cat_output)
    USING
    ColsForLiterals ('OCCUPATION_TYPE')
    Literals ('not provided')
) as dt;



In [None]:
Create volatile table occupationimputetable as (
SELECT * FROM TD_SimpleImputeTransform (
 ON TRNG_CreditCard.credit_card as InputTable
 ON impute_fit_cat_output AS FitTable DIMENSION
) AS dt)With data 
on commit preserve rows;

<p style = 'font-size:16px;font-family:Arial'>TD_SimpleImputeFit and TD_SimpleImputeTransform function works on integer columns also and we can use them for filling the missing column values based on min/max/mean/median of the values in the column</p>

<p style = 'font-size:16px;font-family:Arial'>In model creation we usually prefer Numerical inputs instead of Characters, now let us check how many distinct values we have for our character columns so that we can encode them in numerals</p>

In [None]:
SELECT columnname,count(distinctvalue) FROM cateogrySummaryTable 
group by 1 order by 2; 

<p style = 'font-size:16px;font-family:Arial'>We are not using FLAG_OWN_CAR column in model creation, for other columns we can use <b>TD_OneHotEncodingFit</b> and <b>TD_OrdinalEncodingFit</b> and transform functions to convert character categories to numerals, for that we need to check the exact values present in the columns</p>

In [None]:
SELECT columnname,distinctvalue FROM cateogrySummaryTable where columnname in 
('CODE_GENDER', 'NAME_CONTRACT_TYPE')
order by 1; 

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b> Feature Engineering Functions  </b> </p>
<p style = 'font-size:16px;font-family:Arial'><b>TD_OneHotEncodingFit </b>outputs a table of attributes and categorical values to input to <b>TD_OneHotEncodingTransform </b> which encodes them as one-hot numeric vectors.</p>

In [None]:
CREATE VOLATILE TABLE onehotencodingfit_genderoutput AS (
 SELECT * FROM TD_OneHotEncodingFit (
 ON TRNG_CreditCard.credit_card AS InputTable
 USING
 TargetColumn ('code_gender')
 OtherColumnName ('other')
 CategoricalValues ('M', 'F')
 IsInputDense ('true')
 ) AS dt
) WITH DATA
ON COMMIT PRESERVE ROWS;

<p style = 'font-size:16px;font-family:Arial'>We can check how the fit table looks like</p>

In [None]:
select * from onehotencodingfit_genderoutput;

<p style = 'font-size:16px;font-family:Arial'>We will create a similar fit table for the other categorical column</p>

In [None]:
CREATE VOLATILE TABLE onehotencodingfit_contractoutput AS (
 SELECT * FROM TD_OneHotEncodingFit (
 ON TRNG_CreditCard.credit_card AS InputTable
 USING
 TargetColumn ('NAME_CONTRACT_TYPE')
 OtherColumnName ('other')
 CategoricalValues ('Revolving loans', 'Cash loans')
 IsInputDense ('true')
 ) AS dt
) WITH DATA
ON COMMIT PRESERVE ROWS;


In [None]:
select * from onehotencodingfit_contractoutput;

<p style = 'font-size:16px;font-family:Arial'>For categorical columns which have many values we can use <b>TD_OrdinalEncoding</b> instead</p>

In [None]:
SELECT * FROM TD_OrdinalEncodingFit (
 ON TRNG_CreditCard.credit_card AS InputTable
 OUT volatile table outputtable (ordinalencodingfit_familyoutput)
 USING
 TargetColumn ('NAME_FAMILY_STATUS')
 DefaultValue (-1)
) as dt;

<p style = 'font-size:16px;font-family:Arial'>Similarly we will use TD_OrdinalEncodingFit for OCCUPATION_TYPE also. Note that we have used occupationimputetable which is the output of Impute function applied on Occupation_Type column as we want the Null values in a separate category</p>

In [None]:
SELECT * FROM TD_OrdinalEncodingFit (
 ON occupationimputetable AS InputTable
 OUT volatile table outputtable (ordinalencodingfit_occupationoutput)
 USING
 TargetColumn ('OCCUPATION_TYPE')
 DefaultValue (-1)
) as dt;

<p style = 'font-size:16px;font-family:Arial'><b>TD_BinCodeFit and TD_BinCodeTransform </b>bin-codes the
specified input table columns.Bin-coding is typically used to convert numeric data to categorical data by binning the numeric data into multiple numeric bins (intervals).</p> 

<p style = 'font-size:16px;font-family:Arial'>For variable width bins, we need to provide the bin table to the function. Let's create the table and use that in the TD_BinCodeFit function</p>

In [None]:
create table FitInputTable (ColumnName varchar(20), MinValue integer, MaxValue integer, Label varchar(20));

In [None]:
insert into FitInputTable values('age', 0, 18, '1-Children');
insert into FitInputTable values('age', 19, 25, '2-Young Adult');    
insert into FitInputTable values('age', 26, 45, '3-Middle Adult');
insert into FitInputTable values('age', 46, 60, '4-Old Adult');    
insert into FitInputTable values('age', 61 ,120, '5-Senior Citizen');

In [None]:
create volatile table FitOutputTable as (
SELECT * FROM TD_BincodeFit(
ON TRNG_CreditCard.credit_card as InputTable
ON FitInputTable as FitInput Dimension
USING
TargetColumns('age')
MethodType('Variable-Width')
MinValueColumn('MinValue')
MaxValueCOlumn('MaxValue')
LabelColumn('Label')
TargetColNames('ColumnName')
) AS dt
) with data
ON COMMIT PRESERVE ROWS;


<p style = 'font-size:16px;font-family:Arial'>The fit table looks like below:</p>

In [None]:
select * from FitOutputTable;

<p style = 'font-size:16px;font-family:Arial'><b>TD_ScaleFit and TDScaleTransform </b>scales specified input
table columns i.e perform the specific scale methods like standard deviation, mean etc to the input columns </p> 

In [None]:
select * from TD_scaleFit(
on TRNG_CreditCard.credit_card  as InputTable
OUT VOLATILE TABLE OutputTable(scaleFitOut)
using
TargetColumns('amt_income_total')
MissValue('Keep')
ScaleMethod('range')
GlobalScale('f')
)as dt;


<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b> TD_ColumnTransformer  </b> </p>
<p style = 'font-size:16px;font-family:Arial'>The TD_ColumnTransformer function transforms the entire dataset in a single operation. You only need
to provide the FIT tables to the function, and the function runs all transformations that you require in a
single operation. Running all the it table transformations together in one-go gives approx 30% performace improvement over runnig each transformation sequentially</p>

<p style = 'font-size:16px;font-family:Arial'>Let us put all the fit tables we have created and transform the dataset</p>

In [None]:
SELECT *
    FROM TD_ColumnTransformer(
 ON TRNG_CreditCard.credit_card AS inputtable
 ON  impute_fit_cat_output AS SImpleImputeFitTable DIMENSION   
 ON onehotencodingfit_genderoutput AS ONehotencodingfittable DIMENSION
 ON onehotencodingfit_contractoutput AS ONehotencodingfittable DIMENSION  
 ON ordinalencodingfit_familyoutput AS OrdinalEncodingFitTable DIMENSION
 ON ordinalencodingfit_occupationoutput AS OrdinalEncodingFitTable DIMENSION  
 ON FitOutputTable AS BincodeFitTable DIMENSION  
 ON scaleFitOut AS ScaleFitTable DIMENSION         
)
AS dt
;


<p style = 'font-size:16px;font-family:Arial'>We can create a separate intermediate table after performing all the transformations and removing all the columns from the original table which are not needed further.</p>

In [None]:
Create table Trasformed_data as(
SELECT SK_ID_CURR,"TARGET",CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_FAMILY_STATUS,REGION_POPULATION_RELATIVE
    ,substr(age,1,1) as "AGE_GROUP",FLAG_MOBIL,FLAG_EMP_PHONE,CNT_FAM_MEMBERS,OCCUPATION_TYPE,CODE_GENDER_M,CODE_GENDER_F,
    "NAME_CONTRACT_TYPE_REVOLVING LOANS" as "REVOLVING_LOANS","NAME_CONTRACT_TYPE_CASH LOANS" as "CASH_LOANS"
    FROM TD_ColumnTransformer(
 ON TRNG_CreditCard.credit_card AS inputtable
 ON impute_fit_cat_output AS SImpleImputeFitTable DIMENSION   
 ON onehotencodingfit_genderoutput AS ONehotencodingfittable DIMENSION
 ON onehotencodingfit_contractoutput AS ONehotencodingfittable DIMENSION  
 ON ordinalencodingfit_familyoutput AS OrdinalEncodingFitTable DIMENSION
 ON ordinalencodingfit_occupationoutput AS OrdinalEncodingFitTable DIMENSION  
 ON FitOutputTable AS BincodeFitTable DIMENSION  
 ON scaleFitOut AS ScaleFitTable DIMENSION   
)
AS dt
) WITH DATA
;


In [None]:
select top 10* from Trasformed_data;

<p style = 'font-size:16px;font-family:Arial'>Here we can see that we have now transformed our raw data and converted into numerical values which we further be used as an input in model creation.
</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Clean up</b> </p>

In [None]:
DROP TABLE Trasformed_data;


In [None]:
DROP TABLE FitInputTable;

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b> Conslusion </b> </p>
<p style = 'font-size:16px;font-family:Arial'>In this notebook we have seen some of the Teradata Vantage's new inDb functions for data cleansing, data exploration and feature engineering. Many of these functions can be applied in one go using the TD_COLUMNTRANSFORM function which gives is approx 30% faster than serial processing</p>

<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
        <li>Teradata Analytic Function Reference:
        <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Analytics-Database-Analytic-Functions-Overview'>
        https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Analytics-Database-Analytic-Functions-Overview</a></li>
  
</ul>

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">©2023 Teradata. All Rights Reserved</footer>