<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Credit Card Fraud Detection - Data Cleansing and Feature Engineering Pipeline
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
This is a demonstration of Vantage capabilities for functional demos e.g.
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Data Cleansing Functions - like  TD_GetFutileColumns, TD_SimpleImputeFit and TD_SimpleImputeTransform </li>
        <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Data Exploration Functions - like  TD_ColumnSummary and TD_CategoicalSummary </li>
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Feature Engineering Functions - like TD_BinCodeFit & Transform, TD_OrdinalEncodingFit & Transform, TD_OnehotencodingFit & Transform, TD_ScaleFit & Transform and TD_ColumnTransformer </li>
</p>
<br>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
In a typical Data Science project there are multiple pre-processing steps involved to process the raw incoming data before it can actually be used in a model for predictions. On an estimate about 70-80% of the time and effort goes into the pre-processing steps. With the help of Vantage's in Db functions we can perform these functions very effectively and at scale.
In this demo notebook we are using a sample financial data of credit card application with target of loan defaulters, we will go through the general pre-processing steps that are involved in getting the source data and making the data useable for model creation.
</p>  

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Connect to Vantage.</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press Enter, then use down arrow to go to next cell.</p>

In [None]:
%connect local, hidewarnings=true

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Setup for execution of notebook. Begin running steps with Shift + Enter keys.</p>

In [None]:
Set query_band='DEMO=CreditCardFraud.ipynb;' update for session;

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. In this demo since we are using Temporal table we will be creating databases and tables in local storage and use them in the notebook. Please execute the procedure in the next cell.</p>


In [None]:
--call get_data('DEMO_CreditCard_cloud');    -- takes about 20 seconds, estimated space: 0 MB
call get_data('DEMO_CreditCard_local');     -- takes about 35 seconds, estimated space: 11 MB

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
call space_report();  -- optional, takes about 10 seconds

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b> Access data in Vantage  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For this demo, data is already resident in Object Storage which we are accessing via ReadNOS, create a reference to the table, and sample the contents using the get_data procedure used above.  Data could just as easily reside in permanent tables, another RDBMS, or another Vantage system.</p>

In [None]:
sel top 5 * from DEMO_CreditCard.Credit_Card;

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'> <b> 2. Checking data demographics  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>TD_ColumnSummary </b>function  displays Column name, datatype and other demographics like count of NULLs etc for each specified input table column</p>

In [None]:
SELECT * FROM TD_ColumnSummary (
 ON DEMO_CreditCard.Credit_Card AS InputTable
 USING
 TargetColumns ('[:]')
) AS dt;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In below sql we are checking the null percentage of columns</p>

In [None]:
SELECT columnname, datatype, nullpercentage FROM TD_ColumnSummary (
 ON DEMO_CreditCard.Credit_Card AS InputTable
 USING
 TargetColumns ('[:]')
) AS dt
where 
nullpercentage > 0
order by 3 desc;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As the column HOUSETYPE_MODE has more than 50% of null values we can remove this column from our model calculations.<br>
    Let's check the other varchar columns. 
OCCUPATION_TYPE also has high % of null values.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>TD_CATEGORICALSUMMARY </b>function   displays the distinct values and their counts for each specified input table column</p>

In [None]:
Create volatile table cateogrySummaryTable as (
SELECT * FROM TD_CATEGORICALSUMMARY (
ON DEMO_CreditCard.Credit_Card as inputtable
USING
TargetColumns('CODE_GENDER'
,'NAME_CONTRACT_TYPE'
,'NAME_FAMILY_STATUS'
,'FLAG_OWN_CAR'
,'OCCUPATION_TYPE')
) AS dt)With data 
on commit preserve rows;

In [None]:
select top 5* from cateogrySummaryTable;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>TD_GETFUTILECOLUMNS </b>function displays the categorical columns which will have no effect on the model i.e if all the values are same or unique or If the count of distinct values in the columns divided by the count of the total number of rows in the input
table is greater than or equal to the threshold value</p>

In [None]:
Select * from TD_getFutileColumns(
ON DEMO_CreditCard.Credit_Card as inputtable partition by any
ON cateogrySummaryTable as categorytable Dimension
USING
CategoricalSummaryColumn('ColumnName') 
ThresholdValue(0.05)
)As dt;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we can see that FLAG_OWN_CAR will have no effect on the model as all the values in this column are same so we can remove this column from model creation</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us check the values in the OCCUPATION_TYPE column to see what we can do for the NULLs in the column 
</p>

In [None]:
SELECT * FROM cateogrySummaryTable where columnname = 'OCCUPATION_TYPE' order by DistinctValueCount desc;

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b> Impute Missing Values  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>TD_SimpleImputeFit </b>will output a table with the values that will be used to substitute the missing values<br>
    <b>TD_SimpleImputeTransform</b> will return the input data set with the missing values filled in.
Verify the NULL values have been removed.<br>
    *Note one can also use the Fit table as input to <b>TD_ColumnTransformer</b>

In [None]:
-- fit the SimpleImpute function on categorical columns
SELECT * FROM TD_SimpleImputeFit (
    ON DEMO_CreditCard.Credit_Card as InputTable
    OUT VOLATILE TABLE OutputTable(impute_fit_cat_output)
    USING
    ColsForLiterals ('OCCUPATION_TYPE')
    Literals ('not provided')
) as dt;



In [None]:
Create volatile table occupationimputetable as (
SELECT * FROM TD_SimpleImputeTransform (
 ON DEMO_CreditCard.Credit_Card as InputTable
 ON impute_fit_cat_output AS FitTable DIMENSION
) AS dt)With data 
on commit preserve rows;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>TD_SimpleImputeFit and TD_SimpleImputeTransform function works on integer columns also and we can use them for filling the missing column values based on min/max/mean/median of the values in the column</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In model creation we usually prefer Numerical inputs instead of Characters, now let us check how many distinct values we have for our character columns so that we can encode them in numerals</p>

In [None]:
SELECT columnname,count(distinctvalue) FROM cateogrySummaryTable 
group by 1 order by 2; 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We are not using FLAG_OWN_CAR column in model creation, for other columns we can use <b>TD_OneHotEncodingFit</b> and <b>TD_OrdinalEncodingFit</b> and transform functions to convert character categories to numerals, for that we need to check the exact values present in the columns</p>

In [None]:
SELECT columnname,distinctvalue FROM cateogrySummaryTable where columnname in 
('CODE_GENDER', 'NAME_CONTRACT_TYPE')
order by 1; 

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'> <b> 3. Feature Engineering Functions  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>TD_OneHotEncodingFit </b>outputs a table of attributes and categorical values to input to <b>TD_OneHotEncodingTransform </b> which encodes them as one-hot numeric vectors.</p>

In [None]:
CREATE VOLATILE TABLE onehotencodingfittable AS (
SELECT * FROM TD_OneHotEncodingFit (
 ON DEMO_CreditCard.Credit_Card AS InputTable
 USING
  TargetColumn ('CODE_GENDER','NAME_CONTRACT_TYPE')
  OtherColumnName ('other')
  IsInputDense ('true')
  CategoryCounts(2,2)
  Approach('Auto')    
 ) AS dt
) WITH DATA
ON COMMIT PRESERVE ROWS;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can check how the fit table looks like</p>

In [None]:
select * from onehotencodingfittable;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For categorical columns which have many values we can use <b>TD_OrdinalEncoding</b> instead</p>

In [None]:
SELECT * FROM TD_OrdinalEncodingFit (
ON DEMO_CreditCard.Credit_Card AS InputTable
OUT volatile table outputtable (ordinalencodingfittable)
USING
  TargetColumn ('NAME_FAMILY_STATUS','OCCUPATION_TYPE')
  DefaultValue (-1)
) as dt;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>TD_BinCodeFit and TD_BinCodeTransform </b>bin-codes the
specified input table columns. Bin-coding is typically used to convert numeric data to categorical data by binning the numeric data into multiple numeric bins (intervals).</p> 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For variable width bins, we need to provide the bin table to the function. Let's create the table and use that in the TD_BinCodeFit function</p>

In [None]:
create table FitInputTable (ColumnName varchar(20), MinValue integer, MaxValue integer, Label varchar(20));

In [None]:
insert into FitInputTable values('age', 0, 18, '1-Children');
insert into FitInputTable values('age', 19, 25, '2-Young Adult');    
insert into FitInputTable values('age', 26, 45, '3-Middle Adult');
insert into FitInputTable values('age', 46, 60, '4-Old Adult');    
insert into FitInputTable values('age', 61 ,120, '5-Senior Citizen');

In [None]:
create volatile table FitOutputTable as (
SELECT * FROM TD_BincodeFit(
ON DEMO_CreditCard.Credit_Card as InputTable
ON FitInputTable as FitInput Dimension
USING
TargetColumns('age')
MethodType('Variable-Width')
MinValueColumn('MinValue')
MaxValueCOlumn('MaxValue')
LabelColumn('Label')
TargetColNames('ColumnName')
) AS dt
) with data
ON COMMIT PRESERVE ROWS;


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The fit table looks like below:</p>

In [None]:
select * from FitOutputTable;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>TD_ScaleFit and TDScaleTransform </b>scales specified input
table columns i.e perform the specific scale methods like standard deviation, mean etc to the input columns </p> 

In [None]:
select * from TD_scaleFit(
on DEMO_CreditCard.Credit_Card  as InputTable
OUT VOLATILE TABLE OutputTable(scaleFitOut)
using
TargetColumns('amt_income_total')
MissValue('Keep')
ScaleMethod('range')
GlobalScale('f')
)as dt;


<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b> TD_ColumnTransformer  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The TD_ColumnTransformer function transforms the entire dataset in a single operation. You only need
to provide the FIT tables to the function, and the function runs all transformations that you require in a
single operation. Running all the it table transformations together in one-go gives approx. 30% performance improvement over running each transformation sequentially.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us put all the fit tables we have created and transform the dataset</p>

In [None]:
SELECT * FROM TD_ColumnTransformer(
    ON (SELECT * FROM TD_ColumnTransformer(
    ON DEMO_CreditCard.Credit_Card AS InputTable PARTITION BY ANY 
    ON FitOutputTable AS BincodeFitTable DIMENSION 
    ON onehotencodingfittable AS OneHotEncodingFitTable DIMENSION 
    ON scaleFitOut AS ScaleFitTable DIMENSION 
    ON impute_fit_cat_output AS SimpleImputeFitTable DIMENSION
    ) as dt) AS InputTable PARTITION BY ANY 
    ON ordinalencodingfittable AS OrdinalEncodingFitTable DIMENSION ) as dt1
    ;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can create a separate intermediate table after performing all the transformations and removing all the columns from the original table which are not needed further.</p>

In [None]:
Create multiset table Transformed_data as(
SELECT SK_ID_CURR,"TARGET",CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_FAMILY_STATUS,REGION_POPULATION_RELATIVE
    ,substr(age,1,1) as "AGE_GROUP",FLAG_MOBIL,FLAG_EMP_PHONE,CNT_FAM_MEMBERS,OCCUPATION_TYPE,CODE_GENDER_1 as Male,CODE_GENDER_0 as Female,
    "NAME_CONTRACT_TYPE_1" as "REVOLVING_LOANS","NAME_CONTRACT_TYPE_0" as "CASH_LOANS"
    FROM TD_ColumnTransformer(
    ON (SELECT * FROM TD_ColumnTransformer(
    ON DEMO_CreditCard.Credit_Card AS InputTable PARTITION BY ANY 
    ON FitOutputTable AS BincodeFitTable DIMENSION 
    ON onehotencodingfittable AS OneHotEncodingFitTable DIMENSION 
    ON scaleFitOut AS ScaleFitTable DIMENSION 
    ON impute_fit_cat_output AS SimpleImputeFitTable DIMENSION
    ) as dt) AS InputTable PARTITION BY ANY 
    ON ordinalencodingfittable AS OrdinalEncodingFitTable DIMENSION ) as dt1 ) WITH DATA
; 

In [None]:
select top 5 * from Transformed_data;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now that we’ve shown you how you can use ClearScape in-database functions for preparing the data, you’ve now got a set of data that is cleansed and processed you could proceed to use this as an input in data science model creation. 
</p>

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Cleanup</b> </p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Worktables</b> </p>

In [None]:
DROP TABLE Transformed_data;

In [None]:
DROP TABLE FitInputTable;

<p style = 'font-size:20px;font-family:Arial;color:#00233C'> <b>Database and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
call remove_data('DEMO_CreditCard');-- takes about 5 seconds, optional if you want to use the data later

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'> <b> 5. Conclusion </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this notebook we have seen some of the Teradata Vantage Clearscape's new inDb functions for data cleansing, data exploration and feature engineering. Many of these functions can be applied in one go using the TD_COLUMNTRANSFORM function which gives is approx. 30% faster than serial processing.</p>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Reference Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
        <li>Teradata Analytic Function Reference:
        <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Analytics-Database-Analytic-Functions-Overview'>
        https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Analytics-Database-Analytic-Functions-Overview</a></li>
  
</ul>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023. All Rights Reserved
        </div>
    </div>
</footer>