<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Credit Card Fraud Detection - Data Cleansing and Feature Engineering Pipeline
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
This is a demonstration of Vantage capabilities for functional demos using the Teradataml Python Functions e.g.
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Data Cleansing Functions - like  GetFutileColumns, SimpleInputeFit and SimpleImputeTransform </li>
        <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Data Exploration Functions - like  ColumnSummary and CategoicalSummary </li>
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Feature Engineering Functions - like BinCodeFit & Transform, OrdinalEncodingFit & Transform, OnehotencodingFit & Transform, ScaleFit & Transform and ColumnTransformer </li>
</p>
<br>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
In a typical Data Science project there are multiple pre-processing steps involved to process the raw incoming data before it can actually be used in a model for predictions. On an estimate about 70-80% of the time and effort goes into the pre-processing steps. With the help of Vantage's in Db functions we can perform these functions very effectively and at scale.
In this demo notebook we are using a sample financial data of credit card application with target of loan defaulters, we will go through the general pre-processing steps using the Teradataml Python Functions that are involved in getting the source data and making the data useable for model creation.
</p>  

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Import python packages, connect to Vantage and explore the dataset</b></p>

In [None]:
#import libraries
import getpass
from teradataml import *

display.max_rows = 5 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, then <b>use down arrow</b> to go to next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Setup for execution of notebook. Begin running steps with Shift + Enter keys.</p>

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Credit_Card_Data_Preparation_Python.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. In this demo since we are using Temporal table we will be creating databases and tables in local storage and use them in the notebook. Please execute the procedure in the next cell.</p>


In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_CreditCard_cloud');" 
# takes about 20seconds, estimated space: 0 MB
#%run -i ../run_procedure.py "call get_data('DEMO_CreditCard_local');" 
# takes about 35 seconds, estimated space: 11 MB

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b> Access data in Vantage  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For this demo, data is already resident in Object Storage which we are accessing via ReadNOS, create a reference to the table, and sample the contents using the get_data procedure used above.  Data could just as easily reside in permanent tables, another RDBMS, or another Vantage system.</p>

In [None]:
tdf_cc = DataFrame(in_schema("DEMO_CreditCard","Credit_Card"))
tdf_cc.head()

In [None]:
tdf_cc.shape

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>There are 50000 rows with 15 columns in Credit_Card table

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> Teradataml Python Package Function Reference
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> The Teradata Package for Python (teradataml) is an open-source Python library package that combines the benefits of open-source Python language environment with the massive parallel processing capabilities of Teradata Vantage. More information can be found at 
<a href = 'https://docs.teradata.com/search/all?query=Welcome+to+Teradata+Package+for+Python&filters=prodname~%2522Teradata+Package+for+Python%2522&content-lang=en-US'>
        Teradataml Python Reference
    </a>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> The below command will list all the analytical functions present in the package.   

In [None]:
display_analytic_functions()

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'> <b> 2. Checking data demographics  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>ColumnSummary </b>function  displays Column name, datatype and other demographics like count of NULLs etc for each specified input table column</p>

In [None]:
colsum = ColumnSummary(data=tdf_cc,
                        target_columns=[':']
                       )
colsum.result

In [None]:
cs = colsum.result.filter(items = ['ColumnName', 'Datatype', 'NullPercentage'])
cs[cs['NullPercentage'] > 0.0]

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As the column HOUSETYPE_MODE has more than 50% of null values we can remove this column from our model calculations.<br>
    Let's check the other varchar columns. 
OCCUPATION_TYPE also has high % of null values.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>CATEGORICALSUMMARY </b>function   displays the distinct values and their counts for each specified input table column</p>

In [None]:
catsum = CategoricalSummary(data=tdf_cc,
                             target_columns=['CODE_GENDER','NAME_CONTRACT_TYPE','NAME_FAMILY_STATUS'
                                             ,'FLAG_OWN_CAR','OCCUPATION_TYPE']
                            )
 
catsum.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>GETFUTILECOLUMNS </b>function displays the categorical columns which will have no effect on the model i.e if all the values are same or unique or If the count of distinct values in the columns divided by the count of the total number of rows in the input
table is greater than or equal to the threshold value</p>

In [None]:
futilecol = GetFutileColumns(data=tdf_cc,
                             object=catsum,
                             category_summary_column="ColumnName",
                             threshold_value=0.05)
futilecol.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we can see that FLAG_OWN_CAR will have no effect on the model as all the values in this column are same so we can remove this column from model creation</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us check the values in the OCCUPATION_TYPE column to see what we can do for the NULLs in the column 
</p>

In [None]:
c1=catsum.result
c1[c1['ColumnName'] == 'OCCUPATION_TYPE'].sort('DistinctValueCount', ascending=False)

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b> Impute Missing Values  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>SimpleImputeFit </b>will output a table with the values that will be used to substitute the missing values<br>
    <b>SimpleImputeTransform</b> will return the input data set with the missing values filled in.
Verify the NULL values have been removed.<br>
    *Note one can also use the Fit table as input to <b>ColumnTransformer</b>

In [None]:
# fit the SimpleImpute function on categorical columns
impute_fit_cat_output = SimpleImputeFit(data=tdf_cc,
                              literals_columns="OCCUPATION_TYPE",
                              literals="not provided")
 
# Print the result DataFrame.
impute_fit_cat_output.output

In [None]:
# assign imputed data to new dataframe
occupationimputedf = SimpleImputeTransform(data=tdf_cc, object=impute_fit_cat_output.output).result

In [None]:
occupationimputedf

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>SimpleImputeFit and SimpleImputeTransform function works on integer columns also and we can use them for filling the missing column values based on min/max/mean/median of the values in the column</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In model creation we usually prefer Numerical inputs instead of Characters, now let us check how many distinct values we have for our character columns so that we can encode them in numerals</p>

In [None]:
count_column = c1.DistinctValue.count(distinct=True)

df=c1.groupby("ColumnName").assign(count_=count_column)
df.sort('count_')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We are not using FLAG_OWN_CAR column in model creation, for other columns we can use <b>OneHotEncodingFit</b> and <b>OrdinalEncodingFit</b> and transform functions to convert character categories to numerals, for that we need to check the exact values present in the columns</p>

In [None]:
c1[c1['ColumnName'].isin(['CODE_GENDER','NAME_CONTRACT_TYPE'])]

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'> <b> 3. Feature Engineering Functions  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>OneHotEncodingFit </b>outputs a table of attributes and categorical values to input to <b>OneHotEncodingTransform </b> which encodes them as one-hot numeric vectors.</p>

In [None]:
# create fit object to encode "gender" and "contract type" columns
hot_fit = OneHotEncodingFit(data=tdf_cc,
                                is_input_dense=True,
                                target_column=['CODE_GENDER','NAME_CONTRACT_TYPE'],
                                category_counts=[2,2],
                                approach="auto",
                                other_column="other")
 
# Print the result DataFrame.
hot_fit.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can check how the fit table looks like</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For categorical columns which have many values we can use <b>OrdinalEncoding</b> instead</p>

In [None]:
ordinal_fit = OrdinalEncodingFit(target_column=['NAME_FAMILY_STATUS','OCCUPATION_TYPE'],
                                 data=tdf_cc,
                                 default_value=-1
                                )

ordinal_fit.result

In [None]:
ordinal_fit.result.tdtypes

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>BinCodeFit and BinCodeTransform </b>bin-codes the
specified input table columns. Bin-coding is typically used to convert numeric data to categorical data by binning the numeric data into multiple numeric bins (intervals).</p> 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For variable width bins, we need to provide the bin table to the function. Let's create the table and use that in the BinCodeFit function</p>

In [None]:
%%capture
query1 = '''
CREATE MULTISET TABLE DEMO_User.FitInputTable 
     (
      ColumnName varchar(20), 
      MinValue integer, 
      MaxValue integer, 
      Label varchar(20)
  )
NO PRIMARY INDEX;
'''

query2 = '''
insert into FitInputTable values('age', 0, 18, '1-Children')
;insert into FitInputTable values('age', 19, 25, '2-Young Adult')    
;insert into FitInputTable values('age', 26, 45, '3-Middle Adult')
;insert into FitInputTable values('age', 46, 60, '4-Old Adult')   
;insert into FitInputTable values('age', 61 ,120, '5-Senior Citizen')
;
'''

execute_sql(query1)
execute_sql(query2)

In [None]:
tdf_bin_data=DataFrame(in_schema("DEMO_User","FitInputTable"))
tdf_bin_data.head()

In [None]:
bin_code = BincodeFit(data=tdf_cc,
                      fit_data=tdf_bin_data,
                            fit_data_order_column = ['MinValue', 'MaxValue'],
                            target_columns='AGE',
                            minvalue_column='MinValue',
                            maxvalue_column='MaxValue',
                            label_column='Label',
                            method_type='Variable-Width'
                           )
 
# Print the result.
bin_code.output

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The fit table looks like above.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>ScaleFit and ScaleTransform </b>scales specified input
table columns i.e perform the specific scale methods like standard deviation, mean etc to the input columns </p> 

In [None]:
scale_fit = ScaleFit(data=tdf_cc,
                       target_columns="AMT_INCOME_TOTAL",
                       scale_method="RANGE",
                       miss_value="KEEP",
                       global_scale=False)
scale_fit.output

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b> ColumnTransformer  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ColumnTransformer function transforms the entire dataset in a single operation. You only need
to provide the FIT tables to the function, and the function runs all transformations that you require in a
single operation. Running all the it table transformations together in one-go gives approx. 30% performance improvement over running each transformation sequentially.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us put all the fit tables we have created and transform the dataset</p>

In [None]:
out1 = ColumnTransformer(input_data=tdf_cc,
                                          simpleimpute_fit_data=impute_fit_cat_output.output,
                                          bincode_fit_data=bin_code.output,
                                          scale_fit_data=scale_fit.output,
                                          onehotencoding_fit_data=hot_fit.result,
                                        )
# Print the result DataFrame.
#out1 = ColumnTransformer_out.result 
out2 = ColumnTransformer(input_data=out1.result,
                                         ordinalencoding_fit_data=ordinal_fit.result,
                                        )
tdf = out2.result   


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can drop the columns from the dataframe and create a table in the database to use it further in the model.</p>

In [None]:
# now lets drop the extra columns, rename the columns in dataframe
obj = StrApply(data=tdf,
                   target_columns='AGE',
                   output_columns='AGE_GROUP',
                   string_operation='SUBSTRING',
                   in_place=False,
                   string_length=1,
                   accumulate=[':']
                )
t1 = obj.result

transformed_df = t1.assign(drop_columns=True
                  ,SK_ID_CURR=t1.SK_ID_CURR  
                  ,TARGET=t1.TARGET 
                  ,CNT_CHILDREN=t1.CNT_CHILDREN
                  ,AMT_INCOME_TOTAL=t1.AMT_INCOME_TOTAL
                  ,NAME_FAMILY_STATUS=t1.NAME_FAMILY_STATUS
                  ,REGION_POPULATION_RELATIVE=t1.REGION_POPULATION_RELATIVE
                  ,AGE_GROUP=t1.AGE_GROUP      
                  ,FLAG_MOBIL=t1.FLAG_MOBIL     
                  ,FLAG_EMP_PHONE=t1.FLAG_EMP_PHONE     
                  ,CNT_FAM_MEMBERS=t1.CNT_FAM_MEMBERS        
                  ,OCCUPATION_TYPE=t1.OCCUPATION_TYPE    
                  ,MALE=t1.CODE_GENDER_1         
                  ,FEMALE=t1.CODE_GENDER_0         
                  ,REVOLVING_LOANS=t1.NAME_CONTRACT_TYPE_1     
                  ,CASH_LOANS=t1.NAME_CONTRACT_TYPE_0            
              ) 

transformed_df

In [None]:
#copy the dataframe to table 
transformed_df.to_sql("transformed_data", if_exists="replace")

In [None]:
tdf = DataFrame(in_schema("DEMO_User","Transformed_Data"))
tdf.head()

In [None]:
tdf.shape

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now that we’ve shown you how you can use ClearScape in-database functions for preparing the data, you’ve now got a set of data that is cleansed and processed you could proceed to use this as an input in data science model creation. 
</p>

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Cleanup</b> </p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Worktables</b> </p>

In [None]:
db_drop_table(table_name="Transformed_Data")

In [None]:
db_drop_table(table_name="FitInputTable")

<p style = 'font-size:20px;font-family:Arial;color:#00233C'> <b>Database and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_CreditCard');"        # Takes 5 seconds

In [None]:
remove_context()

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'> <b> 5. Conclusion </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this notebook we have seen some of the Teradata Vantage Clearscape's new inDb functions for data cleansing, data exploration and feature engineering. Many of these functions can be applied in one go using the ColumnTransform function which gives is approx. 30% faster than serial processing.</p>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Reference Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
        <li>Teradata Analytic Function Reference:
        <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Analytics-Database-Analytic-Functions-Overview'>
        https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Analytics-Database-Analytic-Functions-Overview</a></li>
  
</ul>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023. All Rights Reserved
        </div>
    </div>
</footer>