<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Credit Card Fraud Detection - Data Cleansing and Feature Engineering Pipeline
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
This is a demonstration of Vantage capabilities for functional demos using the Teradataml Python Functions e.g.
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Data Cleansing Functions - like  GetFutileColumns, SimpleInputeFit and SimpleImputeTransform </li>
        <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Data Exploration Functions - like  ColumnSummary and CategoicalSummary </li>
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Feature Engineering Functions - like BinCodeFit & Transform, OrdinalEncodingFit & Transform, OnehotencodingFit & Transform, ScaleFit & Transform and ColumnTransformer </li>
</p>
<br>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
In a typical Data Science project there are multiple pre-processing steps involved to process the raw incoming data before it can actually be used in a model for predictions. On an estimate about 70-80% of the time and effort goes into the pre-processing steps. With the help of Vantage's in Db functions we can perform these functions very effectively and at scale.
In this demo notebook we are using a sample financial data of credit card application with target of loan defaulters, we will go through the general pre-processing steps using the Teradataml Python Functions that are involved in getting the source data and making the data useable for model creation.
</p>  

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Import python packages, connect to Vantage and explore the dataset</b></p>

In [1]:
#import libraries
import getpass
from teradataml import *

display.max_rows = 5 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, then <b>use down arrow</b> to go to next cell.</p>

In [2]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

Performing setup ...
Setup complete



Enter password:  ········


... Logon successful
Connected as: xxxxxsql://demo_user:xxxxx@host.docker.internal/dbc
Engine(teradatasql://demo_user:***@host.docker.internal)


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Setup for execution of notebook. Begin running steps with Shift + Enter keys.</p>

In [3]:
%%capture
execute_sql('''SET query_band='DEMO=Credit_Card_Data_Preparation_Python.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. In this demo since we are using Temporal table we will be creating databases and tables in local storage and use them in the notebook. Please execute the procedure in the next cell.</p>


In [4]:
%run -i ../run_procedure.py "call get_data('DEMO_CreditCard_cloud');" 
# takes about 20seconds, estimated space: 0 MB
#%run -i ../run_procedure.py "call get_data('DEMO_CreditCard_local');" 
# takes about 35 seconds, estimated space: 11 MB

That ran for   0:00:12.39 with 5 statements and 0 errors. 


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional step – if you want to see status of databases/tables created and space used.</p>

In [5]:
%run -i ../run_procedure.py "call space_report();"

You have:  #databases=1 #tables=0 #views=3  You have used 0.7 MB of 27,890.4 MB available - 0.0%  ... Space Usage OK
 
   Database Name                  #tables  #views     Avail MB      Used MB
   demo_user                            0       2  27,890.4 MB       0.7 MB 
   DEMO_CreditCard                      0       1       0.0 MB       0.0 MB 


<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b> Access data in Vantage  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For this demo, data is already resident in Object Storage which we are accessing via ReadNOS, create a reference to the table, and sample the contents using the get_data procedure used above.  Data could just as easily reside in permanent tables, another RDBMS, or another Vantage system.</p>

In [6]:
tdf_cc = DataFrame(in_schema("DEMO_CreditCard","Credit_Card"))
tdf_cc.head()

SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_FAMILY_STATUS,REGION_POPULATION_RELATIVE,FLAG_MOBIL,FLAG_EMP_PHONE,CNT_FAM_MEMBERS,HOUSETYPE_MODE,OCCUPATION_TYPE,AGE
100004,0,Revolving loans,M,N,0,67500.0,Single / not married,0.0100320000201463,1,1,1,,Laborers,52
100007,0,Cash loans,M,N,0,121500.0,Single / not married,0.0286630000919103,1,1,1,,Core staff,54
100008,0,Cash loans,M,N,0,99000.0,Married,0.0357920005917549,1,1,2,,Laborers,46
100009,0,Cash loans,F,N,1,171000.0,Married,0.0357920005917549,1,1,3,,Accountants,37
100011,0,Cash loans,F,N,0,112500.0,Married,0.0186340007930994,1,0,2,,,55
100012,0,Revolving loans,M,N,0,135000.0,Single / not married,0.0196889992803335,1,1,1,,Laborers,39
100010,0,Cash loans,M,N,0,360000.0,Married,0.0031220000237226,1,1,2,,Managers,51
100006,0,Cash loans,F,N,0,135000.0,Civil marriage,0.0080190002918243,1,1,2,,Laborers,52
100003,0,Cash loans,F,N,0,270000.0,Married,0.0035409999545663,1,1,2,block of flats,Core staff,45
100002,1,Cash loans,M,N,0,202500.0,Single / not married,0.018800999969244,1,1,1,block of flats,Laborers,25


In [7]:
tdf_cc.shape

(50000, 15)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>There are 50000 rows with 15 columns in Credit_Card table

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> Teradataml Python Package Function Reference
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> The Teradata Package for Python (teradataml) is an open-source Python library package that combines the benefits of open-source Python language environment with the massive parallel processing capabilities of Teradata Vantage. More information can be found at 
<a href = 'https://docs.teradata.com/search/all?query=Welcome+to+Teradata+Package+for+Python&filters=prodname~%2522Teradata+Package+for+Python%2522&content-lang=en-US'>
        Teradataml Python Reference
    </a>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> The below command will list all the analytical functions present in the package.   

In [8]:
display_analytic_functions()


List of available functions:

	Analytics Database Functions:
		* MODEL SCORING functions:
			 1. DecisionTreePredict
			 2. GLMPredictPerSegment
			 3. KMeansPredict
			 4. NaiveBayesPredict
			 5. OneClassSVMPredict
			 6. SVMPredict
			 7. TDDecisionForestPredict
			 8. TDGLMPredict
			 9. XGBoostPredict
		* TEXT ANALYTIC functions:
			 1. NaiveBayesTextClassifierTrainer
			 2. NGramSplitter
			 3. SentimentExtractor
			 4. TextParser
			 5. WordEmbeddings
		* DATA CLEANING functions:
			 1. ConvertTo
			 2. GetFutileColumns
			 3. GetRowsWithoutMissingValues
			 4. OutlierFilterFit
			 5. OutlierFilterTransform
			 6. Pack
			 7. SimpleImputeFit
			 8. SimpleImputeTransform
			 9. StringSimilarity
			 10. Unpack
		* SCORING WITH MACHINE LEARNING ENGINE MODELS functions:
			 1. DecisionForestPredict
			 2. GLMPredict
			 3. NaiveBayesTextClassifierPredict
			 4. SVMSparsePredict
		* FEATURE ENGINEERING TRANSFORM functions:
			 1. Antiselect
			 2. BincodeFit
			 3. BincodeTransform


<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'> <b> 2. Checking data demographics  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>ColumnSummary </b>function  displays Column name, datatype and other demographics like count of NULLs etc for each specified input table column</p>

In [9]:
colsum = ColumnSummary(data=tdf_cc,
                        target_columns=[':']
                       )
colsum.result

ColumnName,Datatype,NonNullCount,NullCount,BlankCount,ZeroCount,PositiveCount,NegativeCount,NullPercentage,NonNullPercentage
FLAG_MOBIL,INTEGER,50000,0,,1.0,49999.0,0.0,0.0,100.0
FLAG_OWN_CAR,VARCHAR(10) CHARACTER SET UNICODE,50000,0,0.0,,,,0.0,100.0
NAME_FAMILY_STATUS,VARCHAR(50) CHARACTER SET UNICODE,50000,0,0.0,,,,0.0,100.0
SK_ID_CURR,BIGINT,50000,0,,0.0,50000.0,0.0,0.0,100.0
AMT_INCOME_TOTAL,FLOAT,50000,0,,0.0,50000.0,0.0,0.0,100.0


In [10]:
cs = colsum.result.filter(items = ['ColumnName', 'Datatype', 'NullPercentage'])
cs[cs['NullPercentage'] > 0.0]

ColumnName,Datatype,NullPercentage
HOUSETYPE_MODE,VARCHAR(50) CHARACTER SET UNICODE,53.08
OCCUPATION_TYPE,VARCHAR(50) CHARACTER SET UNICODE,28.582


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As the column HOUSETYPE_MODE has more than 50% of null values we can remove this column from our model calculations.<br>
    Let's check the other varchar columns. 
OCCUPATION_TYPE also has high % of null values.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>CATEGORICALSUMMARY </b>function   displays the distinct values and their counts for each specified input table column</p>

In [11]:
catsum = CategoricalSummary(data=tdf_cc,
                             target_columns=['CODE_GENDER','NAME_CONTRACT_TYPE','NAME_FAMILY_STATUS'
                                             ,'FLAG_OWN_CAR','OCCUPATION_TYPE']
                            )
 
catsum.result

ColumnName,DistinctValue,DistinctValueCount
FLAG_OWN_CAR,N,50000
CODE_GENDER,F,30943
NAME_FAMILY_STATUS,Civil marriage,5345
NAME_FAMILY_STATUS,Separated,3195
OCCUPATION_TYPE,Cooking staff,1081


<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>GETFUTILECOLUMNS </b>function displays the categorical columns which will have no effect on the model i.e if all the values are same or unique or If the count of distinct values in the columns divided by the count of the total number of rows in the input
table is greater than or equal to the threshold value</p>

In [12]:
futilecol = GetFutileColumns(data=tdf_cc,
                             object=catsum,
                             category_summary_column="ColumnName",
                             threshold_value=0.05)
futilecol.result

ColumnName
FLAG_OWN_CAR


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we can see that FLAG_OWN_CAR will have no effect on the model as all the values in this column are same so we can remove this column from model creation</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us check the values in the OCCUPATION_TYPE column to see what we can do for the NULLs in the column 
</p>

In [13]:
c1=catsum.result
c1[c1['ColumnName'] == 'OCCUPATION_TYPE'].sort('DistinctValueCount', ascending=False)

ColumnName,DistinctValue,DistinctValueCount
OCCUPATION_TYPE,,14291
OCCUPATION_TYPE,Laborers,10256
OCCUPATION_TYPE,Sales staff,5684
OCCUPATION_TYPE,Core staff,4060
OCCUPATION_TYPE,Drivers,3570


<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b> Impute Missing Values  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>SimpleImputeFit </b>will output a table with the values that will be used to substitute the missing values<br>
    <b>SimpleImputeTransform</b> will return the input data set with the missing values filled in.
Verify the NULL values have been removed.<br>
    *Note one can also use the Fit table as input to <b>ColumnTransformer</b>

In [14]:
# fit the SimpleImpute function on categorical columns
impute_fit_cat_output = SimpleImputeFit(data=tdf_cc,
                              literals_columns="OCCUPATION_TYPE",
                              literals="not provided")
 
# Print the result DataFrame.
impute_fit_cat_output.output

TD_INDEX_SIMFIT,TD_TARGETCOLUMN_SIMFIT,TD_NUM_COLVAL_SIMFIT,TD_STR_COLVAL_SIMFIT,TD_ISNUMERIC_SIMFIT
13,OCCUPATION_TYPE,,not provided,0


In [15]:
# assign imputed data to new dataframe
occupationimputedf = SimpleImputeTransform(data=tdf_cc, object=impute_fit_cat_output.output).result

In [16]:
occupationimputedf

SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_FAMILY_STATUS,REGION_POPULATION_RELATIVE,FLAG_MOBIL,FLAG_EMP_PHONE,CNT_FAM_MEMBERS,HOUSETYPE_MODE,OCCUPATION_TYPE,AGE
101613,0,Cash loans,F,N,0,126000.0,Married,0.0357920005917549,1,0,2,,not provided,60
115765,0,Cash loans,F,N,0,99000.0,Separated,0.0096570001915097,1,1,1,,Sales staff,47
124104,0,Cash loans,F,N,0,113400.0,Married,0.0357920005917549,1,1,2,,not provided,62
118537,0,Cash loans,F,N,1,189000.0,Married,0.0070199999026954,1,1,3,,Laborers,47
107403,1,Cash loans,F,N,0,135000.0,Married,0.0307550001889467,1,1,2,,Sales staff,31


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>SimpleImputeFit and SimpleImputeTransform function works on integer columns also and we can use them for filling the missing column values based on min/max/mean/median of the values in the column</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In model creation we usually prefer Numerical inputs instead of Characters, now let us check how many distinct values we have for our character columns so that we can encode them in numerals</p>

In [17]:
count_column = c1.DistinctValue.count(distinct=True)

df=c1.groupby("ColumnName").assign(count_=count_column)
df.sort('count_')

ColumnName,count_
FLAG_OWN_CAR,1
CODE_GENDER,2
NAME_CONTRACT_TYPE,2
NAME_FAMILY_STATUS,5
OCCUPATION_TYPE,18


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We are not using FLAG_OWN_CAR column in model creation, for other columns we can use <b>OneHotEncodingFit</b> and <b>OrdinalEncodingFit</b> and transform functions to convert character categories to numerals, for that we need to check the exact values present in the columns</p>

In [18]:
c1[c1['ColumnName'].isin(['CODE_GENDER','NAME_CONTRACT_TYPE'])]

ColumnName,DistinctValue,DistinctValueCount
NAME_CONTRACT_TYPE,Cash loans,45901
NAME_CONTRACT_TYPE,Revolving loans,4099
CODE_GENDER,M,19057
CODE_GENDER,F,30943


<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'> <b> 3. Feature Engineering Functions  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>OneHotEncodingFit </b>outputs a table of attributes and categorical values to input to <b>OneHotEncodingTransform </b> which encodes them as one-hot numeric vectors.</p>

In [19]:
# create fit object to encode "gender" and "contract type" columns
hot_fit = OneHotEncodingFit(data=tdf_cc,
                                is_input_dense=True,
                                target_column=['CODE_GENDER','NAME_CONTRACT_TYPE'],
                                category_counts=[2,2],
                                approach="auto",
                                other_column="other")
 
# Print the result DataFrame.
hot_fit.result

CODE_GENDER,CODE_GENDER_0,CODE_GENDER_1,CODE_GENDER_other,NAME_CONTRACT_TYPE,NAME_CONTRACT_TYPE_0,NAME_CONTRACT_TYPE_1,NAME_CONTRACT_TYPE_other
,,,,,Cash loans,Revolving loans,
,F,M,,,,,


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can check how the fit table looks like</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For categorical columns which have many values we can use <b>OrdinalEncoding</b> instead</p>

In [20]:
ordinal_fit = OrdinalEncodingFit(target_column=['NAME_FAMILY_STATUS','OCCUPATION_TYPE'],
                                 data=tdf_cc,
                                 default_value=-1
                                )

ordinal_fit.result

TD_ColumnName_ORDFIT,TD_Category_ORDFIT,TD_Value_ORDFIT,TD_Index_ORDFIT,NAME_FAMILY_STATUS,OCCUPATION_TYPE
OCCUPATION_TYPE,Cooking staff,2,1,,
OCCUPATION_TYPE,Drivers,4,1,,
NAME_FAMILY_STATUS,Civil marriage,0,0,,
NAME_FAMILY_STATUS,Married,1,0,,
NAME_FAMILY_STATUS,Single / not married,3,0,,


In [21]:
ordinal_fit.result.tdtypes

COLUMN NAME,TYPE
TD_ColumnName_ORDFIT,"VARCHAR(length=128, charset='UNICODE')"
TD_Category_ORDFIT,"VARCHAR(length=128, charset='UNICODE')"
TD_Value_ORDFIT,INTEGER()
TD_Index_ORDFIT,SMALLINT()
NAME_FAMILY_STATUS,"VARCHAR(length=50, charset='UNICODE')"
OCCUPATION_TYPE,"VARCHAR(length=50, charset='UNICODE')"


<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>BinCodeFit and BinCodeTransform </b>bin-codes the
specified input table columns. Bin-coding is typically used to convert numeric data to categorical data by binning the numeric data into multiple numeric bins (intervals).</p> 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For variable width bins, we need to provide the bin table to the function. Let's create the table and use that in the BinCodeFit function</p>

In [22]:
%%capture
query1 = '''
CREATE MULTISET TABLE DEMO_User.FitInputTable 
     (
      ColumnName varchar(20), 
      MinValue integer, 
      MaxValue integer, 
      Label varchar(20)
  )
NO PRIMARY INDEX;
'''

query2 = '''
insert into FitInputTable values('age', 0, 18, '1-Children')
;insert into FitInputTable values('age', 19, 25, '2-Young Adult')    
;insert into FitInputTable values('age', 26, 45, '3-Middle Adult')
;insert into FitInputTable values('age', 46, 60, '4-Old Adult')   
;insert into FitInputTable values('age', 61 ,120, '5-Senior Citizen')
;
'''

execute_sql(query1)
execute_sql(query2)

In [23]:
tdf_bin_data=DataFrame(in_schema("DEMO_User","FitInputTable"))
tdf_bin_data.head()

ColumnName,MinValue,MaxValue,Label
age,46,60,4-Old Adult
age,0,18,1-Children
age,19,25,2-Young Adult
age,61,120,5-Senior Citizen
age,26,45,3-Middle Adult


In [24]:
bin_code = BincodeFit(data=tdf_cc,
                      fit_data=tdf_bin_data,
                            fit_data_order_column = ['MinValue', 'MaxValue'],
                            target_columns='AGE',
                            minvalue_column='MinValue',
                            maxvalue_column='MaxValue',
                            label_column='Label',
                            method_type='Variable-Width'
                           )
 
# Print the result.
bin_code.output

TD_ColumnName_BINFIT,TD_MinValue_BINFIT,TD_MaxValue_BINFIT,TD_Label_BINFIT,TD_Bins_BINFIT,TD_IndexValue_BINFIT,TD_MaxLenLabel_BINFIT,AGE
AGE,46.0,60.0,4-Old Adult,5,0,16,
AGE,0.0,18.0,1-Children,5,0,16,
AGE,19.0,25.0,2-Young Adult,5,0,16,
AGE,61.0,120.0,5-Senior Citizen,5,0,16,
AGE,26.0,45.0,3-Middle Adult,5,0,16,


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The fit table looks like above.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>ScaleFit and ScaleTransform </b>scales specified input
table columns i.e perform the specific scale methods like standard deviation, mean etc to the input columns </p> 

In [25]:
scale_fit = ScaleFit(data=tdf_cc,
                       target_columns="AMT_INCOME_TOTAL",
                       scale_method="RANGE",
                       miss_value="KEEP",
                       global_scale=False)
scale_fit.output

TD_STATTYPE_SCLFIT,AMT_INCOME_TOTAL
sum,8370178947.3125
,0.0
count,50000.0
max,117000000.0
min,25650.0


<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b> ColumnTransformer  </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ColumnTransformer function transforms the entire dataset in a single operation. You only need
to provide the FIT tables to the function, and the function runs all transformations that you require in a
single operation. Running all the it table transformations together in one-go gives approx. 30% performance improvement over running each transformation sequentially.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us put all the fit tables we have created and transform the dataset</p>

In [26]:
out1 = ColumnTransformer(input_data=tdf_cc,
                                          simpleimpute_fit_data=impute_fit_cat_output.output,
                                          bincode_fit_data=bin_code.output,
                                          scale_fit_data=scale_fit.output,
                                          onehotencoding_fit_data=hot_fit.result,
                                        )
# Print the result DataFrame.
#out1 = ColumnTransformer_out.result 
out2 = ColumnTransformer(input_data=out1.result,
                                         ordinalencoding_fit_data=ordinal_fit.result,
                                        )
tdf = out2.result   


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can drop the columns from the dataframe and create a table in the database to use it further in the model.</p>

In [27]:
# now lets drop the extra columns, rename the columns in dataframe
obj = StrApply(data=tdf,
                   target_columns='AGE',
                   output_columns='AGE_GROUP',
                   string_operation='SUBSTRING',
                   in_place=False,
                   string_length=1,
                   accumulate=[':']
                )
t1 = obj.result

transformed_df = t1.assign(drop_columns=True
                  ,SK_ID_CURR=t1.SK_ID_CURR  
                  ,TARGET=t1.TARGET 
                  ,CNT_CHILDREN=t1.CNT_CHILDREN
                  ,AMT_INCOME_TOTAL=t1.AMT_INCOME_TOTAL
                  ,NAME_FAMILY_STATUS=t1.NAME_FAMILY_STATUS
                  ,REGION_POPULATION_RELATIVE=t1.REGION_POPULATION_RELATIVE
                  ,AGE_GROUP=t1.AGE_GROUP      
                  ,FLAG_MOBIL=t1.FLAG_MOBIL     
                  ,FLAG_EMP_PHONE=t1.FLAG_EMP_PHONE     
                  ,CNT_FAM_MEMBERS=t1.CNT_FAM_MEMBERS        
                  ,OCCUPATION_TYPE=t1.OCCUPATION_TYPE    
                  ,MALE=t1.CODE_GENDER_1         
                  ,FEMALE=t1.CODE_GENDER_0         
                  ,REVOLVING_LOANS=t1.NAME_CONTRACT_TYPE_1     
                  ,CASH_LOANS=t1.NAME_CONTRACT_TYPE_0            
              ) 

transformed_df

SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_FAMILY_STATUS,REGION_POPULATION_RELATIVE,FLAG_MOBIL,FLAG_EMP_PHONE,CNT_FAM_MEMBERS,OCCUPATION_TYPE,AGE_GROUP,CASH_LOANS,FEMALE,MALE,REVOLVING_LOANS
357714,1,2,0.001319520048626,1,0.0357920005917549,1,0,4,-1,3,1,0,1,0
119006,0,2,0.001319520048626,2,0.0101469997316598,1,1,3,-1,3,1,0,1,0
143600,1,0,0.000550120603363,0,0.0228000003844499,1,1,2,1,3,1,1,0,0
126899,0,2,0.001319520048626,1,0.0251640006899833,1,1,4,14,3,1,1,0,0
295785,1,1,0.0006270605478893,1,0.0096570001915097,1,1,3,3,3,1,1,0,0


In [28]:
#copy the dataframe to table 
transformed_df.to_sql("transformed_data", if_exists="replace")

In [29]:
tdf = DataFrame(in_schema("DEMO_User","Transformed_Data"))
tdf.head()

SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_FAMILY_STATUS,REGION_POPULATION_RELATIVE,FLAG_MOBIL,FLAG_EMP_PHONE,CNT_FAM_MEMBERS,OCCUPATION_TYPE,AGE_GROUP,CASH_LOANS,FEMALE,MALE,REVOLVING_LOANS
100004,0,0,0.0003577707420472,3,0.0100320000201463,1,1,1,8,4,0,0,1,1
100007,0,0,0.000819410409205,3,0.0286630000919103,1,1,1,3,4,1,0,1,0
100008,0,0,0.0006270605478893,1,0.0357920005917549,1,1,2,8,4,1,0,1,0
100009,0,1,0.0012425801040997,1,0.0357920005917549,1,1,3,0,3,1,1,0,0
100011,0,0,0.0007424704646787,1,0.0186340007930994,1,0,2,-1,4,1,1,0,0
100012,0,0,0.0009348203259945,3,0.0196889992803335,1,1,1,8,3,0,0,1,1
100010,0,0,0.002858318939152,1,0.0031220000237226,1,1,2,10,4,1,0,1,0
100006,0,0,0.0009348203259945,0,0.0080190002918243,1,1,2,8,4,1,1,0,0
100003,0,0,0.002088919493889,1,0.0035409999545663,1,1,2,3,3,1,1,0,0
100002,1,0,0.0015118699099417,3,0.018800999969244,1,1,1,8,2,1,0,1,0


In [30]:
tdf.shape

(50000, 15)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now that we’ve shown you how you can use ClearScape in-database functions for preparing the data, you’ve now got a set of data that is cleansed and processed you could proceed to use this as an input in data science model creation. 
</p>

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Cleanup</b> </p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Worktables</b> </p>

In [31]:
db_drop_table(table_name="Transformed_Data")

True

In [32]:
db_drop_table(table_name="FitInputTable")

True

<p style = 'font-size:20px;font-family:Arial;color:#00233C'> <b>Database and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [33]:
%run -i ../run_procedure.py "call remove_data('DEMO_CreditCard');"        # Takes 5 seconds

Removed objects related to DEMO_CreditCard. That ran for 0:00:01.44


In [34]:
remove_context()

True

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'> <b> 5. Conclusion </b> </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this notebook we have seen some of the Teradata Vantage Clearscape's new inDb functions for data cleansing, data exploration and feature engineering. Many of these functions can be applied in one go using the ColumnTransform function which gives is approx. 30% faster than serial processing.</p>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Reference Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
        <li>Teradata Analytic Function Reference:
        <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Analytics-Database-Analytic-Functions-Overview'>
        https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Analytics-Database-Analytic-Functions-Overview</a></li>
  
</ul>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023. All Rights Reserved
        </div>
    </div>
</footer>