<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Data Preparation and Discovery
   <br>
       Using Teradataml python package
   <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<b style = 'font-size:20px;font-family:Arial;color:#00233C'>Introduction</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This is a demonstration of the teradataml package that is designed for data management, exploration, and execution of analytic functions.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The current version of the teradataml package includes <b>over 100 functions</b>, organized into these functional areas:</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Utility and database management functions</li>
    <li>Data exploration and preparation functions</li>
    <li>Analytic functions across Vantage</li>
</ul>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>These functions leverage the full power and scale inside Vantage without:</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Costly, slow export of data out of the DBMS</li>
    <li>Being limited by client platform resources</li>
    <li>Having to write complex SQL</li>
</ul>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Contents</b></p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Configuring the Environment</li>
    <li>Connect to Vantage</li>
    <li>Create and Load Tables</li>
    <li>Data Discovery</li>
    <li>Working with Data</li>
    <li>Advanced Data Preparation</li>
    <li>Visualizing results.</li>
    <li>Cleanup</li>
</ol>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>1. Configuring the Environment</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the section, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import json
import getpass
import os
import warnings
#Suppress Warnings
warnings.filterwarnings('ignore')

import pandas as pd

from teradataml import *

from sqlalchemy import func
display.max_rows=5

import seaborn as sns
%matplotlib inline

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>2. Connect to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=DataPrepAndDiscovery.ipynb;' UPDATE FOR SESSION;''')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Begin running steps with Shift + Enter keys.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>3. Create and Load Tables</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>3.1  Create Demo Transaction data - simulated funds transfers.  Use FastLoad to create and import data</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Fastload protocol is excellent for row counts over 100K - shown here as an illustration. These Teradata functions have lots of parameters to help control behavior - the if_exists parameter is excellent, so we don't have to explicitly drop the table before loading it - or we can append it, etc. We can also use copy_to_sql for smaller row counts and more flexibility.</p>

In [None]:
# Read the CSV data into a local pandas dataframe
ip_data = pd.read_csv('data/Transactions_60K.csv')

# Little bit of code that creates an index
ip_data['txn_id'] = range(1, len(ip_data) + 1)

fastload(
    df = ip_data,
    table_name = 'ip_data', 
    primary_index = 'txn_id',
    if_exists = 'replace',
    open_sessions = 2
)

In [None]:
DataFrame('ip_data').shape

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>3.2  Create Simulated Customer Data - load from tdf using SQL to create the table</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the example above, we created our table automatically by calling Fastload. These functions allow us to define data types, encoding, and other parameters. However, we can use SQL to create the table if we want more control. In the below example, we need the "ST_GEOMETRY" data type, which python doesn't support. Hence we use Teradata SQL to overcome this limitation of python.</p>

In [None]:
qry = '''
CREATE MULTISET TABLE CUSTOMER, NO FALLBACK,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO
(
    CUSTOMER_ID DECIMAL(18,0) NOT NULL,
    F_NAME VARCHAR(30),
    L_NAME VARCHAR(30),
    VALIDITY VARCHAR(60),
    CUST_ZIP VARCHAR(5),
    CUST_LOCATION ST_GEOMETRY,
    ETHNICITY VARCHAR(20),
    GENDER CHAR(1),
    CHURN_FLAG VARCHAR(1)
)
PRIMARY INDEX(CUSTOMER_ID);
'''

try:
    execute_sql(qry)
except:
    execute_sql('DROP TABLE CUSTOMER;')
    execute_sql(qry)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Load the customer data - we're using the above table and reading the tdf file directly. Note that we have flexibility on different delimiters - in this case, it is a tab.</p>

In [None]:
copy_to_sql(
    df = pd.read_csv('data/CUSTOMER.tdf', sep = '\t'),
    table_name = 'CUSTOMER'
)

In [None]:
DataFrame('CUSTOMER').shape

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>3.3 Create Simulated Customer Comment Table. Use copy_to_sql to create the table and load the data</b></p>

In [None]:
copy_to_sql(
    df = pd.read_csv('data/CUST_COMMENT.csv'),
    table_name = 'CUST_COMMENT',
    if_exists = 'replace'
)

In [None]:
DataFrame('CUST_COMMENT').shape

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>3.4 Additional Simulated Data - Server Locations. Use SQL to handle the ST_GEOMETRY data type </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Following is an example where python does not support the ST_GEOMETRY data type, so we can use Teradata SQL to overcome the limitations.</p>

In [None]:
qry = '''
CREATE MULTISET TABLE SERVER, NO FALLBACK,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO
(
    SERVER_ID VARCHAR(5) NOT NULL,
    SERVER_ZIP VARCHAR(5),
    SERVER_LOCATION ST_GEOMETRY
)
PRIMARY INDEX(SERVER_ID);
'''

try:
    execute_sql(qry)
except:
    execute_sql('DROP TABLE SERVER;')
    execute_sql(qry)

In [None]:
#load the data - read the csv file using pandas read_csv
srvr = pd.read_csv('data/SERVER.csv')
srvr.rename(columns = {'SERVER ZIP':'SERVER_ZIP','SERVER_LAT':'SERVER_LOCATION'}, inplace = True)

copy_to_sql(df = srvr, table_name = 'SERVER')

In [None]:
DataFrame('SERVER').shape

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>4. Data Discovery</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Look at table statistics, sample data, simple lookups</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>One of the most powerful features of the teradataml functions is that they push processing down to the Teradata system, allowing users to perform analysis without pulling all the data back to the client.
<br>
The following cell creates a pointer(virtual dataframe) to the <b>ip_data</b> table server. We retrieve the table's size and a small sample, i.e. five rows, back to this python environment to look at the sample data.</p>

In [None]:
# Get a teradata DataFrame - this creates a local reference to the large table on the server.
tdf_ip_data = DataFrame("ip_data")

# Check the data - size and sample rows without returning all the data
print(tdf_ip_data.shape)

# Return a small set of the data back to a traditional Pandas DF for full-featured formatting.
tdf_ip_data.to_pandas(num_rows = 5).reset_index()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Please scroll down to the end of the notebook for detailed column descriptions of the above dataset.</p>

In [None]:
# Check for null values.
tdf_ip_data.info(null_counts = True)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above result shows that there are no nulls in the dataset.
    <br>
    <br>
Generate Stats using Teradata DataFrame describe method - similar to pandas, but it runs on the server; we don't need to retrieve all the data. This cell shows the column-wise statistics.</p>

In [None]:
tdf_ip_data.describe()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Apply a set of expressions on the virtual dataframe using loc (pandas set processing technique) to grab all fraudulent values. The logic here will filter fraudulent transactions of type 'TRANSFER' - as a view on the server, not move the data. Calling head(2) will only retrieve two values from the server.
<br>
<br>
The next cell shows a sample of 2 Fraud transactions of type TRANSFER.</p>

In [None]:
tdf_ip_data.loc[(tdf_ip_data.isFraud == 1) & (tdf_ip_data.type == 'TRANSFER')].head(2)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following cell shows a sample of 2 Fraud transactions with only the 3 columns(amount, isFraud, type).</p>

In [None]:
#filter the dataframe, then only retrieve two rows of results
tdf_ip_data.loc[tdf_ip_data.isFraud == 1].filter(items = ['amount', 'isFraud', 'type']).head(2)

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>5. Working with Data at Scale</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Act on our data sets without having to return all the data and leverage the computing power of the Teradata Vantage cluster.</p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.1 Aggregations</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can use these "fluent" methods to keep the code as brief and expressive as possible. The following cell counts each instance of fraud grouped by transaction type. Note that the only data that moves out of the database is the final count() aggregation.</p>

In [None]:
tdf_ip_data.loc[tdf_ip_data.isFraud == 1].filter(items = ['amount', 'isFraud', 'type']).groupby('type').count()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The next cell gives us min and max transaction amounts by transaction type. We can use multiple aggregates in the agg() function call.</p>

In [None]:
tdf_ip_data.loc[tdf_ip_data.isFraud == 1] \
    .filter(items = ['amount', 'isFraud', 'type']) \
    .groupby('type') \
    .agg({'amount' : ['min', 'max']})

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.2 Simple Transformations</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create new "Virtual Dataframes" that results from dropping columns or adding new ones via simple expression. The following cell creates a new Virtual DataFrame by dropping a few columns.</p>

In [None]:
clean_data = tdf_ip_data.loc[tdf_ip_data.isFraud == 1].drop(['nameDest', 'nameOrig', 'isFlaggedFraud'], axis = 1)
clean_data

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The next cell assigns a new column which is the difference between newbalanceDest and amount of transaction.</p>

In [None]:
clean_data = clean_data.assign(diff = clean_data['newbalanceDest'] - clean_data['amount'])
clean_data

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The next cell creates a new column which is the binary representation of transaction type.</p>

In [None]:
clean_data = clean_data.assign(btype = clean_data['type'].str.contains('CASH_OUT'))
clean_data

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, binary representation of 1 means CASH_OUT and 0 means TRANSFER. This is similar to Ordinal Encoding.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.3 Joins</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Join dataframes using python pandas-style join methods. These are teradataml DataFrame methods and run completely in database.</p>

In [None]:
tdf_customer = DataFrame('CUSTOMER')
tdf_customer

In [None]:
tdf_cust_comment = DataFrame('CUST_COMMENT')
tdf_cust_comment

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The next cell performs a join between CUSTOMER and CUST_COMMENT tables on CUSTOMER_ID index.</p>

In [None]:
#Do an inner join and drop up fields we don't need.
tdf_comment_full = tdf_cust_comment.join(
    other = tdf_customer,
    on = ['CUSTOMER_ID = CUSTOMER_ID'],
    how = 'inner',
    lprefix = 'cID_',
    rprefix = 'cOM_'
)
tdf_comment_full.drop(['COMMENT_ID', 'cOM__CUSTOMER_ID', 'CHANNEL_ID',
                       'GENDER', 'CHANNEL_TYPE', 'ETHNICITY', 'VALIDITY'], axis = 1)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above output shows the join of two tables, CUSTOMER and CUST_COMMENT. Note that tdf_customer and tdf_cust_comment are just pointers to the Teradata Dataframe. The data is not moved in this process.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>6. Advanced Data Preparation</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The TeradataML Python package has exposed many powerful SQL data transformation functions to the user.  We can apply these functions to Teradata Dataframes to operate on data at scale in the database.
See the documentation for a complete list of functions (including aggregate, arithmetic, Bit/Byte, Date and Time, Hash, Regular Expression, and String Functions).</p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.1 Aggregate Functions:</b></p>

In [None]:
#This import statement is also above, but rewritten here for emphasis. These functions are applied to the
#Teradata dataframe via the SQLAlchemy func class

from sqlalchemy import func

#reuse our datasets from above
clean_data

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Pearson correlation coefficient (r) is the most common way of measuring a linear correlation. A number between -1 and 1 measures the strength and direction of the relationship between two variables. For our fraudulent data set, Pearson Correlation Coefficient - in this example: What's the correlation between the original balance and the transfer amount?</p>

In [None]:
corr_func = func.corr(clean_data['oldbalanceOrg'].expression, clean_data['amount'].expression)


#Setting drop_columns = True here
df_corr = clean_data.assign(drop_columns = True, corr_ = corr_func)

df_corr

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, the Pearson Correlation Coefficient is higher, meaning the transfer amount is also high when the original balance is high. Both variables are positively correlated.
<br>
<br>
Kurtosis: Let's see what the variance from the normal distribution looks like for our complete data set, for example - transfer amounts grouped by the transfer type. A normal distribution has a kurtosis of 0; negative indicates fewer outliers, and positive represents more significant outliers.</p>

In [None]:
kurtosis_func = func.kurtosis(tdf_ip_data['amount'].expression)

#Can also set drop_columns positionally
df_kurtosis = tdf_ip_data.groupby('type').assign(True, kurtosis_xfer_amt = kurtosis_func)

df_kurtosis

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.2 Arithmetic Functions</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Natural Log: Let's add a column, the natural log of the transfer amount. As we see from above, in calling the describe() method for the amount column, the min is .79, and the max is 36946551.76. We can use the natural log to create a tighter range of values for possible use in the analysis.</p>

In [None]:
tdf_new = tdf_ip_data.assign(ln_amount = func.ln(tdf_ip_data['amount'].expression))

tdf_new.filter(items = ['amount', 'ln_amount'])

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.3 String Functions</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Converting payment type to lower case.</p>

In [None]:
tdf_lower = tdf_ip_data.assign(False, type_lower = func.lower(tdf_ip_data['type'].expression))
tdf_lower.filter(items = ['type', 'type_lower'])

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.4 Regular Expression Functions</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Return the substring based on a regular expression. In our demo data, the "nameDest" has a character code as the first character of the account name (example M1057061069)</p>

In [None]:
regexp_func = func.regexp_substr(tdf_ip_data['nameDest'].expression, '^[A-Z]{1}')

tdf_regex = tdf_ip_data.assign(False, acct_ind = regexp_func)

tdf_regex.filter(items = ['nameDest', 'acct_ind'])

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>7. Visualizations</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>7.1 - Example - Geospatial query to return plot-able data</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code calculates the Spherical Distance between the customers and servers.</p>

In [None]:
qry = '''
SELECT
    C.CUSTOMER_ID,
    S.SERVER_ID,
    CAST(C.CUST_LOCATION.ST_SphericalDistance(S.SERVER_LOCATION)/1000 AS DECIMAL(10,0)) AS KM_DISTANCE
FROM
    CUSTOMER C, SERVER S
WHERE
    S.SERVER_ZIP = C.CUST_ZIP;
'''

tdf_distance = DataFrame.from_query(qry)

In [None]:
tdf_distance

In [None]:
#Sort by greatest distance away
tdf_distance.sort('KM_DISTANCE', ascending = False).head(5)

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>7.2 Use Pandas/seaborn to create visualizations inline</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code creates a graph showing the distance between the customer and the server. A majority of customers and servers are in a 4000 km range.</p>

In [None]:
warnings.simplefilter(action='ignore', category=FutureWarning)

sns.distplot(tdf_distance.to_pandas()['KM_DISTANCE'].astype(float), bins = 50);

In [None]:
#Do a bunch of work to filter, group, aggregate, retrieve, and format our chart
tdf_ip_data.drop(['step', 'isFraud', 'isFlaggedFraud'], axis = 1) \
    .groupby('type') \
    .sum() \
    .to_pandas() \
    .set_index('type') \
    .plot(kind = 'bar')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above graph shows the sum of amount, oldbalanceOrig, newbalanceOrig, oldbalanceDest, and newbalanceDest grouped by transaction type.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>8. Cleanup</b>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Work Tables</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Cleanup work tables to prevent errors next time.</p>

In [None]:
tables = ['CUSTOMER', 'CUST_COMMENT','SERVER','ip_data']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass


In [None]:
remove_context()

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>Dataset:</b>

- `txn_id`: transaction id
- `step`: maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (31 days simulation).
- `type`: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER
- `amount`: amount of the transaction in local currency
- `nameOrig`: customer who started the transaction
- `oldbalanceOrig`: customer's balance before the transaction
- `newbalanceOrig`: customer's balance after the transaction
- `nameDest`: customer who is the recipient of the transaction
- `oldbalanceDest`: recipient's balance before the transaction
- `newbalanceDest`: recipient's balance after the transaction
- `isFraud`: identifies a fraudulent transaction (1) and non fraudulent (0)
- `isFlaggedFraud`: flags illegal attempts to transfer more than 200,000 in a single transaction

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023, 2024. All Rights Reserved
        </div>
    </div>
</footer>