<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       StringSimilarity Function in Vantage
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial'>The StringSimilarity function calculates the similarity between two strings, using a specified comparison method. In this notebook we will see how we can use the Stringsimilarity function available in Vantage.</p>

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>1. Initiate a connection to Vantage</b>

<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths (if required).

In [None]:
from teradataml import *

# Modify the following to match the specific client environment settings
display.max_rows = 5

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>1.1 Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../../UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=PP_StringSimilarity_Python.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys. </p>

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>1.2 Getting Data for This Demo</b></p>

<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call get_data('DEMO_Customer360_local');"        # Takes 30 seconds
#%run -i ../../UseCases/run_procedure.py "call get_data('DEMO_Customer360_cloud');" 

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>2. Data Exploration</b>
<p style = 'font-size:16px;font-family:Arial'>Create a "Virtual DataFrame" that points to the data set in Vantage. Check the shape of the dataframe as check the datatype of all the columns of the dataframe.<br>In our example we have data from two tables which have the same customers but due to them being data coming from different systems, we need to validate which customers are same and which are different. <li style = 'font-size:16px;font-family:Arial'>The equipment distributor's customer table</li><li style = 'font-size:16px;font-family:Arial'>
    The electronic monitor service's e-commerce list relative to the customer table</li> </p>

In [None]:
tdf = DataFrame(in_schema("DEMO_Customer360","Equipment"))
print("Shape of the data: ", tdf.shape)
tdf

In [None]:
tdf2 = DataFrame(in_schema("DEMO_Customer360","Online"))
print("Shape of the data: ", tdf2.shape)
tdf2

<p style = 'font-size:16px;font-family:Arial'>
ClearScape Analytics provides a StringSimilarity function to calculate how close two key strings are. Detailed help can be found by passing function name to built-in help function.</p>

In [None]:
help(StringSimilarity)

<p style = 'font-size:16px;font-family:Arial'> We will create a key string for each row of the equipment and online tables. The key string will consist of the first name, last name, and city with all spaces removed. The function requires the data to compare as a single dataset, hence we will first join the two datasets and then create the string keys which will be used for comparison. <br>The StringSimiliarty function supports 10 different ways of comparing two strings. We will use the Jaro similiarity, which accounts for the number of matching characters and transpositions.</p>

In [None]:
tdf_join = tdf.join(other = tdf2,how = "cross", lprefix = "t1", rprefix = "t2")
tdf_join

In [None]:
from sqlalchemy import func

equip_key=func.regexp_replace(func.concat(tdf_join['GENDER'].expression,tdf_join['FIRSTNAME'].expression, 
                                                        tdf_join['LASTNAME'].expression, 
                                                        tdf_join['t1_CITY'].expression 
                                                        ),r'[^a-zA-Z\d:]', '', 1, 0)
online_key=func.regexp_replace(func.concat(tdf_join['SEX'].expression,tdf_join['FNAME'].expression, 
                                                        tdf_join['LNAME'].expression, 
                                                        tdf_join['t2_CITY'].expression 
                                                        ),r'[^a-zA-Z\d:]', '', 1, 0)

In [None]:
tdf_combined=tdf_join.assign(drop_columns = True, 
                        CUST_ID = tdf_join.CUST_ID,
                        LOYALTY_NUM = tdf_join.LOYALTY_NUM,
                        EMAIL=tdf_join.EMAIL,
                        EQUIPMENT_KEY = equip_key.cast(type_=VARCHAR(50)),
                        ONLINE_KEY = online_key.cast(type_=VARCHAR(50))
                       )

tdf_combined

In [None]:
obj = StringSimilarity(data = tdf_combined,
                       comparison_columns=['jaro (EQUIPMENT_KEY, ONLINE_KEY) AS jaro'],
                       case_sensitive = False,
                       accumulate = ['CUST_ID', 'LOYALTY_NUM', 'EQUIPMENT_KEY', 'ONLINE_KEY', 'EMAIL'])

# Print the result DataFrame.
obj.result

<p style = 'font-size:16px;font-family:Arial'>Below shows the data which is matched by StringSimilarity function.</p>

In [None]:
obj.result[obj.result.jaro >= .90]

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>3. Cleanup</b>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call remove_data('Customer360');"        # Takes 10 seconds

In [None]:
remove_context()

<hr style="height:1px;border:none;">
<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
    <li>StringSimilarity function reference: <a href = 'https://docs.teradata.com/search/all?query=StringSimilarity&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>