<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Improving Customer Satisfaction Through Travel Insights.
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>Banks with a wide network of branches and service centers often aim to enhance customer satisfaction by understanding factors that influence service experience. One key area of interest is the <b>relationship between customer travel distance and complaint behavior</b>. For example, customers who travel longer distances to reach their designated branch or service center may experience greater inconvenience, potentially leading to higher complaint rates. The objective is to analyze:<b> How far customers travel to access banking services, whether longer travel correlates with higher dissatisfaction, and if it is possible to predict which customers are likely to raise complaints</b>.</p>
<p style = 'font-size:20px;font-family:Arial'><b>Teradata Solution Approach:</b></p>
<p style = 'font-size:16px;font-family:Arial'>Using <b>Teradata ClearScape Analytics</b>, banks can address this challenge through a powerful, data-driven approach:
</p>
<ul style="font-size:16px;font-family:Arial"> 
    <li><b>Geo-spatial analytics</b> in Teradata can calculate the exact travel distance and time between customer locations and their preferred branches.</li>
    <li><b>Statistical and feature engineering capabilities</b> can uncover correlations between travel effort, demographics, and complaint frequency.</li>
    <li><b>In-database predictive modeling</b> enables the bank to predict which customers are at higher risk of dissatisfaction or complaints, allowing proactive engagement and service improvements.</li>
    <li><b>Natural Language Processing (NLP)</b> techniques—such as TF-IDF, embeddings, and semantic similarity—can analyze complaint text to identify common pain points and emerging issues.This integrated, in-database approach empowers banks to improve customer experience, optimize branch accessibility, and enhance operational decision-making without moving sensitive data outside the secure Teradata environment.</li>
</ul>



<p style = 'font-size:18px;font-family:Arial'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Connect to Vantage and Data Loading</li>    
    <li>Travel Data Preparation</li>
    <li>Exploratory Data Analysis and Outlier Removal</li>
    <li>Feature Engineering</li>
    <li>Hypothesis Testing: Do Longer Trips Cause More Complaints?</li>
    <li>ADS (Analytical Data Set) Creation</li>
    <li>Balance and Sample the ADS</li>    
    <li>Model Training and Scoring</li>
    <li>Applying NLP: Complaint Analysis</li>
        <ul>
            <li>9.1 Embedding Complaint Texts</li>
            <li>9.2 Unsupervised Topic Modeling</li>
            <li>9.3 TF-IDF Analysis for Clustered Complaints</li>
            <li>9.4 Supervised Topic Embedding Comparison</li>
        </ul>
    <li>Cleanup</li>
</ol>

<hr style="height:2px;border:none;">
<p style = 'font-size:16px;font-family:Arial;'><b>Note:</b> Before running this notebook, please ensure that the <code>Initialization_and_Model_Load.ipynb</code> notebook has been executed.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>1. Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Select <code>Teradata SQL</code> kernel: The Teradata SQL kernel to directly run SQL in the notebook.</p>
<p style = 'font-size:16px;font-family:Arial;'>Begin running steps with Shift + Enter keys.You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%connect local, hidewarnings=True

<p style = 'font-size:16px;font-family:Arial'>Setup session for execution of notebook. </p>

In [None]:
Set query_band='DEMO=Improving_Customer_Satisfaction_Travel_Insights_SQL.ipynb;' update for session;

<p style = 'font-size:18px;font-family:Arial;'><b>Data Setup and Loading</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one of them is commented out. You may switch between the modes by changing the comment string.</p>
<p style = 'font-size:16px;font-family:Arial;'>Load relevant tables (complaints, trips_geo) from <code>DEMO_Cust_Travel</code>.</p>

In [None]:
--call get_data('DEMO_Cust_Travel_cloud');           -- Takes 30 seconds
call get_data('DEMO_Cust_Travel_local');                   -- Takes 1 minutes

<p style = 'font-size:16px;font-family:Arial'>Optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
call space_report();          -- Takes 5 seconds

<p style = 'font-size:16px;font-family:Arial;'><b>Preview data samples.</b></p>
<p style = 'font-size:16px;font-family:Arial'>We've preloaded the tables required for analysis under Database Demo_Cust_Travel.</p>

In [None]:
help Database DEMO_Cust_Travel; 

In [None]:
Select * from DEMO_Cust_Travel.Trips_Geo sample 5;

In [None]:
Select * from DEMO_Cust_Travel.Complaints sample 5;

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>2. Travel Data Preparation</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Create a new TRAVEL table by filtering and formatting trip data to include only valid trips associated with complaints. Ensure that all entries have non-null origin and destination locations and valid trip durations.</p>

In [None]:
--comment the next line if the table doesn't exist.
--drop table TRAVEL;

In [None]:
-- Join Trips_Geo with Complaints on the person_id
-- Create the TRAVEL table

CREATE MULTISET TABLE TRAVEL AS(
    Select person_id as person_id
    ,trip_id
    , CAST(CAST(start_time AS TIMESTAMP(6) FORMAT 'YYYY-MM-DDBHH:MI:SS.S(6)') as DATE format 'DD-MM-YYYY') as start_time
    , CAST(CAST(end_time AS TIMESTAMP(6) FORMAT 'YYYY-MM-DDBHH:MI:SS.S(6)') as DATE format 'DD-MM-YYYY') as end_time
    ,duration_minutes
    ,start_location
    ,end_location   
FROM DEMO_Cust_Travel.Trips_Geo
    WHERE 1=1 
    AND person_id is in (Select person_id from DEMO_Cust_Travel.complaints)
    AND start_location is not null
    AND end_location is not null
    AND duration_minutes is not null
)WITH DATA PRIMARY INDEX (person_id,end_time)


In [None]:
Select TOP 5 * FROM TRAVEL;

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>3. Exploratory Data Analysis and Outlier Removal</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Analyze distribution of trip durations with TD_Histogram.</p>

In [None]:
SELECT top 20 * FROM TD_Histogram(
    ON TRAVEL as InputTable
    USING
        TargetColumn('duration_minutes')
        MethodType('SCOTT')
) as dt;

<p style = 'font-size:16px;font-family:Arial;'>Detect and replace outliers using TD_OutlierFilterFit and TD_OutlierFilterTransform functions. After outlier treatment, compare the mean trip durations before and after the outlier removal to assess the impact of filtering.</p>

In [None]:
--comment the next line if the table doesn't exist.
--drop table outlier_fit;

In [None]:
SELECT * FROM TD_OutlierFilterFit (
      ON TRAVEL AS InputTable
      OUT TABLE OutputTable (outlier_fit)
      USING
          TargetColumns ('duration_minutes')
          LowerPercentile (0.25)
          UpperPercentile (0.75)
          OutlierMethod ('Percentile')
          ReplacementValue ('median')
          PercentileMethod ('PercentileCont')
 ) AS dt;

In [None]:
select top 5 * from outlier_fit

<p style = 'font-size:16px;font-family:Arial;'><b>Filter the data using the fit table.</b><p>

In [None]:
--comment the line if the table doesn't exist
--drop table TRAVEL_Filtered;

In [None]:
create table TRAVEL_Filtered as (
SELECT * FROM TD_OutlierFilterTransform (
  ON TRAVEL AS InputTable PARTITION BY ANY
  ON outlier_fit AS FitTable DIMENSION
) AS dt) with data;

<p style = 'font-size:16px;font-family:Arial;'>Experience the change in mean value of the duration_minutes before and after the outlier removal.<p>

In [None]:
Select average(duration_minutes) from TRAVEL;

In [None]:
Select average(duration_minutes) from TRAVEL_Filtered

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>4. Feature Engineering</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Data Exploration and Feature Creation with TD_UnivariateStatistics.</p>


In [None]:
SELECT * FROM TD_UnivariateStatistics (
  ON TRAVEL AS InputTable
  USING
  TargetColumns ('duration_minutes')
  Stats ('MEAN', 'STD', 'MODE','PERCENTILES')
) AS dt;


<p style = 'font-size:16px;font-family:Arial;'>Compute Distance Travelled.</p>
<p style = 'font-size:16px;font-family:Arial;'>Compute geodesic distance between start and end locations for each trip.</p>

In [None]:
-- comment the next line if the table doesn't exist
--drop table DISTANCE_TRAVELED;

In [None]:
CREATE TABLE DISTANCE_TRAVELED AS(
Select dt.* 
    --How far they travel?
    ,start_location.ST_SPHERICALDISTANCE(end_location)/1000.00 As Distance_In_km  
    
-- We dont need trip_id column    
FROM Antiselect (
  ON TRAVEL
  USING
      Exclude ('trip_id')
) AS dt 
)WITH DATA PRIMARY INDEX(person_id);

In [None]:
Select * from DISTANCE_TRAVELED sample 10;

In [None]:
Select avg(Distance_In_km) from DISTANCE_TRAVELED;


<p style = 'font-size:16px;font-family:Arial;'>Standard Scaling of Travel Distances.</p>

<p style = 'font-size:16px;font-family:Arial;'>Apply z-score scaling to standardize the distance values using the val.td_analyze function. Store the resulting transformed data in the travel_distance_transformed table.</p>

In [None]:
call val.td_analyze (
    'vartran',
    'database = demo_user;
    tablename = DISTANCE_TRAVELED;
    retain = columns (end_time,start_time);
    outputstyle =  table;
    outputdatabase = demo_user;
    outputtablename = travel_distance_transformed;
    
    zscore = columns (Distance_In_km);'
);

In [None]:
Select top 5 * FROM travel_distance_transformed;

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>5. Hypothesis Testing.</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Create a frequency distribution of person's complaints after they have traveled longer distances vs shorter distances to reach business facilities.</p>
<p style = 'font-size:16px;font-family:Arial;'>Then link trips to the nearest post-trip complaint, classify complaints by whether they followed short or long trips, and aggregate counts of these complaints by person.</p>
<p style = 'font-size:16px;font-family:Arial;'>Do Longer/Shorter Trips result in more complaints?</p>


In [None]:
--comment the next line if the table doesn't exist.
--drop table complains_after_trip;

In [None]:
CREATE TABLE complains_after_trip AS( 
Select person_id
    ,sum(comp_aft_short)AS num_cmpl_short_trip
    ,sum(comp_aft_long)AS num_cmpl_long_trip
FROM

(Select person_id
    --Number of complaints after short trips
    ,CASE WHEN Distance_In_km <0.5 THEN 1 ELSE 0 END AS comp_aft_short
    
    --Number of complaints after long trips
    ,CASE WHEN Distance_In_km >= 0.5 THEN 1 ELSE 0 END  AS comp_aft_long

FROM (
    Select a.person_id
    ,end_trip
    ,date_received
    
    --get a trip closest to a complain
    ,date_received - end_trip as time_lapce
    ,Distance_In_km
    ,row_number() OVER(partition by a.person_id, date_received order by time_lapce asc) as rn

    FROM
    (Select person_id 
        ,CAST(end_time as DATE) as end_trip 
        ,Distance_In_km
    FROM travel_distance_transformed 
    ) as a
    
LEFT OUTER JOIN 
    (Select person_id, date_received FROM DEMO_Cust_Travel.complaints
    ) as b
    
    ON a.person_id = b.person_id
    AND date_received >= end_trip -- trip is before happen before complaint received
    
) as tb
    
WHERE time_lapce is not null
AND rn = 1
    
) as tbl2
GROUP BY 1
)WITH DATA PRIMARY INDEX(person_id);


<p style = 'font-size:16px;font-family:Arial;'>A glimpse of the frequency distribution: Number of complaints persons make after traveling short distances vs long distances.</p>

In [None]:
Select TOP 5 * from complains_after_trip;

<p style = 'font-size:16px;font-family:Arial;'>How many complains with long and short distance traveled?</p>

In [None]:
Select sum(num_cmpl_long_trip) as total_num_cmpl_long_trip , sum(num_cmpl_short_trip) as total_num_cmpl_short_trip FROM complains_after_trip;


<p style = 'font-size:16px;font-family:Arial;'>We can use the wilcoxon rank test to <b>test the NULL Hypothesis that the distribution of number of complaints for short distance travel are the same as compared to long distance</b>.</p>

In [None]:
call val.td_analyze (
  'parametrictest',
  'database = demo_user;
   tablename = complains_after_trip;
   firstcolumn = num_cmpl_short_trip;
   secondcolumn = num_cmpl_long_trip;
   paired = false;
   equalvariance = false;
   statsdatabase = val;
   outputdatabase = demo_user;
   outputtablename = p_test;'
);

<p style = 'font-size:16px;font-family:Arial;'>Rank test result:<br>
a=accept null hypothesis,<br>
p=reject null hypothesis (positive),<br>
n=reject null hypothesis (negative)</p>

<p style = 'font-size:16px;font-family:Arial;'><b>Interpretation</b></p>
<p style = 'font-size:16px;font-family:Arial;'>- Mean rank of num_cmpl_short_trip is lower than the mean rank of num_cmpl_long_trip.</p>
<p style = 'font-size:16px;font-family:Arial;'>- So people do complain more after long trips compared to after short trips.</p>

In [None]:
Select * from p_test;

<p style = 'font-size:16px;font-family:Arial;'><b>Predictive Modeling</b></p>
<p style = 'font-size:16px;font-family:Arial;'>A concocted use-case: Predict whether a customer is likely to complain.</p>

<p style = 'font-size:16px;font-family:Arial;'><b>Build a Predictive modelling using Vantage and ClearScape analytics (Two Ways):</b></p>
<p style = 'font-size:16px;font-family:Arial'><b>Method 1: In Vantage:</b></p>
<ul style = 'font-size:16px;font-family:Arial'> By analyzing the customer trip data we will predict who will complain in the future. Train the model in Vantage. </ul>
<p style = 'font-size:16px;font-family:Arial'><b>Method 2: BYOM – Run External Predictive Models in Vantage:</b></p>
<ul style = 'font-size:16px;font-family:Arial'> We will do the same but prepare a model externally and train the model externally and then export it to Vantage for scoring. </ul>

<hr style="height:2px;border:none;">
<p style = 'font-size:16px;font-family:Arial'><b>Method 1: In Vantage:</b></p>
<p style = 'font-size:16px;font-family:Arial;'> Create additional features.</p>
<ul style = 'font-size:16px;font-family:Arial'>- Number of previous complaints</ul>
<ul style = 'font-size:16px;font-family:Arial'>- Average distance traveled</ul>
<ul style = 'font-size:16px;font-family:Arial'>- STD of ditance traveled</ul>
<ul style = 'font-size:16px;font-family:Arial'>- Number of trips</ul>
<ul style = 'font-size:16px;font-family:Arial'>- Change in trips over time. Represented by the slope of regression</ul>
    
<p style = 'font-size:16px;font-family:Arial'> Responce variable have complained</p>
<ul style = 'font-size:16px;font-family:Arial'>- Train a model</ul>
<ul style = 'font-size:16px;font-family:Arial'>- Validate a model</ul>

In [None]:
Select * from travel_distance_transformed sample 10;

In [None]:
Select * from DEMO_Cust_Travel.complaints sample 5;

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>6. ADS (Analytical Data Set) Creation.</b></p>
<p style = 'font-size:18px;font-family:Arial;'>Create a frequency distribution of person's DEMO_Cust_Travel.complaints after they have traveled longer distances vs shorter distances to reach business facilities.</p>
<p style = 'font-size:18px;font-family:Arial;'> Engineer features like:</p>
<li style = 'font-size:16px;font-family:Arial;'> Number of previous complaints</li>
<li style = 'font-size:16px;font-family:Arial;'> Distance metrics (avg, stddev, trend)</li>
<li style = 'font-size:16px;font-family:Arial;'> Number of trips, slope of distance trend</li>
<li style = 'font-size:16px;font-family:Arial;'> Binary target (<code>has_complained</code>)</li></p>
    
<p style = 'font-size:18px;font-family:Arial;'>Create two versions:
<li style = 'font-size:16px;font-family:Arial;'> Raw (<code>ADS_trip_complained</code>)</li>
<li style = 'font-size:16px;font-family:Arial;'> Scaled (<code>ads_trips_scaled</code> using <code>TD_ScaleFit</code>)</li></p>

In [None]:
--comment the next line if the table doesn't exist.
--drop table simple_trips;

In [None]:
CREATE MULTISET TABLE simple_trips as(
    Select
    person_id
    ,end_time
    ,start_time
    ,previous_complaints 
    ,distance_norm
    FROM
    (Select person_id
        ,time_lapce
        ,num_received
        ,row_number() OVER(partition by person_id ORDER BY end_time asc) as rn
        ,lag(num_received) over(partition by person_id order by end_trip asc) as prev_com
        ,CASE WHEN num_received is NULL and prev_com is NULL THEN 0
            WHEN time_lapce = 9999 THEN prev_com
            ELSE num_received
            END as previous_complaints
     
        ,end_time
        ,start_time
        ,date_received
        ,Distance_In_km as distance_norm

        FROM 
        (Select a.person_id
            ,start_time
            ,end_time
            ,end_trip
            ,date_received
            ,num_received
            --get a trip closest to a complain
            ,CASE WHEN date_received is NULL THEN 9999 
                ELSE CAST(date_received AS DATE)- end_trip 
            END as time_lapce
     
            ,Distance_In_km
            ,row_number() OVER(partition by a.person_id, date_received order by time_lapce asc) as rn
            FROM
            (Select person_id 
                ,CAST(end_time as DATE) as end_trip 
                ,start_time 
                ,end_time
                ,Distance_In_km
            FROM travel_distance_transformed 
            ) as a
    
            LEFT OUTER JOIN 
            (Select person_id
               , date_received 
                ,count(*) over (partition by person_id
                ORDER BY date_received asc 
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as num_received
            from DEMO_Cust_Travel.complaints
            ) as b
    
            ON a.person_id = b.person_id
            AND date_received >= end_trip -- trip is before happen before complaint received
        ) as Tbl    
        WHERE rn = 1
    ) as tbl2
)WITH DATA PRIMARY INDEX(person_id)

In [None]:
Select * FROM simple_trips sample 5;

In [None]:
--comment the next line if the table doesn't exist
--drop table ADS_trip_complained;

In [None]:
CREATE TABLE ADS_trip_complained as(
Select distinct  a.person_id
    ,COALESCE(previous_complaints, 0) as previous_complaints
    ,Distance_In_km
    ,COALESCE (AVG(a.Distance_In_km) over(partition by a.person_id) 
                ,0.0) as  avg_dist
    ,COALESCE(STDDEV_SAMP(a.Distance_In_km) over(partition by a.person_id)
                ,0.0) as  std_dist
    ,COUNT(Distance_In_km) over(partition by a.person_id) as  num_trips
    
    ,COALESCE( CAST(
        REGR_SLOPE(rn, Distance_In_km) OVER(partition by a.person_id ) 
     AS FLOAT)   ,0.0) as  slope
     
    ,CASE WHEN previous_complaints > 0 THEN 1 
           WHEN previous_complaints  is NULL THEN 0
        ELSE 0 
    END AS has_complained
    
FROM (Select aa.*
        ,row_number() over(PARTITION BY person_id ORDER BY end_time asc) as rn
     FROM travel_distance_transformed as aa
    ) as a
    
LEFT OUTER JOIN 
simple_trips as b

ON a.person_id = b.person_id
AND a.start_time = b.start_time
AND a.end_time = b.end_time

)WITH DATA PRIMARY INDEX(person_id);

In [None]:
Select * FROM 
ADS_trip_complained
sample 35;

<li style = 'font-size:16px;font-family:Arial;'> Apply range-based scaling.</li>
<li style = 'font-size:16px;font-family:Arial;'> Apply z-scores-based standard scaling.</li>

In [None]:
--comment the next line if the table doesn't exist
--drop table scaleFitOut;

In [None]:
SELECT * FROM TD_ScaleFit (
  ON ADS_trip_complained AS InputTable
  OUT PERMANENT TABLE OutputTable (scaleFitOut)
  USING
  TargetColumns ('previous_complaints'
                  ,'avg_dist'
                  ,'std_dist'
                  ,'num_trips'
                  ,'Distance_In_km'
)
  MissValue ('keep')
  ScaleMethod ('range')
  GlobalScale ('f')
) AS dt2; 

In [None]:
Select * from scaleFitOut;

In [None]:
--comment the next line if the table doesn't exist.
--drop table ads_trips_scaled;

In [None]:
CREATE MULTISET TABLE  ads_trips_scaled as(
SELECT dt2.* 
       ,row_number() over(order by person_id) as rn
FROM TD_scaleTransform (
  ON ADS_trip_complained AS InputTable
  ON scaleFitOut AS FitTable DIMENSION
  USING
  Accumulate ('person_id','slope','has_complained')
) AS dt2
)WITH DATA PRIMARY INDEX(person_id);

In [None]:
Select * from ads_trips_scaled sample 10;

In [None]:
call val.td_analyze (
    'vartran',
    'database = demo_user;
    tablename = ADS_trip_complained;
    retain = columns (person_id,slope,has_complained, Distance_In_km);
    outputstyle =  table;
    outputdatabase = demo_user;
    outputtablename = simple_trips_normalised;
    
    zscore = columns (previous_complaints
                    ,avg_dist
                    ,std_dist
                    ,num_trips);'
);

In [None]:
Select * from simple_trips_normalised sample 10;

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>7. Balance and Sample the ADS.</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Data is imbalanced.</p>
<p style = 'font-size:16px;font-family:Arial;'>Check for class imbalance in the <code>has_complained</code> variable, then apply stratified sampling to create a balanced dataset named <code>ads_trip_sample</code>.</p>

In [None]:
Select has_complained, count(*) FROM ads_trips_scaled group by 1;

<p style = 'font-size:18px;font-family:Arial;'>Take sample of the ADS and maintain a balance in the sample.</p>

In [None]:
--comment the next line if the table doesn't exist.
--drop table ads_trip_sample;

In [None]:
CREATE TABLE ads_trip_sample as(
     SELECT a.*
    , SAMPLEID as SAMPLE_ID
     FROM ads_trips_scaled as a
     SAMPLE WHEN has_complained = 1 THEN 0.8
            WHEN has_complained = 0  THEN 0.015
            END
)WITH DATA PRIMARY INDEX(person_id)

In [None]:
Select has_complained, count(*) FROM ads_trip_sample group by 1;

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>8. Model Training and Scoring</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Train a Decision Forest classifier using key features to predict complaints, score the model on unseen data with <code>TD_DecisionForestPredict</code>, and evaluate its performance using <code>TD_ClassificationEvaluator</code>.</p>

In [None]:
--comment the next line if the table doesn't exist.
--drop table DecisionForestOutput;

In [None]:
CREATE TABLE DecisionForestOutput AS (
SELECT * FROM TD_DecisionForest(
  ON ads_trip_sample AS InputTable PARTITION BY ANY
USING
  ResponseColumn('has_complained')
  InputColumns('slope'
              ,'previous_complaints'
              ,'avg_dist'
              ,'std_dist'
              ,'num_trips'
              ,'Distance_In_km'
)
  TreeType('CLASSIFICATION')
  ) AS dt
 ) WITH DATA
;

<p style = 'font-size:16px;font-family:Arial;'>Inspect the Model Table.</p>

In [None]:
Select * FROM DecisionForestOutput;

<p style = 'font-size:16px;font-family:Arial;'>Score the Model in Vantage</p>

In [None]:
--comment the next line if the table doesn't exist.
--drop table trip_scored;

In [None]:
CREATE TABLE trip_scored AS(
SELECT * FROM TD_DecisionForestPredict (
    ON (Select * FROM ads_trips_scaled
        WHERE rn is not in (Select rn FROM ads_trip_sample)
        ) AS InputTable PARTITION BY ANY
    ON DecisionForestOutput AS ModelTable DIMENSION
USING
  IdColumn ('person_id')
  Accumulate('has_complained','rn')
  Detailed('false')
) AS dt
)WITH DATA PRIMARY INDEX(person_id)

<p style = 'font-size:16px;font-family:Arial;'>Inspect the predictions</p>

In [None]:
Select * from trip_scored sample 10;

<p style = 'font-size:16px;font-family:Arial;'><b>Validate the Model in Vantage</b> </p>

In [None]:
--comment the next line if the table doesn't exist.
--drop table additional_metrics;

In [None]:
SELECT * FROM TD_ClassificationEvaluator (
   ON trip_scored AS InputTable
   OUT TABLE trip_evaluation (additional_metrics)
   USING
   ObservationColumn ('has_complained')
   PredictionColumn ('prediction')
   Labels ('1','0')
) AS dt ORDER BY SeqNum;

<p style = 'font-size:16px;font-family:Arial;'>Inspect the evaluation metrics.</p>

In [None]:
Select * from additional_metrics;

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>9. Applying NLP: Complaint Analysis.</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Make sure necessary functions and model(s) are already installed in Vantage. Refer to the IVSM_Init_Model_Load.ipynb (Run that notebook first).</p>


<p style = 'font-size:16px;font-family:Arial;'><b>Natural Language Processing</b></p>
<ul style = 'font-size:16px;font-family:Arial;'> Make sure necessary functions and model(s) are already installed in Vantage. Refer to the IVSM_Init_Model_Load.ipynb (Run that notebook first).</ul>

<p style = 'font-size:16px;font-family:Arial;'><b>In-DB Complaint Analysis</b></p>
<ul style = 'font-size:16px;font-family:Arial;'> - Create embeddings from the complaints.
- Unsupervised Topic Modeling (topics not defined): Cluster the complaints using K-Means on the latent features (embeddings) and classical TF_IDF.</ul>
<ul style = 'font-size:16px;font-family:Arial;'>- Supervised Topic Modeling: Use cosine similarity to cluster the complaints into pre-defined topics.</ul>

<br>
<p style = 'font-size:18px;font-family:Arial;'><b>9.1. Embedding Complaint Texts</b></p>
<p style = 'font-size:16px;font-family:Arial;'><b>Convert complaints into Embeddings (Latent Features) for Analysis.</b></p>
<p style = 'font-size:16px;font-family:Arial;'> Use the in-DB tokenizer to tokenize the complaints..</p>

In [None]:
--DROP TABLE first_embeddings

<p style = 'font-size:16px;font-family:Arial;'><b>The required embeddings for the <code>Complaints</code> table have already been generated and stored in a table named <code>Complaints_Embeddings</code>, which will be used throughout this demo.</code>.</b></p>
<p style = 'font-size:16px;font-family:Arial;'> Below, we generate 10 sample rows of embeddings for demonstration purposes and store them in a table named First_Embeddings. .</p>

In [None]:
 Select top 5 * from DEMO_Cust_Travel.complaints

In [None]:
CREATE TABLE first_embeddings AS(
select
    *
from mldb.ONNXEmbeddings(
    on (Select complaint_id, person_id, consumer_complaint_narrative as txt from DEMO_Cust_Travel.complaints sample 10) 
    on (select model_id, model from  embeddings_models where model_id = 'bge-small-en-v1.5') DIMENSION
    on (select model as tokenizer from embeddings_tokenizers where model_id = 'bge-small-en-v1.5') DIMENSION
USING
    Accumulate('complaint_id', 'person_id', 'txt')
    ModelOutputTensor('sentence_embedding')
    OutputFormat('FLOAT32(384)')
) as td
)WITH DATA;

In [None]:
Select top 5 * from first_embeddings;

<br>
<p style = 'font-size:18px;font-family:Arial;'><b>9.2. Unsupervised Topic Modeling using Complaints Clustering</b></p>


<p style = 'font-size:16px;font-family:Arial;'>Cluster the complaints into 4 clusters.Run K-Means clustering on embeddings to discover latent topics.
Predict cluster assignments for each complaint.</p>

In [None]:
--comment the next line if the table doesn't exist.
--drop table Kmeans_complaints;

In [None]:
CREATE TABLE Kmeans_complaints as(
    SELECT * FROM TD_KMeans (
        ON DEMO_Cust_Travel.Complaints_Embeddings as InputTable
        USING
            IdColumn('complaint_id')
            TargetColumns('[3:385]')
            NumClusters(4)
            Seed(0)
            StopThreshold(0.0395)
            MaxIterNum(50)
    )as dt
)WITH DATA;

<p style = 'font-size:16px;font-family:Arial;'>Inspect the centroids of the clusters (averaged out embeddings)</p>

In [None]:
Select * from Kmeans_complaints;

<p style = 'font-size:16px;font-family:Arial;'>Use these centroids to classify (possibly new) complaints.</p>

In [None]:
--comment the next line if the table doesn't exist
--drop table complains_clusters;

In [None]:
CREATE TABLE complains_clusters as(
    SELECT * FROM TD_KMeansPredict (
        ON DEMO_Cust_Travel.Complaints_Embeddings AS InputTable
        ON Kmeans_complaints AS ModelTable DIMENSION
        USING
        OutputDistance('true')
        Accumulate('person_id', 'txt')
    )AS dt
)WITH DATA PRIMARY INDEX(person_id);

<p style = 'font-size:16px;font-family:Arial;'>Inspect the result of classifications.</p>

In [None]:
select * from complains_clusters sample 5;

<br>
<p style = 'font-size:18px;font-family:Arial;'><b>9.3. TF-IDF Analysis for Clustered Complaints</b></p>



<p style = 'font-size:16px;font-family:Arial;'>You can view the sample data provided in the <code>Stopwords</code> table.</p>

In [None]:
select * from DEMO_Cust_Travel.Stopwords sample 10;

<p style = 'font-size:16px;font-family:Arial;'>Parse the complaints in each cluster (treating each cluster as a separate document/topic of complaints)Tokenize complaint texts using <code>TD_TextParser</code>, removing stop words.
Apply <code>TD_TFIDF</code> to extract keywords per cluster.</p>

In [None]:
--comment the next line if the table doesn't exist.
--drop table cluster_tokens;

In [None]:
CREATE TABLE cluster_tokens as(
    SELECT * FROM TD_TextParser (
        ON complains_clusters AS InputTable
        ON DEMO_Cust_Travel.Stopwords As StopWordsTable DIMENSION
        USING
            TextColumn ('txt')
            StemTokens ('false')
            RemoveStopWords ('true')
            Accumulate ('td_clusterid_kmeans')
    ) as dt
)WITH DATA PRIMARY INDEX(td_clusterid_kmeans);

In [None]:
Select
    td_clusterid_kmeans as cluster_id
    ,token as words_from_complaints_narrative
FROM cluster_tokens 
sample 5;

<p style = 'font-size:16px;font-family:Arial;'>Apply <code>TF_IDF</code> on the clustered complaints.</p>

In [None]:
--comment the next line if the table doesn't exist.
--drop table cluster_tf_idf;

In [None]:
CREATE TABLE cluster_tf_idf as (
    SELECT distinct token
    ,TD_TF
    ,TD_IDF
    ,TD_TF_IDF
    ,td_clusterid_kmeans
    FROM TD_TFIDF (
        ON cluster_tokens AS InputTable
        USING
            DocIdColumn ('td_clusterid_kmeans')
            TokenColumn ('token')
            TFNormalization ('LOG')
            IDFNormalization ('SMOOTH')
            Regularization ('L2')
)as dt
) with data primary index(td_clusterid_kmeans);

In [None]:
Select * from cluster_tf_idf sample 5;

<br><p style = 'font-size:18px;font-family:Arial;'><b>9.4. Supervised Topic Embedding Comparison</b></p>


<p style = 'font-size:16px;font-family:Arial;'>Create the <code>topics</code> table and insert the relevant data.</p>

In [None]:
--drop table topics;

In [None]:
CREATE TABLE Topics (
    topic_id INTEGER,
    topic VARCHAR(1000) CHARACTER SET UNICODE
)
PRIMARY INDEX (topic_id);


In [None]:
INSERT INTO topics (topic_id, topic) VALUES
(0, 'Widespread Failures in Accurate Credit Reporting and Alleged Violations of Consumer Protection Laws (FCRA/ECOA)');

INSERT INTO topics (topic_id, topic) VALUES
(1, 'Pervasive Issues with Identity Theft, Fraudulent Accounts, and Ineffective Resolution of Unauthorized Activity');

INSERT INTO topics (topic_id, topic) VALUES
(2, 'Persistent Poor Customer Service, Harassing Debt Collection, and Obstruction of Account Resolution');


In [None]:
Select * from topics;

<p style = 'font-size:16px;font-family:Arial;'>Embed predefined topics using the same ONNX model.Use cosine similarity to classify complaints into predefined topics.</p>

In [None]:
--comment the next line if the table doesn't exist.
--drop table topics_embeddings_VS;

In [None]:
CREATE TABLE topics_embeddings_VS AS(
select
    *
from mldb.ONNXEmbeddings(
    on (select topic_id, topic as txt FROM topics) 
    on (select model_id, model from  embeddings_models where model_id = 'bge-small-en-v1.5') DIMENSION
    on (select model as tokenizer from embeddings_tokenizers where model_id = 'bge-small-en-v1.5') DIMENSION
USING
    Accumulate('topic_id', 'txt')
    ModelOutputTensor('sentence_embedding')
    OutputFormat('FLOAT32(384)')
) as td
)WITH DATA;



In [None]:
Select * from topics_embeddings_VS;

<p style = 'font-size:16px;font-family:Arial;'>Compute the cosine similarity metric classifying complaints into topics.</p>

In [None]:
--comment the next line if the table doesn't exist.
--drop table topic_complaint_similarities;

In [None]:
Create table topic_complaint_similarities as (
SELECT 
    dt.target_id as complaint_id,
    dt.reference_id as topic_id,
    e_tgt.txt as consumer_complaint_narrative,
    e_ref.txt as topic,
    (1.0 - dt.distance) as similarity
FROM
    TD_VECTORDISTANCE (
        ON (select * from DEMO_Cust_Travel.Complaints_Embeddings a) AS TargetTable
        ON topics_embeddings_VS AS ReferenceTable DIMENSION
        USING
            TargetIDColumn('complaint_id')
            TargetFeatureColumns('[emb_0:emb_383]')
            RefIDColumn('topic_id')
            RefFeatureColumns('[emb_0:emb_383]')
            DistanceMeasure('cosine')
            topk(1) 
    ) AS dt
JOIN  DEMO_Cust_Travel.Complaints_Embeddings e_tgt on e_tgt.complaint_id = dt.target_id
JOIN topics_embeddings_VS e_ref on e_ref.topic_id = dt.reference_id
) with data;

In [None]:
Select * from topic_complaint_similarities sample 10;

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>10. Cleanup.</b></p>

<p style = 'font-size:18px;font-family:Arial'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Cleanup work tables to prevent errors next time. This section drops all the tables created during the demonstration.</p>

In [None]:
drop table topic_complaint_similarities;

In [None]:
drop table TRAVEL;

In [None]:
drop table topics_embeddings_VS;

In [None]:
drop table cluster_tf_idf;

In [None]:
drop table cluster_tokens;

In [None]:
drop table complains_clusters;

In [None]:
drop table Kmeans_complaints;

In [None]:
drop table first_embeddings

In [None]:
drop table additional_metrics;

In [None]:
drop table trip_scored;

In [None]:
drop table DecisionForestOutput;

In [None]:
drop table ads_trip_sample;

In [None]:
drop table ads_trips_scaled;

In [None]:
drop table scaleFitOut;

In [None]:
drop table ADS_trip_complained;

In [None]:
drop table simple_trips;

In [None]:
drop table complains_after_trip;

In [None]:
drop table DISTANCE_TRAVELED;

In [None]:
drop table TRAVEL_Filtered;

In [None]:
drop table Topics;

In [None]:
drop table outlier_fit;

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
call remove_data('DEMO_Cust_Travel');          -- Takes 5 seconds

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>