<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Text Term Frequency Analysis - Python</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial'>
This demo will analyse the text in rows of the table to find the TF-IDF or Term Frequency-Inverse Document Frequency is an indicator of a term's importance in a specific document based on the entire corpus of documents.    
This is a demonstration of Vantage capabilities for functional demos e.g.
    <li style = 'font-size:16px;font-family:Arial'> NGramSplitter Function - tokenizes (splits) an input stream of text and outputs n multigrams (called n-grams) based on the specified Reset, Punctuation, and Delimiter syntax elements.</li>
</p>
<p style = 'font-size:16px;font-family:Arial'> This notebook demonstrate how the function is used in Python kernel, there is a similar notebook which shows the same features in sql kernel. </p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Steps</b></p>
<p style = 'font-size:16px;font-family:Arial'>
    <li style = 'font-size:16px;font-family:Arial'> Connect to Vantage and read the dataset. </li>
    <li style = 'font-size:16px;font-family:Arial'> Use NGramSplitter SQL to create a table of grams of n-size. </li>
    <li style = 'font-size:16px;font-family:Arial'> Express SQL to calculate TF-IDF and store the output in a table. </li>
    <li style = 'font-size:16px;font-family:Arial'> Retrieve the data as a local dataframe. </li>
    <li style = 'font-size:16px;font-family:Arial'> Basic visualization to show top 30 important terms. </li>
</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b> Accessing the Data </b> </p>
<p style = 'font-size:16px;font-family:Arial'>These demos will work either with foreign tables accessed from Cloud Storage via NOS or you may import the tables to your machine. If you import data for multiple demos, you may need to use the Data Dictionary "Manage Your Space" routine to cleanup tables you no longer need.     
    
<p style = 'font-size:16px;font-family:Arial'>Use the link below to access the 2 options for using data from the data dictionary notebook:

[Click Here to get data for this notebook](../Data_Dictionary/Data_Dictionary.ipynb#TRNG_RETAILDSE)

[Click Here to Manage Your Space](../Data_Dictionary/Data_Dictionary.ipynb#Manage_Your_Space)

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>1. Import python packages, connect to Vantage and explore the dataset</b></p>

In [None]:
import json
import getpass
import warnings
import os
from datetime import datetime, timedelta

import pandas as pd
import numpy as np

from teradataml.dataframe.dataframe import DataFrame
from teradataml.analytics.sqle import NGramSplitter
from teradataml.dataframe.dataframe import in_schema
from teradataml.context.context import create_context, remove_context, get_context
from teradataml.dataframe.copy_to import copy_to_sql
from teradataml.dataframe.fastload import fastload
from teradataml.options.display import display

from teradatasqlalchemy.types import *

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
sns.set()
warnings.filterwarnings('ignore')


<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b> Connect to Vantage </b> </p>
<p style = 'font-size:16px;font-family:Arial'>Below command will make a connection to the Vantage environment.

In [None]:
# Connect to Vantage
eng = create_context(host = 'host.docker.internal', username='demo_user', password = getpass.getpass())

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b> Access data in Vantage  </b> </p>
<p style = 'font-size:16px;font-family:Arial'>For this demo, data is already resident in Object Storage which we are accessing via ReadNOS.  Create a reference to the table, and sample the contents.  Data could just as easily reside in permanent tables, another RDBMS, or another Vantage system.</p>

In [None]:
tdf_reviews = DataFrame('"TRNG_RETAILDSE"."WEB_COMMENT"')
tdf_reviews.head(5)

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>2. Use the NGram Splitter SQL Function</b></p>
<p style = 'font-size:16px;font-family:Arial'>NGram function will split the corpus of documents into "terms" (grams) of selected size.  Specifically, this example will create a table called "tbl_grams" that is the result of splitting each "document" (review) into two-word chunks (grams).  Each row in this table includes;
<ol style = 'font-size:16px;font-family:Arial'>
    <li>The two-word chunk (ngram).</li>
    <li>The source review id (row_id).</li>
     <li>Chunk length (n).</li>
     <li>The count of this chunk in the review (frequency).</li>
     <li>The count of this chunk in all the reviews (totalcnt)</li>
</ol>
<p style = 'font-size:16px;font-family:Arial'>
The splitting algorithm can be controlled with delimeters, punctuation indicators, etc.</p>

In [None]:

qry = 'DROP TABLE tbl_grams;'

try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

#how many grams should we split the docs into?
grams = 2

#Create ngram table
qry = f'''
CREATE TABLE tbl_grams AS (
    SELECT * FROM NGramSplitter ( 
        ON ( SELECT * FROM "TRNG_RETAILDSE"."WEB_COMMENT" )   
        USING 
            TextColumn('comment_text') 
            Accumulate('comment_id') 
            Grams('{grams}') 
            OverLapping('TRUE') 
            ConvertToLowerCase('TRUE') 
            Delimiter(' ') 
            Punctuation('[`~#^&*()-]') 
            OutputTotalGramCount('TRUE') 
            NGramColName('ngram') 
            GramLengthColName('n') 
            FrequencyColName('frequency') 
            TotalCountColName('totalcnt') 
    ) as ngram_out
    )
WITH DATA
PRIMARY INDEX (comment_id);
'''

eng.execute(qry)

tdf_grams = DataFrame('tbl_grams')

tdf_grams.to_pandas(num_rows = 5).head()

<p style = 'font-size:16px;font-family:Arial'>In the table created above we can see the NGram function applied to the web comment column. We can see the frequency and the total number of times the ngram appear in the column.</p>

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>3. Create the TF-IDF Table</b></p>
<p style = 'font-size:16px;font-family:Arial'><b>TF-IDF</b> or <b>Term Frequency-Inverse Document Frequency</b> is an indicator of a term's <b>importance</b> in a specific document based on the entire corpus of documents.  This value is calculated by taking the Product of:
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Term Frequency = (Number of Terms in the Document)/(Number of Terms in the Corpus)</li>
    <li>Inverse Document Frequency = Natural Log((Total Number of Documents)/(Number of Documents with the Term))</li>
 </ul>   
<p style = 'font-size:16px;font-family:Arial'>
This can be accomplished in SQL using the results table created using NGgram Splitter function in above step:</p>

In [None]:
qry = 'DROP TABLE tbl_tf_idf;'

try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise
        
qry = '''
CREATE TABLE tbl_tf_idf AS (
SELECT gr.comment_id as comment_id,
gr."ngram" as term,
CAST (CAST(gr.frequency AS FLOAT) / CAST(gr.totalcnt AS FLOAT) AS DECIMAL(10,6)) as tf,
CAST(LN(sel_docs.tot_docs / sel_docs.tot_term) AS DECIMAL(10,6)) as idf,
CAST(idf * tf AS DECIMAL(10,6)) as tf_idf
FROM tbl_grams as gr
--get the number of docs where each term exists
LEFT JOIN (select "ngram", tot_term , tot_docs from
((SELECT "ngram",COUNT(*) as tot_term
FROM tbl_grams
GROUP BY "ngram") terms
--get the total doc count and join it to the table
CROSS JOIN (SELECT COUNT(DISTINCT comment_id) as tot_docs
FROM tbl_grams ) as sum_docs
)
) sel_docs
ON gr."ngram" = sel_docs."ngram"
WHERE tf_idf > .5
    )
WITH DATA
PRIMARY INDEX (comment_id);
'''

eng.execute(qry)

tdf_tf_idf = DataFrame('tbl_tf_idf')
tdf_tf_idf.head(5)

In [None]:
df_tf_idf = tdf_tf_idf.to_pandas(all_rows = True)
df_tf_idf['tf_idf'] = df_tf_idf['tf_idf'].astype(float)

In [None]:
df_tf_idf.sort_values(by = 'tf_idf', ascending = False)

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>4. Visualize the Results</b></p>
<p style = 'font-size:16px;font-family:Arial'>
Let's use Pandas and Matplotlib to do visualizations of the data:</p>


In [None]:
df1 = df_tf_idf.sort_values(by = 'tf_idf', ascending = False).head(30)

In [None]:
#plot it:
df1.sort_values(by = 'term', ascending = False).head(30).set_index('term')[['tf_idf']].plot(kind = 'barh', legend = True, figsize = (12, 9));

<p style = 'font-size:16px;font-family:Arial'>
In this plot we are plotting the top 30 terms which are used in the reviews. </p>

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>5.  Clean up</b></p>

In [None]:
eng.execute('DROP TABLE DEMO_USER.tbl_tf_idf;') 

In [None]:
eng.execute('DROP TABLE DEMO_USER.tbl_grams;') 

<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradata Python Package User Guide: <a href = 'https://docs.teradata.com/reader/eteIDCTX4O4IMvazRMypxQ/uDjppX7PJInABCckgu~KFg'>https://docs.teradata.com/reader/eteIDCTX4O4IMvazRMypxQ/uDjppX7PJInABCckgu~KFg</a></li>
    <li>Teradataml Python Reference: <a href = 'https://docs.teradata.com/reader/GsM0pYRZl5Plqjdf9ixmdA/MzdO1q_t80M47qY5lyImOA'>https://docs.teradata.com/reader/GsM0pYRZl5Plqjdf9ixmdA/MzdO1q_t80M47qY5lyImOA</a></li>
    <li>Teradata NGramSplitter Function Reference: <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Text-Analytic-Functions/NGramSplitter'>https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Text-Analytic-Functions/NGramSplitter</a></li>
  
</ul>


<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">©2023 Teradata. All Rights Reserved</footer>