<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Text Term Frequency Analysis (Python)
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
This demo will analyze the text in rows of the table to find the TF-IDF or Term Frequency-Inverse Document Frequency is an indicator of a term's importance in a specific document based on the entire corpus of documents.    
This is a demonstration of Vantage capabilities for functional demos e.g.
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'> NGramSplitter Function - tokenizes (splits) an input stream of text and outputs n multigrams (called n-grams) based on the specified Reset, Punctuation, and Delimiter syntax elements.</li>
</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> This notebook demonstrate how the function is used in Python kernel, there is a similar notebook which shows the same features in sql kernel using  <a href = 'https://www.docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide-17.20/Introduction-to-Teradata-Package-for-Python'>TeradataML Package</a>,there is a similar notebook which shows the same features in sql kernel.More information on the functions can be found in <a href = 'https://docs.teradata.com/r/Enterprise/Teradata-Package-for-Python-Function-Reference-17.20'>TeradataML Python Reference</a> </p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Steps</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Connect to Vantage and read the dataset. </li>
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Use NGramSplitter SQL to create a table of grams of n-size. </li>
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Express SQL to calculate TF-IDF and store the output in a table. </li>
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Retrieve the data as a local dataframe. </li>
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'> Basic visualization to show top 30 important terms. </li>
</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Begin running steps with Shift + Enter keys.

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Import python packages, connect to Vantage and explore the dataset</b></p>

In [None]:
import getpass
import sys
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# from teradataml import func
# from teradataml.analytics.sqle import NGramSplitter
# from teradataml.context.context import create_context, get_context, remove_context
# from teradataml.dataframe.dataframe import DataFrame, in_schema
# # from teradataml.dataframe.copy_to import copy_to_sql
# from teradataml.options.display import display
from teradataml import *
from teradatasqlalchemy.types import *
from sqlalchemy import func
%matplotlib inline

# Set message level
display.print_sqlmr_query = False
display.max_rows = 5
warnings.filterwarnings('ignore')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell.</p>

In [None]:
%run -i ../startup.ipynb

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Below command will make a connection to the Vantage environment and set (optional) QueryBand for database session.</p>

In [None]:
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql("SET query_band='DEMO=Text_Term_Frequency_Python.ipynb;' UPDATE FOR SESSION;")

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. You have the option of either running the demo using foreign tables to access the data without using any storage on your environment or downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage. There are two statements in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%%time
%run -i ../run_procedure.py "call get_data('DEMO_Retail_cloud');"
# takes about 25 seconds, estimated space: 0 MB
#%run -i ../run_procedure.py "call get_data('DEMO_Retail_local');" 
# takes about 50 seconds, estimated space: 23 MB

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
# %run -i ../run_procedure.py "call space_report();"

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Get the data from Vantage object <i>"DEMO_Retail"."Web_Comment"</i> in the DataFrame.</p>

In [None]:
tdf_reviews = DataFrame('"DEMO_Retail"."Web_Comment"')

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Getting familiar with the dataset</b></p>

In [None]:
tdf_reviews.info()

In [None]:
print("Object tdf_reviews has {} records and uses {} bytes in memory."
      .format(tdf_reviews.size, sys.getsizeof(tdf_reviews)));

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us see how the data in the table looks like. We have taken one comment.</p>

In [None]:
# check an example comment
tdf_reviews[(tdf_reviews.comment_id == 30)]

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Use the NGram Splitter SQL Function</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>NGram function will split the corpus of documents into "terms" (grams) of selected size.  Specifically, this example will create a table called "tbl_grams" that is the result of splitting each "document" (review) into two-word chunks (grams).  Each row in this table includes;</p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>The two-word chunk (ngram).</li>
    <li>The source review id (row_id).</li>
     <li>Chunk length (n).</li>
     <li>The count of this chunk in all the reviews (frequency).</li>
     <li>The count of all chunks in the review (totalcnt)</li>
</ol>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The splitting algorithm <a href = 'https://docs.teradata.com/r/Enterprise/Teradata-Package-for-Python-Function-Reference-17.20/teradataml-Analytic-Database-SQL-Engine-Analytic-Functions/Supported-on-Database-Versions-16.20.xx-17.00.xx-17.05.xx/NGramSplitter'>NGramSplitter </a>can be controlled with delimeters, punctuation indicators, etc.</p>

In [None]:
tdf_grams = NGramSplitter(
               data             = tdf_reviews
              ,text_column      = 'comment_text'
              ,accumulate       = 'comment_id'
              ,grams            = "2"
              ,overlapping      = True
              ,to_lower_case    = True
              ,delimiter        = " "
              ,punctuation      = '[`~#^&*()-]'
              ,total_gram_count = True
            ).result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the dataframe created above we can see the NGram function applied to the web comment column. We can see the frequency and the total number of times the ngram appear in the column.  
Let us check how the comment id we saw earlier looks after converting to ngrams</p>

In [None]:
# check an example comment
tdf_grams[(tdf_grams.comment_id == 30)]

In [None]:
# check count of distinct CommentID (distinct_Comment should be 22641)
tdf_grams.assign(drop_columns=True
                ,distinct_Comment=tdf_grams.comment_id.distinct().count())

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Create the TF-IDF Table</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://en.wikipedia.org/wiki/Tf%E2%80%93idf'<b>TF-IDF</b> or <b>Term Frequency-Inverse Document Frequency</b> </a> is an indicator of a term's <b>importance</b> in a specific document based on the entire corpus of documents.  This value is calculated by taking the Product of:
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Term Frequency = (Number of Terms in the Document)/(Number of Terms in the Corpus)</li>
    <li>Inverse Document Frequency = Natural Log((Total Number of Documents)/(Number of Documents with the Term))</li>
 </ul>   
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
This can be accomplished in teradatml.dataframe using the results table created using NGgram Splitter function in above step.
<br>
    
    tdf_tf_idf.tf     = tdf_grams.frequency / tdf_grams.totalcnt
    tdf_tf_idf.idf    = ln( (Total distinct tdf_grams.comment_id) / (Total distinct tdf_grams.ngram) )
    tdf_tf_idf.tf_idf = idf * tf
</p>

In [None]:
#get Total distinct tdf_grams.comment_id --> iDistinctComment (SQL column: tot_docs)
iDistinctComment = tdf_grams.assign(drop_columns=True
                                         ,distinct_Comment = tdf_grams.comment_id.distinct().count()
                                   ).get_values()[0][0]

In [None]:
# get Total distinct tdf_grams.ngram --> td_ngram_count (SQL column: tot_term)
td_ngram_count  = tdf_grams.groupby(["ngram"]).count().select(["ngram", "count_comment_id"])

In [None]:
# first draft of tdf_tf_idf by selecting column to keep & adding iDistinctComment
tdf_tf_idf = tdf_grams.assign(drop_columns = True
                                   ,comment_id = tdf_grams.comment_id
                                   ,term      = tdf_grams.ngram
                                   ,tf         = tdf_grams.frequency.cast(type_=FLOAT) / tdf_grams.totalcnt 
                                   ,tot_docs   = (int(iDistinctComment))
                                   )

In [None]:
# left-join tdf_tf_idf to td_ngram_count
tdf_tf_idf = tdf_tf_idf.join(other   = td_ngram_count
                            ,on      = ["term=ngram"]
                            ,how     = "left" 
                            ,lsuffix = "t1", rsuffix = "t2")

In [None]:
# add inner division of idf-LN
tdf_tf_idf = tdf_tf_idf.assign(idf_    = (tdf_tf_idf.tot_docs.cast(type_=FLOAT) / tdf_tf_idf.count_comment_id) )

In [None]:
# add idf
tdf_tf_idf = tdf_tf_idf.assign(idf     = func.ln(tdf_tf_idf.idf_.expression))

In [None]:
# add tf_idf
tdf_tf_idf = tdf_tf_idf.assign(tf_idf  = tdf_tf_idf.idf * tdf_tf_idf.tf )

In [None]:
# check whether all columns are now pulled together
tdf_tf_idf.info()

In [None]:
# pull it local to pandas-dataframe
# again.. - Lazy Evaluation & this step does all the work, taking some time
df_tf_idf = tdf_tf_idf.to_pandas(all_rows = True)
df_tf_idf['tf_idf'] = df_tf_idf['tf_idf'].astype(float)

In [None]:
# check mem sizes of teradataml.dataframe <-> pandas.dataframe
print("teradataml.dataframe tdf_tf_idf has {:>10} bytes in memory.".format(sys.getsizeof(tdf_tf_idf)));
print("pandas.dataframe     df_tf_idf  has {:>10} bytes in memory.".format(sys.getsizeof(df_tf_idf)));

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us check the term frequency and inverse document frequency calculated for the comment we saw before</p>

In [None]:
# check an example comment
df_tf_idf[(df_tf_idf.comment_id == 30)]

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now, let us check the mostly used terms in our data</p>

In [None]:
df_tf_idf.sort_values(by = 'tf_idf', ascending = False)

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Visualize the Results</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's use Pandas and Matplotlib to do visualizations of the data:</p>

In [None]:
# get Top30 by tf_idf
df_top30 = df_tf_idf.sort_values(by = 'tf_idf', ascending = False).head(30)

In [None]:
# plot it using build in plot
df_top30.sort_values(by = 'tf_idf', ascending = True)\
        .set_index('term')[['tf_idf']]\
        .plot(kind = 'barh', legend = True, figsize = (12, 9));

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this plot you see the top 30 terms which are used in the reviews. The plot is made up with pandas.dataframe build in <a href = 'https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.htmlhttps://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html'>plot function </a> build on <a href = 'https://matplotlib.org/stable/api/indexhttps://matplotlib.org/stable/api/index'> matplolib </a>.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Cleanup </b></p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Database and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_Retail');" 
#Takes 5 seconds

In [None]:
remove_context()

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Reference Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><a href = 'https://docs.teradata.com/reader/eteIDCTX4O4IMvazRMypxQ/uDjppX7PJInABCckgu~KFg'>Teradata Python Package User Guide</a></li>
    <li><a href = 'https://docs.teradata.com/reader/GsM0pYRZl5Plqjdf9ixmdA/MzdO1q_t80M47qY5lyImOA'>Teradataml Python Reference</a></li>
    <li><a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Text-Analytic-Functions/NGramSplitter'>Teradata NGramSplitter Function Reference</a></li>
  
</ul>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023,2024. All Rights Reserved
        </div>
    </div>
</footer>