<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Classification Using ClearScape Analytics Text Preparation and Naive Bayes Classification Functions
  <br>
       <img id="teradata-logo" src="../../images/TeradataLogo.png" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>
<hr>

<br>

<b style = 'font-size:24px;font-family:Arial;color:#00233C'>Utilize Native ClearScape Analytics functions for Text Processing and Analytics for performance at extreme scale</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Naive Bayes Classifiers are a collection of classification algorithms based on Bayes' Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For text classifcation, a very simple way to understand how this type of classification works is that the algorithm can calculate the probability of appearance of a word or a sequence of words of length n (also known as n-gram) within the words of a text or the words of a category.  Prior to classification, text needs to be processed into these grams (or "tokens").</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Many text processing and classification tools exist across a variety of programming languages, but only Vantage provides the capability to perfom these tasks with the degree of performance and scale required by the modern enterprise.  Furthermore, text preparation and analytics <b>pipelines</b> can be built that can automate the usage of these powerful algorithms seamlessly to the business.</p> 

<hr>

<b style = 'font-size:24px;font-family:Arial;color:#00233C'>Live Demonstration</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The data for this demonstration consists of Amazon Fine Foods Reviews, which can be found <a href = 'https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews'>here</a>.  The Demonstration below consists of the following steps:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Data Preparation; including ratings categorization and tokenization</li>
    <li>Model Training; create a Naive Bayes Text Classification Model using Training data</li>
    <li>Scoring and Evaluation; make a prediction using test data, and evaluate the result</li>
    </ol>
    
<img src = 'Flow_Diagram_TextClassifier.png' width = 100%>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 1 - Data Preparation</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we will inspect the original data set, and perform various preparation tasks.</p>


<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Inspect the rows of the table</li>
    <li>Transform the numeric rating to a categorical value using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Data-Cleaning-Functions/Parsing-Data/TD_ConvertTo'>ConvertTo</a>, then verify the new column types using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Data-Exploration-Functions/TD_ColumnSummary'>ColumnSummary</a></li>
    <li>Split the data into training and testing data sets</li>
    <li>Tokenize the data using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Text-Analytic-Functions/TD_TextParser'>TextParser</a></li>
    </ol>
    

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Imports and Connection</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Import required packages and create a connection context to Vantage.</p>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import json
from teradataml import *
display.suppress_vantage_runtime_warnings = True

from IPython.display import display as ipydisplay

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# load vars json
with open('../../vars.json', 'r') as f:
    session_vars = json.load(f)

# Create the SQLAlchemy Context
host = session_vars['environment']['host']
username = session_vars['hierarchy']['users']['business_users'][1]['username']
password = session_vars['hierarchy']['users']['business_users'][1]['password']

eng = create_context(host=host, username=username, password=password)

eng.execute(f'''SET SESSION COMPUTE GROUP {session_vars['hierarchy']['users']['business_users'][1]['compute_group']}''')

# confirm connection
print(eng)

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>1.1 - Inspect the Data</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create a "Virtual Dataframe" which is a remote representation of the data set.  This allows us to operate on the data at remove/at scale using common pandas and python syntax.  <b>ColumnSummary</b> is a powerful in-built function for describing whole-data-set statistical information.</p>

In [None]:
tdf_reviews = DataFrame('"demo_ofs"."Amazon_Fine_Foods_Reviews"')

In [None]:
ipydisplay(tdf_reviews.shape)
ipydisplay(tdf_reviews.sample(2))

In [None]:
res = ColumnSummary(data = tdf_reviews, target_columns = ['doc_id', 'rating', 'review'])
res.result

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>1.2 Transform a numeric column to categorical</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Use <b>ConvertTo</b> to transform the "rating" column from INTEGER to VARCHAR data type.  Note ConvertTo can accept multiple columns, column ranges as TargetColumns and TargetDataType.  Next, check column dtypes and database datatype.</p>

In [None]:
res = ConvertTo(data = tdf_reviews, 
                target_columns = 'rating', 
                target_datatype = 'VARCHAR(charlen=11,charset=UNICODE,casespecific=NO)')


In [None]:
ipydisplay(res.result.dtypes)
ipydisplay(res.result.tdtypes)

<hr>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Test/Train Split</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Extraordinarily fast "Sample" function can split the data into multiple data sets in seconds.  Use Matplotlib to plot the distributions of each split.</p>

In [None]:
tdf_samples = res.result.sample(frac = [0.02, 0.08])
copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 2], table_name = 'text_train', schema_name = 'demo_ofs', if_exists = 'replace')
copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 1], table_name = 'text_test', schema_name = 'demo_ofs', if_exists = 'replace')

tdf_train = DataFrame('"demo_ofs"."text_train"')
tdf_test = DataFrame('"demo_ofs"."text_test"')

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols = 2)

df1 = tdf_train.groupby('rating').count().to_pandas(index_column = 'rating')[['count_doc_id']]
df1.sort_index().plot(kind = 'bar', ax = ax1)
ax1.set_title(f'Training Set, {str(df1.sum()[0])} records')
df2 = tdf_test.groupby('rating').count().to_pandas(index_column = 'rating')[['count_doc_id']]
df2.sort_index().plot(kind = 'bar', ax = ax2)
ax2.set_title(f'Testing Set, {str(df2.sum()[0])} records')

plt.show()

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>1.4 - Use TextParser to prepare text for analysis</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Text-Analytic-Functions/TD_TextParser'>TextParser</a> Function performs the following actions:</p>

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Splits the text in the specified column into "tokens" based on a delimeter and creates a row for each token</li>
    <li>Optionally removes select punctuation</li>
    <li>Optionally converts the text to lowercase</li>
    <li>Removes predefined "Stop Words" from the text</li>
    <li>Performs "Stemming" operations to modify the token to its root form</li>
    </ul>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'>Stop Words</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Stop Words are common words that don't provide much meaning, and are normally dropped from text prior to analysis or processing.  Common English Stop Words include "a", "the", "and", "of", etc.  For purposes of this demonstration, a Stop Words table has been created using the open-source Natural Language Toolkit (NLTK) <a href = 'https://gist.github.com/sebleier/554280'>list of stopwords</a>.</p>

In [None]:
tdf_stopwords = DataFrame('"demo"."stop_words"')
tdf_stopwords.sample(5)

In [None]:
train_tokens = TextParser(data = tdf_train, 
                          object = tdf_stopwords, 
                          punctuation="!#$%&()*+<>\",-./:;?@\\^_`{|}~''",
                          delimiter=None,
                          text_column = 'review', 
                          remove_stopwords = True,
                          accumulate = ['doc_id', 'rating'])
train_tokens.result.sample(5)

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 2 - Model Training</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Text-Analytic-Functions/TD_NaiveBayesTextClassifierTrainer'>NaiveBayesTextClassifierTrainer</a> Function takes the table of tokens, which retain their original rating id (id) and rating score (categorical from 1 to 5) as input, and writes out a model table containing the probability of each token's category(Score).  Function parameters include:</p>


<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Column containing tokens</li>
    <li>Column containing categories</li>
    <li>Model Type - either Multinomial or Bernoulli.  Bernoulli is better for binary classification, where Multinonial is better for multiple classes</li>
    </ul>

In [None]:
model = NaiveBayesTextClassifierTrainer(data = train_tokens.result, 
                                        doc_category_column = 'rating', 
                                        token_column = 'token', 
                                       model_type = 'Multinomial')

model.model_data

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 3 - Model Scoring and Evaluation</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Execute a testing prediction using the split data above.  Evaluate the model by creating a confusion matrix with the <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Advanced-SQL-Engine-Analytic-Functions/TD_ClassificationEvaluator'>ClassificationEvaluator</a> Function.</p>


<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Tokenize the Testing data that was split above - use the same function parameters</li>
    <li>Execute <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Text-Analytic-Functions/NaiveBayesTextClassifierPredict'>NaiveBayesTextClassifierPredict</a> using the model built above</li>
    <li>Execute <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Advanced-SQL-Engine-Analytic-Functions/TD_ClassificationEvaluator'>TD_ClassificationEvaluator</a> and pass the actual classification and the predicted value</li>
    <li>Investigate the Confusion Matrix and additional metrics values</li>
    </ol>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>3.1 Tokenize the Testing data</b></p>

In [None]:
test_tokens = TextParser(data = tdf_test, object = tdf_stopwords, punctuation="!#$%&()*+<>\",-./:;?@\\^_`{|}~''",
                         delimiter=None,
                text_column = 'review', remove_stopwords = True,
                 accumulate = ['doc_id', 'rating'])

test_tokens.result

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>3.2 Execute the Prediction Function</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Text-Analytic-Functions/NaiveBayesTextClassifierPredict'>NaiveBayesTextClassifierPredict</a> Function takes the model built using TD_NaiveBayesClassifierTrainer as input table, and outputs likelihood and probability information per document.  Additional parameters include (but are not limited to):</p>


<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Trained model information</li>
    <li>Input data table information</li>
    <li>Various output parameters</li>
    </ul>

In [None]:
pred = NaiveBayesTextClassifierPredict(newdata = test_tokens.result, 
                                       object = model.model_data,
                                       accumulate = ['rating'],
                                       input_token_column = 'token',
                                       responses = ['1','2','3','4','5'],
                                       output_prob = True,
                                       model_prob_column = 'prob',
                                       model_category_column = 'category',
                                       model_token_column = 'token', 
                                       doc_id_columns = 'doc_id',
                                       newdata_partition_column = 'doc_id')
pred.result

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>3.3 - Evaluate the Model Accuracy</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Evaluate the model by creating a confusion matrix with the <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Advanced-SQL-Engine-Analytic-Functions/TD_ClassificationEvaluator'>ClassificationEvaluator</a> Function.</p>


<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Execute <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Advanced-SQL-Engine-Analytic-Functions/TD_ClassificationEvaluator'>ClassificationEvaluator</a> and pass the actual classification and the predicted value</li>
    <li>Investigate the Confusion Matrix and additional metrics values</li>
    <li>Alternatively, create a heatmap using open-source tools</li>
    </ol>

In [None]:
acc = ClassificationEvaluator(data = pred.result[['prediction','rating']], num_labels = 5,
                              observation_column = 'rating', prediction_column = 'prediction')

In [None]:
ipydisplay(acc.result)
ipydisplay(acc.output_data)

In [None]:
df_pred = pred.result.to_pandas()
cm = confusion_matrix(df_pred['rating'], df_pred['prediction'])
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['1', '2', '3', '4', '5'])
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax)

plt.show()

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'>Cleanup</p>

In [None]:
db_drop_table('text_train', schema_name = 'demo_ofs')
db_drop_table('text_test', schema_name = 'demo_ofs')

In [None]:
remove_context()