<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Resume Classification Model Training using Vantage In-DB Functions
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>
<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>This notebook is a part of the Resume Analyzer application.
In this notebook we will use data from a collection of Resume Examples taken from <a href ="https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset"> kaggle </a>
for categorizing a given resume into any of the labels defined in the dataset. Resumes are differentiated into their respective labels with filename as the id defined in the csv.</p>
<br>

<p style = 'font-size:16px;font-family:Arial'><b>The Data used for model training consists of:</b>
<li style = 'font-size:14px;font-family:Arial'><code>ID:</code> Unique identifier and file name for the respective pdf. </li>
    <li style = 'font-size:14px;font-family:Arial'><code>Resume_str:</code> Contains the resume text only in string format.</li>
<li style = 'font-size:14px;font-family:Arial'><code>Category:</code> Category of the job the resume was used to apply.</li>
</p>

<p></p>    
<br>  
<p style = 'font-size:16px;font-family:Arial'>Present categories are: HR, Designer, Information-Technology, Teacher, Advocate, Business-Development, Healthcare, Fitness, Agriculture, BPO, Sales, Consultant, Digital-Media, Automobile, Chef, Finance, Apparel, Engineering, Accountant, Construction, Public-Relations, Banking, Arts, Aviation   

</p>



<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>1. Connect to Vantage</b>
<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import os
import warnings
warnings.filterwarnings('ignore')


from teradataml import *

# Modify the following to match the specific client environment settings
display.max_rows = 5


<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

<p style = 'font-size:16px;font-family:Arial'>Setup for execution of notebook. Begin running steps with Shift + Enter keys.</p>

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Resume_Classification_model_training.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:18px;font-family:Arial'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one of them is commented out. You may switch between the modes by changing the comment string.</p>

In [None]:
%run -i ../UseCases/run_procedure.py "call get_data('DEMO_ResumeClassification_local');"           # Takes 1 minute

<p style = 'font-size:16px;font-family:Arial'>Optional step – if you want to see status of databases/tables created and space used.</p>


In [None]:
%run -i ../UseCases/run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>2. Explore the dataset</b>
<p style = 'font-size:16px;font-family:Arial'>The dataset consists of text from from various resumes with the id and category of the resumes.</p>

In [None]:
tdf = DataFrame(in_schema("DEMO_ResumeClassification", "Resume_data"))
print("Shape of the data: ", tdf.shape)
tdf

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>3. Check model table</b>
<br>
</p>
<p style = 'font-size:16px;font-family:Arial'>Before starting the process we will check if the model table already exists
<br>
</p>

In [None]:
from IPython.display import display, Markdown
def display_msg(msg):
    return display(Markdown(
        f"""<div class="alert alert-block alert-info">
        <p style='font-size:20px;font-family:Arial'><b>Note: </b><i>{msg}</i></p>"""))

In [None]:
model_df=DataFrame.from_query('''select tablename from dbc.tablesv where databasename = 'demo_user' and tablename = 'resume_category_model_tb';''')
if model_df.shape[0]==1:
    display_msg('Model table exists. No need to execute the below steps')
else:
    display_msg('Please continue with the below steps and create the model table')

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>4. Model Creation</b>
<br>
</p>
<p style = 'font-size:16px;font-family:Arial'>Model table should exist in order to do Resume classification, We can either create the model table from raw data, or load the model table from the cloud. Creation of model table from scratch takes <b>30-35</b> minutes, loading of table from cloud will take <b>1-2</b> minutes</p>
</p>

<p style = 'font-size:16px;font-family:Arial'><b>Note: If we want to load the model table from cloud please enter "no". If we want to create the model table please enter "yes" and follow the steps below"
</b>
</p>

In [None]:
# import time
# Request user's input
generate = input("Do you want to create a new model? ('yes'/'no'): ")

# Check the user's input
if generate.lower() == 'no':
    print("\nGreat! We'll load the model table from cloud.")

    print("\nLoading model table from cloud, please wait...")
    # start = time.time()
    qry="""CREATE MULTISET TABLE DEMO_USER.resume_category_model_tb 
     (
      token VARCHAR(37) CHARACTER SET UNICODE NOT CASESPECIFIC,
      category VARCHAR(100) CHARACTER SET UNICODE NOT CASESPECIFIC,
      prob FLOAT)
    NO PRIMARY INDEX;"""
    
    qry1="""Insert into resume_category_model_tb 
        Select "token", "category", "prob" from DEMO_ResumeClassification_db.Resume_Model_Data;"""

    try:
        execute_sql(qry)
        execute_sql(qry1)
        # end = time.time()
        print('Table Loaded')
        # print(end-start)
        
    except:
        db_drop_table('resume_category_model_tb')
        execute_sql(qry)
        execute_sql(qry1)
        # end = time.time()
        print('Table Loaded')
        # print(end-start)


    display_msg("\nModel table loaded successfully! Please click <a href='#launch_app'>here</a> to launch the app")

elif generate.lower() == 'yes':
    display_msg("\nTo create the model again you will have to execute all the steps below")
    
else:
    print("\nInvalid input. Please enter 'yes' or 'no' to proceed.")

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>5. Text Parsing</b>
<br>
</p>
<p style = 'font-size:16px;font-family:Arial'>A text parser, also known as a text tokenizer, is a software component that breaks a text into its constituent parts, such as words, phrases, sentences, or other meaningful units. Text parsing is an important technique in natural language processing (NLP) and is used in a wide range of applications, from search engines and chatbots to email filters and data analysis tools.
<br>
<p style = 'font-size:16px;font-family:Arial'>The TD_TextParser performs the following operations:
<li style = 'font-size:16px;font-family:Arial'>Tokenizes the text in the specified column</li>
<li style = 'font-size:16px;font-family:Arial'>Removes the punctuations from the text and converts the text to lowercase</li>
<li style = 'font-size:16px;font-family:Arial'>Removes stop words from the text and converts the text to their root forms</li>
<li style = 'font-size:16px;font-family:Arial'>Creates a row for each word in the output table</li>
<li style = 'font-size:16px;font-family:Arial'>Performs stemming; that is, the function identifies the common root form of a word by removing or replacing word suffixes</li></p>
<br>
</p>

In [None]:
if generate.lower() == 'no':
    display_msg("\nModel is already loaded from cloud, execution of this steps is not required")

elif generate.lower() == 'yes':
    qry = '''CREATE MULTISET TABLE tfidf_input_tokenized AS (
        SELECT ResID,
            CAST(token AS VARCHAR(15)) AS token,
            category
            FROM TD_TextParser (
            ON DEMO_ResumeClassification.Resume_data AS InputTable
            USING
            TextColumn ('Resume_str')
            ConvertToLowerCase ('true')
            OutputByWord ('true')
            Punctuation ('\[.,?\!\]')
            RemoveStopWords ('true')
            StemTokens ('true')
            Accumulate('ResID', 'category') 
            ) AS dt ) WITH DATA Primary Index(ResID);'''

    try:
        execute_sql(qry)
    except:
        db_drop_table("tfidf_input_tokenized")
        execute_sql(qry)
else:
    print("\nInvalid input. Please enter 'yes' or 'no' to proceed.")        

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>6. TF-IDF</b>
<br>
</p>
<p style = 'font-size:16px;font-family:Arial'>Term Frequency-Inverse Document Frequency (TF-IDF) is a technique for evaluating the importance of a specific term in a specific document in a document set. Term frequency (tf) is the number of times that the term appears in the document and inverse document frequency (idf) is the number of times that the term appears in the document set. The TF-IDF score for a term is tf * idf. A term with a high TF-IDF score is relevant to the specific document.</p>

<p style = 'font-size:16px;font-family:Arial'>We can use the TF-IDF scores as input for documents clustering and classification algorithms, including:
<li style = 'font-size:16px;font-family:Arial'>Cosine-similarity</li>
<li style = 'font-size:16px;font-family:Arial'>Latent Dirichlet allocation</li>
<li style = 'font-size:16px;font-family:Arial'>K-means clustering</li>
<li style = 'font-size:16px;font-family:Arial'>K-nearest neighbors</li></p>
<p style = 'font-size:16px;font-family:Arial'>TD_TFIDF function represents each document as an N-dimensional vector, where N is the number of terms in the document set (therefore, the document vector is sparse). Each entry in the document vector is the TF-IDF score of a term.</p>
<br>
</p>

In [None]:
if generate.lower() == 'no':
    display_msg("\nModel is already loaded from cloud, execution of this steps is not required")

elif generate.lower() == 'yes':
    qry = '''CREATE MULTISET TABLE resume_tf_idf_train AS ( SELECT *
                FROM TD_TFIDF(
                ON tfidf_input_tokenized AS InputTable
                USING
                DocIdColumn ('ResID')
                TokenColumn ('token')
                TFNormalization ('LOG')
                IDFNormalization ('SMOOTH')
                Regularization ('L2')
                Accumulate('category')
                ) AS dt ) WITH DATA Primary Index(ResID);'''
    try:
        execute_sql(qry)
    except:
        db_drop_table("resume_tf_idf_train")
        execute_sql(qry)
        
else:
    print("\nInvalid input. Please enter 'yes' or 'no' to proceed.")     

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>7. Naive Bayes Text Classifier</b>
<br>
</p>
<p style = 'font-size:16px;font-family:Arial'>The TD_NaiveBayesTextClassifierTrainer function calculates the conditional probabilities for token-category pairs, the prior probabilities, and the missing token probabilities for all categories. The trainer function trains the model with the probability values, and the predict function uses the values to classify documents into categories.</p>

<br>
</p>
<p style = 'font-size:16px;font-family:Arial'><b>Note: The below query build model table which will take approximately 30 minutes</b></p>

In [None]:
if generate.lower() == 'no':
    display_msg("\nModel is already loaded from cloud, execution of this steps is not required")

elif generate.lower() == 'yes':
    qry = '''SELECT TOP 1 'Model table created' as output
    FROM TD_NaiveBayesTextClassifierTrainer(	
    ON resume_tf_idf_train AS InputTable	
    OUT PERMANENT  TABLE ModelTable (resume_category_model_tb)	
    USING	
    TokenColumn ('token')	
    ModelType ('Multinomial')	
    DocCategoryColumn ('category')	
    ) AS dt;'''

    try:
        execute_sql(qry)
    except:
        db_drop_table("resume_category_model_tb")
        execute_sql(qry)
    
    print("Model table Created")

else:
    print("\nInvalid input. Please enter 'yes' or 'no' to proceed.") 

<a id="launch_app"></a>
<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>8. Launch the Resume Analyzer APP</b>
<br>
</p>
</p>
<p style = 'font-size:16px;font-family:Arial'>Please click on the button below to launch the Resume Analyzer Application.</p>

<a href="https://resume-analyze.ci.clearscape.teradata.com/login/" target="_blank">
  <img src="images/app_button.png" alt="Launch App" width="200"/>
</a>

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>9. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Cleanup work tables to prevent errors next time. This section drops all the tables created during the demonstration.</p>

In [None]:
try:
    db_drop_table("resume_tf_idf_train")
    db_drop_table("tfidf_input_tokenized")
except:
    pass
    

<div class="alert alert-block alert-danger">
<p style = 'font-size:16px;font-family:Arial'><b>Note: <i>If you do not want the model table anymore and do not want to do Resume Classification from the application than go ahead and drop the table using the cell below, else do not drop the model table.</i></b></p>
</div>

In [None]:
# import time
# Request user's input
modeldrop = input("Do you want to drop the model table? ('yes'/'no'): ")

# Check the user's input
if modeldrop.lower() == 'yes':
    try:
        db_drop_table("resume_category_model_tb")
        display_msg("\nModel table is dropped. If you want to run Resume Classification, you will need to create model table again.")
    except:
        pass
else:
    display_msg("\nModel table is not dropped. Continue with Resume Classification.")
   
    

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../UseCases/run_procedure.py "call remove_data('DEMO_ResumeClassification');"        #Takes 5 seconds

In [None]:
remove_context()

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
        </div>
    </div>
</footer>