<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Generative Question Answering using Generative AI with Vantage</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction:</b></p>
<p style = 'font-size:16px;font-family:Arial'>In the Question-Answering system using Generative AI demo, the combination of <b>RAG, Langchain, and LLM models</b> allows users to ask queries in layman's terms, retrieve relevant information from the Vantage tables, and generate accurate and concise answers based on the retrieved data. This integration of retrieval-based and generative-based approaches provides a powerful tool for extracting knowledge from structured sources and delivering user-friendly responses.</p>

<p style = 'font-size:16px;font-family:Arial'>In this demo we will build Generative Question-Answering using LangChain, a powerful library for working with LLMs like GPT-3.5, GPT-4, Bloom, etc. and JumpStart in ClearScape notebooks, a system is built where users can ask business questions in natural English and receive answers with data drawn from the relevant databases.</p>

<p style = 'font-size:16px;font-family:Arial'>The following diagram illustrates the architecture.</p>

<center><img src="images/vantage_qa_gen.png" alt="Generative_QA_architecture"  width=800 height=800/></center>

<br>
<p style = 'font-size:16px;font-family:Arial'>Before going any farther, let's get a better understanding of RAG, LangChain, and LLM.</p>

<ul style = 'font-size:16px;font-family:Arial'><li> <b>Retrieval-Augmented Generation (RAG):</b></li></ul>
<p style = 'font-size:16px;font-family:Arial'> &emsp;  &emsp;RAG is a framework that combines the strengths of retrieval-based and generative-based approaches in question-answering systems.It utilizes both a retrieval model and a generative model to generate high-quality answers to user queries. The retrieval model is responsible for retrieving relevant information from a knowledge source, such as a database or documents. The generative model then takes the retrieved information as input and generates concise and accurate answers in natural language.</p>

<ul style = 'font-size:16px;font-family:Arial'><li> <b>Langchain:</b></li></ul>
<p style = 'font-size:16px;font-family:Arial'> &emsp;  &emsp; Langchain is a language model developed for understanding and generating human-like text. It is designed to handle queries and requests expressed in everyday language, enabling users to ask questions in layman's terms. Langchain leverages state-of-the-art deep learning techniques to comprehend the semantics and context of user queries. It can process various types of queries, ranging from simple factual questions to complex and nuanced queries.</p>

<ul style = 'font-size:16px;font-family:Arial'><li> <b>LLM Models (Large Language Models):</b></li></ul>
<p style = 'font-size:16px;font-family:Arial'> &emsp;  &emsp; LLM models refer to the large-scale language models that are trained on vast amounts of text data.
These models, such as GPT-3 (Generative Pre-trained Transformer 3),  GPT-3.5, GPT-4, HuggingFace BLOOM, LLaMA, Google's FLAN-T5, etc. are capable of generating human-like text responses. LLM models have been pre-trained on diverse sources of text data, enabling them to learn patterns, grammar, and context from a wide range of topics. They can be fine-tuned for specific tasks, such as question-answering, natural language understanding, and text generation.
LLM models have achieved impressive results in various natural language processing tasks and are widely used in AI applications for generating human-like text responses.</p>

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Configuring the environment</li>
    <li>Connect to Vantage</li>
    <li>Data Exploration</li>
    <li>LLM Setup</li>
    <li>Run the query function</li>
    <li>Cleanup</li>
</ol>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>1. Configuring the environment</b>

<p style = 'font-size:16px;font-family:Arial'>In order to utilize this demo, you will need an OpenAI API key. If you do not have one, please refer to the instructions provided in this guide to obtain your OpenAI API key: </p>

[Openai_setup_api_key_guide](..//Openai_setup_api_key/Openai_setup_api_key.md)

In [None]:
%%capture
# '%%capture' suppresses the display of installation steps of the following packages

# !pip install -r requirements.txt --quiet

<p style = 'font-size:16px;font-family:Arial'>
    <i>The above statements may need to be uncommented if you run the notebooks on a platform that does not have the libraries installed.  If you uncomment those installs, be sure to restart the kernel after executing those lines.</i>
</p>
<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import io
import os
import numpy as np
import pandas as pd

# teradata lib
from teradataml import *

# LLM
import sqlalchemy
from sqlalchemy import create_engine
from langchain import PromptTemplate,SQLDatabase, SQLDatabaseChain, LLMChain

# Suppress warnings
warnings.filterwarnings('ignore')
display.max_rows = 5

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Initiate a connection to Vantage</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Let's start by connecting to the Teradata system </b></p>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
eng.execute('''SET query_band='DEMO= Generative_Question_Answering_Python.ipynb;' UPDATE FOR SESSION;''')

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_MarketingCamp_cloud');"        # Takes 1 minute
# %run -i ../run_procedure.py "call get_data('DEMO_MarketingCamp_local');"        # Takes 2 minutes

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>3. Data Exploration</b>

<p style = 'font-size:16px;font-family:Arial'>The goal of the Marketing Campaign Effectiveness prediction is to reduce marketing resources by identifying customers who would purchase the product and thereby directing marketing efforts to them.</p>

<p style = 'font-size:16px;font-family:Arial'>The data is from the last marketing campaign, with thousands of rows of customer data like age, job, marital status, education, etc.<p/>

<p style = 'font-size:16px;font-family:Arial'>Each row is a snapshot of data taken during the last marketing campaign, and each column is a different variable. The input dataset can be divided into three categories, as below:</p>
<p style = 'font-size:16px;font-family:Arial'> 
<ol style = 'font-size:16px;font-family:Arial'>
    <li>customer data i.e. age, profession, eduction, monthly income, etc.</li>
    <li>attributes related with the last contact of the current campaign i.e. contact, month, day, etc.</li>
    <li>other attributes i.e. campaign, previous outcome, payment methods, etc.</li>
   <li>target attribute - purchased.</li>

</ol>
</p>

<p style = 'font-size:16px;font-family:Arial'>The source data from <a href="https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset">kaggle</a> is loaded in Vantage and supplemented with information about city, monthly income, family members, etc. The data is loaded into vantage table named <i>Retail_Marketing</i>.</p>

<p style = 'font-size:16px;font-family:Arial'><b><i>*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.</i></b></p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>3.1 Examine the Retail Marketing Campaign table</b></p>    
<p style = 'font-size:16px;font-family:Arial'>Let's look at the sample data in the Retail_Marketing table.</p>

In [None]:
tdf = DataFrame(in_schema('DEMO_MarketingCamp', 'Retail_Marketing'))
df = tdf.to_pandas()
print("Data information: \n",tdf.shape)
tdf.sort('customer_id')

<p style = 'font-size:16px;font-family:Arial'>There are 11K records in all, and there are 23 variables. Purchased is the target variable. We shall classify the purchased variable in accordance with the remaining features.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>4. LLM setup</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.1 Connect to databases using SQL Alchemy</b></p>    

<p style = 'font-size:16px;font-family:Arial'>Under the hood, LangChain uses SQLAlchemy to connect to SQL databases. The SQLDatabaseChain can therefore be used with any SQL dialect supported by SQLAlchemy, such as Teradata Vantage, MS SQL, MySQL, MariaDB, PostgreSQL, Oracle SQL, and SQLite. Please refer to the <a href="https://docs.sqlalchemy.org/en/20/"> SQLAlchemy documentation</a> for more information about requirements for connecting to your database.</p>

<p style = 'font-size:16px;font-family:Arial'>Important: The code below establishes a database connection for data sources and Large Language Models. Please note that the solution will only work if the database connection for your sources is defined in the cell below</p>

<p style = 'font-size:16px;font-family:Arial'>Build a consolidated view of Table Data Catalog by combining metadata stored for the database and table in pipe delimited format.</p>

In [None]:
#  Create the vantage  SQLAlchemy engine
db_vantage = SQLDatabase(eng)

def parse_catalog(source, source_dataset):
    columns_str = ''
   
    for c in tdf.columns:
        columns_str = columns_str + f"\n{source}|{source_dataset}|{c}"

    return columns_str, ",".join(tdf.columns)

catalog, columns = parse_catalog(source = 'DEMO_MarketingCamp', source_dataset = 'Retail_Marketing')

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.2 Define LLM model</b></p>  

<p style = 'font-size:16px;font-family:Arial'>define large language model here. Make sure to set api keys for the variable YOUR_API_KEY</p>

<p style = 'font-size:16px;font-family:Arial'>In OpenAI's language models, the <b>temperature</b> parameter controls the randomness of the generated text. It affects the diversity and creativity of the model's responses.</p>

<p style = 'font-size:16px;font-family:Arial'>A higher temperature value, such as 1.0 or above, increases the randomness and diversity of the generated output. This can lead to more varied and surprising responses, but it may also result in less coherence and occasional nonsensical outputs.</p>

<p style = 'font-size:16px;font-family:Arial'>On the other hand, a lower temperature value, such as 0.2 or below, reduces randomness and makes the model's output more focused and deterministic. The generated text is likely to be more conservative, sticking closely to patterns observed in the training data.</p>

<p style = 'font-size:16px;font-family:Arial'>Choosing an appropriate temperature value depends on the desired output. Higher temperatures can be useful for creative tasks or brainstorming, while lower temperatures are preferred when you need more control over the output, such as when generating specific responses or following a particular style.</p>

In [None]:
from langchain.llms import OpenAI

# OpenAI API
# To authenticate your requests with the OpenAI API, you need to set the API key. Replace 'YOUR_API_KEY' in the code below with your actual API key:
# os.environ["OPENAI_API_KEY"] = "sk-XXXX"  

llm = OpenAI(temperature=0.9) # call open AI model - api

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b> 4.3 Determine the best data channel to answer the user query</b></p>

<p style = 'font-size:16px;font-family:Arial'>To find the channel, we are taking the following steps:</p>

<ol style = 'font-size:16px;font-family:Arial'> 
    <li>In order to establish the prompts that LangChain will utilise, we must first create the prompt template by pasting the consolidated Data Catalog into it. </li>
    <li>Next, We pass the prompt template generated in the previous step to the prompt, along with the user query to the LangChain model, to find the best data source to answer the question. LangChain uses the LLM model of our choice to detect source metadata.</li>
    <li>Return the name of the data source and the channel from this function as the final step.</li>
</ol>

In [None]:
# define a function that infers the channel/database/table and sets the database for querying
def identify_channel(query):
    
    # Prompt 1 'Infer Channel'
    #set prompt template. It instructs the llm on how to evaluate and respond to the llm. 
    # It is referred to as dynamic since glue data catalog is first getting generated and appended to the prompt.
    prompt_template = (
        """
     From the table below, find the database (in column database) which will contain the data (in corresponding column_names) to answer the question 
     {query} \n
     """
        + catalog
        + """ 
     Give your answer as database == 
     Also,give your answer as database.table == 
     """
    )
    
    #define prompt 1
    PROMPT_channel = PromptTemplate(template=prompt_template, input_variables=["query"])
   
    # define llm chain
    llm_chain = LLMChain(prompt=PROMPT_channel, llm=llm)
    generated_texts = llm_chain.run(query)    

    # set the channel from where the query can be answered
    if source in generated_texts:
        channel = "db"
        db = db_vantage
    else:
        raise Exception(
            "User question cannot be answered by any of the channels mentioned in the catalog"
        )
    
    return channel, db

<p style = 'font-size:16px;font-family:Arial'>The resulting text includes details such as the names of the database and tables used to execute the user query.</p>

<p style = 'font-size:16px;font-family:Arial'>For instance, if the user query is <b>How many male customers have purchased the product?</b> The generated_text will contain the information <b> database == DEMO_MarketingCamp</b> and <b>database.table == DEMO_MarketingCamp.Retail_Marketing</b></p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b> 4.4 Generate response to user query</b></p>

<p style = 'font-size:16px;font-family:Arial'>Next, we run LangChain’s SQL database chain to convert text to SQL and implicitly run the generated SQL against the database to retrieve the database results in a simple readable language. we are taking the following steps:</p>

<ol style = 'font-size:16px;font-family:Arial'> 
    <li>Let's begin by establishing a prompt template that guides the Language Model (LLM) to generate SQL statements in a dialect that adheres to correct syntax. Subsequently, we execute these generated statements against the corresponding database.</li>
    <li>Finally, we pass the LLM, database connection, and prompt to the SQL database chain if channel is db and run the SQL query:</li>
</ol>

In [None]:
# define a function that infers the channel/database/table and sets the database for querying
def run_query(query):

    # call the identify channel function first    
    channel, db = identify_channel(query)

    #Prompt 2 'Run Query'
    prompt_template_query = """Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.

    Do not append 'Query:' to SQLQuery.
    
    Do not remove dashes from the Query

    Display SQLResult after the query is run in plain english that users can understand. 

    Provide answer in simple english statement.

    Only use the following tables:

    {table_info}
    
    Only use the following Column names: \n
     """+columns +""" 
    
    If someone asks for the marketing, they really mean the DEMO_MarketingCamp.Retail_Marketing table.
    Use default DEMO_MarketingCamp as database
    Use default Retail_Marketing as table

    Question: {input}"""

    PROMPT_sql = PromptTemplate(
        input_variables=["input", "table_info", "dialect"], template=prompt_template_query
    )
    
    if channel == "db":
        db_chain = SQLDatabaseChain.from_llm(
            llm, db, prompt=PROMPT_sql, verbose=False, return_intermediate_steps=False,
        )
        response = db_chain.run(query)
    else:
        raise Exception("Unlisted channel. Check your unified catalog")
    return response

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b> 4.5 Format the answer and Display</b></p>

<p style = 'font-size:16px;font-family:Arial'>To view the answer in proper format with markdown</p>


In [None]:
from IPython.display import display, Markdown

def markdown_template(query, response):
    return f"<p style = 'font-size:16px;font-family:Arial'>SQL and response from user query {query}  <br> <b>{response}<b>"

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>5. Run the query function</b>

<p style = 'font-size:16px;font-family:Arial'>Run the run_query function that in turn calls the Langchain SQL Database chain to convert 'text to sql' and runs the query against the source data channel</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>5.1 Query 1</b></p>

<p style = 'font-size:16px;font-family:Arial'>For example, for the user query How many married customers have purchased the product? the answer is as follows:</p>

In [None]:
# Enter the query

query = """How many married customers have purchased the product?""" 

#Response from Langchain
response =  run_query(query)

display(Markdown(markdown_template(query, response)))

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>5.2 Query 2</b></p>

<p style = 'font-size:16px;font-family:Arial'>For example, for the user query What is the number of purchases made by customers who are in management professions? the answer is as follows:</p>

In [None]:
# Enter the query

query = """What is the number of purchases made by customers who are in management professions?""" 

#Response from Langchain
response =  run_query(query)

display(Markdown(markdown_template(query, response)))

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>5.3 Query 3</b></p>

<p style = 'font-size:16px;font-family:Arial'>For example, for the user query Which are the most common purchasing behaviours of customers? the answer is as follows:</p>

In [None]:
# Enter the query

query = """Which are the most common purchasing behaviours of customers?""" 

#Response from Langchain
response =  run_query(query)

display(Markdown(markdown_template(query, response)))

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>5. Cleanup</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Cleanup work tables to prevent errors next time.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_MarketingCamp');"        # Takes 5 seconds

In [None]:
remove_context()

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Dataset:</b>

- `customer_id`: Unique row customer id
- `age`: customer age (numeric)
- `profession` : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student","blue-collar","self-employed","retired","technician","services")
- `marital` : marital status (categorical: "married","divorced","single"; note: "divorced" meansdivorced or widowed)
- `education` customer eduction (categorical: "unknown","secondary","primary","tertiary")
- `city`: city of customer (categorical: 'New York','Los Angeles','Chicago','Houston','Phoenix','Philadelphia','San Antonio','San Diego','Dallas','San Jose')
- `monthly_income_in_thousand`: customer's monthly income, in dollar (numeric)
- `family_members`: number of family members (numeric)
- `communication_type`: communication type (categorical: "unknown","telephone","cellular")
- `last_contact_day`: last contact day of the month (numeric)
- `last_contact_month`: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
- `credit_card`: does customer have a credit card? (binary: 'yes','no')
- `num_of_cars`: number of cars (numeric)
- `last_contact_duration`: last contact duration, in seconds (numeric)
- `campaign`: number of contacts performed during this campaign and for this client (categorical,includes last contact)
- `days_from_last_contact`: number of days that passed by after the client was last contacted from a previouscampaign (numeric, -1 means client was not previously contacted)
- `prev_contacts_performed`: number of contacts performed before this campaign and for this client (numeric)
- `prev_campaign_outcome`: outcome of the previous marketing campaign (categorical:"unknown","other","failure","success")
- `payment_method`: payment method use by customer (categorical: 'cash','credit_card','debit_card','ewallets', 'payment_links', 'QRcodes')
- `purchase_frequency`: how frequently customer is purchasing (categorical: 'daily','weekly','biweekly','monthly','quarterly','yearly')
- `gender`: gender of customer? (binary: 'male','female')
- `recency`: number of days since the last purchase (numeric)


Output variable (desired target):
- `purchased`: does customer did a purchase - target column (binary: 'yes','no')

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
    <li>Langchain Python reference: <a href='https://python.langchain.com/docs/get_started/introduction.html'>here</a></li>
</ul>

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">Copyright © Teradata Corporation - 2023. All Rights Reserved.</footer>