#### Exploring what kind of prompts would affect the kind of output the ChatGPT API would give out

In [1]:
import os
import openai
import tiktoken
from dotenv import load_dotenv, find_dotenv

import pandas as pd
import warnings
import time

warnings.filterwarnings('ignore')



In [2]:
# fix this
os.environ['OPENAI_API_KEY'] = "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"
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key  =  os.environ['OPENAI_API_KEY']

In [3]:
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )

    content = response.choices[0].message["content"]
    token_dict = {
        'prompt_tokens':response['usage']['prompt_tokens'],
        'completion_tokens':response['usage']['completion_tokens'],
        'total_tokens':response['usage']['total_tokens'],
    }
    return content, token_dict

##### Principle 1 for prompting:
1. Use delimiters to specify the part of the prompt that requires action.
2. Ask the model to give the out in a specific structured format.
3. Ask the model to check for specific conditions in the input text in prompt and avoid any assumptions.
4. Perform few-shot prompting by providing the model with example conversation. -> *How to do this in case of code generation?*

In [None]:
"""
Context,Question ,Answer,Theme
The dataset contains information about tasks completed by different participants in the claim process. This includes details like processing time, applications used, and number of keystrokes.
Who is the most productive participant based on the number of tasks completed?
The participant with the most tasks completed is John.
Productivity

The processing time refers to the time spent by a participant on a particular task. This could potentially be an indicator of the participant's efficiency.
Who has the shortest average processing time?
The participant with the shortest average processing time is Mary.
Productivity

The dataset includes the number of keystrokes made by the participants during each task.
Who uses the least number of keystrokes on average?
The participant with the least average number of keystrokes is Steve.
Productivity

The 'applications' field lists the applications used by a participant for a task. The use of multiple applications may indicate multitasking ability or complexity of the task.
Which participant uses the most applications on average?
John uses the most applications on average.
Productivity
"""

In [6]:
# Specify the path to the text file
file_path = '../data/sample/schema0.txt'

# Read the contents of the text file
with open(file_path, 'r') as file:
    file_contents = file.read()

# Print the contents of the text file
print(file_contents)


1. case_id : this is a unique text identifier which represents a unique process execution. For example, a claim number or a policy number or a customer number. 
2. case_status: this text identifier represents the status of case during the row observation. For example case_status may be open, in_progress, on_hold, closed, updated e.t.c. This field may not be available in some cases where it is not indicated on an agent/user's screen during observation. 
3. case_type: this text identifier represents the type of case being handled during the row observation. For instance, case_type may be billing, renewal, cancellation, quick_claims e.t.c depending on the case categories in the organization. This field may not be available in some cases where it is not indicated on an agent/user's screen during observation. 
4. task_name: this text field represents a business friendly name given to the action being performed by an agent/user/participant such that the context of this action is easily under

In [7]:
question_1 = f"""Who has the longest average processing time?"""
question_2 = f"""Who has the shortest average processing time?"""
question_3 = f"""Who is the most productive participant based on the number of tasks completed?"""
question_4 = f"""Who uses the least number of keystrokes on average?"""
question_5 = f"""Name of the participant that uses the most number of long cut keys on average?""" # An invalid question, to check hallucination
schema = file_contents

In [8]:
# ask model to generate code in SQL by looking at the schema and the input questions
prompt_1 = f""" Consider yourself to be a SQL code generator.
            Your task is to perform the following actions:
            1 - Read and understand the following question delimited by <>.
            2 - Read and understand the following schema delimited by <>.
            3 - Use the schema to generate correct SQL query for the question.
            4 - Output correct SQL query.

            Use the following format:
            Question: <User question for which SQL query has to be generated>
            Schema: <Schema of the database table according to which generate the correct SQL query for the given question>
            SQL query: <valid SQL code>

            Question: <{question_1}>
            Schema: <{schema}>
"""

start = time.time()
response_1, token_dict_1 = get_completion(prompt_1)
end = time.time()

print("Time taken for prompt_1:", end-start)
print("\nCompletion for prompt 1:")
print(response_1)
print("\nToken dict for prompt 1:")
print(token_dict_1)

Time taken for prompt_1: 2.034454107284546

Completion for prompt 1:
SQL query: <SELECT participant_id, AVG(turnaround_time) AS avg_processing_time FROM table_name GROUP BY participant_id ORDER BY avg_processing_time DESC LIMIT 1;>

Token dict for prompt 1:
{'prompt_tokens': 877, 'completion_tokens': 35, 'total_tokens': 912}


In [32]:
# ask model to generate code in SQL by looking at the schema and the input questions
prompt_2 = f""" Consider yourself to be a SQL code generator.
            Your task is to perform the following actions:
            1 - Read and understand the following question delimited by <>.
            2 - Read and understand the following schema delimited by <>.
            3 - Use the schema to generate correct SQL query for the question.
            4 - Output correct SQL query.
            5 - Omit any explanation in the end.

            Use the following format:
            Question: <User question for which SQL query has to be generated>
            Schema: <Schema of the database table according to which generate the correct SQL query for the given question>
            SQL query: <valid SQL code>

            Question: <{question_1}>
            Schema: <{schema}>
"""

start = time.time()
response_2, token_dict_2 = get_completion(prompt_2)
end = time.time()

print("Time taken for prompt_2:", end-start)
print("\nCompletion for prompt 2:")
print(response_2)
print("\nToken dict for prompt 2:")
print(token_dict_2)

Time taken for prompt_2: 2.3985209465026855

Completion for prompt 2:
SQL query: SELECT participant_id, AVG(turnaround_time) as avg_processing_time FROM table_name GROUP BY participant_id ORDER BY avg_processing_time DESC LIMIT 1

Token dict for prompt 2:
{'prompt_tokens': 889, 'completion_tokens': 32, 'total_tokens': 921}


In [33]:
# ask model to generate code in SQL by looking at the schema and the input questions
prompt_3 = f""" You are a SQL code generator.
            Perform the following actions:
            1 - Understand the question delimited by <>.
            2 - Understand the schema delimited by <>.
            3 - Use the schema to generate SQL query for the question.
            4 - Output correct SQL query.
            5 - Omit any explanation in the end.

            Use the following format:
            Question: <Question for which SQL query has to be generated>
            Schema: <Schema of the database table to be queried for the given question>
            SQL query: <valid SQL query>

            Question: <{question_1}>
            Schema: <{schema}>
"""

start = time.time()
response_3, token_dict_3 = get_completion(prompt_3)
end = time.time()

print("Time taken for prompt_3:", end-start)
print("\nCompletion for prompt 3:")
print(response_3)
print("\nToken dict for prompt 3:")
print(token_dict_3)

Time taken for prompt_3: 2.9995551109313965

Completion for prompt 3:
SQL query: 

SELECT participant_id, AVG(turnaround_time) as avg_process_time 
FROM table_name 
GROUP BY participant_id 
ORDER BY avg_process_time DESC 
LIMIT 1;

Token dict for prompt 3:
{'prompt_tokens': 870, 'completion_tokens': 38, 'total_tokens': 908}


In [34]:
# ask model to generate code in SQL by looking at the schema and the input questions
prompt_4 = f""" You are a SQL code generator.
            Perform the following actions:
            1 - Understand the question delimited by <>.
            2 - Understand the schema delimited by <>.
            3 - Use the schema to generate SQL query answering the question.
            4 - Output correct SQL query.
            5 - Omit any explanations.

            Use the following format:
            Question: <Question to be answered by the generated SQL query>
            Schema: <Schema of the database table to be queried for the given question>
            SQL query: <valid SQL query>

            Question: <{question_1}>
            Schema: <{schema}>
"""

start = time.time()
response_4, token_dict_4 = get_completion(prompt_4)
end = time.time()

print("Time taken for prompt_4:", end-start)
print("\nCompletion for prompt 4:")
print(response_4)
print("\nToken dict for prompt 4:")
print(token_dict_4)

Time taken for prompt_4: 2.4593677520751953

Completion for prompt 4:
SQL query: SELECT participant_id, AVG(turnaround_time) AS avg_processing_time
            FROM table_name
            GROUP BY participant_id
            ORDER BY avg_processing_time DESC
            LIMIT 1;

Token dict for prompt 4:
{'prompt_tokens': 867, 'completion_tokens': 41, 'total_tokens': 908}


In [41]:
# ask model to generate code in SQL by looking at the schema and the input questions
prompt_5 = f""" You are a SQL code generator.
            Perform the following actions:
            1 - Understand the question delimited by <>.
            2 - Understand the schema delimited by <>.
            3 - Use the schema to generate SQL query answering the question.
            4 - Output computationally most efficient SQL query.
            5 - Omit any explanations.

            Use the following format:
            Question: <Question to be answered by the generated SQL query>
            Schema: <Schema of the database table to be queried for the given question>
            SQL query: <valid SQL query>

            Question: <{question_1}>
            Schema: <{schema}>
"""

start = time.time()
response_5, token_dict_5 = get_completion(prompt_5)
end = time.time()

print("Time taken for prompt_5:", end-start)
print("\nCompletion for prompt 5:")
print(response_5)
print("\nToken dict for prompt 5:")
print(token_dict_5)

Time taken for prompt_5: 2.0653066635131836

Completion for prompt 5:
SQL query: 

SELECT participant_id 
FROM table_name 
GROUP BY participant_id 
ORDER BY AVG(turnaround_time) DESC 
LIMIT 1

Token dict for prompt 5:
{'prompt_tokens': 870, 'completion_tokens': 29, 'total_tokens': 899}


In [45]:
# ask model to generate code in SQL by looking at the schema and the input questions
# AVOID ASSUMPTIONS IN THE MODEL NOW
prompt_6 = f""" You are a SQL code generator.
            Perform the following actions:
            1 - Understand the question delimited by <>.
            2 - Understand the schema delimited by <>.
            3 - Do not assume anything, if there's an ambiguity, ask a followup question.
            4 - Use the schema to generate SQL query answering the question.
            5 - Output computationally most efficient SQL query.
            6 - Omit any explanations.

            Use the following format:
            Question: <Question to be answered by the generated SQL query>
            Schema: <Schema of the database table to be queried for the given question>
            SQL query: <valid SQL query>

            Question: <{question_1}>
            Schema: <{schema}>
"""

start = time.time()
response_6, token_dict_6 = get_completion(prompt_6)
end = time.time()

print("Time taken for prompt_6:", end-start)
print("\nCompletion for prompt 6:")
print(response_6)
print("\nToken dict for prompt 6:")
print(token_dict_6)

Time taken for prompt_6: 2.3339860439300537

Completion for prompt 6:
SQL query: 

SELECT participant_id 
FROM table_name 
GROUP BY participant_id 
ORDER BY AVG(turnaround_time) DESC 
LIMIT 1

Token dict for prompt 6:
{'prompt_tokens': 891, 'completion_tokens': 29, 'total_tokens': 920}


##### Observations:
1. Question 1 
- Prompt 1: is open ended -> output can also have the reasoning followed by the model.
- Prompt 2: no reasoning, -> lesser response time by a few msec/sec -> more prompt tokens -> less cost effective as per speedup.
- Prompt 3: lesser prompt tokens and response time that prior ones (*if model uses processing time instead of turnaround time*).
- Prompt 4: least prompt tokens and response time (*if model uses processing time instead of turnaround time*).

In the previous prompts the model **assumes** (*the model sometimes uses processing time and sometimes turnaround time*) -> *need to avoid assumptions*. Assumes that processing time includes the wait time.

##### Principle 2 for prompting:
1. Specify the steps to complete a task. -> *It might require more tokens as input, how to deal with that?*
2. Instruct the model to work out it's own solution before rushing to a conclusion. -> *How to do this in our use case where time is important?*

##### Prompting on the Web UI

Edge Cases:
1. Assumptions by the model
    - **Missing variables:** To prevent the model from making up variables on its own, instruct the model to not assume the presence of those variables and ask followup questions in case something is missing.
    - **Avoids reading through the schema descriptions carefully:** Sometimes the model assumes a few things based on past data trends without reading properly through the schema descriptions.

2. Hallucination
    - **Reasoning based on past knowledge and experience**

Aim:
Improve prompts such that the model can ask correct follow-up questions.

Perform the following actions using the same schema and different question:
- If schema and/or question is missing, ask a followup question asking for schema and/or question in the input.
- Understand the question and schema each delimited by <>.
- If the any row in the schema or any combination of rows in the schema don't have information for answering the question, ask for more information from the user.
- Use the schema to generate SQL query answering the question.
- Output computationally most efficient SQL query.
- Omit any explanations.

            Use the following format:
            Question: <Question to be answered by the generated SQL query>
            Schema: <Schema of the database table to be queried for the given question>
            SQL query: <valid SQL query>

Question: Which participant uses the most applications on average?