# SQL Query Engine Sample
This notebook demonstrates the implementation and usage of the SQL query engine.

## IMPORTANT
If you are going to use this notebook, please make a copy of it and change the name so that 
this notebook may be preserved for others as an example!

In [1]:
from currensee.schema.schema import PostgresTables
from currensee.query_engines.sql_query_engine.query_engine import create_sql_workflow
from currensee.utils.db_utils import create_pg_engine

In [2]:
# required to run asynchronous code

import nest_asyncio

nest_asyncio.apply()

## Create the SQL Workflow

The SQL workflow can take the following parameters:

1. source_db: the name of the database where the table is stored (e.g. `crm`)
2. source_tables: a list of the name(s) of the table(s) that we want the query engine to have access to
  * note that multiple tables can be passed - this is if you want the query engine to try to join tables
    in the queries that may have relationships to one another
  * THIS IS LEVEL 2!! So do not attempt until you get the hang of just using one table at a time!!
    
3. table_descriptions: a list of the description(s) of the table(s) passed above
4. text_to_sql_tmpl: a string containing the prompt telling the LLM how to produce the SQL query from the text given
   * defaults to the variable `text_to_sql_tmpl` defined in `currensee.query_engines.prompting.py`
   * you may override this by passing in your own string
5. response_synthesis_prompt_str: a string containing the prompt telling the LLM how to synthesize the final response from the SQL table(s)
   * defaults to the variable `response_synthesis_prompt_str` defined in `currensee.query_engines.prompting.py`
   * you may override this by passing in your own string
6. model: the name of the model to use for all of the tasks
   * defaults to `gemini-1.5-flash`
   * you may override this with any of the models defined at https://ai.google.dev/gemini-api/docs/models#model-variations using the string with dashes defined in the "Model variant" column.
   * **BE VERY CAREFUL TO PAY ATTENTION TO THE PRICING!!!!!** I recommend that you use the default model until you understand the other models better!!!
7. temperature: the temperature parameter to pass to the model
   * default is 0.0
   * the higher the temperature, the more creative it is. Recommend keeping low for the SQL query generation.

In [3]:
from google.cloud import secretmanager
import pandas as pd
from sqlalchemy import create_engine, text, inspect

PROJECT_ID = 'adsp-34002-on02-sopho-scribe'
REGION = 'us-central1'
DB_NAME = 'postgres'
DB_HOST = '35.232.92.211'
DB_PORT = '5432'

def access_secret(secret_id):
    client = secretmanager.SecretManagerServiceClient()
    name = f"projects/{PROJECT_ID}/secrets/{secret_id}/versions/latest"
    response = client.access_secret_version(name=name)
    return response.payload.data.decode("UTF-8")

DB_USER = access_secret('cloudSqlUser')
DB_PASSWORD = access_secret('cloudSqlUserPassword')


In [4]:
engine = create_pg_engine('outlook')
inspector = inspect(engine)
tables = inspector.get_table_names()
print(tables)


['email_data', 'meeting_data']


In [5]:
columns = inspector.get_columns('email_data')
for col in columns:
    print(col['name'])

email_timestamp
to_names
to_emails
from_name
from_email
email_subject
email_body


In [6]:
emails_table_description = """
    Contain email correspondence between a financial advisor at Bankwell Financial and representatives of her client companies
    Columns:
     email_timestamp,
     to_names,
     to_emails,
     from_name,
     from_email,
     email_subject,
     email_body
"""

#More description on the tables

### Below is the default defined in `prompting.py`

In [7]:
text_to_sql_tmpl = """
    Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
    You can order the results by the email_timestamp column (from earliest to latest) to return the most interesting examples in the database.

    GUIDELINES:
    * Never query for all the columns from a specific table, only ask for a few relevant columns given the question.
    * Pay attention to use only the column names that you can see in the schema description.
    * Be careful to not query for columns that do not exist.
    * Pay attention to which column is in which table.
    * Make sure to filter on all criteria mentioned in the query.
    * If using a LIMIT to restrict the results, make sure it comes only in the end of the query.

    IMPORTANT NOTE:
    * Use the ~* operator instead of = when filtering with WHERE on text columns.
    * Add word boundaries '\\y' to the beginning and end of each search term in the query.

    You are required to use the following format, each taking one line:

    Question: Question here
    SQLQuery: SQL Query to run
    SQLResult: Result of the SQLQuery
    Answer: Final answer here

    Only use tables listed below.
    {schema}
"""

    #Question: {query_str}
    #SQLQuery: SELECT email_timestamp, to_names, to_emails, from_name, from_email, email_subject 
    #          FROM email_data 
    #          WHERE to_names ~* '\\y{query_str}\\y'
    #          ORDER BY email_timestamp;


### Below is the default defined in `prompting.py`

In [8]:
response_synthesis_prompt_str = """
    Query: {query_str}
    SQL: {sql_query}
    SQL Response: {context_str}

    IMPORTANT INSTRUCTIONS:
    * If SQL Response is empty or 0, apologise and mention that you could not find
     examples to answer the query.
    * In such cases, kindly nudge the user towards providing more details or refining
    their search.
    * For example, you could suggest narrowing down the search by a specific sender, date range, or subject keyword.
    * You can also suggest rephrasing keywords like "Bob" to account for variations such as "Robert" or "Bobby."
    * Do not explicitly state phrases such as 'based on the SQL query executed' or related
     references to context in your Response.
    * Never mention the underlying sql query, or the underlying sql tables and other database elements.
    * Never mention that SQL was used to answer this question.

    Example response to an error:
    "I’m sorry, I couldn’t find any emails matching your request. To help narrow down the search, could you provide more details such as a specific date range, sender, or subject? Rephrasing the search terms might also help."

    Response:
"""

### Define the DB information
**IMPORTANT**: The table names MUST be lowercase in order for the engine to find them.

In [9]:
source_db = 'outlook'
table_description_mapping = {
    'email_data': emails_table_description

}

In [10]:
sql_workflow = create_sql_workflow(
    source_db=source_db,
    table_description_mapping=table_description_mapping,
    text_to_sql_tmpl=text_to_sql_tmpl,
    response_synthesis_prompt_str=response_synthesis_prompt_str
)

## Define the Query

In [11]:
query = "Show me emails to Cynthia Hobbs"

## Retrieve and Output the Query

In [12]:
result = await sql_workflow.run(query=query)

Running step generate_sql_response
Step generate_sql_response produced event StopEvent


In [13]:
result

Response(response="I’m sorry, I couldn’t find any emails to Cynthia Hobbs. To help me find the emails you're looking for, could you please provide additional information, such as a date range, the sender's email address, or keywords from the email subject or body?\n", source_nodes=[NodeWithScore(node=TextNode(id_='6b900f06-710d-4dd1-a671-3f0855e02b8f', embedding=None, metadata={'sql_query': "SELECT email_subject FROM email_data WHERE to_emails ~* '\\yjane\\.doe\\@example\\.com\\y' ORDER BY email_timestamp", 'result': [], 'col_keys': ['email_subject']}, excluded_embed_metadata_keys=['sql_query', 'result', 'col_keys'], excluded_llm_metadata_keys=['sql_query', 'result', 'col_keys'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='[]', mimetype='text/plain', start_char_idx=None, end_char_idx=None, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}'), score=None)], metadata={'6b900f06-710d-4dd1-a671-3f0855e02b8f': {'sql_query': "SELECT e