# Llama-Index Text-To-SQL Retrieval Agent
### Thoughts:
- Too inconsistent in its performance
- Easily makes up facts in the absence of results
- Isn't really able to grasp the full context of the data structure and meaning
- Underlying functionality difficult to modify, particularly the prompt template for the text-to-sql process prior to response synthesis.

In [2]:
import os
from dotenv import load_dotenv
from IPython.display import Markdown, display
import pandas as pd

from llama_index.core import SQLDatabase
from llama_index.llms.openai import OpenAI
from llama_index.core.indices.struct_store.sql_query import (
    SQLTableRetrieverQueryEngine,
)
from llama_index.core.objects import (
    SQLTableNodeMapping,
    ObjectIndex,
    SQLTableSchema,
)
from llama_index.core import VectorStoreIndex, PromptTemplate

from src.db.database import engine
from src.db import models


load_dotenv()


OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")


llm = OpenAI(temperature=0.1, model="gpt-4o-mini", api_key=OPENAI_API_KEY)

sql_database = SQLDatabase(engine)

table_node_mapping = SQLTableNodeMapping(sql_database)

table_schema_objs = [
    (SQLTableSchema(table_name=table.__tablename__, context_str=table.__context_str__)) 
    for table in models.__dict__.values() if hasattr(table, '__tablename__')
]

obj_index = ObjectIndex.from_objects(
    table_schema_objs,
    table_node_mapping,
    VectorStoreIndex,
)

response_synthesis_prompt_str = (
    "Given an input question, synthesize a response from the query results. \
    You must ensure your response is completely factual.\n"
    "<query>{query_str}</query>\n"
    "<sql>{sql_query}</sql>\n"
    "<sql response>SQL Response: {context_str}</sql response>\n"
    "Response: "
)
response_synthesis_prompt = PromptTemplate(
    response_synthesis_prompt_str,
)

query_engine = SQLTableRetrieverQueryEngine(
    sql_database, obj_index.as_retriever(similarity_top_k=1),
    response_synthesis_prompt=response_synthesis_prompt,
)

# query = "What are the fields in the meetings table and what do they represent contextually?"
# query = "Using just your provided system messaging and without using SQL, \
#     What are the fields in the meetings table and what do they represent contextually?"
# query = "What is the name of the firm that has the most meetings and how many meets do they have?"
# query = "Can you show me the first 5 rows of meetings?"
query = "Fetch the first 5 meetings and their content which have a firm attended that are in the Energy sector."
response = query_engine.query(query)

print("SQL Query:")
print("```\n" + response.metadata["sql_query"] + "\n```")
print("Response:")
display(Markdown(f"<b>{response}</b>"))
if "result" in response.metadata:
    display(pd.DataFrame(response.metadata["result"], columns=response.metadata["col_keys"]))

SQL Query:
```
SELECT m.beam_id, m.title, m.content
FROM meetings m
JOIN firms f ON m.firm_attended_id = f.firm_id
WHERE f.sector = 'Energy'
ORDER BY m.date
LIMIT 5;
```
Response:


<b>The first 5 meetings attended by a firm in the Energy sector are Meeting 1330, Meeting 726, Meeting 1626, Meeting 702, and Meeting 1822.</b>

Unnamed: 0,beam_id,title,content
0,343d86ac-c36b-4876-8c9e-83005ac14b15,Meeting 1330,Content 1330
1,c08a8428-2287-43b6-952f-4228db247a5d,Meeting 726,Content 726
2,d83d6dc8-930b-4aca-99fa-d76118838bfb,Meeting 1626,Content 1626
3,97534c07-2ce9-4901-b0a5-67ca6afe9905,Meeting 702,Content 702
4,0fd3a552-e3f5-4207-8a02-1972869598e6,Meeting 1822,Content 1822


# Custom Simplified Implementation
- Much slower
- Has chain of thought reasoning with verbosity
- Still has issues constructing queries
- Need to consider how the information is presented back to the User in a memory-friendly way
    - Can return just beam_ids as part of the retrieval?
        - This can be added to the user's 'meetings in-focus' view?
    - Can return as markdown (BIG CONTEXT ISSUE)

In [None]:
from src.db.database import session_scope
from src.rag.sql_retriever import SQLAgent


agent = SQLAgent(llm, "src/db/models.py", verbose=True)

# query = "I need all meetings between 2022-01-01 and 2023-01-01 where the firms that attended are in the Energy sector."
query = "Return the beam_ids of all meetings between 2022-01-01 and 2023-01-01 where the firms that attended are in the Energy sector."

with session_scope() as session:
    response = agent.complete(session, query)

response_md = response.to_markdown(index=True)

print(response_md)

CHAIN OF THOUGHTS:
Thoughts: I need to retrieve the beam_ids of meetings that occurred between the specified dates. The meetings table has a date column that I can filter on. Additionally, I need to join the firms table to filter by the Energy sector, which is specified in the firms table.
Outcome: I will need to join the meetings table with the firms table using the firm_attended_id in the meetings table and the firm_id in the firms table. Then, I will filter the results based on the date range and the sector. 

Thoughts: The date range is from '2022-01-01' to '2023-01-01'. I will use the date column in the meetings table for this filter. I also need to filter the firms by the sector column to only include those in the Energy sector.
Outcome: I will add a filter for the date range and another filter for the sector being 'Energy'. 

SQL QUERY:
```
SELECT meetings.beam_id 
FROM meetings 
JOIN meeting_firms ON meetings.meeting_id = meeting_firms.meeting_id 
JOIN firms ON meeting_firms.fi