# Zero Shot VS Few Shot prompting for NL to SQL
In this notebook we will see a comparative analysis between zero shot vs few shot prompting strategies for NL to SQL task on QFabric dataset.
<br>

In [None]:
!pip install langchain -U langchain-openai

In [2]:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

Load LLM

In [4]:
llm_gpt = ChatOpenAI(model_name="gpt-3.5-turbo", openai_api_key= "YOUR OPENAI API KEY")

## Zero Shot Prompting

<b><u>Zero-Shot Prompting:</b></u> In this strategy, we provide the language model (LLM) solely with the database schema and instructions to generate the SQL query. We analyze the quality of SQL queries generated by the LLM based on its understanding of the database schema and the posed question.
<br>
The database schema is presented below
<pre>
CREATE TABLE QFabric (
  id INTEGER PRIMARY KEY,
  change_type  TEXT,
  change_status_date1  TEXT,
  change_status_date2  TEXT,
  change_status_date3  TEXT,  
  change_status_date4  TEXT,
  change_status_date5  TEXT,
  date1  DATE,
  date2  DATE,
  date3  DATE,
  date4  DATE,
  date5  DATE,
  urban_types  TEXT,
  geography_types  TEXT,
  geometry_type  TEXT,
  coordinates  TEXT
);
</pre>

<b><u>Important Notes</b></u>

- <b> In the final pipeline, we will use the language model (LLM) to generate a natural language response for the user question based on the retrieved information from the SQL database. As the retrieved information can contain thousands or lakhs of rows for our dataset, it may exceed the input token limit for the LLM when provided as context for a question to answer. Therefore, for this assignment, we will generate only aggregation queries. However, this can be adjusted later based on user requirements.</b>

Zero Shot Prompt

In [3]:
# zero Shot prompt template
template = """### Instructions:
Your task is convert a question into a SQL query, given a SQlite database schema.
Adhere to these rules:
- **Only generate SQL query and no other text
- **Deliberately go through the question and database schema word by word** to appropriately  generate the SQL query for the question
- **You can use Julianday in SQlite database queries
- **Always use aggregation functions like COUNT, SUM etc
- **Appropriately understand the context of each column in the given schema.
### Input:
This query will run on a database whose schema is represented in this string:
CREATE TABLE QFabric (
  id INTEGER PRIMARY KEY, -- Unique ID for each location
  change_type  TEXT, -- Type of the construction. ['commercial', 'demolition', 'industrial', 'mega projects', 'residential', 'road']
  change_status_date1  TEXT, -- construction status at first recorded date. Can be equal to any one from this list ['construction done', 'construction midway', 'construction started', 'excavation', 'greenland', 'land cleared', 'materials dumped', 'NA', 'operational', 'prior construction']
  change_status_date2  TEXT, -- construction status at second recorded date. Can be equal to any one from this list ['construction done', 'construction midway', 'construction started', 'excavation', 'greenland', 'land cleared', 'materials dumped', 'NA', 'operational', 'prior construction']
  change_status_date3  TEXT, -- construction status at third recorded date. Can be equal to any one from this list ['construction done', 'construction midway', 'construction started', 'excavation', 'greenland', 'land cleared', 'materials dumped', 'NA', 'operational', 'prior construction']
  change_status_date4  TEXT, -- construction status at fourth recorded date. Can be equal to any one from this list ['construction done', 'construction midway', 'construction started', 'excavation', 'greenland', 'land cleared', 'materials dumped', 'NA', 'operational', 'prior construction']
  change_status_date5  TEXT, -- construction status at last (fifth) recorded date. Can be equal to any one from this list ['construction done', 'construction midway', 'construction started', 'excavation', 'greenland', 'land cleared', 'materials dumped', 'NA', 'operational', 'prior construction']
  date1  DATE, -- first recorded date
  date2  DATE, -- second recorded date
  date3  DATE, -- third recorded date
  date4  DATE, -- fourth recorded date
  date5  DATE, -- last (fifth) recorded date
  urban_types  TEXT, -- type of urban area like "dense urban", "sparse urban", "rural", "urban slum" etc.
  geography_types  TEXT, -- type of geograpy of the region. For ex. river, forest, green land etc.
  geometry_type  TEXT, -- region geometry, always "polygon"
  coordinates  TEXT -- coordinates of the polygon
);

Use the same format as mentioned below

### Response:
User input: {question}
SQL query:
"""

In [5]:
# Initialize langchain llm chain f
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm_gpt)

In [7]:
# Helper function to print generated SQL queries
def print_generated_sql(query_string):
  '''
  This Helper function prints the llm generated sql query in a better readable format.
  '''
  # Split the query string into lines using '\n'
  lines = query_string.split('\n')

  # Print each line
  for line in lines:
      print(line)

In [8]:
# Inference on sample queries
question_1 = "How many large scale projects were constructed between 2015 and 2019?"
output = llm_chain.invoke(question_1)
query = output['text']
print_generated_sql(query)

SELECT COUNT(*) FROM QFabric WHERE change_type = 'mega projects' AND date1 >= '2015-01-01' AND date5 <= '2019-12-31'


In [9]:
question_2 = "Is there a house that took at least three years to construct?"
output = llm_chain.invoke(question_2)
query = output['text']
print_generated_sql(query)

SELECT COUNT(*) FROM QFabric WHERE change_type = 'residential' AND (JulianDay(date5) - JulianDay(date1)) >= 1095


#### Interpretation Zero shot

- The generated SQL queries reveal that the LLM hasn't understood the context of the database columns well, even with provided information about each column. This is due to the relationship between columns like 'change_status_dates' and 'dates'. <br>In the queries, the LLM didn't bother to check if 'construction done' is present in any change status; instead, it assumed date1 and date5 as the start and end dates for construction, respectively.
- Hence, Zero-Shot prompting would not be our preferred approach and we need to provide some additional example queries to help llm understand the relationship between different dates and change_status_date.

## Few Shot Prompting

In this approach, we will provide the LLM with a few example queries to enhance its understanding of the context of the database values. <br>We will input the following natural language queries along with their corresponding SQL queries as examples in the prompt.
- How many houses were constructed after 2015?
- Is there a house that took at least three years to construct?
- How many houses are their in rural areas?


In [10]:
# Few Shot prompt template
template = """### Instructions:
Your task is convert a question into a SQL query, given a SQlite database schema.
Adhere to these rules:
- **Only generate SQL query and no other text
- **Deliberately go through the question and database schema word by word** to appropriately  generate the SQL query for the question
- **You can use Julianday in SQlite database queries
- **Always use aggregation functions like COUNT, SUM etc
- **Appropriately understand the context of each column in the given schema.
### Input:
This query will run on a database whose schema is represented in this string:
CREATE TABLE QFabric (
  id INTEGER PRIMARY KEY, -- Unique ID for each location
  change_type  TEXT, -- Type of the construction. ['commercial', 'demolition', 'industrial', 'mega projects', 'residential', 'road']
  change_status_date1  TEXT, -- construction status at first recorded date. Can be equal to any one from this list ['construction done', 'construction midway', 'construction started', 'excavation', 'greenland', 'land cleared', 'materials dumped', 'NA', 'operational', 'prior construction']
  change_status_date2  TEXT, -- construction status at second recorded date. Can be equal to any one from this list ['construction done', 'construction midway', 'construction started', 'excavation', 'greenland', 'land cleared', 'materials dumped', 'NA', 'operational', 'prior construction']
  change_status_date3  TEXT, -- construction status at third recorded date. Can be equal to any one from this list ['construction done', 'construction midway', 'construction started', 'excavation', 'greenland', 'land cleared', 'materials dumped', 'NA', 'operational', 'prior construction']
  change_status_date4  TEXT, -- construction status at fourth recorded date. Can be equal to any one from this list ['construction done', 'construction midway', 'construction started', 'excavation', 'greenland', 'land cleared', 'materials dumped', 'NA', 'operational', 'prior construction']
  change_status_date5  TEXT, -- construction status at last (fifth) recorded date. Can be equal to any one from this list ['construction done', 'construction midway', 'construction started', 'excavation', 'greenland', 'land cleared', 'materials dumped', 'NA', 'operational', 'prior construction']
  date1  DATE, -- first recorded date
  date2  DATE, -- second recorded date
  date3  DATE, -- third recorded date
  date4  DATE, -- fourth recorded date
  date5  DATE, -- last (fifth) recorded date
  urban_types  TEXT, -- type of urban area like "dense urban", "sparse urban", "rural", "urban slum" etc.
  geography_types  TEXT, -- type of geograpy of the region. For ex. river, forest, green land etc.
  geometry_type  TEXT, -- region geometry, always "polygon"
  coordinates  TEXT -- coordinates of the polygon
);

Below are a number of examples of questions and their corresponding SQL queries. Use the same format as mentioned below

User input: How many houses were constructed after 2015?
SQL query:
    SELECT COUNT(*)
    FROM QFabric
    WHERE 'construction done' IN (change_status_date1, change_status_date2, change_status_date3, change_status_date4, change_status_date5)
    AND CASE
        WHEN 'construction done' = change_status_date1 THEN date1
        WHEN 'construction done' = change_status_date2 THEN date2
        WHEN 'construction done' = change_status_date3 THEN date3
        WHEN 'construction done' = change_status_date4 THEN date4
        WHEN 'construction done' = change_status_date5 THEN date5
    END > '2015-12-31'
    AND change_type = 'residential'

User input: Is there a house that took at least three years to construct?
SQL query:
  SELECT COUNT(*)
  FROM QFabric
  WHERE
  'construction done' IN (change_status_date1, change_status_date2, change_status_date3, change_status_date4, change_status_date5)
  AND
  'construction started' IN (change_status_date1, change_status_date2, change_status_date3, change_status_date4, change_status_date5)
  AND
  (julianday(
            CASE
              WHEN 'construction done'= change_status_date1 THEN date1
              WHEN 'construction done' = change_status_date2 THEN date2
              WHEN 'construction done' = change_status_date3 THEN date3
              WHEN 'construction done' = change_status_date4 THEN date4
              WHEN 'construction done' = change_status_date5 THEN date5
            END
          ) -
  julianday(
            CASE
              WHEN 'construction started'= change_status_date1 THEN date1
              WHEN 'construction started' = change_status_date2 THEN date2
              WHEN 'construction started' = change_status_date3 THEN date3
              WHEN 'construction started' = change_status_date4 THEN date4
              WHEN 'construction started' = change_status_date5 THEN date5
            END
  )) >= 1095
  AND
  change_type = 'residential'

User Input: How many houses are their in rural areas?
SQL Query:
SELECT COUNT(*) FROM QFabric WHERE change_type = 'residential' AND urban_types LIKE '%rural%'

### Response:
User input: {question}
SQL query:
"""

In [11]:
# Initialize langchain llm chain f
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm_gpt)

In [12]:
# Inference on sample queries
question_1 = "How many large scale projects were constructed between 2015 and 2019?"
output = llm_chain.invoke(question_1)
query = output['text']
print_generated_sql(query)

SELECT COUNT(*)
FROM QFabric
WHERE 'construction done' IN (change_status_date1, change_status_date2, change_status_date3, change_status_date4, change_status_date5)
AND CASE
    WHEN 'construction done' = change_status_date1 THEN date1
    WHEN 'construction done' = change_status_date2 THEN date2
    WHEN 'construction done' = change_status_date3 THEN date3
    WHEN 'construction done' = change_status_date4 THEN date4
    WHEN 'construction done' = change_status_date5 THEN date5
END BETWEEN '2015-01-01' AND '2019-12-31'
AND change_type = 'mega projects'


#### Final interpretation
As observed above, few-shot prompting has greatly assisted the LLM in understanding the context of various columns and their usage. Consequently, we have obtained a perfectly suitable SQL query generated for the provided input question.
<br> <br>
<b>Therefore, we will be utilizing few shot prompting in our final NL to SQL parser</b>