![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


# **Text2SQL Generation**

The Text-to-SQL task, which involves automatically converting natural language questions into corresponding SQL queries, has seen significant advancements with the application of state-of-the-art models. In this direction, We are excited to introduce our new Text2SQL annotator. This powerful tool revolutionizes the way you interact with databases by effortlessly translating natural language text prompts into accurate and effective SQL queries. With the integration of a state-of-the-art LLM, this annotator opens new possibilities for enhanced data retrieval and manipulation, streamlining your workflow and boosting efficiency.

Also we have a new text2sql_mimicsql model that is specifically finetuned on MIMIC-III dataset schema for enhancing the precision of SQL queries derived from medical natural language queries on MIMIC dataset.

In addition, we introduced two models can generate SQL queries from natural questions and custom database schemas with a single table. It is based on a large-size LLM, which is finetuned by John Snow Labs on a dataset having schemas with single tables.

The model "***text2sql_with_schema_single_table_augmented***" trained on an augmented dataset achieves the new State-Of-The-Art (SOTA) for this task.


Available models can be found at the [Models Hub](https://nlp.johnsnowlabs.com/models?annotator=Text2SQL).


In [0]:
import functools 
import numpy as np
import pandas as pd
from scipy import spatial

import pyspark.sql.types as T
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

import sparknlp
import sparknlp_jsl
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *


pd.set_option('max_colwidth', 100)
pd.set_option('display.max_columns', 100)  
pd.set_option('display.expand_frame_repr', False)


print('sparknlp.version : ',sparknlp.version())
print('sparknlp_jsl.version : ',sparknlp_jsl.version())

spark

sparknlp.version :  5.1.0
sparknlp_jsl.version :  5.1.0


# 🔎 MODELS

<div align="center">

| **Index** | **Text2SQL models**        |
|---------------|----------------------|
| 1        |  [text2sql_mimicsql](https://nlp.johnsnowlabs.com/2023/08/14/text2sql_mimicsql_en.html)     |
  2       |   [text2sql_with_schema_single_table](https://nlp.johnsnowlabs.com/2023/09/02/text2sql_with_schema_single_table_en.html)   
  3      | [text2sql_with_schema_single_table_augmented](https://nlp.johnsnowlabs.com/2023/09/25/text2sql_with_schema_single_table_augmented_en.html)


</div>

## 📑  **Text2SQL_MIMICSQL**

This model is based on the LLM FlanT5-Large, which is finetuned with a biomedical dataset (MIMICSQL) by John Snow Labs. It can generate SQL queries from medical natural language questions on MIMIC-III dataset.

In [0]:
document_assembler = DocumentAssembler()\
    .setInputCol("prompt")\
    .setOutputCol("document_prompt")

text2sql_mimicsql  = Text2SQL.pretrained("text2sql_mimicsql", "en", "clinical/models")\
    .setInputCols("document_prompt")\
    .setOutputCol("sql_query")\

pipeline = Pipeline(stages=[document_assembler, text2sql_mimicsql])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("prompt"))

text2sql_mimicsql download started this may take some time.
[ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][OK!]


In [0]:
text = ["Find the average number of prescriptions per patient for patients with a specific diagnosis.",
        "give me the number of patients who had single internal mammary-coronary artery bypass.",
        "provide the drug code and drug dose for anna johnson.",
        "calculate the minimum age of married patients who had elective type hospital admission.",
        "What is the maximum age of patients who were hospitalized for 20 days and died before 2023 ?"]

data = spark.createDataFrame([(prompt,) for prompt in text], ["prompt"])

result = model.transform(data)

result.select("sql_query.result").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[SELECT AVG ( DEMOGRAPHIC."AGE" ) FROM DEMOGRAPHIC INNER JOIN DIAGNOSES on DEMOGRAPHIC.HADM_ID = DIAGNOSES.HADM_ID INNER JOIN PRESCRIPTIONS on DEMOGRAPHIC.HADM_ID = PRESCRIPTIONS.HADM_ID WHERE DI

### **📍 LightPipelines**

In [0]:
light_model = LightPipeline(model)
light_result = light_model.annotate(text)

In [0]:
for i in range(len(light_result)):
    document_text = light_result[i]['document_prompt'][0]
    summary_text = light_result[i]['sql_query'][0]

    print("➤ User query: {}: \n{}".format(i+1, document_text))
    print("\n")
    print("➤ SQL query {}: \n{}".format(i+1, summary_text))
    print("\n")

➤ User query: 1: 
Find the average number of prescriptions per patient for patients with a specific diagnosis.


➤ SQL query 1: 
SELECT AVG ( DEMOGRAPHIC."AGE" ) FROM DEMOGRAPHIC INNER JOIN DIAGNOSES on DEMOGRAPHIC.HADM_ID = DIAGNOSES.HADM_ID INNER JOIN PRESCRIPTIONS on DEMOGRAPHIC.HADM_ID = PRESCRIPTIONS.HADM_ID WHERE DIAGNOSES."SHORT_TITLE" = "Specific hst" AND PRESCRIPTIONS."DRUG" = "1"


➤ User query: 2: 
give me the number of patients who had single internal mammary-coronary artery bypass.


➤ SQL query 2: 
SELECT COUNT ( DISTINCT DEMOGRAPHIC."SUBJECT_ID" ) FROM DEMOGRAPHIC INNER JOIN PROCEDURES on DEMOGRAPHIC.HADM_ID = PROCEDURES.HADM_ID WHERE PROCEDURES."SHORT_TITLE" = "1 int mam-cor art bypass"


➤ User query: 3: 
provide the drug code and drug dose for anna johnson.


➤ SQL query 3: 
SELECT PRESCRIPTIONS."FORMULARY_DRUG_CD",PRESCRIPTIONS."DRUG_DOSE" FROM DEMOGRAPHIC INNER JOIN PRESCRIPTIONS on DEMOGRAPHIC.HADM_ID = PRESCRIPTIONS.HADM_ID WHERE DEMOGRAPHIC."NAME" = "Anna Johnson

<b><h1><font color='darkred'>!!! ATTENTION !!! </font><h1><b>

<b>Before continue, <font color='darkred'>Please chek your spark-nlp version it must be 5.1.1</font> <b>


## 📑  **Text2SQL_With_Schema_Single_Table**


This model can generate SQL queries from natural questions and custom database schemas with a single table. It is based on a large-size LLM, which is finetuned by John Snow Labs on a dataset having schemas with single tables.

In [0]:
query_schema = {"patient": ["ID","Name","Age","Gender","BloodType","Weight","Height","Address","Email","Phone"] }

text2sql_with_schema_single_table = Text2SQL.pretrained("text2sql_with_schema_single_table", "en", "clinical/models")\
    .setMaxNewTokens(200)\
    .setSchema(query_schema)\
    .setInputCols(["document_prompt"])\
    .setOutputCol("sql_query")

pipeline = Pipeline(stages=[document_assembler, text2sql_with_schema_single_table])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("prompt"))

In [0]:
text = ["Calculate the average age of patients with blood type 'A-'",
        "Retrieve the names and email addresses of patients with blood type 'B+'",
        "Calculate the number of patients with blood type A- and weight above 100kg"
        ]

data = spark.createDataFrame([(prompt,) for prompt in text], ["prompt"])

result = model.transform(data)

result.select("sql_query.result").show(truncate=False)

Let's test with another custom database schema:

In [0]:
query_schema = {"drug": ["ID","Name","Manufacturer","Price","ExpiryDate","PrescriptionRequired","SideEffects","Dosage","Quantity"] }
text2sql_with_schema_single_table.setSchema(query_schema)

text = ["Retrieve the names and dosages of drugs containing '50mcg'",
        "Calculate the average price of drugs with a prescription requirement",
        "Retrieve the names and prices of drugs containing '600mg'"
        ]

data = spark.createDataFrame([(prompt,) for prompt in text], ["prompt"])

result = model.transform(data)

result.select("sql_query.result").show(truncate=False)

### **📍 LightPipelines**

In [0]:
light_model = LightPipeline(model)
light_result = light_model.annotate(text)

for i in range(len(light_result)):
    document_text = light_result[i]['document_prompt'][0]
    summary_text = light_result[i]['sql_query'][0]
    print("➤ User query: {}: \n{}".format(i+1, document_text))
    print("\n")
    print("➤ SQL query {}: \n{}".format(i+1, summary_text))
    print("\n")

## 📑  **Text2SQL_With_Schema_Single_Table_Augmented**


This model is the State-of-the-Art (SOTA) for generating SQL queries from natural questions and custom database schemas with a single table. It is based on a large-size LLM, which is finetuned by John Snow Labs on an augmented dataset having schemas with single tables.

In [0]:
query_schema = {
    "medical_treatment": ["patient_id","patient_name","age","gender","diagnosis","treatment","doctor_name","hospital_name","admission_date","discharge_date"]
}

text2sql_with_schema_single_table_augmented = Text2SQL.pretrained("text2sql_with_schema_single_table_augmented", "en", "clinical/models")\
    .setMaxNewTokens(200)\
    .setSchema(query_schema)\
    .setInputCols(["document_prompt"])\
    .setOutputCol("sql_query")

pipeline = Pipeline(stages=[document_assembler, text2sql_with_schema_single_table_augmented])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("prompt"))

In [0]:
text = ["Which patients were admitted in September 2023?",
        "What is the average age of female patients with 'Diabetes'?",
        "Who are the patients treated by 'Dr. Brown'?"
        ]

data = spark.createDataFrame([(prompt,) for prompt in text], ["prompt"])

result = model.transform(data)

result.select("sql_query.result").show(truncate=False)

### **📍 LightPipelines**

In [0]:
light_model = LightPipeline(model)
light_result = light_model.annotate(text)

for i in range(len(light_result)):
    document_text = light_result[i]['document_prompt'][0]
    summary_text = light_result[i]['sql_query'][0]

    print("➤ User query: {}: \n{}".format(i+1, document_text))
    print("\n")
    print("➤ SQL query {}: \n{}".format(i+1, summary_text))
    print("\n")