We know that LLMs can generate SQL code from Natural language. The challenge in adopting this to empower all skill levels to query big data is many fold. From LLM perspective: For a correct SQL query generation from natural langugae, LLMs not only need to understad the context but also have an understanding of the database schema. Passing schema to prompts could be an approach here but this is not scalable.
We showcase using prompt engineering approaches from chain of thought modelling we can make this approach scalable. This project shows LLMs working from natural language to query a star schema in data lake (via Synapse) without the need to know the DB schema before hand.We employ Lakehouse pattern in this showcase.
The High level Arch is as follows:
Prerequisites:
- An Azure Open AI Service up and running
- An Azure Synpase up and running
- Data lake populated as per description in section "Setting up the data in the Data lake" below
The project has is dockerized, edit the Environment file with appropriate values and then run docker-compose up
Navigate to localhost:8501
for the app:
Trace the chain of thoughts at localhost:4173
The following star schema and data will be populated in data lake (from the data generation step):
app
: The app folder for the Streamlit app and logicapp/helpers
Helper modules/scriptsapp/pages
Pages for the Streamlit app (will be extended to more pages in the future)scripts
Setup scrpts to install and configure driverssynapse
Scripts to generate fake data for our retail star schema, store in data lake via spark and then scripts to create external tables via SQL (lakehouse pattern)images
Images used in documentation/readme
- Create a Synapse environment and create a Dedicated sqlpool and a Spark cluster in the environment (refer Azure docs)
- Make sure that your user has
Storage Blob Data Contributor
role in the ADLS storage created with Synapse, create a new container in theredataset
to hold our generated data - Navigate to the newly created container and create a SAS token for the container, keep the token handy we will need it later.
- Copy the
synapse/data/CreateSalesStarData.ipynb
into your Synapse environment and execute it using the Spark cluster (refer to comments in the notebook for additional setup) - Copy the Credtion cretor file to Synapse (use the SAS token in the file) and create a DB scope credential to access Lakehouse
- Copy the Customer external table creation script to Synapse replace the ADLS account name and the container name in the file before you execute it
- Copy the Product external table creation script to Synapse replace the ADLS account name and the container name in the file before you execute it
- Copy the Product external table creation script to Synapse replace the ADLS account name and the container name in the file before you execute it
- Copy the Store external table creation script to Synapse replace the ADLS account name and the container name in the file before you execute it
- Finally create a login using CReate Login script and add the role to the dedicated SQL using the Add role script
We setup the dedicated SQL pool of Synapse to talk to lakehouse external tables, you could also do a similar setup with the serverless SQL pool. Adjust the above accordingly.
- Prerequisite: you need docker installed and running in your system.
- EASY STEP: The repo comes with a
.devcontainer
, so if you are using VS Code, you will be propted to build and open in container. If you do not get this pop up and you are on VS code, you can open the command pallete and look for "Dev containers:...", follow for documentation to open in devcontainers in vs code docs. - COMPLEX STEP: If not you need to manually setup a few things, I assume: a debian environment, azure cli installed, conda (mamba optional) installed and available. Please make sure to have these setup in your environment.
- switch to root and install the MS ODBC driver: from container switch to root:
sudo su
and then execute. scripts/mssql-odbc.sh
- Finally install all the required dependencies using conda:
conda env update -f app/env.yaml
- switch to root and install the MS ODBC driver: from container switch to root:
- Switch over to the new environment using:
conda activate athena
- Run the langchain server execute the following command:
langchain-server
in an activateed conda environment (athena), navigate tohttp://localhost:4173/
- Make sure you have followe the steps before, then run
streamlit run app/Azure_OpenAI_demo.py
in an activated conda environment (athena)