Skip to content
/ Athena Public

Query data from data lake in natural language: LLMs + Chain of thought modelling

Notifications You must be signed in to change notification settings

Ritaja/Athena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Motivation

We know that LLMs can generate SQL code from Natural language. The challenge in adopting this to empower all skill levels to query big data is many fold. From LLM perspective: For a correct SQL query generation from natural langugae, LLMs not only need to understad the context but also have an understanding of the database schema. Passing schema to prompts could be an approach here but this is not scalable.

We showcase using prompt engineering approaches from chain of thought modelling we can make this approach scalable. This project shows LLMs working from natural language to query a star schema in data lake (via Synapse) without the need to know the DB schema before hand.We employ Lakehouse pattern in this showcase.

The High level Arch is as follows: Architecture

Quickstart

Prerequisites:

  • An Azure Open AI Service up and running
  • An Azure Synpase up and running
  • Data lake populated as per description in section "Setting up the data in the Data lake" below

The project has is dockerized, edit the Environment file with appropriate values and then run docker-compose up

Navigate to localhost:8501 for the app: App Screenshot

Trace the chain of thoughts at localhost:4173 Chain of Thought

The following star schema and data will be populated in data lake (from the data generation step): Table Struct

Quick intro to the folder structure

  • app: The app folder for the Streamlit app and logic
  • app/helpers Helper modules/scripts
  • app/pages Pages for the Streamlit app (will be extended to more pages in the future)
  • scripts Setup scrpts to install and configure drivers
  • synapse Scripts to generate fake data for our retail star schema, store in data lake via spark and then scripts to create external tables via SQL (lakehouse pattern)
  • images Images used in documentation/readme

Setting up the data in the Data lake

  • Create a Synapse environment and create a Dedicated sqlpool and a Spark cluster in the environment (refer Azure docs)
  • Make sure that your user has Storage Blob Data Contributor role in the ADLS storage created with Synapse, create a new container in there dataset to hold our generated data
  • Navigate to the newly created container and create a SAS token for the container, keep the token handy we will need it later.
  • Copy the synapse/data/CreateSalesStarData.ipynb into your Synapse environment and execute it using the Spark cluster (refer to comments in the notebook for additional setup)
  • Copy the Credtion cretor file to Synapse (use the SAS token in the file) and create a DB scope credential to access Lakehouse
  • Copy the Customer external table creation script to Synapse replace the ADLS account name and the container name in the file before you execute it
  • Copy the Product external table creation script to Synapse replace the ADLS account name and the container name in the file before you execute it
  • Copy the Product external table creation script to Synapse replace the ADLS account name and the container name in the file before you execute it
  • Copy the Store external table creation script to Synapse replace the ADLS account name and the container name in the file before you execute it
  • Finally create a login using CReate Login script and add the role to the dedicated SQL using the Add role script

We setup the dedicated SQL pool of Synapse to talk to lakehouse external tables, you could also do a similar setup with the serverless SQL pool. Adjust the above accordingly.

Local development setup

  • Prerequisite: you need docker installed and running in your system.
  • EASY STEP: The repo comes with a .devcontainer, so if you are using VS Code, you will be propted to build and open in container. If you do not get this pop up and you are on VS code, you can open the command pallete and look for "Dev containers:...", follow for documentation to open in devcontainers in vs code docs.
  • COMPLEX STEP: If not you need to manually setup a few things, I assume: a debian environment, azure cli installed, conda (mamba optional) installed and available. Please make sure to have these setup in your environment.
    • switch to root and install the MS ODBC driver: from container switch to root: sudo su and then execute . scripts/mssql-odbc.sh
    • Finally install all the required dependencies using conda: conda env update -f app/env.yaml
  • Switch over to the new environment using: conda activate athena

Locally running the app

  • Run the langchain server execute the following command: langchain-server in an activateed conda environment (athena), navigate to http://localhost:4173/
  • Make sure you have followe the steps before, then run streamlit run app/Azure_OpenAI_demo.py in an activated conda environment (athena)

About

Query data from data lake in natural language: LLMs + Chain of thought modelling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published