Imagine typing questions like these directly into your Shiny dashboard, and seeing the results in realtime:
- "Show only penguins that are not species Gentoo and have a bill length greater than 50mm."
- "Show only blue states with an incidence rate greater than 100 per 100,000 people."
- "What is the average mpg of cars with 6 cylinders?"
querychat is a drop-in component for Shiny that allows users to query a data frame using natural language. The results are available as a reactive data frame, so they can be easily used from Shiny outputs, reactive expressions, downloads, etc.
This is not as terrible an idea as you might think! We need to be very careful when bringing LLMs into data analysis, as we all know that they are prone to hallucinations and other classes of errors. querychat is designed to excel in reliability, transparency, and reproducibility by using this one technique: denying it raw access to the data, and forcing it to write SQL queries instead. See the section below on "How it works" for more.
pak::pak("posit-dev/querychat/r-package")
First, you'll need an OpenAI API key. See the instructions from Ellmer. (Or use a different LLM provider, see below.)
Here's a very minimal example that shows the three function calls you need to make.
library(shiny)
library(bslib)
library(querychat)
# 1. Configure querychat. This is where you specify the dataset and can also
# override options like the greeting message, system prompt, model, etc.
querychat_config <- querychat_init(mtcars)
ui <- page_sidebar(
# 2. Use querychat_sidebar(id) in a bslib::page_sidebar.
# Alternatively, use querychat_ui(id) elsewhere if you don't want your
# chat interface to live in a sidebar.
sidebar = querychat_sidebar("chat"),
DT::DTOutput("dt")
)
server <- function(input, output, session) {
# 3. Create a querychat object using the config from step 1.
querychat <- querychat_server("chat", querychat_config)
output$dt <- DT::renderDT({
# 4. Use the filtered/sorted data frame anywhere you wish, via the
# querychat$df() reactive.
DT::datatable(querychat$df())
})
}
shinyApp(ui, server)
querychat's natural language chat experience is powered by LLMs. You may use any model that ellmer supports that has the ability to do tool calls, but we currently recommend (as of March 2025):
- GPT-4o
- Claude 3.5 Sonnet
- Claude 3.7 Sonnet
In our testing, we've found that those models strike a good balance between accuracy and latency. Smaller models like GPT-4o-mini are fine for simple queries but make surprising mistakes with moderately complex ones; and reasoning models like o3-mini slow down responses without providing meaningfully better results.
The small open source models (8B and below) we've tested have fared extremely poorly. Sorry. 🤷
querychat does not have direct access to the raw data; it can only read or filter the data by writing SQL SELECT
statements. This is crucial for ensuring relability, transparency, and reproducibility:
- Reliability: Today's LLMs are excellent at writing SQL, but bad at direct calculation.
- Transparency: querychat always displays the SQL to the user, so it can be vetted instead of blindly trusted.
- Reproducibility: The SQL query can be easily copied and reused.
Currently, querychat uses DuckDB for its SQL engine. It's extremely fast and has a surprising number of statistical functions.
When the querychat UI first appears, you will usually want it to greet the user with some basic instructions. By default, these instructions are auto-generated every time a user arrives; this is slow, wasteful, and unpredictable. Instead, you should create a file called greeting.md
, and when calling querychat_init
, pass greeting = readLines("greeting.md")
.
If you need help coming up with a greeting, your own app can help you! Just launch it and paste this into the chat interface:
Help me create a greeting for your future users. Include some example questions. Format your suggested greeting as Markdown, in a code block.
And keep giving it feedback until you're happy with the result, which will then be ready to be pasted into greeting.md
.
Alternatively, you can completely suppress the greeting by passing greeting = ""
.
In LLM parlance, the system prompt is the set of instructions and specific knowledge you want the model to use during a conversation. querychat automatically creates a system prompt which is comprised of:
- The basic set of behaviors the LLM must follow in order for querychat to work properly. (See
inst/prompt/prompt.md
if you're curious what this looks like.) - The SQL schema of the data frame you provided.
- (Optional) Any additional description of the data you choose to provide.
- (Optional) Any additional instructions you want to use to guide querychat's behavior.
If you give querychat your dataset and nothing else, it will provide the LLM with the basic schema of your data:
- Column names
- DuckDB data type (integer, float, boolean, datetime, text)
- For text columns with less than 10 unique values, we assume they are categorical variables and include the list of values
- For integer and float columns, we include the range
And that's all the LLM will know about your data. If the column names are usefully descriptive, it may be able to make a surprising amount of sense out of the data. But if your data frame's columns are x
, V1
, value
, etc., then the model will need to be given more background info--just like a human would.
To provide this information, use the data_description
argument. For example, the mtcars
data frame used in the example above has pretty minimal column names. You might create a data_description.md
like this:
The data was extracted from the 1974 Motor Trend US magazine, and
comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973–74 models).
- mpg: Miles/(US) gallon
- cyl: Number of cylinders
- disp: Displacement (cu.in.)
- hp: Gross horsepower
- drat: Rear axle ratio
- wt: Weight (1000 lbs)
- qsec: 1/4 mile time
- vs: Engine (0 = V-shaped, 1 = straight)
- am: Transmission (0 = automatic, 1 = manual)
- gear: Number of forward gears
- carb: Number of carburetors
which you can then pass via:
querychat_config <- querychat_init(
mtcars,
data_description = readLines("data_description.md")
)
querychat doesn't need this information in any particular format; just put whatever information, in whatever format, you think a human would find helpful.
You can add additional instructions of your own to the end of the system prompt, by passing extra_instructions
into query_init
.
querychat_config <- querychat_init(mtcars, extra_instructions = c(
"You're speaking to a British audience--please use appropriate spelling conventions.",
"Use lots of emojis! 😃 Emojis everywhere, 🌍 emojis forever. ♾️",
"Stay on topic, only talk about the data dashboard and refuse to answer other questions."
))
You can also put these instructions in a separate file and use readLines()
to load them, as we did for data_description
above.
Warning: It is not 100% guaranteed that the LLM will always—or in many cases, ever—obey your instructions, and it can be difficult to predict which instructions will be a problem. So be sure to test extensively each time you change your instructions, and especially, if you change the model you use.
Provide a create_chat_func
function that takes a system_prompt
parameter, and returns an Ellmer chat object. A convenient way to do this is with purrr::partial
:
querychat_config <- querychat_init(mtcars,
create_chat_func = purrr::partial(ellmer::chat_claude, model = "claude-3-7-sonnet-latest")
)
This would use Claude 3.7 Sonnet instead, which would require you to provide an API key. See the instructions from Ellmer.