# Querying a Pandas dataframe with a LLM

Here we build an AI agent which generates Python code, then run this code and returns the answer to the user.
This example uses `deepseek-r1:7b` locally, which does not always return answers in the desired format, so some requests end up crashing or needing extra formatting, but it can still perform remarkably well!

First we make a df, for instance by querying astronomical data from the Gaia mission:

In [1]:
%%time
from astroquery.gaia import Gaia

query_string = """SELECT l, b, phot_g_mean_mag, bp_rp, parallax, pmra, pmdec, parallax_error, pmra_error, pmdec_error
  FROM gaiadr3.gaia_source_lite
  WHERE random_index < 5000"""

job = Gaia.launch_job_async(query=query_string, verbose=False)
gaia_data = job.get_results() 

INFO: Query finished. [astroquery.utils.tap.core]
CPU times: user 391 ms, sys: 60.4 ms, total: 451 ms
Wall time: 2.6 s


In [2]:
df_gaia = gaia_data.to_pandas()

In [3]:
df_gaia.shape

(5000, 10)

### Attempt using LlamaIndex

https://docs.llamaindex.ai/en/stable/examples/query_engine/pandas_query_engine/

LlamaIndex passes the result of `df.head()` to the LLM, along with a prompt asking to return an exectuable string.

DeepSeek seems too chatty for the default prompts in LlamaIndex, so the actual execution of the code providing the answer ends up crashing! Let's see how it should work:

In [4]:
import pandas as pd
from llama_index.experimental.query_engine import PandasQueryEngine

# Also specify that we use a local LLM through Ollama
from llama_index.llms.ollama import Ollama
llm_deepseek = Ollama(model="deepseek-r1:7b", request_timeout=120.0, temperature=0.)

query_engine = PandasQueryEngine(df=df_gaia, verbose=True, llm=llm_deepseek)

In [5]:
%%time
response = query_engine.query(
    "How many rows does this dataframe contain?",
)

INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
> Pandas Instructions:
```
<think>
Alright, so I'm trying to figure out how to determine the number of rows in the given pandas DataFrame called `df`. The user provided a sample output from `print(df.head())`, which shows the first five rows of the dataframe. They also included an example query asking for the number of rows and mentioned that the final answer should be an expression that can be evaluated with `eval()`.

First, I recall that in pandas, dataframes have a method called `shape` which returns a tuple containing the number of rows and columns. So, `df.shape[0]` would give the number of rows. But since the user wants this as a Python expression to be evaluated using `eval()`, I need to construct an expression that returns this value.

I think about how to structure this expression. The simplest way is just to call `len(df)` becau

Traceback (most recent call last):
  File "/Users/tristan/playground/gen-ai-ollama/.venv/lib/python3.13/site-packages/llama_index/experimental/query_engine/pandas/output_parser.py", line 44, in default_output_processor
    safe_exec(ast.unparse(module), {}, local_vars)  # type: ignore
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tristan/playground/gen-ai-ollama/.venv/lib/python3.13/site-packages/llama_index/experimental/exec_utils.py", line 171, in safe_exec
    return exec(__source, _get_restricted_globals(__globals), __locals)
  File "<string>", line 1, in <module>
NameError: name 'python' is not defined


This failed because the LLM we used does not return output in the exact format expected by the LlamaIndex agent. 

We are going to do some manual coding to work around that.

### Building our agent from scratch

We pass some information about the df in the prompt, then ask the LLM to produce code. Sometimes I get code on the last line, most of the time I get it wrapped in triple backquotes. Sometimes I also get another sentence of text after the answer. Sometimes the code is correct but wrapped in a `print()` which makes it return a string. For reliable a consistent output, a more powerful model should ensure the output is always formatted properly.

In [6]:
import ollama

class PandasAgent():
    def __init__(self, df: pd.DataFrame):
        self.df = df

    def query(self, user_question: str, ollama_llm: str):
        df = self.df
        df_str = df.head()
        self.pandas_prompt = (
            "CONTEXT:\n"
            "You are working with a pandas dataframe in Python.\n"
            "The name of the dataframe is `df`.\n"
            "This is the result of `print(df.head())`:\n"
            f"{df_str}\n\n"
            "INSTRUCTIONS:\n"
            "Convert the user request to a one-liner executable Python code using Pandas.\n"
            "Reply with a single line."
            "The question is:\n"
            f"{user_question}\n"
            "If there are more than one way of solving it, just choose one.\n"
            "IMPORTANT:\n"
            "Your reply should not contain any text, only code."
        )
        self.full_output = ollama.generate(model=ollama_llm,
                                            prompt=self.pandas_prompt,
                                            options={"temperature": 0},
                                            )

        self.code_response = self.full_output.response.split("```python\n")[-1].split('\n')[0]
        # deal with print(...)
        if self.code_response.startswith("print("):
            self.code_response = self.code_response[6:-1]
            
        try:
            self.actual_result = eval(self.code_response)
            return self.actual_result
        except:
            print(self.code_response,"cannot be evaluated.")
            return None


In [7]:
agent_df_gaia = PandasAgent(df_gaia)

In [8]:
%%time
answer = agent_df_gaia.query('How many rows are there in the dataframe?', ollama_llm='deepseek-r1:7b')

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
CPU times: user 3.58 ms, sys: 1.17 ms, total: 4.74 ms
Wall time: 25 s


In [9]:
print('The result is:',answer)
#print(agent_df_gaia.pandas_prompt)
print(agent_df_gaia.full_output.response)

The result is: 5000
<think>
Okay, so I'm trying to figure out how to determine the number of rows in a pandas DataFrame using Python. The user provided a sample dataframe and wants to know the method for this.

First, I remember that pandas DataFrames have some built-in functions or attributes related to their shape. Oh right, there's the `shape` attribute which returns a tuple indicating the number of rows and columns. So if I access `df.shape`, it should give me something like (number_of_rows, number_of_columns).

Looking at the sample dataframe provided, when they printed `df.head()`, it showed 5 rows. So using `len(df)` would also return 5 because that's another way to get the total number of rows.

But since the user mentioned that if there are multiple ways, just choose one, I can go with either method. However, using `shape` is more direct and efficient because it gives both dimensions in one attribute, whereas `len(df)` would require an extra step if someone wanted columns as w

It worked!!!

## Trying a more complex query

In [10]:
answer = agent_df_gaia.query("""Give the median parallax of the df rows grouped by integer value of their phot_g_mean_mag. 
Note that the column phot_g_mean_mag contains some NaN values.""",
ollama_llm='deepseek-r1:7b')

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"


In [11]:
print('The result is:',answer)
#print(agent_df_gaia.pandas_prompt)
print(agent_df_gaia.full_output.response)

The result is: phot_g_mean_mag
10.0    1.570822
11.0    1.604811
12.0    0.848711
13.0    1.056185
14.0    0.659533
15.0    0.461134
16.0    0.377964
17.0    0.328263
18.0    0.258979
19.0    0.296725
20.0    0.323012
21.0         NaN
Name: parallax, dtype: float64
<think>
Okay, so I need to figure out how to calculate the median parallax for each group in the dataframe based on the integer value of phot_g_mean_mag. The user mentioned that phot_g_mean_mag has some NaN values, so I have to handle those.

First, I remember that pandas can group data using the groupby function. So I'll start by grouping the dataframe 'df' by the 'phot_g_mean_mag' column. But since there are NaNs, I should convert the values to integers after dropping the NaNs to ensure proper grouping.

Wait, how do I handle the NaNs? Maybe I can drop them first before converting to integer. So I'll use df.dropna(subset=['phot_g_mean_mag']) to exclude rows where phot_g_mean_mag is NaN. Then, I'll cast that column to integ

It worked! The LLM found the right solution AND returned the one-liner in a parsable format.

## Another complex query

In [12]:
answer = agent_df_gaia.query("""Is the mean value of l larger than the mean value of b?""",
ollama_llm='deepseek-r1:7b')

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"


In [13]:
print('The result is:',answer)
#print(agent_df_gaia.pandas_prompt)
print(agent_df_gaia.full_output.response)

The result is: True
<think>
Okay, so I need to figure out how to determine if the mean value of 'l' is larger than the mean value of 'b' in the given pandas DataFrame. Let me break this down step by step.

First, I know that in pandas, calculating the mean of a column is straightforward using the .mean() method. So for each column 'l' and 'b', I can compute their respective means.

The user's question is asking whether the average of 'l' is greater than the average of 'b'. That translates to comparing these two means. If df['l'].mean() is greater than df['b'].mean(), then the answer is yes; otherwise, no.

I should write a one-liner that performs this comparison. So combining everything, it would be something like (df['l'] > df['b']).mean(). But wait, actually, I just need to check if the mean of 'l' is greater than the mean of 'b'. So comparing the two means directly.

Alternatively, another approach could be to subtract the mean of 'b' from the mean of 'l' and see if it's positive. T

Correct again! This time it forgot to return a single solution, and added a lot of text around the answers too, but our parsing scheme was still able to grab the executable snippet of code.