Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: SubQuestionQueryEngine generates a python script instead of markdown with questions causing OutputParserException #13498

Open
SNIKO opened this issue May 15, 2024 · 9 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@SNIKO
Copy link

SNIKO commented May 15, 2024

Bug Description

I'm using SubQuestionQueryEngine with multiple tools, but it crashes with the following exception:

Traceback (most recent call last):
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\llama_index\core\output_parsers\utils.py", line 45, in parse_json_markdown
json_obj = json.loads(json_string)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\json_init_.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\json\decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 3 column 5 (char 8)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\llama_index\core\output_parsers\utils.py", line 52, in parse_json_markdown
json_obj = yaml.safe_load(json_string)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\yaml_init_.py", line 125, in safe_load
return load(stream, SafeLoader)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\yaml_init_.py", line 81, in load
return loader.get_single_data()
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\yaml\constructor.py", line 49, in get_single_data
node = self.get_single_node()
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\yaml\composer.py", line 39, in get_single_node
if not self.check_event(StreamEndEvent):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\yaml\parser.py", line 98, in check_event
self.current_event = self.state()
^^^^^^^^^^^^
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\yaml\parser.py", line 171, in parse_document_start
raise ParserError(None, None,
yaml.parser.ParserError: expected '', but found ''
in "", line 4, column 5:
semantic_search = pipeline('sema ...
^

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "c:\AI\llama\rag2.py", line 119, in
response = top_query_engine.query(query)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\llama_index\core\instrumentation\dispatcher.py", line 274, in wrapper
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\llama_index\core\base\base_query_engine.py", line 53, in query
query_result = self._query(str_or_query_bundle)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\llama_index\core\query_engine\sub_question_query_engine.py", line 145, in _query
sub_questions = self._question_gen.generate(self._metadatas, query_bundle)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\llama_index\core\question_gen\llm_generators.py", line 81, in generate
parse = self._prompt.output_parser.parse(prediction)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\llama_index\core\question_gen\output_parser.py", line 11, in parse
json_dict = parse_json_markdown(output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sergii.vashchyshchuk\AppData\Roaming\Python\Python312\site-packages\llama_index\core\output_parsers\utils.py", line 54, in parse_json_markdown
raise OutputParserException(
llama_index.core.output_parsers.base.OutputParserException: Got invalid JSON object. Error: Extra data: line 3 column 5 (char 8) expected '', but found ''
in "", line 4, column 5:
semantic_search = pipeline('sema ...
^. Got JSON string: []

# Create a semantic search model
semantic_search = pipeline('semantic-search')

# Iterate over each tool
for tool_name, tool_description in tools.items():
    # Perform semantic search to find relevant phrases in the user question
    relevance_scores = semantic_search(user_question, [tool_description])
    relevance_score = relevance_scores[0]['score']

After debugging, I find out that when the model (llama3:70b-instruct) is asked to generate questions for the metadata, it generates a python script to generate questions instead of questions themselves:

> Here is a Python solution that generates the desired output:
> ```
> import json
> 
> def generate_sub_questions(tools, user_question):
>     sub_questions = []
> 
>     # Extract relevant keywords from the user question
>     keywords = [word.lower() for word in user_question.split()]
> 
>     # Iterate over each tool
>     for tool_name, tool_description in tools.items():
>         # Check if any keyword is present in the tool description
>         for keyword in keywords:
>             if keyword in tool_description.lower():
>                 # Generate sub-questions based on the tool and user question
>                 sub_questions.extend(generate_sub_questions_for_tool(tool_name, user_question))
> 
>     return {"items": sub_questions}
> 
> def generate_sub_questions_for_tool(tool_name, user_question):
>     sub_questions = []
> 
>     # Example 1: Uber/Lyft financials
>     if "uber" in tool_name or "lyft" in tool_name:
>         sub_questions.extend([
>             {"sub_question": f"What is the revenue growth of {tool_name.split('_')[0].capitalize()}", "tool_name": tool_name},
>             {"sub_question": f"What is the EBITDA of {tool_name.split('_')[0].capitalize()}", "tool_name": tool_name}
>         ])
> 
>     # Example 2: CargoWise Wiki
>     elif "cargowise" in tool_name.lower():
>         sub_questions.extend([
>             {"sub_question": f"How to migrate data from old column to new one in {tool_name.split('_')[0].capitalize()}?", "tool_name": tool_name},
>             {"sub_question": f"What transformations of which type should I write for migrating data in {tool_name.split('_')[0].capitalize()}?", "tool_name": tool_name}
>         ])
> 
>     return sub_questions
> 
> # Example 1
> tools = {
>     "uber_10k": "Provides information about Uber financials for year 2021",
>     "lyft_10k": "Provides information about Lyft financials for year 2021"
> }
> user_question = "Compare and contrast the revenue growth and EBITDA of Uber and Lyft for year 2021"
> print(json.dumps(generate_sub_questions(tools, user_question), indent=4))

which fails the parser.

I believe you need to make the question_gen_prompt prompt clearer, so that LLM understands we want a result, not a script generating the result.

Version

0.10.37

Steps to Reproduce

It is inconsistent. For some queries/tools/metadata it works, for some it doesn't. Even for the same query sometimes it works, sometimes doesn't.

Relevant Logs/Tracbacks

No response

@SNIKO SNIKO added bug Something isn't working triage Issue needs to be triaged/prioritized labels May 15, 2024
Copy link

dosubot bot commented May 15, 2024

Hello @SNIKO! I'm Dosu, a bot here to lend a hand with bugs, answer your questions, and guide you on contributing while we wait for a human maintainer to join us. I'm diving into your issue now and will get back to you with a full answer shortly. Let's tackle this together!

@logan-markewich
Copy link
Collaborator

I think the prompt is fairly clear? But you can definitely customize the prompt too

Open source llms are much less reliable for structured outputs. In this case I might use ollama with json mode turned on

@SNIKO
Copy link
Author

SNIKO commented May 15, 2024

Not clear enough for open-source models )

Is there a way to override the prompt (question_gen_prompt)? Want to try making it clearer.

@logan-markewich
Copy link
Collaborator

logan-markewich commented May 15, 2024

You'd have to override the question generator
image

The prompt is a tad complicated. I would only modify the prefix tbh. But here's the whole spiel

from llama_index.core.question_gen import LLMQuestionGenerator

PREFIX = """\
Given a user question, and a list of tools, output a list of relevant sub-questions \
in json markdown that when composed can help answer the full user question:

"""


example_query_str = (
    "Compare and contrast the revenue growth and EBITDA of Uber and Lyft for year 2021"
)
example_tools = [
    ToolMetadata(
        name="uber_10k",
        description="Provides information about Uber financials for year 2021",
    ),
    ToolMetadata(
        name="lyft_10k",
        description="Provides information about Lyft financials for year 2021",
    ),
]
example_tools_str = build_tools_text(example_tools)
example_output = [
    SubQuestion(
        sub_question="What is the revenue growth of Uber", tool_name="uber_10k"
    ),
    SubQuestion(sub_question="What is the EBITDA of Uber", tool_name="uber_10k"),
    SubQuestion(
        sub_question="What is the revenue growth of Lyft", tool_name="lyft_10k"
    ),
    SubQuestion(sub_question="What is the EBITDA of Lyft", tool_name="lyft_10k"),
]
example_output_str = json.dumps({"items": [x.dict() for x in example_output]}, indent=4)

EXAMPLES = f"""\
# Example 1
<Tools>
\`\`\`json
{example_tools_str}
\`\`\`

<User Question>
{example_query_str}


<Output>
\`\`\`json
{example_output_str}
\`\`\`

""".replace(
    "{", "{{"
).replace(
    "}", "}}"
)

SUFFIX = """\
# Example 2
<Tools>
\`\`\`json
{tools_str}
\`\`\`

<User Question>
{query_str}

<Output>
"""

DEFAULT_SUB_QUESTION_PROMPT_TMPL = PREFIX + EXAMPLES + SUFFIX

question_gen = LLMQuestionGenerator.from_defaults(
  llm=llm,
  prompt_template_str=DEFAULT_SUB_QUESTION_PROMPT_TMPL
)

sub_question_engine = SubQuestionQueryEngine.from_defaults(..., question_gen=question_gen)

@logan-markewich
Copy link
Collaborator

(ignore the \ chars I inserted, was trying to get the markdown to render nicely)

@SNIKO
Copy link
Author

SNIKO commented May 15, 2024

Yeah, adding this to the end of the prompt resolves the issue:

Make sure you generate an actual json as an output, not a script generating the json.

@logan-markewich
Copy link
Collaborator

That's pretty specific though to solve a single edge case 😅 at the top it says json markdown. Maybe a thing where some llms pay attention to the end more than the start

@SNIKO
Copy link
Author

SNIKO commented May 15, 2024

There are not that many models, you can write a prompt satisfying all of them, especially popular ones like llama3 )

Adding more strict instructions for "silly" models won't hurt. I realize that small models often may generate an invalid json (there are a few bugs reported here) - they are hopeless, there is nothing you can do. But for models like this which misunderstand the prompt - we can add more instructions to the prompt, it will not hurt other models.

Copy link

dosubot bot commented May 15, 2024

To address the issue with the SubQuestionQueryEngine generating Python scripts instead of the expected JSON markdown, you can customize the prompt to ensure it explicitly requests the desired output format. This approach involves adjusting the prompt to be clear and specific about the output requirements, which can help guide the AI model to produce the correct format, especially for models like llama3 that might misinterpret less specific instructions.

Here's a simplified template based on your needs:

PREFIX = """Given a user question, and a list of tools, output a list of relevant sub-questions in JSON markdown that can help answer the full user question:"""

EXAMPLES = f"""
# Example 1
<Tools>
```json
{{
    "uber_10k": "Provides information about Uber financials for year 2021",
    "lyft_10k": "Provides information about Lyft financials for year 2021"
}}
Compare and contrast the revenue growth and EBITDA of Uber and Lyft for year 2021 ```json {{ "items": [ {{"sub_question": "What is the revenue growth of Uber", "tool_name": "uber_10k"}}, {{"sub_question": "What is the EBITDA of Uber", "tool_name": "uber_10k"}}, {{"sub_question": "What is the revenue growth of Lyft", "tool_name": "lyft_10k"}}, {{"sub_question": "What is the EBITDA of Lyft", "tool_name": "lyft_10k"}} ] }} ``` """

SUFFIX = """Now, given the following tools and user question, generate the relevant sub-questions in JSON markdown format:"""

Combine them to form the complete prompt

complete_prompt = PREFIX + EXAMPLES + SUFFIX


This template clearly states the task, provides detailed examples, and reinforces the expected output format, which should help in generating valid JSON markdown for questions. It's designed to be adaptable for various AI models and tasks, ensuring clarity and consistency in the expected output format.

Regarding the `OutputParserException`, ensuring that the examples and the expected output are in valid JSON format is crucial for avoiding these errors. This approach does not rely on overly specific instructions that might not apply to other contexts but focuses on the format and the type of reasoning required to generate sub-questions.

If you encounter further issues or have any questions, feel free to ask for more clarification.


<!-- Dosu Comment Footer -->
<details>
<summary>

#### Details
</summary>

**Sources**
- [llama-index-core/llama_index/core/question_gen/prompts.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/question_gen/prompts.py)
- [llama-index-legacy/llama_index/legacy/question_gen/prompts.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/question_gen/prompts.py)
- [llama-index-core/llama_index/core/question_gen/llm_generators.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/question_gen/llm_generators.py)
- [llama-index-legacy/llama_index/legacy/program/llm_program.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/program/llm_program.py)
- [llama-index-core/llama_index/core/question_gen/output_parser.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/question_gen/output_parser.py)
- [llama-index-legacy/llama_index/legacy/question_gen/output_parser.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/question_gen/output_parser.py)

<br/>

**About Dosu**
This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
</details>

*To continue the conversation, mention @dosu.*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

2 participants