-
Notifications
You must be signed in to change notification settings - Fork 128
Blog-Evaluating Answers with Large Language Models: How InferESG and RAGAS Helped #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh-pages
Are you sure you want to change the base?
Conversation
csalt-scottlogic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good start, I've made a few suggestions on rephrasing things for clarity, a small number of grammar corrections, and some accessibility issues with the graphs
|
|
||
| # Evaluating Answers with Large Language Models: How InferESG and RAGAS Helped | ||
|
|
||
| In our latest project, we set out to evaluate how different Large Language Models (LLMs) perform when responding to user prompts. Building on our existing platform, InferESG, which automatically generates greenwashing reports from ESG disclosures, our goal was not to determine which model is superior, but rather to assess whether open-source models can serve as viable alternatives to OpenAI’s proprietary models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence that starts "Building..." is a pretty long run-on sentence. It might be better split up and with the goal put to the fore. Something like everything after "our goal" as one sentence, then "We did this by building on our existing platform, InferESG, which is..."
It would be good, if we've published anything previously about InferESG, to turn "InferESG" into a link to it
|
|
||
| In our latest project, we set out to evaluate how different Large Language Models (LLMs) perform when responding to user prompts. Building on our existing platform, InferESG, which automatically generates greenwashing reports from ESG disclosures, our goal was not to determine which model is superior, but rather to assess whether open-source models can serve as viable alternatives to OpenAI’s proprietary models. | ||
|
|
||
| For this study, we tested the following models: DeepSeek, Gemma-3-1B, GPT-4o, GPT-4o-mini, GPT-OSS-20B, LFM2-1.2, and Qwen3-30B-A3B. The table below gives a better understanding of the models sizes and provides links with useful information about those models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apostrophe missing in "models' sizes"
I'd also change "about those models" to "about them" to avoid repeating the word "models" too often
|
|
||
| ## The Role of InferESG | ||
|
|
||
| InferESG served as the core system in our experiment. Under the hood, InferESG uses an agentic solver that breaks down the user’s request into specific subtasks and leverages various API calls to LLMs, each with task-specific system contexts. The system takes a primary sustainability document, analyses it, and generates a comprehensive greenwashing report. This generated report then serves as the foundation for evaluating and comparing the performance of other large language models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explain a little bit more about what a "greenwashing report" actually is?
|
|
||
| InferESG served as the core system in our experiment. Under the hood, InferESG uses an agentic solver that breaks down the user’s request into specific subtasks and leverages various API calls to LLMs, each with task-specific system contexts. The system takes a primary sustainability document, analyses it, and generates a comprehensive greenwashing report. This generated report then serves as the foundation for evaluating and comparing the performance of other large language models. | ||
|
|
||
| To ensure the accuracy of our generated reports before using them to build our test dataset, we first created a benchmark report. In this benchmark, GPT-5 extracted all factual statements and classified them. A manual verification was conducted at the end to ensure the generated report was consistent and accurate. For a better understanding of this process, you can refer to our [blog](https://blog.scottlogic.com/2025/10/27/testing-open-source-llms.html), which provides a more detailed explanation of these steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to include the title of the linked article. "You can refer to our blog post, Beyond Benchmarks: Testing Open-Source LLMs in Multi-Agent Workflows"
You don't need the hostname part of the link when it's a link to another blog post
|
|
||
| ## Evaluating with RAGAS | ||
|
|
||
| Ragas played a central role in both creating our test dataset and evaluating model outputs through a set of well-defined metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"RAGAS" is capitalised in some places but not in others - can you make it consistent please
| First, the quality of evaluation heavily depends on the clarity of questions and the quality of reference context. | ||
|
|
||
| LLM outputs are context-sensitive, and poorly defined questions or insufficient reference material can significantly impact performance metrics. | ||
| Second, hallucinations remain a concern with LLMs. In our experiments, setting the model temperature to 0 within Ragas helped mitigate this issue. Reducing the temperature controls the randomness of a model’s output and makes it more deterministic, increasing the likelihood of selecting the most probable next token. However, it also limits response diversity and nuance, making outputs more rigid and less adaptable to ambiguous queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put a blank line before this so it starts a new paragraph
| Second, hallucinations remain a concern with LLMs. In our experiments, setting the model temperature to 0 within Ragas helped mitigate this issue. Reducing the temperature controls the randomness of a model’s output and makes it more deterministic, increasing the likelihood of selecting the most probable next token. However, it also limits response diversity and nuance, making outputs more rigid and less adaptable to ambiguous queries. | ||
|
|
||
| Overall, this helped reduce speculative or fabricated responses and improved answer fidelity. | ||
| Third, prompt design is critical. Clear, well-structured prompts ensure that LLMs generate focused and relevant answers, which in turn supports more accurate evaluation outcomes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put a blank line before this so it starts a new paragraph
|
|
||
| Overall, this helped reduce speculative or fabricated responses and improved answer fidelity. | ||
| Third, prompt design is critical. Clear, well-structured prompts ensure that LLMs generate focused and relevant answers, which in turn supports more accurate evaluation outcomes. | ||
| Overall, the results across all models were generally positive, particularly in terms of semantic similarity, which remained consistently high across the board, indicating that most models preserved the intended meaning of answers even when phrasing differed from the reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same again, if this is meant to be a new paragraph, put a blank line before it
|
|
||
| On average, Qwen3-30B emerged as the strongest performer, excelling in factual correctness and maintaining high semantic similarity, making it the most robust and reliable model for generating accurate, contextually grounded, and relevant answers. GPT-OSS-20B also performed very well, with strong answer accuracy and semantic similarity, making it a solid choice for balanced performance. | ||
|
|
||
| In terms of performance and efficiency, Qwen3-30B-A3B (31B params) is the most computationally intensive model, excelling at complex, high-reasoning tasks but with longer execution times, particularly noticeable on local hardware. GPT-4o (200B params) offers the most advanced reasoning capabilities, though its large size comes with substantial computational cost. Models such as DeepSeek (8B params) and GPT-4o-mini (8B params) strike a balance between performance and efficiency, providing strong results with moderate runtimes. Smaller models like Gemma-3-1B and LFM2-1.2B (1B params each) are highly efficient and fast, making them well-suited for lightweight or time-sensitive tasks, though they are less capable of handling workloads that require extensive reasoning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I find "B" as an abbreviation for "billion" less clear than "bn", because at first sight people might confuse "B" and "8"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to crop your pic so your face is more central after the blog has filtered it - see how it looks on your local preview
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Caitlin. I see what you mean and I have cropped and updated before but it's still the same I think it's because of the angle the picture was taken. I will keep it as it is.
Please add a direct link to your post here:
https://.github.io/blog/
Have you (please tick each box to show completion):
npm installfollowed bynpx mdspell "**/{FILE_NAME}.md" --en-gb -a -n -x -tif that's your thing)Posts are reviewed / approved by your Regional Tech Lead.