Blog-Evaluating Answers with Large Language Models: How InferESG and RAGAS Helped #377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

Ana-ScottLogic wants to merge 18 commits into ScottLogic:gh-pages from Ana-ScottLogic:gh-pages

Ana-ScottLogic commented Oct 29, 2025

Please add a direct link to your post here:

https://.github.io/blog/

Have you (please tick each box to show completion):

Added your blog post to a single category?
Added a brief summary for your post? Summaries should be roughly two sentences in length and give potential readers a good idea of the contents of your post.
Checked that the build passes?
Checked your spelling (you can use npm install followed by npx mdspell "**/{FILE_NAME}.md" --en-gb -a -n -x -t if that's your thing)
Ensured that your author profile contains a profile image, and a brief description of yourself? (make it more interesting than just your job title!)
Optimised any images in your post? They should be less than 100KBytes as a general guide.

Posts are reviewed / approved by your Regional Tech Lead.

Ana-ScottLogic added 15 commits

October 29, 2025 11:14


          Update my profile

f27d607


          Adding images and text to the blog

79ac8c9


          new changes


          update photo

194c543


          delete file

ae20182


          changing date

88a78ad


          data update

abebd78


          Changing author name and delete file

dd0c764


          Adding Tags

a7c3fe4


          Adding summary

bfb542e


          Adjusting the text

4a29b08


          Changing titles

148ae46


          Adding line and HugginFace as link

b55d62c


          Adding blog link and 2 graphs

8f5fd89


          adding spaces

9fe71fd

Ana-ScottLogic changed the title ~~Adding profile page~~ Blog-Evaluating Answers with Large Language Models: How InferESG and RAGAS Helped


          Adding OpenAI links

3291a97

csalt-scottlogic requested changes

View reviewed changes

Contributor

csalt-scottlogic left a comment

It's a good start, I've made a few suggestions on rephrasing things for clarity, a small number of grammar corrections, and some accessibility issues with the graphs

_posts/2025-11-17-testing-open-source-llms-with-ragas.md


		# Evaluating Answers with Large Language Models: How InferESG and RAGAS Helped

		In our latest project, we set out to evaluate how different Large Language Models (LLMs) perform when responding to user prompts. Building on our existing platform, InferESG, which automatically generates greenwashing reports from ESG disclosures, our goal was not to determine which model is superior, but rather to assess whether open-source models can serve as viable alternatives to OpenAI’s proprietary models.

Contributor

csalt-scottlogic Nov 21, 2025

The sentence that starts "Building..." is a pretty long run-on sentence. It might be better split up and with the goal put to the fore. Something like everything after "our goal" as one sentence, then "We did this by building on our existing platform, InferESG, which is..."

It would be good, if we've published anything previously about InferESG, to turn "InferESG" into a link to it

_posts/2025-11-17-testing-open-source-llms-with-ragas.md Outdated


		In our latest project, we set out to evaluate how different Large Language Models (LLMs) perform when responding to user prompts. Building on our existing platform, InferESG, which automatically generates greenwashing reports from ESG disclosures, our goal was not to determine which model is superior, but rather to assess whether open-source models can serve as viable alternatives to OpenAI’s proprietary models.

		For this study, we tested the following models: DeepSeek, Gemma-3-1B, GPT-4o, GPT-4o-mini, GPT-OSS-20B, LFM2-1.2, and Qwen3-30B-A3B. The table below gives a better understanding of the models sizes and provides links with useful information about those models.

Contributor

csalt-scottlogic Nov 21, 2025

Apostrophe missing in "models' sizes"

I'd also change "about those models" to "about them" to avoid repeating the word "models" too often

_posts/2025-11-17-testing-open-source-llms-with-ragas.md


		## The Role of InferESG

		InferESG served as the core system in our experiment. Under the hood, InferESG uses an agentic solver that breaks down the user’s request into specific subtasks and leverages various API calls to LLMs, each with task-specific system contexts. The system takes a primary sustainability document, analyses it, and generates a comprehensive greenwashing report. This generated report then serves as the foundation for evaluating and comparing the performance of other large language models.

Contributor

csalt-scottlogic Nov 21, 2025

Maybe explain a little bit more about what a "greenwashing report" actually is?

_posts/2025-11-17-testing-open-source-llms-with-ragas.md Outdated


		InferESG served as the core system in our experiment. Under the hood, InferESG uses an agentic solver that breaks down the user’s request into specific subtasks and leverages various API calls to LLMs, each with task-specific system contexts. The system takes a primary sustainability document, analyses it, and generates a comprehensive greenwashing report. This generated report then serves as the foundation for evaluating and comparing the performance of other large language models.

		To ensure the accuracy of our generated reports before using them to build our test dataset, we first created a benchmark report. In this benchmark, GPT-5 extracted all factual statements and classified them. A manual verification was conducted at the end to ensure the generated report was consistent and accurate. For a better understanding of this process, you can refer to our [blog](https://blog.scottlogic.com/2025/10/27/testing-open-source-llms.html), which provides a more detailed explanation of these steps.

Contributor

csalt-scottlogic Nov 21, 2025

It would be better to include the title of the linked article. "You can refer to our blog post, Beyond Benchmarks: Testing Open-Source LLMs in Multi-Agent Workflows"

You don't need the hostname part of the link when it's a link to another blog post

_posts/2025-11-17-testing-open-source-llms-with-ragas.md


		## Evaluating with RAGAS

		Ragas played a central role in both creating our test dataset and evaluating model outputs through a set of well-defined metrics.

Contributor

csalt-scottlogic Nov 21, 2025

"RAGAS" is capitalised in some places but not in others - can you make it consistent please

_posts/2025-11-17-testing-open-source-llms-with-ragas.md

+              First, the quality of evaluation heavily depends on the clarity of questions and the quality of reference context.
+              LLM outputs are context-sensitive, and poorly defined questions or insufficient reference material can significantly impact performance metrics.
+              Second, hallucinations remain a concern with LLMs. In our experiments, setting the model temperature to 0 within Ragas helped mitigate this issue. Reducing the temperature controls the randomness of a model’s output and makes it more deterministic, increasing the likelihood of selecting the most probable next token. However, it also limits response diversity and nuance, making outputs more rigid and less adaptable to ambiguous queries.

Contributor

csalt-scottlogic Nov 21, 2025

Put a blank line before this so it starts a new paragraph

_posts/2025-11-17-testing-open-source-llms-with-ragas.md

+              Second, hallucinations remain a concern with LLMs. In our experiments, setting the model temperature to 0 within Ragas helped mitigate this issue. Reducing the temperature controls the randomness of a model’s output and makes it more deterministic, increasing the likelihood of selecting the most probable next token. However, it also limits response diversity and nuance, making outputs more rigid and less adaptable to ambiguous queries.
+              Overall, this helped reduce speculative or fabricated responses and improved answer fidelity.
+              Third, prompt design is critical. Clear, well-structured prompts ensure that LLMs generate focused and relevant answers, which in turn supports more accurate evaluation outcomes.

Contributor

csalt-scottlogic Nov 21, 2025

Put a blank line before this so it starts a new paragraph

_posts/2025-11-17-testing-open-source-llms-with-ragas.md

+              Overall, this helped reduce speculative or fabricated responses and improved answer fidelity.
+              Third, prompt design is critical. Clear, well-structured prompts ensure that LLMs generate focused and relevant answers, which in turn supports more accurate evaluation outcomes.
+              Overall, the results across all models were generally positive, particularly in terms of semantic similarity, which remained consistently high across the board, indicating that most models preserved the intended meaning of answers even when phrasing differed from the reference.

Contributor

csalt-scottlogic Nov 21, 2025

Same again, if this is meant to be a new paragraph, put a blank line before it

_posts/2025-11-17-testing-open-source-llms-with-ragas.md Outdated


		On average, Qwen3-30B emerged as the strongest performer, excelling in factual correctness and maintaining high semantic similarity, making it the most robust and reliable model for generating accurate, contextually grounded, and relevant answers. GPT-OSS-20B also performed very well, with strong answer accuracy and semantic similarity, making it a solid choice for balanced performance.

		In terms of performance and efficiency, Qwen3-30B-A3B (31B params) is the most computationally intensive model, excelling at complex, high-reasoning tasks but with longer execution times, particularly noticeable on local hardware. GPT-4o (200B params) offers the most advanced reasoning capabilities, though its large size comes with substantial computational cost. Models such as DeepSeek (8B params) and GPT-4o-mini (8B params) strike a balance between performance and efficiency, providing strong results with moderate runtimes. Smaller models like Gemma-3-1B and LFM2-1.2B (1B params each) are highly efficient and fast, making them well-suited for lightweight or time-sensitive tasks, though they are less capable of handling workloads that require extensive reasoning.

Contributor

csalt-scottlogic Nov 21, 2025

Personally, I find "B" as an abbreviation for "billion" less clear than "bn", because at first sight people might confuse "B" and "8"

afonseca/picture.jpg

Contributor

csalt-scottlogic Nov 21, 2025

You might want to crop your pic so your face is more central after the blog has filtered it - see how it looks on your local preview

Author

Ana-ScottLogic Nov 24, 2025

Thanks Caitlin. I see what you mean and I have cropped and updated before but it's still the same I think it's because of the angle the picture was taken. I will keep it as it is.

Ana-ScottLogic added 2 commits

November 24, 2025 11:12


          addressing comments

0280a7a


          Markdown

95e4617

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet