Skip to content

Conversation

@Ana-ScottLogic
Copy link

Please add a direct link to your post here:

https://.github.io/blog/

Have you (please tick each box to show completion):

  • Added your blog post to a single category?
  • Added a brief summary for your post? Summaries should be roughly two sentences in length and give potential readers a good idea of the contents of your post.
  • Checked that the build passes?
  • Checked your spelling (you can use npm install followed by npx mdspell "**/{FILE_NAME}.md" --en-gb -a -n -x -t if that's your thing)
  • Ensured that your author profile contains a profile image, and a brief description of yourself? (make it more interesting than just your job title!)
  • Optimised any images in your post? They should be less than 100KBytes as a general guide.

Posts are reviewed / approved by your Regional Tech Lead.

@Ana-ScottLogic Ana-ScottLogic changed the title Adding profile page Blog-Evaluating Answers with Large Language Models: How InferESG and RAGAS Helped Nov 20, 2025
Copy link
Contributor

@csalt-scottlogic csalt-scottlogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good start, I've made a few suggestions on rephrasing things for clarity, a small number of grammar corrections, and some accessibility issues with the graphs


# Evaluating Answers with Large Language Models: How InferESG and RAGAS Helped

In our latest project, we set out to evaluate how different Large Language Models (LLMs) perform when responding to user prompts. Building on our existing platform, InferESG, which automatically generates greenwashing reports from ESG disclosures, our goal was not to determine which model is superior, but rather to assess whether open-source models can serve as viable alternatives to OpenAI’s proprietary models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence that starts "Building..." is a pretty long run-on sentence. It might be better split up and with the goal put to the fore. Something like everything after "our goal" as one sentence, then "We did this by building on our existing platform, InferESG, which is..."

It would be good, if we've published anything previously about InferESG, to turn "InferESG" into a link to it


In our latest project, we set out to evaluate how different Large Language Models (LLMs) perform when responding to user prompts. Building on our existing platform, InferESG, which automatically generates greenwashing reports from ESG disclosures, our goal was not to determine which model is superior, but rather to assess whether open-source models can serve as viable alternatives to OpenAI’s proprietary models.

For this study, we tested the following models: DeepSeek, Gemma-3-1B, GPT-4o, GPT-4o-mini, GPT-OSS-20B, LFM2-1.2, and Qwen3-30B-A3B. The table below gives a better understanding of the models sizes and provides links with useful information about those models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apostrophe missing in "models' sizes"

I'd also change "about those models" to "about them" to avoid repeating the word "models" too often


## The Role of InferESG

InferESG served as the core system in our experiment. Under the hood, InferESG uses an agentic solver that breaks down the user’s request into specific subtasks and leverages various API calls to LLMs, each with task-specific system contexts. The system takes a primary sustainability document, analyses it, and generates a comprehensive greenwashing report. This generated report then serves as the foundation for evaluating and comparing the performance of other large language models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain a little bit more about what a "greenwashing report" actually is?


InferESG served as the core system in our experiment. Under the hood, InferESG uses an agentic solver that breaks down the user’s request into specific subtasks and leverages various API calls to LLMs, each with task-specific system contexts. The system takes a primary sustainability document, analyses it, and generates a comprehensive greenwashing report. This generated report then serves as the foundation for evaluating and comparing the performance of other large language models.

To ensure the accuracy of our generated reports before using them to build our test dataset, we first created a benchmark report. In this benchmark, GPT-5 extracted all factual statements and classified them. A manual verification was conducted at the end to ensure the generated report was consistent and accurate. For a better understanding of this process, you can refer to our [blog](https://blog.scottlogic.com/2025/10/27/testing-open-source-llms.html), which provides a more detailed explanation of these steps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to include the title of the linked article. "You can refer to our blog post, Beyond Benchmarks: Testing Open-Source LLMs in Multi-Agent Workflows"

You don't need the hostname part of the link when it's a link to another blog post


## Evaluating with RAGAS

Ragas played a central role in both creating our test dataset and evaluating model outputs through a set of well-defined metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"RAGAS" is capitalised in some places but not in others - can you make it consistent please

First, the quality of evaluation heavily depends on the clarity of questions and the quality of reference context.

LLM outputs are context-sensitive, and poorly defined questions or insufficient reference material can significantly impact performance metrics.
Second, hallucinations remain a concern with LLMs. In our experiments, setting the model temperature to 0 within Ragas helped mitigate this issue. Reducing the temperature controls the randomness of a model’s output and makes it more deterministic, increasing the likelihood of selecting the most probable next token. However, it also limits response diversity and nuance, making outputs more rigid and less adaptable to ambiguous queries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put a blank line before this so it starts a new paragraph

Second, hallucinations remain a concern with LLMs. In our experiments, setting the model temperature to 0 within Ragas helped mitigate this issue. Reducing the temperature controls the randomness of a model’s output and makes it more deterministic, increasing the likelihood of selecting the most probable next token. However, it also limits response diversity and nuance, making outputs more rigid and less adaptable to ambiguous queries.

Overall, this helped reduce speculative or fabricated responses and improved answer fidelity.
Third, prompt design is critical. Clear, well-structured prompts ensure that LLMs generate focused and relevant answers, which in turn supports more accurate evaluation outcomes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put a blank line before this so it starts a new paragraph


Overall, this helped reduce speculative or fabricated responses and improved answer fidelity.
Third, prompt design is critical. Clear, well-structured prompts ensure that LLMs generate focused and relevant answers, which in turn supports more accurate evaluation outcomes.
Overall, the results across all models were generally positive, particularly in terms of semantic similarity, which remained consistently high across the board, indicating that most models preserved the intended meaning of answers even when phrasing differed from the reference.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same again, if this is meant to be a new paragraph, put a blank line before it


On average, Qwen3-30B emerged as the strongest performer, excelling in factual correctness and maintaining high semantic similarity, making it the most robust and reliable model for generating accurate, contextually grounded, and relevant answers. GPT-OSS-20B also performed very well, with strong answer accuracy and semantic similarity, making it a solid choice for balanced performance.

In terms of performance and efficiency, Qwen3-30B-A3B (31B params) is the most computationally intensive model, excelling at complex, high-reasoning tasks but with longer execution times, particularly noticeable on local hardware. GPT-4o (200B params) offers the most advanced reasoning capabilities, though its large size comes with substantial computational cost. Models such as DeepSeek (8B params) and GPT-4o-mini (8B params) strike a balance between performance and efficiency, providing strong results with moderate runtimes. Smaller models like Gemma-3-1B and LFM2-1.2B (1B params each) are highly efficient and fast, making them well-suited for lightweight or time-sensitive tasks, though they are less capable of handling workloads that require extensive reasoning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I find "B" as an abbreviation for "billion" less clear than "bn", because at first sight people might confuse "B" and "8"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to crop your pic so your face is more central after the blog has filtered it - see how it looks on your local preview

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Caitlin. I see what you mean and I have cropped and updated before but it's still the same I think it's because of the angle the picture was taken. I will keep it as it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants