AI Testing for QA Engineers, are we over complicating it? #32

priyanshus · 2026-04-10T05:39:54Z

priyanshus
Apr 10, 2026

I have been thinking about this for a while and wanted to hear from the community.

Over the last few months I spent a significant amount of time exploring how to test AI systems, specifically RAG systems.

As a QA consultant, I went through DeepEval, Ragas, Promptfoo and a few others. All genuinely powerful tools. These tools speak the language of ML. Retrieval precision, K-accuracy, embedding similarity, chunk evaluation, etc.

Having been in this profession for more than a decade, I have experienced that most QAs prefer black-box end-to-end testing. Treating the system as a black box and validating what the end user receives. I personally feel that these tools are best suited for ML engineers/developers, and to use them effectively, I need to be equally knowledgeable as my developers. (However, I don’t see that as a drawback but not all QAs have that opportunity. Many are offshore, disconnected from developers, and may not have the same level of knowledge of the product internals they are testing.)
I see these tools as well-suited for unit[+intergration]-level testing of LLM systems, but we also need some end-to-end testing solutions where we are not necessarily diving into the internals, but rather covering the whole user journey, including the LLM, and testing it from the end-user perspective.

And, that the reason I am here, do we actually need a simplified solution to do end-to-end testing of AI system?

A QA engineer testing a login flow does not need to understand OAuth internals. They test the behaviour the user sees. Why should testing an AI feature be fundamentally different?

I am curious what this community thinks:

How are you currently testing AI features on your team?
Is the ML-heavy approach necessary or is it overkill for most QA teams?
What would "good enough" AI testing look like for a team without ML expertise?
Is black box E2E testing even valid for AI systems or does the non-determinism make it too unreliable?

I have been working on something in this space and have my own opinions but genuinely curious what the QA community's experience[perspective] has been before I share more.

sharovatov · 2026-04-11T06:35:31Z

sharovatov
Apr 11, 2026
Maintainer

dear @priyanshus thank you for raising a very interesting question.

I am replying now, and will reply more on Monday when I have my full research day :)

As a QA consultant, I went through DeepEval, Ragas, Promptfoo and a few others. All genuinely powerful tools. These tools speak the language of ML. Retrieval precision, K-accuracy, embedding similarity, chunk evaluation, etc.

This is supercool! Anupam and I experimented with Ragas a lot too.

Having been in this profession for more than a decade, I have experienced that most QAs prefer black-box end-to-end testing. Treating the system as a black box and validating what the end user receives.

Curiously, I have a very different observation. In my experience, QA professionals choose the testing methods/types/levels/approaches most efficient to the task, i.e. providing good evidence for the risks at the minimal cost. End-to-end testing sits at quite an expensive level and even though it does provide good evidence for certain risks and in certain scenarios, there are very many more ways to get this evidence cheaper (faster/earlier in the SDLC). The most efficient QA approaches I've seen give more focus on prevention, not appraisal, thus reducing the need for expensive late-stage appraisal like E2E.

In my opinion, preference matters only if rationally justified: basically, if we can say "the cheapest way to get good evidence for this risk through E2E testing".

I personally feel that these tools are best suited for ML engineers/developers, and to use them effectively, I need to be equally knowledgeable as my developers. (However, I don’t see that as a drawback but not all QAs have that opportunity. Many are offshore, disconnected from developers, and may not have the same level of knowledge of the product internals they are testing.)

I don't think tools are "suited" for specific roles. A tool is useful when it provides information you need at a cost you can afford. If QA needs the evidence that Ragas provides (e..g, distinguishing retrieval failures from generation failures), and QA currently can't use Ragas, then the economics question is straightforward: what does it cost to give them that capability (training, pairing with ML engineers, better tooling abstractions) versus what does it cost to not have that evidence (undiagnosed failures, longer fix cycles, higher failure costs)?

The point about offshore, disconnected QA teams I see as an even more important point, and it's not a tooling problem, it's an organisational structure problem. When QA is structurally separated from development, they're limited to late-stage appraisal regardless of what tools exist. I don't see how any tool, however simplified, can fix the information gap created by that separation.

5 replies

priyanshus Apr 12, 2026
Author

Hey @sharovatov!

Thanks for the thoughtful reply. I agree with your point that E2E is expensive, and in an ideal setup we should always push quality earlier rather than relying on its inclusion in late-stages. That has been my view for years as well.

Where I think I’m coming from a slightly different angle is the reality of how many QA teams still operate today. In many organisations, QA is still structurally positioned around late-stage validation, whether we like that model or not. By the time work reaches them, their practical choice is often: test it manually, automate it at the user journey level, or don’t test it deeply at all[I know its too bold statement].

So my question is less “what is the most theoretically complete way to test AI systems?” and more “what is the most practical way to help those teams test AI systems better than they do today or will do if given an opportunity?”

That’s why I feel there is room for a simplified end-to-end AI testing approach. Not as a replacement for tools like Ragas or DeepEval, and not as a substitute any other similar tool, but as an abstraction layer for teams that are currently disconnected from model internals and still need to validate user-facing behaviour through automation.

I completely agree that this does not solve the deeper cultural or organisational gap. If a team wants to raise its quality game, it still has to improve collaboration, shift left, and build stronger technical foundations.

But I also think better tooling can help bridge the gap to an extend, if teams are not there yet.

That is partly why I’ve been exploring this space through Evaliphy an open-source attempt to simplify AI testing from a QA perspective. The problem I’m trying to solve is not “how do we avoid understanding AI systems altogether,” but “how do we make meaningful testing possible for teams that currently only have a black-box entry point?”

So I think we largely agree on the ideal state. My argument is just that there is still value in supporting the current state, because that is where a lot of teams still are.

Additionally, I believe, if we all only design practices for mature teams, we risk leaving a large part of the QA industry unaddressed.

sharovatov Apr 14, 2026
Maintainer

Thanks for you reply, dear @priyanshus, I truly enjoy this conversation.

As usual, I have a lot to say :D

On the pragmatic argument

You're 100% right that very many QA teams are mostly "testing-after". That's an unfortunate reality we live in. And we do need to help them.

But then I'm thinking what's this help really is and what it should be. When you give a late-stage team better appraisal tooling, two things can happen. They use it as a "stepping stone": the tool surfaces enough signal that people start asking harder questions, and eventually the org moves earlier. Or the tool makes the current position comfortable enough that they don't move at all.

From the economics of testing standpoint, prevention costs less than appraisal, which costs less than failure. That applies to every team, not just mature ones. This makes us think: "what's the economically justified way to help the 'testing-after' teams?"

Surely, sometimes that's a better E2E framework. But sometimes it's a QA engineer paired with an ML engineer for an afternoon, doing component-level evaluation on the two highest-risk parts of the pipeline. In my opinion, the economics question should come before the tooling question, not after.

So I do agree that we can and should help people with their day-to-day work, but I'm curious how we can avoid the situation where they become too comfortable with the current status and keep investing into appraisal only to hit the ceiling later again, and waste money while waiting for this ceiling to hit them? I don't have a good answer to this question yet.

sharovatov Apr 14, 2026
Maintainer

That’s why I feel there is room for a simplified end-to-end AI testing approach.

On E2E in general:

In this economics of testing research I tried formulating a decision procedure for choosing the testing method/approach/type/level. First the risk is assessed (what can go wrong, what does failure cost?), then testing investment is chosen, one (or many) that provide sufficient evidence at a cost the risk justifies. So for me, choosing whether we need E2E testing is more like "does E2E provide sufficient evidence for THIS risk at THIS consequence level?"

For instance:

Feature where the cost of failure is low (like an internal FAQ chatbot for HR policy). If it says the wrong thing about holiday accrual, someone makes a confused Slack message and asks a colleague. The consequence of failure is low, so the evidence threshold the economics framework sets is low. E2E evaluation with an LLM judge ("does the response roughly answer the question correctly?") "clears" that threshold. The investment is justified.
Feature where the cost of failure is high: a RAG system answering patient questions about medication interactions, or a financial assistant giving investment guidance, or a legal tool explaining contract terms. The consequence of failure is high, sometimes irreversible, so the evidence threshold is high. Here we need to know not just that the answer was wrong, but why. Retrieval failure (the right chunks weren't retrieved) needs a different fix than generation hallucination (the chunks were fine but the model invented something anyway), which needs a different fix than context mismatch (the retrieved content was technically relevant but not to this specific question in this phrasing). E2E evaluation can tell us that a response failed, but can't tell which of those three things happened. The evidence doesn't reach the threshold the consequence demands.

So imo testing investment should be "proportional" to the consequence of failure, and the choice of testing approach (E2E, component-level, or both) is an economic decision, not a tooling preference.

sharovatov Apr 14, 2026
Maintainer

On Evaliphy

I love this tool, thanks for sharing. Beautifully done. You've made something QA engineers can pick up without needing to understand RAG pipeline internals, and that's not a trivial thing to build cleanly.

I have a few questions of course :)

The first thing I notice is that the judge is another LLM. But this reminds me of a classical bootstrap problem i.e. how do you know the judge is right? If the system hallucinates in a way that's plausible-sounding, the judge may pass it, because the judge is subject to the same plausibility bias. Have you measured judge accuracy against human evaluations on your test cases? I'd be curious what the disagreement rate looks like, and especially whether the disagreements cluster around a particular failure type.

I am also curious about the fact that the failure diagnosis is out of scope by design. So when a response fails, the tool reports that it failed. It can't say whether retrieval was the problem or generation was the problem, because it doesn't have access to the pipeline internals. And it's intentional. So even in late-stage testing companies, testers are usually expected to provide more than "it failed". A bug report needs diagnostic detail, steps to reproduce, some indication of where things went wrong. In traditional E2E testing, the failure scenario itself carries diagnostic information ("login fails when the email contains a + character" tells the engineer to look at input parsing). With E2E for a RAG-enabled system, "the judge scored this as not faithful" doesn't point to specifics. The failure could be in retrieval, chunking, prompt construction, or generation. The engineer has to redo the entire diagnostic from scratch, so the tester's automated verdict added a signal but not the diagnostic value that makes a bug report actionable.

I'd love to know how you reason about faithfulness vs groundedness. Looking at the rubric descriptions: faithfulness checks whether the response is supported by the context, groundedness checks whether claims are supported by source documents. In practice, for many RAG systems, those two things are evaluating nearly the same signal. Do you have data showing the two metrics discriminate meaningfully? Cases where one fails and the other passes? If they correlate strongly in practice, that's useful to know, because it affects how much information value you're actually getting from running both.

The question I'd most like your answer to: from the teams using Evaliphy now, do you see them using it to start asking harder questions about their pipeline, or does it mostly satisfy the "we're testing AI now" checkbox?

priyanshus Apr 17, 2026
Author

So I do agree that we can and should help people with their day-to-day work, but I'm curious how we can avoid the situation where they become too comfortable with the current status and keep investing into appraisal only to hit the ceiling later again, and waste money while waiting for this ceiling to hit them? I don't have a good answer to this question yet.

Honestly, I, too, do not have a good answer at the moment. From my own experience, I started with something very simple as a newcomer but the curiosity it sparked in me to go beyond what was given.

The first thing I notice is that the judge is another LLM. But this reminds me of a classical bootstrap problem i.e. how do you know the judge is right? If the system hallucinates in a way that's plausible-sounding, the judge may pass it, because the judge is subject to the same plausibility bias. Have you measured judge accuracy against human evaluations on your test cases? I'd be curious what the disagreement rate looks like, and especially whether the disagreements cluster around a particular failure type.

This is a genuinely a problem and not only with Evaliphy but for all other tools in this space. You are right that a plausible-sounding hallucination can fool the judge, that is a known limitation of this approach. There are a few best practices recommended around this, like appropriate model selection, and tuning the system prompt for LLM-as-Judge as per your domain. Evaliphy comes with default prompts but it allows you to bring your own prompt as well through a simple config change.

Have I measured judge accuracy against human evaluation? Not at scale yet. But one area which I definitely see judges struggle when their is a long history attached to your query. In that case, neither the application LLM nor LLM-as-Judge acts correctly.

I am also curious about the fact that the failure diagnosis is out of scope by design. So when a response fails, the tool reports that it failed. It can't say whether retrieval was the problem or generation was the problem, because it doesn't have access to the pipeline internals. And it's intentional. So even in late-stage testing companies, testers are usually expected to provide more than "it failed". A bug report needs diagnostic detail, steps to reproduce, some indication of where things went wrong. In traditional E2E testing, the failure scenario itself carries diagnostic information ("login fails when the email contains a + character" tells the engineer to look at input parsing). With E2E for a RAG-enabled system, "the judge scored this as not faithful" doesn't point to specifics. The failure could be in retrieval, chunking, prompt construction, or generation. The engineer has to redo the entire diagnostic from scratch, so the tester's automated verdict added a signal but not the diagnostic value that makes a bug report actionable.

Yes, this is correct. It just gives you a signal. But that's intentional and better than having nothing. But It would give a space to discuss/argue with dev team that "faithfulness scored 0.4 on this query, here is the query, here is the response, here is the context for that query" is actually more actionable than nothing. I think, it remains same in other deterministic applications as well, your e2e tests are meant to give you a signal and not to diagnose the problems. Honestly, we never raise a defect immediately after we see a playwright test fails, we diagnose the failure, recreate it and then we raise it. This process remains same. Observability tools will come handy here.

I'd love to know how you reason about faithfulness vs groundedness. Looking at the rubric descriptions: faithfulness checks whether the response is supported by the context, groundedness checks whether claims are supported by source documents. In practice, for many RAG systems, those two things are evaluating nearly the same signal. Do you have data showing the two metrics discriminate meaningfully? Cases where one fails and the other passes? If they correlate strongly in practice, that's useful to know, because it affects how much information value you're actually getting from running both.

This is exactly the kind of feedback that will shape which assertions survive into v1. I released beta to understand such subtle differences. Thanks for this valuable feedback. I have taken it in my to-do list and soon publish my findings.

These are exactly the questions I was hoping this beta would surface. Thank you for the depth of your feedback. This is really useful! I will adding a couple of more replies in this thread in upcoming days.

sharovatov · 2026-04-11T06:43:17Z

sharovatov
Apr 11, 2026
Maintainer

A QA engineer testing a login flow does not need to understand OAuth internals. They test the behaviour the user sees. Why should testing an AI feature be fundamentally different?

This analogy is very tempting but I think it hides something important. When QA tests login black-box, the reason it works is that the oracle is trivial: valid credentials get in, invalid don't. Binary, deterministic, fully specified. QA doesn't need OAuth knowledge because the oracle does all the work.

For an AI feature like a RAG-based Q&A, what is the oracle? "The answer should be correct"? But how correct? How complete? How relevant? By whose definition? And the system is non-deterministic, so you're evaluating a distribution of possible outputs against fuzzy, context-dependent criteria. Defining the oracle is the hardest part of the entire testing problem, and it requires deep domain understanding even if not ML understanding.

Is the ML-heavy approach necessary or is it overkill for most QA teams?

In my economics of testing research I argue that it depends on what a wrong answer might cost.

Say, for a low stakes AI feature like an internal FAQ chatbot, sampled human evaluation of outputs might provide sufficient evidence. The cost of a wrong answer is low, so the evidence threshold is low.

For a high-stakes AI feature like a medical, financial, legal one, you need to know where failures happen: did retrieval fail, or did retrieval succeed but the LLM hallucinate? Black-box E2E tells you "the answer was wrong." Component-level evaluation (what Ragas and DeepEval do) tells you why. The evidence threshold is high, and E2E alone can't possibly reach it, or maybe I'm simply not seeing how.

more replies coming on monday and I'm off on a velo trip now :)) thank you again!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Testing for QA Engineers, are we over complicating it? #32

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

AI Testing for QA Engineers, are we over complicating it? #32

Uh oh!

priyanshus Apr 10, 2026

Replies: 2 comments · 5 replies

Uh oh!

sharovatov Apr 11, 2026 Maintainer

Uh oh!

priyanshus Apr 12, 2026 Author

Uh oh!

sharovatov Apr 14, 2026 Maintainer

On the pragmatic argument

Uh oh!

sharovatov Apr 14, 2026 Maintainer

On E2E in general:

Uh oh!

sharovatov Apr 14, 2026 Maintainer

On Evaliphy

Uh oh!

priyanshus Apr 17, 2026 Author

Uh oh!

sharovatov Apr 11, 2026 Maintainer

priyanshus
Apr 10, 2026

Replies: 2 comments 5 replies

sharovatov
Apr 11, 2026
Maintainer

priyanshus Apr 12, 2026
Author

sharovatov Apr 14, 2026
Maintainer

sharovatov Apr 14, 2026
Maintainer

sharovatov Apr 14, 2026
Maintainer

priyanshus Apr 17, 2026
Author

sharovatov
Apr 11, 2026
Maintainer