AI Testing for QA Engineers, are we over complicating it? #32
Replies: 2 comments 5 replies
-
|
dear @priyanshus thank you for raising a very interesting question. I am replying now, and will reply more on Monday when I have my full research day :)
This is supercool! Anupam and I experimented with Ragas a lot too.
Curiously, I have a very different observation. In my experience, QA professionals choose the testing methods/types/levels/approaches most efficient to the task, i.e. providing good evidence for the risks at the minimal cost. End-to-end testing sits at quite an expensive level and even though it does provide good evidence for certain risks and in certain scenarios, there are very many more ways to get this evidence cheaper (faster/earlier in the SDLC). The most efficient QA approaches I've seen give more focus on prevention, not appraisal, thus reducing the need for expensive late-stage appraisal like E2E. In my opinion, preference matters only if rationally justified: basically, if we can say "the cheapest way to get good evidence for this risk through E2E testing".
I don't think tools are "suited" for specific roles. A tool is useful when it provides information you need at a cost you can afford. If QA needs the evidence that Ragas provides (e..g, distinguishing retrieval failures from generation failures), and QA currently can't use Ragas, then the economics question is straightforward: what does it cost to give them that capability (training, pairing with ML engineers, better tooling abstractions) versus what does it cost to not have that evidence (undiagnosed failures, longer fix cycles, higher failure costs)? The point about offshore, disconnected QA teams I see as an even more important point, and it's not a tooling problem, it's an organisational structure problem. When QA is structurally separated from development, they're limited to late-stage appraisal regardless of what tools exist. I don't see how any tool, however simplified, can fix the information gap created by that separation. |
Beta Was this translation helpful? Give feedback.
-
This analogy is very tempting but I think it hides something important. When QA tests login black-box, the reason it works is that the oracle is trivial: valid credentials get in, invalid don't. Binary, deterministic, fully specified. QA doesn't need OAuth knowledge because the oracle does all the work. For an AI feature like a RAG-based Q&A, what is the oracle? "The answer should be correct"? But how correct? How complete? How relevant? By whose definition? And the system is non-deterministic, so you're evaluating a distribution of possible outputs against fuzzy, context-dependent criteria. Defining the oracle is the hardest part of the entire testing problem, and it requires deep domain understanding even if not ML understanding.
In my economics of testing research I argue that it depends on what a wrong answer might cost. Say, for a low stakes AI feature like an internal FAQ chatbot, sampled human evaluation of outputs might provide sufficient evidence. The cost of a wrong answer is low, so the evidence threshold is low. For a high-stakes AI feature like a medical, financial, legal one, you need to know where failures happen: did retrieval fail, or did retrieval succeed but the LLM hallucinate? Black-box E2E tells you "the answer was wrong." Component-level evaluation (what Ragas and DeepEval do) tells you why. The evidence threshold is high, and E2E alone can't possibly reach it, or maybe I'm simply not seeing how. more replies coming on monday and I'm off on a velo trip now :)) thank you again! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have been thinking about this for a while and wanted to hear from the community.
Over the last few months I spent a significant amount of time exploring how to test AI systems, specifically RAG systems.
As a QA consultant, I went through DeepEval, Ragas, Promptfoo and a few others. All genuinely powerful tools. These tools speak the language of ML. Retrieval precision, K-accuracy, embedding similarity, chunk evaluation, etc.
Having been in this profession for more than a decade, I have experienced that most QAs prefer black-box end-to-end testing. Treating the system as a black box and validating what the end user receives. I personally feel that these tools are best suited for ML engineers/developers, and to use them effectively, I need to be equally knowledgeable as my developers. (However, I don’t see that as a drawback but not all QAs have that opportunity. Many are offshore, disconnected from developers, and may not have the same level of knowledge of the product internals they are testing.)
I see these tools as well-suited for unit[+intergration]-level testing of LLM systems, but we also need some end-to-end testing solutions where we are not necessarily diving into the internals, but rather covering the whole user journey, including the LLM, and testing it from the end-user perspective.
And, that the reason I am here, do we actually need a simplified solution to do end-to-end testing of AI system?
A QA engineer testing a login flow does not need to understand OAuth internals. They test the behaviour the user sees. Why should testing an AI feature be fundamentally different?
I am curious what this community thinks:
I have been working on something in this space and have my own opinions but genuinely curious what the QA community's experience[perspective] has been before I share more.
Beta Was this translation helpful? Give feedback.
All reactions