Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return the test details with runtests? #139

Closed
findmyway opened this issue Jan 16, 2024 · 5 comments
Closed

Return the test details with runtests? #139

findmyway opened this issue Jan 16, 2024 · 5 comments
Labels
speculative a feature idea that we are undecided about

Comments

@findmyway
Copy link

I'd like to extract the test results and do some calculations later. My current approach is to set report=true and then parse the resulting xml file. Is there any other better approach?

More specifically, can we return testitems instead of nothing here?

return nothing

@nickrobinson251
Copy link
Collaborator

nickrobinson251 commented Jan 16, 2024

thanks for the issue! Can you say more about what the use-case is?

Returning testitems there is not something we'd want to do by default (since in the REPL it'd then print out that object), but it could perhaps be made an options... although that's a ReTestItems.TestItems object (which is currently a private internal structure which i'd be reluctant to make public as is).

Also the whole runtests() function will throw (specifically the Test.finish call will throw) if there was any non-passing tests, so returning anything from runtests would only be possible in cases where all tests succeed... unless runtests() is itself called from within another testset (which is basically what we do in the tests of this package itself, we call runtests inside an EncasedTestSet and inspect the results).

So what's possible/desirable to do here depends on what exactly you're trying to achieve. Can you share an example?

Also, for context: i'm a little wary of committing to any kind of support for interacting with test results without a bit of a plan for what would be in or out of scope, and taking a look at prior art e.g. a natural next step would be to use the returned test results when re-running tests, e.g. to only run the failures/error, or to run those first -- but maybe we can add something that helps you without that, if i can understand the use-case better

@nickrobinson251 nickrobinson251 added question Further information is requested speculative a feature idea that we are undecided about labels Jan 16, 2024
@findmyway
Copy link
Author

findmyway commented Jan 16, 2024

First, thanks for your prompt reply!

I understand your concern and your explanation makes sense to me.

Can you say more about what the use-case is?

Sure! So I'm working on a project similar to openai/human-eval and evalplus, but in JuliaLang. Generally it will ask different LLMs to generate the code based on given prompt and then execute the code to calculate the p@k (a metric based on the test case pass rate to measure the performance of code LLMs). I think this package already provides many interesting features, like parallel execution, timeout, and many important metrics. All these features combined help me a lot when analyzing the performance of existing LLMs.

I think all I need are already in the report.xml file. It would be great if we can somehow make it easier to EXTRACT these details (instead of parsing the raw xml file)

FYI: you can view the test cases here: https://github.com/oolong-dev/OolongEval.jl/tree/add_human_eval/benchmarks/HumanEval/src/tasks

@nickrobinson251
Copy link
Collaborator

i see. yeah, i'm not sure i see an easy way to add this to ReTestItems due to how Test.jl works (at least not one i'd be comfortable committing too 😅)

Maybe right now using the report.xml is actually the easiest way (daft as it might sound). The report.xml is a somewhat standard format ("JUnit XML"). I don't think there are any Julia parsers for it (just generic XML parsers), but python has junitparser (if you're open to using python) and there are probably other, or writing your own parsers in Julia based on an XML parsers shouldn't be too arduous.

But an alternative to parsing the report.xml could maybe to use MetaTesting? It's designed for "testing your test functions" i.e. testing functions that themselves run tests... but that's pretty close to testing what happened when tests run, so could be adapted to that use-case i think... something like:

using MetaTesting: EncasedTestSet
using Test
using ReTestItems

# Wrap in a `MetaTesting.EncasedTestSet` so we can capture results rather than throwing if any fail/error
ts = @testset EncasedTestSet "runtests" begin
    runtests(...)
end

results = only(ts.results) # <- # Get the `Test.DefaultTestSet` produced by `runtests(...)`
# Do stuff with `results`... e.g. print results to check they're the same as what `runtests(...)` would show:
Test.print_test_results(results)

This gives you a Test.DefaultTestSet, which only stores the failures/errors and the number of passes (no other data about the passes), but maybe that's enough for your use-case?

@findmyway
Copy link
Author

Great thanks!

I wasn't aware of MetaTesting before. I'll try both approaches and see which one works better.

@nickrobinson251 nickrobinson251 removed the question Further information is requested label Jan 17, 2024
@findmyway
Copy link
Author

I used results.xml in the end.

FYI: https://github.com/01-ai/HumanEval.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
speculative a feature idea that we are undecided about
Projects
None yet
Development

No branches or pull requests

2 participants