We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
There was an error while loading. Please reload this page.
Currently the eval results when you run multiple models and multiple prompt samples is a table that looks like this:
It would be great to also have a bar-chart of non-compliance (where higher is better) that looks like the results presented in the paper: