-
Notifications
You must be signed in to change notification settings - Fork 13
dev #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pelikhan
wants to merge
132
commits into
main
Choose a base branch
from
dev
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* 🎛️ feat: switch evalModel to evalModelSet for test evaluation Replaces evalModel with evalModelSet, allowing multiple evaluation models. * ✨ feat: add multi-model evaluation to metrics and compliance Support evaluating metrics and compliance with multiple models via evalModelSet. * ✨ Refine evaluation model handling and debug logging Improved evalModelSet defaults, header levels, and debugging output. * ✨ Enhance evalModelSet sourcing and logging in promptpex Now supports sourcing evalModelSet from env var, adds validation, and logging. * ✨ refactor test metric evaluation and overview model handling Refined evalModelSet parsing and updated test metric iteration logic. * ✨ feat: Add combined avg metric across eval models Compute and store average metric score for all evaluation models used. * ✨ Enhance promptpex test evaluation and script logic Added separate eval-only/test-run modes, improved metric evaluations * ♻️ Rename evalModelSet to evalModel throughout codebase Standardizes config and variable naming from evalModelSet to evalModel.
…#141) * ✨ Enhance test results saving and eval metrics workflow Improved control of results file writing and evaluation metrics assignment. * ✨ Add evals config flag to control evaluation execution Introduces evals boolean for toggling evaluation of test results. * ✨: Enable direct context-loading from JSON files Refactored CLI to load PromptPexContext from JSON, updating file flow.
* ✨ Add scripts and logic for multi-stage sample evaluations Introduces zx scripts for gen/run/eval sample tests and conditional test executions. * 🔀 rename: Samples scripts renamed to .zx.mjs extensions All run-samples-*.mjs scripts updated to .zx.mjs for zx compatibility. * ♻️ refactor: Rename sample scripts to .zx.mjs extensions Updated script names in package.json and renamed a sample file for zx compatibility
Introduces groundtruth model option, result tracking, and output storage.
Extended PromptPexTest and PromptPexTestResult with groundtruth support.
Add lmstudio to settings, expand UI model suggestions, tidy runTests.
✨ Add support for groundtruth model and outputs
Create genai-issue-labeller
* ✨ Enhance groundtruth evaluation with multiple models Added support for evaluating groundtruth with multiple eval models. * Update src/genaisrc/src/promptpex.mts Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * ✨ Enable multiple eval groundtruth models and results merging Add support for evaluating with multiple groundtruth models and merging results. * ✨ Update eval paths and enhance JSON context logging Refined script paths in package.json and improved debug info for JSON context. * ✨ Refine metrics reporting and output handling logic Metric keys now include model names; output directories improved. * ✨: Display groundtruth eval results in output table Show filtered groundtruth eval results in the test output section. * 🔥 refactor: Removed grounding fields from PromptPexTestResult Eliminated isGrounded and groundedText fields for streamlined interface. --------- Co-authored-by: Peli de Halleux <pelikhan@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ✨ feat: Enable separate groundtruth metric evaluation path Adds support for filtering and evaluating groundtruth metrics via new prompt. * ✨ feat: Add groundtruthScore based on evalModels averages Groundtruth metrics are now computed and averaged per test result. * ✨ Debug and improve groundtruth metrics computation logic Log groundtruth metrics, fix average scoring, and enhance debug output. * ✨ refactor groundtruth evaluation and add retry logic Extracted groundtruth scoring to a helper and added retries for low scores. * 🚦 feat: introduce configurable groundtruth thresholds Added constants for groundtruth thresholds and retries in constants.mts. Updated testrun.mts to use these values, improving flexibility in test groundtruth score evaluation and retry handling.
Introduces a detailed overview of PromptPex's test groundtruth flow.
Groundtruth scores are now tracked for tests, with improved debug output.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.