Skip to content

dev #138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 132 commits into
base: main
Choose a base branch
from
Open

dev #138

wants to merge 132 commits into from

Conversation

pelikhan
Copy link
Member

@pelikhan pelikhan commented Jun 2, 2025

No description provided.

pelikhan and others added 30 commits June 2, 2025 23:45
* 🎛️ feat: switch evalModel to evalModelSet for test evaluation

Replaces evalModel with evalModelSet, allowing multiple evaluation models.

* ✨ feat: add multi-model evaluation to metrics and compliance

Support evaluating metrics and compliance with multiple models via evalModelSet.

* ✨ Refine evaluation model handling and debug logging

Improved evalModelSet defaults, header levels, and debugging output.

* ✨ Enhance evalModelSet sourcing and logging in promptpex

Now supports sourcing evalModelSet from env var, adds validation, and logging.

* ✨ refactor test metric evaluation and overview model handling

Refined evalModelSet parsing and updated test metric iteration logic.

* ✨ feat: Add combined avg metric across eval models

Compute and store average metric score for all evaluation models used.

* ✨ Enhance promptpex test evaluation and script logic

Added separate eval-only/test-run modes, improved metric evaluations

* ♻️ Rename evalModelSet to evalModel throughout codebase

Standardizes config and variable naming from evalModelSet to evalModel.
…#141)

* ✨ Enhance test results saving and eval metrics workflow

Improved control of results file writing and evaluation metrics assignment.

* ✨ Add evals config flag to control evaluation execution

Introduces evals boolean for toggling evaluation of test results.

* ✨: Enable direct context-loading from JSON files

Refactored CLI to load PromptPexContext from JSON, updating file flow.
* ✨ Add scripts and logic for multi-stage sample evaluations

Introduces zx scripts for gen/run/eval sample tests and conditional test executions.

* 🔀 rename: Samples scripts renamed to .zx.mjs extensions

All run-samples-*.mjs scripts updated to .zx.mjs for zx compatibility.

* ♻️ refactor: Rename sample scripts to .zx.mjs extensions

Updated script names in package.json and renamed a sample file for zx compatibility
Introduces groundtruth model option, result tracking, and output storage.
Extended PromptPexTest and PromptPexTestResult with groundtruth support.
Add lmstudio to settings, expand UI model suggestions, tidy runTests.
✨ Add support for groundtruth model and outputs
pelikhan and others added 30 commits June 11, 2025 13:18
* ✨ Enhance groundtruth evaluation with multiple models

Added support for evaluating groundtruth with multiple eval models.

* Update src/genaisrc/src/promptpex.mts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* ✨ Enable multiple eval groundtruth models and results merging

Add support for evaluating with multiple groundtruth models and merging results.

* ✨ Update eval paths and enhance JSON context logging

Refined script paths in package.json and improved debug info for JSON context.

* ✨ Refine metrics reporting and output handling logic

Metric keys now include model names; output directories improved.

* ✨: Display groundtruth eval results in output table

Show filtered groundtruth eval results in the test output section.

* 🔥 refactor: Removed grounding fields from PromptPexTestResult

Eliminated isGrounded and groundedText fields for streamlined interface.

---------

Co-authored-by: Peli de Halleux <pelikhan@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ✨ feat: Enable separate groundtruth metric evaluation path

Adds support for filtering and evaluating groundtruth metrics via new prompt.

* ✨ feat: Add groundtruthScore based on evalModels averages

Groundtruth metrics are now computed and averaged per test result.

* ✨ Debug and improve groundtruth metrics computation logic

Log groundtruth metrics, fix average scoring, and enhance debug output.

* ✨ refactor groundtruth evaluation and add retry logic

Extracted groundtruth scoring to a helper and added retries for low scores.

* 🚦 feat: introduce configurable groundtruth thresholds

Added constants for groundtruth thresholds and retries in constants.mts. Updated testrun.mts to use these values, improving flexibility in test groundtruth score evaluation and retry handling.
Introduces a detailed overview of PromptPex's test groundtruth flow.
Groundtruth scores are now tracked for tests, with improved debug output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants