Interesting benchmark!
Recently, you opened an issue in the repository for the changepoynt package.
I think that change point detection and especially hyperparameter tuning for the detection (algorithm, window size, lag, etc.) would make for a very interesting use case within the agentic evaluation benchmark. Given ground-truth change-point detections and an accuracy metric for those detections, agent performance would be easily quantifiable. From my experience, deciding on an algorithm and tuning the hyperparameters is the hardest part for engineers and practitioners, so having an agent loop that does that would really help.
For additional ideas and further details see my answer in the issue.
Interesting benchmark!
Recently, you opened an issue in the repository for the changepoynt package.
I think that change point detection and especially hyperparameter tuning for the detection (algorithm, window size, lag, etc.) would make for a very interesting use case within the agentic evaluation benchmark. Given ground-truth change-point detections and an accuracy metric for those detections, agent performance would be easily quantifiable. From my experience, deciding on an algorithm and tuning the hyperparameters is the hardest part for engineers and practitioners, so having an agent loop that does that would really help.
For additional ideas and further details see my answer in the issue.