[GSK-3365] Ease custom metrics support #1871

pierlj · 2024-03-29T08:50:27Z

No description provided.

linear · 2024-03-29T08:50:30Z

GSK-3365 Rework custom metric support

giskard/datasets/metadata/text_metadata_provider.py

giskard/rag/evaluate.py

giskard/rag/metrics/correctness.py

giskard/rag/metrics/ragas_metrics.py

Bumps [softprops/action-gh-release](https://github.com/softprops/action-gh-release) from 1 to 2. - [Release notes](https://github.com/softprops/action-gh-release/releases) - [Changelog](https://github.com/softprops/action-gh-release/blob/master/CHANGELOG.md) - [Commits](softprops/action-gh-release@v1...v2) --- updated-dependencies: - dependency-name: softprops/action-gh-release dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>

* ENH: adding two tests on performance with brier score for binary classification and monotonicity with respect to a specific column * FIX: fixing precommit checks * FIX: replace iloc by loc to get rows from index values * UPD: change test name to test_monotonicity * ENH: add smoothness test * FIX: pre-commit * UPD: adding new norm to test smoothness * FIX: generate giskard datasets for predictions + fixed brier score to take probas as input + fixed docstrings * FIX: fixed monotonicity metric + added relevant tests * UPD: adding the option to pass a reference function as an np.array to compare to predictions * FIX: pre-commit checks * UPD: add new checks on features and classification label + add test for checks + fixes related to review * UPD: change new tests naming convention from miscellaneous to stability + add them to docs * Update index.rst * Update performance.rst * Create stability.rst --------- Co-authored-by: Hartorn <bazire@giskard.ai> Co-authored-by: Rabah Abdul Khalek <rabah.khalek@gmail.com>

* Initial commit for the MVP of "Talk to my model" functionality. * Defined the basic pipeline of the 'talk' function. * Defined the Tool interface and the boilerplate for the first tool, which returns a prediction from the row of the dataset. * small addition * Added method to initialise tools objects, each time the method 'talk' is called. Replaced placeholders and dummy variables to the real objects. * Initial implementation of the "__call__" method. * Bug fixes. Adapted flow to currently use legacy 'functions' instead of 'tools' API. * Debugged "predict from dataset" tool workflow. Debugged the tool workflow. Performed prompt engineering for the tool description and LLM instruction. * Initial implementation of the 'SHAPExplanationTool'. * Added handling an errors, while calling tools. * Moved more attributes and properties to the BaseTool, since they are common for all child Tools classes. * Changed PredictFromDataset tool's specification. * Adapting model.py to the use of tools API. * Fully changed 'talk' method workflow to use tools API. * Added multiple toll calling for the SHAP explanation tool * Code refactoring. * Initial implementation of the IssuesScannerTool, which gives user an info about model's performance issues. * Refactoring. * Removed __futures__ import. * Started implementing prediction from user input tool. * Implemented the final PredictUserInputTool. * Put the shap explanations calculation logic into separate module. * Explicitly set target to the 'None', when creating Dataset, to omit warnings. * Distributed the tools across separate dedicated modules for easier maintenance. Code refactoring. * Implemented history (context) persistence to enable dialogue regime between LLM and the User. Formatted the output of the model.talk() method. * Small refactoring. * Executed pre-commit hooks on all files. * Update regarding new LLMClient API. * Updated `pdm.lock` * Finalised adaptation to the new LLMClient API for the 'talk' functionality. * Removed "_form_tool_calls" method. * Small fixes. * Updated pdm. * Updated pdm. * United the PredictDatasetInput and PredictUserInput tools into single tool. It improves taxonomy of tools, making it more distinct. Also, and what is important, it allowed to calculate SHAP values, when an input is built from the user input. * If we already see, that filtered dataset is of length 0, stop further potential filtering. Implemented fuzzy string features matching. * Created the new tool to calculate model's performance metrics. * Talk architecture polishing. * Improved the system prompt to: 1) Avoid providing generic answers; 2) Refuse to answer on a harmful questions. * System prompt improvement. * 1) Updated to the latest gpt-4-turbo version; 2) Fixed the bug with metrics calculation, when there is a need to filter dataset; * Bug fix. * Added better spacing to the instruct prompt. * Improved instruction to not provide generic answers. * Added docstrings. * Added docstrings. * Added docstrings. * Added docstrings. * Added docstrings. * Added docstrings. * Updated typing with the respect to not using the __futures__. * Replaced thefuzz.ratio() by the native difflib.SequenceMatcher().ratio() * Removed optional list casting. * Refactored the dataset filtering logic. Added comments. * Removed useless casting to list. * Simplified assignment expression. * Small fix. * Replaced by the object's method call. * Replaced the __str__ by the __repr__ * Moved fuzzy similarity threshold to the config. * Small fix. * Removed import BaseOpenAIClient from model.py * 1) The 'dataset' argument of the 'talk' is mandatory now. 2) An exception will appear, if the user will call the "IssuesScannerTool" through the 'talk', without providing the "scan_report" argument. * Added clarifying comments, on why to use non-top-level imports, as well as on background sample calculation. * Added the possibility of configuring Talk LLM model through the env variable. * Returned the from __future__ import annotations, since we accept such protocol. * Documented the reason, why to import functions not from the top-level. * Improved typing and docstrings. * [RESTORING] dataset is not mandatory parameter. * Created the new group 'talk' for the 'talk-to-my-ml' feature dependencies. * Regenerating pdm.lock * - Fixed ambiguity in calling for 'model performance'. Now, the metric calculation tool is called, instead of scan issues tool. - Fixed seed and temperature of the LLM client. - Put LLM client.complete() parameters into separate dict. - Now the scan tool is supplied with scan_report.to_markdown(template="hugging_face"), thus having more info on scan report, preventing hallucinations. * Regenerating pdm.lock * Created unit-tests for the 'talk' feature. * Small fix. * Regenerating pdm.lock * Committing missing pytest file with unit-tests for the 'talk' feature. * Update giskard/llm/talk/config.py Co-authored-by: Rabah Abdul Khalek <rabah.khalek@gmail.com> * Update giskard/llm/talk/config.py Co-authored-by: Rabah Abdul Khalek <rabah.khalek@gmail.com> * Update giskard/llm/talk/config.py * Update giskard/llm/talk/config.py * Update giskard/llm/talk/config.py Co-authored-by: Rabah Abdul Khalek <rabah.khalek@gmail.com> * Update giskard/llm/talk/config.py * Fixed typos with GPT. * Better exception raising logic. * 1) Specified, that model and dataset are mandatory parameters of tools. 2) Improved the logic of mapping pandas dtypes to json dtypes. * Update giskard/llm/talk/tools/metric.py Co-authored-by: Rabah Abdul Khalek <rabah.khalek@gmail.com> * Removed comments. * Made features_json_type as a property. * Added `features_dict` validation logic. * Replaced metrics calculation functions from sklearn to giskard * Fixed unit-tests by escaping regex-sensitive characters. * Re-made unit-tests. Mocked LLM responses to avoid dependence on OpenAI API calls. * Fixed CI/CD errors: 1) Added 'tabulate' package to the 'talk' dependency group'; 2) Improved error matching criteria in the 'talk' unit-tests. * Regenerating pdm.lock * Fixed CI/CD errors: Improved error matching criteria in the 'talk' unit-tests to make it compatible with python 3.9. * Delete pdm.lock * Regenerating pdm.lock * Created the docs page for the AI Quality Copilot. * Regenerating pdm.lock * Regenerating pdm.lock * Small docs fix. * Removed instruction because of redundancy. * Rewrote the initialization of all tools. Now only mandatory tool parameters can be passed. Also, improved docstrings. * Introduced PredictionMixin class to abstract away common prediction necessary methods of the Predict and Metric tools. Reduces code duplication. * Small docstring fix. * Added doc page for the AI Quality Copilot. * Returned old page. * Returned old page. * Once again, I added the doc page for the AI Quality Copilot. * Delete pdm.lock * Regenerating pdm.lock * Delete pdm.lock * Regenerating pdm.lock * Update talk_result.py --------- Co-authored-by: Hartorn <bazire@giskard.ai> Co-authored-by: BotLocker <bot.locker@users.noreply.github.com> Co-authored-by: Rabah Abdul Khalek <rabah.khalek@gmail.com>

This reverts commit 5d1894d.

sonarcloud · 2024-04-15T14:20:11Z

Quality Gate passed

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
83.2% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

pierlj added 8 commits March 25, 2024 18:42

Apply Bazire's comments

0d58835

Update tests

f09aeae

Early draft custom metric

2c265c9

Simplify custom metrics handling in evaluation

58adfef

Update tests

3681e82

Fix tests

4b0df14

Small fixes

3316a3c

Minor tweaks

795d031

Merge branch 'main' into gsk-3365-rag-custom-metrics

ab948f3

pierlj requested a review from mattbit March 29, 2024 09:44

pierlj marked this pull request as ready for review March 29, 2024 09:44

mattbit requested changes Apr 2, 2024

View reviewed changes

pierlj and others added 11 commits April 2, 2024 15:16

Apply Matteo's comments

629822c

Fix typo

d8f66b1

Merge branch 'main' into gsk-3365-rag-custom-metrics

4df66f4

Remove unused import

18e040d

Refactoring

306d278

Add ragas version requirement in docs

d0481d8

Change bokeh theme to dark

6bfd406

Refactoring

fe29060

Fix typing issue

b1015ed

Merge branch 'main' into gsk-3365-rag-custom-metrics

c63bbeb

Merge branch 'main' into gsk-3365-rag-custom-metrics

7d92fce

pierlj requested a review from mattbit April 10, 2024 07:02

pierlj and others added 5 commits April 10, 2024 10:14

Fix oos tests

51908f0

Update README.md

c81cca5

Big refactoring: break everything, and rebuild simpler

a18c6d3

Start clean up of evaluators

5da333e

Update prompt for harmful content detector

f3d51cd

kevinmessiaen and others added 16 commits April 15, 2024 09:39

Fixed coherency test

98316e1

adjust wording

859feec

v2.10.0

6498dba

Revert "Feature/gsk 2334 talk to my model mvp (#1831)"

657b386

This reverts commit 5d1894d.

Apply Bazire's comments

79cc14d

Update tests

b54cad2

Early draft custom metric

c0fd55a

Simplify custom metrics handling in evaluation

353fca4

Apply Matteo's comments

0226639

Remove unused import

d9b4b12

Refactoring

734beda

Fix oos tests

2d3da66

Fix test after custom LLM support rebase

3e7514c

pierlj requested a review from a team as a code owner April 15, 2024 08:36

Merge branch 'main' into gsk-3365-rag-custom-metrics

970bbc8

pierlj added the Lockfile Temporary label to update pdm.lock label Apr 15, 2024

pierlj added 2 commits April 15, 2024 11:57

Update lockfile

2ab6fb6

Fix typo

3b8f451

pierlj removed the Lockfile Temporary label to update pdm.lock label Apr 15, 2024

mattbit added the Lockfile Temporary label to update pdm.lock label Apr 15, 2024

Regenerating pdm.lock

2bee135

github-actions bot removed the Lockfile Temporary label to update pdm.lock label Apr 15, 2024

Fix test report

b7c4969

mattbit approved these changes Apr 15, 2024

View reviewed changes

mattbit merged commit 9856151 into main Apr 16, 2024
16 checks passed

mattbit deleted the gsk-3365-rag-custom-metrics branch April 16, 2024 07:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSK-3365] Ease custom metrics support #1871

[GSK-3365] Ease custom metrics support #1871

pierlj commented Mar 29, 2024

linear bot commented Mar 29, 2024

sonarcloud bot commented Apr 15, 2024

[GSK-3365] Ease custom metrics support #1871

[GSK-3365] Ease custom metrics support #1871

Conversation

pierlj commented Mar 29, 2024

linear bot commented Mar 29, 2024

sonarcloud bot commented Apr 15, 2024

Quality Gate passed