The LLM Evaluation Framework
-
Updated
Jul 3, 2025 - Python
The LLM Evaluation Framework
The one-stop repository for large language model (LLM) unlearning. Supports TOFU, MUSE, WMDP, and many unlearning methods. All features: benchmarks, methods, evaluations, models etc. are easily extensible.
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
A bash shell script to run a single prompt against any or all of your locally installed ollama models, saving the output and performance statistics as easily navigable web pages.
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.
This repo is for an streamlit application that provides a user-friendly interface for evaluating large language models (LLMs) using the beyondllm package.
Tools for systematic large language model evaluations
Add a description, image, and links to the llm-evaluation-metrics topic page so that developers can more easily learn about it.
To associate your repository with the llm-evaluation-metrics topic, visit your repo's landing page and select "manage topics."