LM-Eval (#445)

* LM-Eval * add MTEB * add Ragas * add UpTrain * add OpenCompass * add AgentBench * add a series of tools * add Optimum Benchmark * add Bigcode * Update README.md
EthicalML · Jan 1, 2024 · 641174c · 641174c
1 parent 94294ed
commit 641174c
Showing 1 changed file with 15 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -615,21 +615,35 @@ This repository contains a curated list of awesome open source libraries that wi
 
 
 ## Industry Strength Benchmarking and Evaluation
+* [AgentBench](https://github.com/THUDM/AgentBench) ![](https://img.shields.io/github/stars/THUDM/AgentBench.svg?style=social) - A Comprehensive Benchmark to Evaluate LLMs as Agents.
+* [AlpacaEval](https://github.com/tatsu-lab/alpaca_eval) ![](https://img.shields.io/github/stars/tatsu-lab/alpaca_eval.svg?style=social) - An automatic evaluator for instruction-following language models.
+* [Auto-evaluator](https://github.com/rlancemartin/auto-evaluator) ![](https://img.shields.io/github/stars/rlancemartin/auto-evaluator.svg?style=social) - Evaluation tool for LLM QA chains.
+* [BigCode](https://github.com/bigcode-project/bigcode-evaluation-harness) ![](https://img.shields.io/github/stars/bigcode-project/bigcode-evaluation-harness.svg?style=social) - A framework for the evaluation of autoregressive code generation language models.
 * [BIG-bench](https://github.com/google/BIG-bench) ![](https://img.shields.io/github/stars/google/BIG-bench.svg?style=social) - The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities.
 * [D4RL](https://github.com/Farama-Foundation/D4RL) ![](https://img.shields.io/github/stars/Farama-Foundation/D4RL.svg?style=social) - D4RL is an open-source benchmark for offline reinforcement learning.
+* [DeepEval](https://github.com/confident-ai/deepeval) ![](https://img.shields.io/github/stars/confident-ai/deepeval.svg?style=social) - DeepEval is a simple-to-use, open-source evaluation framework for LLM applications.
 * [EvadeML](https://github.com/mzweilin/EvadeML-Zoo) ![](https://img.shields.io/github/stars/mzweilin/EvadeML-Zoo.svg?style=social) - A benchmarking and visualization tool for adversarial ML.
 * [EvalAI](https://github.com/Cloud-CV/EvalAI) ![](https://img.shields.io/github/stars/Cloud-CV/EvalAI.svg?style=social) - EvalAI is an open source platform for evaluating and comparing AI algorithms at scale.
 * [Evals](https://github.com/openai/evals) ![](https://img.shields.io/github/stars/openai/evals.svg?style=social) - Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.
 * [Evaluate](https://github.com/huggingface/evaluate) ![](https://img.shields.io/github/stars/huggingface/evaluate.svg?style=social) - Evaluate is a library that makes evaluating and comparing models and reporting their performance easier and more standardized.
 * [Helm](https://github.com/stanford-crfm/helm) ![](https://img.shields.io/github/stars/stanford-crfm/helm.svg?style=social) - Holistic Evaluation of Language Models (HELM) is a benchmark framework to increase the transparency of language models.
+* [LM-Eval](https://github.com/EleutherAI/lm-evaluation-harness) ![](https://img.shields.io/github/stars/EleutherAI/lm-evaluation-harness.svg?style=social) - LM-Eval is a framework for few-shot evaluation of autoregressive language models.
 * [Lucid](https://github.com/tensorflow/lucid) ![](https://img.shields.io/github/stars/tensorflow/lucid.svg?style=social) - Lucid is a collection of infrastructure and tools for research in neural network interpretability.
 * [Meta-World](https://github.com/Farama-Foundation/Metaworld) ![](https://img.shields.io/github/stars/Farama-Foundation/Metaworld.svg?style=social) - Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of many distinct robotic manipulation tasks.
+* [Multi-Modality Arena](https://github.com/OpenGVLab/Multi-Modality-Arena) ![](https://img.shields.io/github/stars/OpenGVLab/Multi-Modality-Arena.svg?style=social) - Multi-Modality Arena is an evaluation platform for large multi-modality models.
+* [MTEB](https://github.com/embeddings-benchmark/mteb) ![](https://img.shields.io/github/stars/embeddings-benchmark/mteb.svg?style=social) - Massive Text Embedding Benchmark (MTEB) is a comprehensive benchmark of text embeddings.
 * [OmniSafe](https://github.com/PKU-MARL/omnisafe) ![](https://img.shields.io/github/stars/PKU-MARL/omnisafe.svg?style=social) - OmniSafe is a comprehensive and reliable benchmark for safe reinforcement learning, covering a multitude of SafeRL domains and delivering a new suite of testing environments.
+* [OpenCompass](https://github.com/open-compass/OpenCompass) ![](https://img.shields.io/github/stars/open-compass/OpenCompass.svg?style=social) - OpenCompass is an LLM evaluation platform, supporting a wide range of models (LLaMA, LLaMa2, ChatGLM2, ChatGPT, Claude, etc) over 50+ datasets.
 * [OpenCV Zoo and Benchmark](https://github.com/opencv/opencv_zoo) ![](https://img.shields.io/github/stars/opencv/opencv_zoo.svg?style=social) - A zoo for models tuned for OpenCV DNN with benchmarks on different platforms.
+* [Optimum Benchmark](https://github.com/huggingface/optimum-benchmark) ![](https://img.shields.io/github/stars/huggingface/optimum-benchmark.svg?style=social) - A unified multi-backend utility for benchmarking Transformers and Diffusers with support for Optimum's arsenal of hardware optimizations/quantization schemes.
 * [Overcooked-AI](https://github.com/HumanCompatibleAI/overcooked_ai) ![](https://img.shields.io/github/stars/HumanCompatibleAI/overcooked_ai.svg?style=social) - Overcooked-AI is a benchmark environment for fully cooperative human-AI task performance, based on the wildly popular video game Overcooked.
+* [PandaLM](https://github.com/WeOpenML/PandaLM) ![](https://img.shields.io/github/stars/WeOpenML/PandaLM.svg?style=social) - PandaLM aims to provide reproducible and automated comparisons between different large language models.
+* [PhaseLLM](https://github.com/wgryc/phasellm) ![](https://img.shields.io/github/stars/wgryc/phasellm.svg?style=social) - Large language model evaluation and workflow framework from [Phase AI](phasellm.com).
+* [Ragas](https://github.com/explodinggradients/ragas) ![](https://img.shields.io/github/stars/explodinggradients/ragas.svg?style=social) - Ragas is a framework that helps evaluate Retrieval Augmented Generation (RAG) pipelines.
 * [Recommenders](https://github.com/Microsoft/Recommenders) ![](https://img.shields.io/github/stars/Microsoft/Recommenders.svg?style=social) - Recommenders contains benchmark and best practices for building recommendation systems, provided as Jupyter notebooks.
-* [RLeXplore](https://github.com/yuanmingqi/rl-exploration-baselines) ![](https://img.shields.io/github/stars/yuanmingqi/rl-exploration-baselines.svg?style=social) - RLeXplore provides stable baselines of exploration methods in reinforcement learning
+* [RLeXplore](https://github.com/yuanmingqi/rl-exploration-baselines) ![](https://img.shields.io/github/stars/yuanmingqi/rl-exploration-baselines.svg?style=social) - RLeXplore provides stable baselines of exploration methods in reinforcement learning.
 * [SafePO-Baselines](https://github.com/PKU-MARL/Safe-Policy-Optimization) ![](https://img.shields.io/github/stars/PKU-MARL/Safe-Policy-Optimization.svg?style=social) - SafePO-Baselines is a benchmark repository for safe reinforcement learning algorithms.
+* [UpTrain](https://github.com/uptrain-ai/uptrain) ![](https://img.shields.io/github/stars/uptrain-ai/uptrain.svg?style=social) - UpTrain is an open-source tool to evaluate LLM applications.
 
 
 ## Commercial Platform