Skip to content

Commit

Permalink
LM-Eval (#445)
Browse files Browse the repository at this point in the history
* LM-Eval

* add MTEB

* add Ragas

* add UpTrain

* add OpenCompass

* add AgentBench

* add a series of tools

* add Optimum Benchmark

* add Bigcode

* Update README.md
  • Loading branch information
zhimin-z committed Jan 1, 2024
1 parent 94294ed commit 641174c
Showing 1 changed file with 15 additions and 1 deletion.
16 changes: 15 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -615,21 +615,35 @@ This repository contains a curated list of awesome open source libraries that wi


## Industry Strength Benchmarking and Evaluation
* [AgentBench](https://github.com/THUDM/AgentBench) ![](https://img.shields.io/github/stars/THUDM/AgentBench.svg?style=social) - A Comprehensive Benchmark to Evaluate LLMs as Agents.
* [AlpacaEval](https://github.com/tatsu-lab/alpaca_eval) ![](https://img.shields.io/github/stars/tatsu-lab/alpaca_eval.svg?style=social) - An automatic evaluator for instruction-following language models.
* [Auto-evaluator](https://github.com/rlancemartin/auto-evaluator) ![](https://img.shields.io/github/stars/rlancemartin/auto-evaluator.svg?style=social) - Evaluation tool for LLM QA chains.
* [BigCode](https://github.com/bigcode-project/bigcode-evaluation-harness) ![](https://img.shields.io/github/stars/bigcode-project/bigcode-evaluation-harness.svg?style=social) - A framework for the evaluation of autoregressive code generation language models.
* [BIG-bench](https://github.com/google/BIG-bench) ![](https://img.shields.io/github/stars/google/BIG-bench.svg?style=social) - The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities.
* [D4RL](https://github.com/Farama-Foundation/D4RL) ![](https://img.shields.io/github/stars/Farama-Foundation/D4RL.svg?style=social) - D4RL is an open-source benchmark for offline reinforcement learning.
* [DeepEval](https://github.com/confident-ai/deepeval) ![](https://img.shields.io/github/stars/confident-ai/deepeval.svg?style=social) - DeepEval is a simple-to-use, open-source evaluation framework for LLM applications.
* [EvadeML](https://github.com/mzweilin/EvadeML-Zoo) ![](https://img.shields.io/github/stars/mzweilin/EvadeML-Zoo.svg?style=social) - A benchmarking and visualization tool for adversarial ML.
* [EvalAI](https://github.com/Cloud-CV/EvalAI) ![](https://img.shields.io/github/stars/Cloud-CV/EvalAI.svg?style=social) - EvalAI is an open source platform for evaluating and comparing AI algorithms at scale.
* [Evals](https://github.com/openai/evals) ![](https://img.shields.io/github/stars/openai/evals.svg?style=social) - Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.
* [Evaluate](https://github.com/huggingface/evaluate) ![](https://img.shields.io/github/stars/huggingface/evaluate.svg?style=social) - Evaluate is a library that makes evaluating and comparing models and reporting their performance easier and more standardized.
* [Helm](https://github.com/stanford-crfm/helm) ![](https://img.shields.io/github/stars/stanford-crfm/helm.svg?style=social) - Holistic Evaluation of Language Models (HELM) is a benchmark framework to increase the transparency of language models.
* [LM-Eval](https://github.com/EleutherAI/lm-evaluation-harness) ![](https://img.shields.io/github/stars/EleutherAI/lm-evaluation-harness.svg?style=social) - LM-Eval is a framework for few-shot evaluation of autoregressive language models.
* [Lucid](https://github.com/tensorflow/lucid) ![](https://img.shields.io/github/stars/tensorflow/lucid.svg?style=social) - Lucid is a collection of infrastructure and tools for research in neural network interpretability.
* [Meta-World](https://github.com/Farama-Foundation/Metaworld) ![](https://img.shields.io/github/stars/Farama-Foundation/Metaworld.svg?style=social) - Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of many distinct robotic manipulation tasks.
* [Multi-Modality Arena](https://github.com/OpenGVLab/Multi-Modality-Arena) ![](https://img.shields.io/github/stars/OpenGVLab/Multi-Modality-Arena.svg?style=social) - Multi-Modality Arena is an evaluation platform for large multi-modality models.
* [MTEB](https://github.com/embeddings-benchmark/mteb) ![](https://img.shields.io/github/stars/embeddings-benchmark/mteb.svg?style=social) - Massive Text Embedding Benchmark (MTEB) is a comprehensive benchmark of text embeddings.
* [OmniSafe](https://github.com/PKU-MARL/omnisafe) ![](https://img.shields.io/github/stars/PKU-MARL/omnisafe.svg?style=social) - OmniSafe is a comprehensive and reliable benchmark for safe reinforcement learning, covering a multitude of SafeRL domains and delivering a new suite of testing environments.
* [OpenCompass](https://github.com/open-compass/OpenCompass) ![](https://img.shields.io/github/stars/open-compass/OpenCompass.svg?style=social) - OpenCompass is an LLM evaluation platform, supporting a wide range of models (LLaMA, LLaMa2, ChatGLM2, ChatGPT, Claude, etc) over 50+ datasets.
* [OpenCV Zoo and Benchmark](https://github.com/opencv/opencv_zoo) ![](https://img.shields.io/github/stars/opencv/opencv_zoo.svg?style=social) - A zoo for models tuned for OpenCV DNN with benchmarks on different platforms.
* [Optimum Benchmark](https://github.com/huggingface/optimum-benchmark) ![](https://img.shields.io/github/stars/huggingface/optimum-benchmark.svg?style=social) - A unified multi-backend utility for benchmarking Transformers and Diffusers with support for Optimum's arsenal of hardware optimizations/quantization schemes.
* [Overcooked-AI](https://github.com/HumanCompatibleAI/overcooked_ai) ![](https://img.shields.io/github/stars/HumanCompatibleAI/overcooked_ai.svg?style=social) - Overcooked-AI is a benchmark environment for fully cooperative human-AI task performance, based on the wildly popular video game Overcooked.
* [PandaLM](https://github.com/WeOpenML/PandaLM) ![](https://img.shields.io/github/stars/WeOpenML/PandaLM.svg?style=social) - PandaLM aims to provide reproducible and automated comparisons between different large language models.
* [PhaseLLM](https://github.com/wgryc/phasellm) ![](https://img.shields.io/github/stars/wgryc/phasellm.svg?style=social) - Large language model evaluation and workflow framework from [Phase AI](phasellm.com).
* [Ragas](https://github.com/explodinggradients/ragas) ![](https://img.shields.io/github/stars/explodinggradients/ragas.svg?style=social) - Ragas is a framework that helps evaluate Retrieval Augmented Generation (RAG) pipelines.
* [Recommenders](https://github.com/Microsoft/Recommenders) ![](https://img.shields.io/github/stars/Microsoft/Recommenders.svg?style=social) - Recommenders contains benchmark and best practices for building recommendation systems, provided as Jupyter notebooks.
* [RLeXplore](https://github.com/yuanmingqi/rl-exploration-baselines) ![](https://img.shields.io/github/stars/yuanmingqi/rl-exploration-baselines.svg?style=social) - RLeXplore provides stable baselines of exploration methods in reinforcement learning
* [RLeXplore](https://github.com/yuanmingqi/rl-exploration-baselines) ![](https://img.shields.io/github/stars/yuanmingqi/rl-exploration-baselines.svg?style=social) - RLeXplore provides stable baselines of exploration methods in reinforcement learning.
* [SafePO-Baselines](https://github.com/PKU-MARL/Safe-Policy-Optimization) ![](https://img.shields.io/github/stars/PKU-MARL/Safe-Policy-Optimization.svg?style=social) - SafePO-Baselines is a benchmark repository for safe reinforcement learning algorithms.
* [UpTrain](https://github.com/uptrain-ai/uptrain) ![](https://img.shields.io/github/stars/uptrain-ai/uptrain.svg?style=social) - UpTrain is an open-source tool to evaluate LLM applications.


## Commercial Platform
Expand Down

0 comments on commit 641174c

Please sign in to comment.