evaluation

Here are 1,105 public repositories matching this topic...

langwatch / langwatch

🤖 Build AI applications with confidence ✅ DSPy Visualizer ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.

ai analytics evaluation openai gpt datasets observability llm prompt-engineering

Updated Jun 2, 2024
TypeScript

HowieHwong / TrustLLM

Star

[ICML 2024] TrustLLM: Trustworthiness in Large Language Models

nlp benchmark natural-language-processing ai toolkit evaluation dataset pypi-package trustworthy-machine-learning trustworthy-ai large-language-models llm

Updated Jun 2, 2024
Python

promptfoo / promptfoo

Star

Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Jun 2, 2024
TypeScript

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated Jun 2, 2024
TypeScript

tatsu-lab / alpaca_eval

Star

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

nlp deep-learning leaderboard evaluation instruction-following foundation-models large-language-models rlhf

Updated Jun 1, 2024
Jupyter Notebook

JieyuZ2 / TaskMeAnything

Star

A task generation and model evaluation system.

benchmark evaluation foundation-models

Updated Jun 1, 2024
Python

dustalov / llmfao

Star

Large Language Model Feedback Analysis and Optimization (LLMFAO)

benchmark leaderboard evaluation crowdsourcing pairwise-comparison llm llmfao

Updated Jun 1, 2024
Python

cdaringe / programming-language-selector

Star

Programming Language Selector based on language metadata and user-specified values.

decision-making evaluation languages

Updated Jun 1, 2024
TypeScript

Psycoy / MixEval

Star

MixEval, a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU), with its queries being stably updated every month to avoid contamination.

benchmark machine-learning deep-learning evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-training llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture dynamic-benchmark

Updated Jun 1, 2024
Python

luka2220 / fun-interpreter

Star

Interpreter for a programming language with basic features

java interpreter parsing evaluation lexical-analysis

Updated Jun 1, 2024
Java

mary-lev / NER

Star

Evaluation of Named Entity Recognition Models for Russian News Texts in the Cultural Domain

evaluation named-entity-recognition russian ner llm llms

Updated Jun 1, 2024
Jupyter Notebook

EXP-Tools / steam-discount

Star

steam 特惠游戏榜单（自动刷新）

steam crawler evaluation rank discount zero playing

Updated Jun 1, 2024
Python

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Jun 1, 2024
Python

krisstallenberg / literature-rag

Star

Comparing Naive- and Advanced RAG to LLM with full-context injection on a Question-answering (QA) dataset based on a narrative text. Evaluation for accuracy, latency and cost.

evaluation literature digital-humanities hyde rag llms narrative-text

Updated Jun 1, 2024
Jupyter Notebook

NullDev / Spendenr-AI-d

Star

AI powered Spendenraid evaluation.

nodejs python ocr evaluation classification pr0gramm nulldev

Updated Jun 1, 2024
JavaScript

CAS-SIAT-XinHai / CPsyCoun

Star

[ACL 2024]CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling

nlp evaluation dataset dataset-generation mental-health llm

Updated Jun 1, 2024
Jupyter Notebook

open-compass / opencompass

Star

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

Updated Jun 1, 2024
Python

lunary-ai / lunary

Star

The production toolkit for LLMs. Observability, prompt management and evaluations.

testing ai monitoring evaluation logs self-hosted openai hacktoberfest observability prompts llm langchain

Updated May 31, 2024
TypeScript

kolenaIO / kolena

Star

Python client for Kolena's machine learning testing platform

testing machine-learning evaluation evaluation-metrics evaluation-framework mlops evaluate-models llmops

Updated May 31, 2024
Python

deshwalmahesh / PHUDGE

Star

Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.

nlp ai evaluation ml pytorch judge feedback-collection sota custom-dataset finetuning hallucination llm llm-evaluation hallucination-detection phi-3

Updated May 31, 2024
Jupyter Notebook

Improve this page

Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

Here are 1,105 public repositories matching this topic...

langwatch / langwatch

HowieHwong / TrustLLM

promptfoo / promptfoo

langfuse / langfuse

tatsu-lab / alpaca_eval

JieyuZ2 / TaskMeAnything

dustalov / llmfao

cdaringe / programming-language-selector

Psycoy / MixEval

luka2220 / fun-interpreter

mary-lev / NER

EXP-Tools / steam-discount

athina-ai / athina-evals

krisstallenberg / literature-rag

NullDev / Spendenr-AI-d

CAS-SIAT-XinHai / CPsyCoun

open-compass / opencompass

lunary-ai / lunary

kolenaIO / kolena

deshwalmahesh / PHUDGE

Improve this page

Add this topic to your repo