Skip to content

Samarth0211/LLMEvalStudio

Repository files navigation

LLM Eval Studio

Interactive LLM Output Quality Evaluation Dashboard

Next.js TypeScript Tailwind CSS Gemini Recharts

Live Demo


Paste any LLM output alongside its source text. Get instant quality scores across 5 dimensions, visualized on a radar chart with claim-level hallucination analysis.

Shipping LLMs without evaluation guardrails is flying blind. LLM Eval Studio gives you multi-dimensional quality scoring for any LLM output -- single evaluations, side-by-side comparisons, or batch CSV processing.


Features

Single Evaluation Mode

  • Two side-by-side text areas: Source / Ground Truth and LLM Output
  • Click "Evaluate" to score the output across 5 quality dimensions
  • Results panel with:
    • Radar chart showing all 5 metrics at a glance
    • Overall score (large number, color-coded: green > 80, yellow > 60, red < 60)
    • Per-metric cards with score, explanation, and tooltip
    • Claim-level breakdown: each claim in the output marked as Supported or Unsupported

Comparison Mode

  • Add a third text area for LLM Output B
  • Side-by-side radar charts comparing Output A vs. Output B
  • Winner highlighted for each metric
  • "Which output is better?" summary with explanation

Batch Evaluation Mode

  • Upload a CSV file with source and output columns
  • Progress bar during evaluation
  • Results include:
    • Summary statistics (mean, min, max per metric)
    • Sortable table of all evaluations
    • Score distribution bar charts
    • Average metrics chart
  • Download Report button -- exports CSV with all scores + JSON with detailed breakdowns

Pre-loaded Examples

Three built-in examples to try instantly:

Example Demonstrates
Good Output High scores across all dimensions
Hallucinated Output Low faithfulness score -- catches fabricated claims
Incomplete Output Low completeness score -- detects missing information

Evaluation History

  • All evaluations stored in localStorage
  • Sliding history panel to revisit past evaluations
  • Click any entry to re-view its full results
  • "Clear history" button

Evaluation Metrics

Metric Method What It Measures
ROUGE-L Algorithmic (LCS F1) Longest common subsequence overlap between source and output
Semantic Similarity Gemini Embeddings + Cosine Similarity How closely the output's meaning matches the source
Accuracy LLM-as-Judge (Gemini 2.5 Flash, 1--10) Factual correctness relative to the source
Completeness LLM-as-Judge (Gemini 2.5 Flash, 1--10) Coverage of key information from the source
Faithfulness LLM-as-Judge (Gemini 2.5 Flash, 1--10) Absence of hallucinated or fabricated claims

All scores are normalized to 0--100 for the radar chart and overall score.


Architecture

                    +---------------------+
                    |   LLM Eval Studio   |
                    |      Frontend       |
                    +----------+----------+
                               |
            +------------------+------------------+
            |                  |                  |
            v                  v                  v
     /api/evaluate      /api/batch        /api/embeddings
     (Single eval)      (CSV batch)       (Vector embeddings)
            |                  |                  |
            +--------+---------+------------------+
                     |
          +----------+----------+
          |                     |
          v                     v
   ROUGE-L (local)     Google Gemini API
   LCS algorithm       - Embeddings (cosine similarity)
                       - LLM-as-Judge (accuracy,
                         completeness, faithfulness)

Tech Stack

Layer Technology
Framework Next.js 16 (App Router, Server Components)
Language TypeScript 5
Styling Tailwind CSS 4 (light theme, dashboard aesthetic)
LLM Google Gemini 2.5 Flash (LLM-as-Judge)
Embeddings Gemini Embedding Model (cosine similarity)
Charts Recharts (radar, bar, distribution)
CSV Parsing PapaParse
Animations Framer Motion
Icons Lucide React
Storage localStorage (evaluation history)

Getting Started

Prerequisites

Installation

git clone https://github.com/Samarth0211/LLMEvalStudio.git
cd LLMEvalStudio
npm install

Environment Variables

Create a .env.local file in the project root:

GOOGLE_API_KEY=your_gemini_api_key_here

Run Development Server

npm run dev

Open http://localhost:3000 in your browser.

Production Build

npm run build
npm start

Project Structure

LLMEvalStudio/
  src/
    app/
      page.tsx                # Main evaluation page (single + comparison)
      batch/page.tsx          # Batch evaluation page
      layout.tsx              # Root layout with navbar
      api/
        evaluate/route.ts     # Single evaluation (ROUGE-L + embeddings + LLM judge)
        batch/route.ts        # Batch CSV evaluation
        embeddings/route.ts   # Gemini embeddings endpoint
    components/
      Navbar.tsx              # Navigation with mode switching
      RadarChart.tsx          # Recharts radar visualization
      ScoreCard.tsx           # Individual metric card with tooltip
      OverallScore.tsx        # Circular progress score display
      ClaimsList.tsx          # Claim-level Supported/Unsupported analysis
      HistoryPanel.tsx        # Sliding evaluation history sidebar
      ExampleButton.tsx       # Pre-loaded examples dropdown
    lib/
      types.ts                # TypeScript type definitions
      examples.ts             # Pre-loaded example data
      history.ts              # localStorage history management
      utils.ts                # Utility functions
  .env.local                  # Environment variables (not committed)
  package.json

API Routes

Endpoint Method Description
/api/evaluate POST Evaluate a single source/output pair across all 5 metrics
/api/batch POST Evaluate an array of source/output pairs with summary stats
/api/embeddings POST Generate Gemini embeddings for an array of texts

Example: Single Evaluation Request

curl -X POST http://localhost:3000/api/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "source": "The capital of France is Paris.",
    "output": "Paris is the capital city of France, located in Europe."
  }'

Deployment

Deployed on Vercel. To deploy your own instance:

  1. Fork this repo
  2. Import it into Vercel
  3. Add GOOGLE_API_KEY as an environment variable in Vercel project settings
  4. Deploy

Author

Samarth Bhamare -- AI/ML Engineer


License

MIT

About

Interactive LLM Output Quality Evaluation Dashboard

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors