Paste any LLM output alongside its source text. Get instant quality scores across 5 dimensions, visualized on a radar chart with claim-level hallucination analysis.
Shipping LLMs without evaluation guardrails is flying blind. LLM Eval Studio gives you multi-dimensional quality scoring for any LLM output -- single evaluations, side-by-side comparisons, or batch CSV processing.
- Two side-by-side text areas: Source / Ground Truth and LLM Output
- Click "Evaluate" to score the output across 5 quality dimensions
- Results panel with:
- Radar chart showing all 5 metrics at a glance
- Overall score (large number, color-coded: green > 80, yellow > 60, red < 60)
- Per-metric cards with score, explanation, and tooltip
- Claim-level breakdown: each claim in the output marked as Supported or Unsupported
- Add a third text area for LLM Output B
- Side-by-side radar charts comparing Output A vs. Output B
- Winner highlighted for each metric
- "Which output is better?" summary with explanation
- Upload a CSV file with
sourceandoutputcolumns - Progress bar during evaluation
- Results include:
- Summary statistics (mean, min, max per metric)
- Sortable table of all evaluations
- Score distribution bar charts
- Average metrics chart
- Download Report button -- exports CSV with all scores + JSON with detailed breakdowns
Three built-in examples to try instantly:
| Example | Demonstrates |
|---|---|
| Good Output | High scores across all dimensions |
| Hallucinated Output | Low faithfulness score -- catches fabricated claims |
| Incomplete Output | Low completeness score -- detects missing information |
- All evaluations stored in
localStorage - Sliding history panel to revisit past evaluations
- Click any entry to re-view its full results
- "Clear history" button
| Metric | Method | What It Measures |
|---|---|---|
| ROUGE-L | Algorithmic (LCS F1) | Longest common subsequence overlap between source and output |
| Semantic Similarity | Gemini Embeddings + Cosine Similarity | How closely the output's meaning matches the source |
| Accuracy | LLM-as-Judge (Gemini 2.5 Flash, 1--10) | Factual correctness relative to the source |
| Completeness | LLM-as-Judge (Gemini 2.5 Flash, 1--10) | Coverage of key information from the source |
| Faithfulness | LLM-as-Judge (Gemini 2.5 Flash, 1--10) | Absence of hallucinated or fabricated claims |
All scores are normalized to 0--100 for the radar chart and overall score.
+---------------------+
| LLM Eval Studio |
| Frontend |
+----------+----------+
|
+------------------+------------------+
| | |
v v v
/api/evaluate /api/batch /api/embeddings
(Single eval) (CSV batch) (Vector embeddings)
| | |
+--------+---------+------------------+
|
+----------+----------+
| |
v v
ROUGE-L (local) Google Gemini API
LCS algorithm - Embeddings (cosine similarity)
- LLM-as-Judge (accuracy,
completeness, faithfulness)
| Layer | Technology |
|---|---|
| Framework | Next.js 16 (App Router, Server Components) |
| Language | TypeScript 5 |
| Styling | Tailwind CSS 4 (light theme, dashboard aesthetic) |
| LLM | Google Gemini 2.5 Flash (LLM-as-Judge) |
| Embeddings | Gemini Embedding Model (cosine similarity) |
| Charts | Recharts (radar, bar, distribution) |
| CSV Parsing | PapaParse |
| Animations | Framer Motion |
| Icons | Lucide React |
| Storage | localStorage (evaluation history) |
- Node.js 18 or later
- A Google Gemini API key (free tier works)
git clone https://github.com/Samarth0211/LLMEvalStudio.git
cd LLMEvalStudio
npm installCreate a .env.local file in the project root:
GOOGLE_API_KEY=your_gemini_api_key_herenpm run devOpen http://localhost:3000 in your browser.
npm run build
npm startLLMEvalStudio/
src/
app/
page.tsx # Main evaluation page (single + comparison)
batch/page.tsx # Batch evaluation page
layout.tsx # Root layout with navbar
api/
evaluate/route.ts # Single evaluation (ROUGE-L + embeddings + LLM judge)
batch/route.ts # Batch CSV evaluation
embeddings/route.ts # Gemini embeddings endpoint
components/
Navbar.tsx # Navigation with mode switching
RadarChart.tsx # Recharts radar visualization
ScoreCard.tsx # Individual metric card with tooltip
OverallScore.tsx # Circular progress score display
ClaimsList.tsx # Claim-level Supported/Unsupported analysis
HistoryPanel.tsx # Sliding evaluation history sidebar
ExampleButton.tsx # Pre-loaded examples dropdown
lib/
types.ts # TypeScript type definitions
examples.ts # Pre-loaded example data
history.ts # localStorage history management
utils.ts # Utility functions
.env.local # Environment variables (not committed)
package.json
| Endpoint | Method | Description |
|---|---|---|
/api/evaluate |
POST | Evaluate a single source/output pair across all 5 metrics |
/api/batch |
POST | Evaluate an array of source/output pairs with summary stats |
/api/embeddings |
POST | Generate Gemini embeddings for an array of texts |
curl -X POST http://localhost:3000/api/evaluate \
-H "Content-Type: application/json" \
-d '{
"source": "The capital of France is Paris.",
"output": "Paris is the capital city of France, located in Europe."
}'Deployed on Vercel. To deploy your own instance:
- Fork this repo
- Import it into Vercel
- Add
GOOGLE_API_KEYas an environment variable in Vercel project settings - Deploy
Samarth Bhamare -- AI/ML Engineer
MIT