This tool helps evaluate responses from a RAG (Retrieval-Augmented Generation) system by comparing them against predefined ground truth answers using multiple scoring metrics. It automates the process of sending queries through a web interface and generates comprehensive evaluation reports.
- Web automation using Playwright for interacting with RAG system UI
- Multiple scoring metrics (see detailed guide below):
- Cosine Similarity (using TF-IDF)
- ROUGE-L Score
- Exact Match
- Token F1 Score
- METEOR Score
- Modular architecture for easy addition of new scoring methods
- CSV-based input and output for easy data management
- Detailed evaluation reports with multiple metrics
- Spanish language support with proper tokenization and stemming
This section provides detailed information about each scoring method available in the evaluation tool. Understanding these metrics will help you interpret the results and choose the most appropriate evaluation approach for your RAG system.
Purpose: Measures semantic similarity between texts using TF-IDF vectorization.
How it works:
- Converts both reference and candidate texts into TF-IDF vectors
- Calculates the cosine of the angle between these vectors
- Uses Spanish tokenization and stemming for better language-specific accuracy
Score Range: 0.0 to 1.0
- 1.0: Perfect semantic similarity
- 0.8-0.9: High similarity, likely good match
- 0.6-0.7: Moderate similarity, partial match
- 0.3-0.5: Low similarity, different content
- 0.0-0.2: Very low similarity, unrelated content
Best for: Evaluating semantic closeness when exact wording isn't required.
Limitations:
- TF-IDF is calculated only on the two documents being compared, limiting IDF effectiveness
- May not capture complex semantic relationships
Purpose: Measures longest common subsequence (LCS) between reference and candidate texts.
How it works:
- Finds the longest common subsequence of words between texts
- Calculates F-measure based on precision and recall of the LCS
- Uses Spanish stemming to handle word variations
Score Range: 0.0 to 1.0
- 1.0: Perfect word order and content match
- 0.8-0.9: Excellent overlap with good word order preservation
- 0.6-0.7: Good overlap, some word order differences
- 0.4-0.5: Moderate overlap, significant differences
- 0.0-0.3: Poor overlap, very different content
Best for: Evaluating fluency and word order preservation, especially important for summaries.
Limitations:
- Focuses on word order, may penalize semantically correct but differently structured answers
- Less effective for short texts
Purpose: Checks if texts are identical after normalization.
How it works:
- Normalizes both texts (lowercase, removes accents, strips whitespace)
- Returns 1.0 if texts match exactly, 0.0 otherwise
Score Range: Binary (0.0 or 1.0)
- 1.0: Perfect match after normalization
- 0.0: Any difference in content
Best for:
- Factual questions with single correct answers
- Evaluating precision in specific information retrieval
- Quality control for critical information
Limitations:
- Very strict, doesn't account for paraphrasing or synonyms
- Not suitable for open-ended questions
Purpose: Measures word-level overlap using precision and recall.
How it works:
- Tokenizes both texts into individual words
- Calculates precision (common tokens / candidate tokens)
- Calculates recall (common tokens / reference tokens)
- Computes F1 score as harmonic mean of precision and recall
Score Range: 0.0 to 1.0
- 1.0: Perfect word overlap
- 0.8-0.9: High word overlap, comprehensive answer
- 0.6-0.7: Good word overlap, mostly correct content
- 0.4-0.5: Moderate overlap, missing some key terms
- 0.0-0.3: Poor overlap, significantly different vocabulary
Best for:
- Evaluating vocabulary coverage
- Measuring information completeness
- Balancing precision and recall in content evaluation
Limitations:
- Doesn't consider word order or semantic relationships
- Treats all words equally regardless of importance
Purpose: Advanced metric considering stemmed matches and word order.
How it works:
- Matches words based on stems (handles word variations)
- Considers word order through alignment
- Penalizes differences in word order
- Uses Spanish stemming for language-specific accuracy
Score Range: 0.0 to 1.0
- 1.0: Perfect match with optimal word order
- 0.8-0.9: Excellent content with good structure
- 0.6-0.7: Good content, some structural differences
- 0.4-0.5: Moderate match, notable differences
- 0.0-0.3: Poor match, very different content or structure
Best for:
- Comprehensive evaluation considering both content and structure
- Handling word variations through stemming
- More nuanced evaluation than simple word overlap
Limitations:
- WordNet synonym matching doesn't work for Spanish (falls back to stem matching)
- More complex and computationally intensive
- High scores across all metrics: Excellent response quality
- High Cosine + ROUGE, low Exact Match: Good semantic match with different wording
- High Token F1, low ROUGE: Good vocabulary but poor word order
- High Exact Match, others varying: Perfect factual accuracy with varying presentation
For Factual Q&A: Prioritize Exact Match and Token F1 For Summarization: Focus on ROUGE-L and METEOR For Semantic Search: Emphasize Cosine Similarity and METEOR For Comprehensive Evaluation: Use all metrics and analyze patterns
- Excellent: Average score > 0.8
- Good: Average score 0.6-0.8
- Acceptable: Average score 0.4-0.6
- Poor: Average score < 0.4
Remember that the ideal threshold depends on your specific use case, domain, and quality requirements.
- Clone this repository:
git clone <repository-url>
cd rag_evaluation- Create a virtual environment and activate it:
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate- Install the required packages:
pip install -r requirements.txt- Install Playwright browsers:
playwright install-
Prepare your input CSV file with queries and ground truth answers. The file should have two columns:
query: The question to askground_truth: The expected answer
Example:
query,ground_truth "What is the capital of France?","Paris is the capital city of France."
-
Run the evaluation script:
python main.py --url "https://your-rag-system.com" \
--input "data/queries.csv" \
--output "data/results.csv" \
--input-selector "#query-input" \
--submit-selector "#submit-button"python main.py --url "http://localhost:8000/" --input "data/sample_queries.csv" --output "data/results.csv" --input-selector "#query-input" --submit-selector "#submit-button"--url: URL of the RAG system's web interface--input: Path to input CSV file with queries and ground truth--output: Path where to save the results CSV--input-selector: CSS selector for the query input field--submit-selector: CSS selector for the submit button
To add a new scoring method:
- Create a new class in
src/scorers/scorers.pythat inherits fromBaseScorer - Implement the required methods:
calculate_score(self, reference: str, candidate: str) -> floatget_score_name(self) -> str
Example:
from .base_scorer import BaseScorer
class MyNewScorer(BaseScorer):
def calculate_score(self, reference: str, candidate: str) -> float:
"""
Calculate your custom score between reference and candidate.
Args:
reference: Ground truth text
candidate: Generated response text
Returns:
float: Score between 0.0 and 1.0
"""
# Implement your scoring logic here
# Example: Simple character-based similarity
if not reference or not candidate:
return 0.0
# Your custom logic here
score = len(set(reference) & set(candidate)) / len(set(reference) | set(candidate))
return score
def get_score_name(self) -> str:
return "my_new_score"- Add your scorer to the evaluation manager. The current implementation includes all available scorers by default.
- Return normalized scores: Always return values between 0.0 and 1.0
- Handle edge cases: Check for empty strings, None values, etc.
- Document limitations: Add clear docstrings explaining when to use your scorer
- Consider language: If working with Spanish text, use appropriate tokenization
- Performance: Cache expensive operations if the scorer will be used repeatedly
The tool generates a CSV file with the following columns:
- Original columns from input file (
query,ground_truth) actual_response: The response received from the RAG system- Individual score columns:
cosine_similarity: TF-IDF based semantic similarity (0.0-1.0)rouge_l_score: ROUGE-L F-measure for text overlap (0.0-1.0)exact_match: Binary exact match after normalization (0.0 or 1.0)token_f1: Token-level F1 score (0.0-1.0)meteor_score: METEOR score with Spanish stemming (0.0-1.0)
average_score: Average of all scoring metrics
query,ground_truth,actual_response,cosine_similarity,rouge_l_score,exact_match,token_f1,meteor_score,average_score
"What is the capital of France?","Paris","Paris is the capital city of France.",0.85,0.92,0.0,0.75,0.88,0.68- Individual Metrics: Use specific scores to understand different aspects of response quality
- Average Score: Provides overall assessment but consider individual metrics for detailed analysis
- Patterns: Look for consistent strengths/weaknesses across different types of queries
- Outliers: Investigate cases where metrics disagree significantly
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.