-
arxiv is coming out soon.
-
Models checkpoints are available on huggingface:
- Qwen2.5-Osiris-0.5B-Instruct:
judgmentlabs/Qwen2.5-Osiris-0.5B-Instruct
- Qwen2.5-Osiris-1.5B-Instruct:
judgmentlabs/Qwen2.5-Osiris-1.5B-Instruct
- Qwen2.5-Osiris-3B-Instruct:
judgmentlabs/Qwen2.5-Osiris-3B-Instruct
- Qwen2.5-Osiris-7B-Instruct:
judgmentlabs/Qwen2.5-Osiris-7B-Instruct
- Qwen2.5-Osiris-0.5B-Instruct:
-
📊 Dataset is available at here.
from src.data.perturb_musique import DatasetPerturbator
perturbator = DatasetPerturbator(
dataset_path="/path/to/your/dataset.jsonl",
output_dir="/path/to/save/perturbed/dataset"
)
perturbator.perturb()
# Navigate to the evaluation directory
cd src/data/evaluation
# Run the RAGTruth benchmark on models defined in this script
# This evaluates how well models detect hallucinations in RAG contexts
bash ragtruth_predict.sh
# Format the RAGTruth benchmark results into structured JSON files if necessary
# This prepares the data for analysis and visualization
bash format_predictions.sh
# Calculate and display evaluation metrics
# Shows Recall, Precision, and F1 scores for hallucination detection
python show_results.py
We used LLaMA-Factory for efficient fine-tuning. A sample configuration can be found here.