reliability-checklist
is a Python framework (available via CLI
) for Comprehensively Evaluating the Reliability of NLP Systems
reliability-checklist
accepts any model and dataset as input and facilitates the comprehensive evaluation on a wide range of reliability-related aspects such as accuracy, selective prediction, novelty detection, stability, sensitivity, and calibration.
✅ No coding needed
Pre-defined templates available to easily integrate your models/datasets via command line only.
✅ Bring Your own Model (BYoM)
Your model template is missing? We have you covered: Check out BYoM to create your own model specific config file.
✅ Bring Your own Data (BYoD)
Your dataset template is missing? Check out BYoD to create your own dataset specific config file.
✅ Reliability metrics
Currently, we support a number of reliability related aspects:
- Accuracy/F1/Precision/Recall
- Calibration: Reliability Diagram Expected Calibration Error (ECE), Expected Overconfidence Error (EOE)
- Selective Prediction: Risk-Coverage Curve (RCC), AUC of risk-coverage curve
- Sensitivity
- Stability
- Out-of-Distribution
- Adversarial Attack: Model in the loop adversarial attacks to evaluate model's robustness.
- Task-Specific Augmentations: Task-specific augmentations to check the reliability on augmented inputs.
- Novelty
- Other Measures: We plan to incorporate other measures such as bias, fairness, toxicity, and faithfulness of models. We also plan to measure the reliability of generative models on crucial parameters such as hallucinations.
✅ Want to integrate more features?
Our easy-to-develop infrastructure allows developers to contribute models, datasets, augmentations, and evaluation metrics seamlessly to the workflow.
pip install git+https://github.com/Maitreyapatel/reliability-checklist
python -m spacy download en_core_web_sm
python -c "import nltk;nltk.download('wordnet')"
Evaluate example model/data with default configuration
# eval on CPU
recheck
# eval on GPU
recheck trainer=gpu +trainer.gpus=[1,2,3]
Evaluate model with chosen dataset-specific experiment configuration from reliability_checklist/configs/task/
recheck tasl=<task_name>
Specify the custom model_name as shown in following MNLI example
# if model_name is used for tokenizer as well.
recheck task=mnli custom_model="bert-base-uncased-mnli"
# if model_name is different for tokenizer then
recheck task=mnli custom_model="bert-base-uncased-mnli" custom_model.tokenizer.model_name="ishan/bert-base-uncased-mnli"
# create config folder structure similar to reliability_checklist/configs/
mkdir ./configs/
mkdir ./configs/custom_model/
# run following command after creating new config file inside ./configs/custom_model/<your-config>.yaml
recheck task=mnli custom_model=<your-config>
reliability-checklist
supports the wide range of visualization tools. One can decide to go with default wandb online visualizer. It also generates plots that are highly informative which will be stored into logs
directory.
Any kind of positive contribution is welcome! Please help us to grow by contributing to the project.
If you wish to contribute, you can work on any features/issues listed here or create one on your own. After adding your code, please send us a Pull Request.
Please read
CONTRIBUTING
for details on ourCODE OF CONDUCT
, and the process for submitting pull requests to us.