NLP-Telescope

Introduction:

NLP-Telescope is a comparative analysis tool which is an updated and extended version of MT-Telescope (Rei et al., 2021). Like MT-Telescope, it aims to facilitate researchers and developers to analyse their systems by offering features such as:

System-level metric that globally evaluates the outputs of the systems;
Segment-level metric that evaluates segment-by-segment the outputs of the systems;
Dynamic Corpus Filtering to filter your testset with specific linguistic phenomena such as named entities;
Visual interactive plots (important to compare systems side-by-side and segment-by-segment);
Statistical tests, in which the tool uses bootstrap resampling (Koehn et al., 2004).

NLP-Telescope also offers new features compared to MT-Telescope, such as:

Analysing and comparing the results of N systems from M references. N and M are numbers greater than or equal to 1. This functionality is updated from MT-Telescope, which analyses only two systems from one reference;
Being able to analyse four Natural Language Processing (NLP) tasks such as machine translation, text summarization, dialogue system and text classification. For each task, appropriate visual analysis interface, metrics and filters are added. This functionality is updated from MT-Telescope, which analyses machine translation systems only;
Analysing and evaluating gender biases when comparing the references with the systems outputs. Only available for the machine translation;
Being able to rank the systems through an aggregation mechanism that aggregates all requested metrics;
Being able to rename systems (important for systems to contain the names that the user requires in the plot) through the upload of one file (in which each line is a system name) or directly on the web browser;
Being able to download plots and tables.

For Natural Language Generation (NLG) tasks such as machine translation, text summarization and dialogue system, we have three types of visual interface:

Error-type analysis: To evaluate the system utility, the tool divides the errors into four parts through the stacked bar plot. Only available if COMET or BERTScore are selected for segment-level metrics;
Segment-level scores histogram: With a histogram plot, one may observe general evaluation of the distribution of scores between systems;
Pairwise comparison: The user may choose two systems from many systems to analyse. It is used when the differences between two systems are not obvious. This type of comparison is composed by:
- Segment-level comparison (with bubble plot, the user may check the comparison of the sentence scores of the two systems through the differences);
- Bootstrap Resampling.

For text summarization and dialogue system, we also have the funcionally named Similar Source Sentences, in which the user may see the 10 sentences in the source that contain more similarity with sentences of systems outputs. The similarity is determined thought the maximum and minimum values calculated using the cosine similarity.

For the classification task, we have the following visual interfaces:

Confusion Matrix of a system;
Confusion Matrix of a system focused on one label;
Scores of each label for each system through the stacked bar plot;
Examples that are incorrectly labelled.

For all tasks and for each reference, the tool offers a table and a plot with system metrics scores.

For each task, the tool contains a set of metrics and filters. The user may change these sets in files metrics.yaml and filters.yaml, located in the folder user. Same case for bias evaluation and universal metrics (which the user may add more weighted-mean with different weights).

In this document, we will explain the requirements and how to install and run NLP-Telescope. To run the NLP-Telescope tool you may use:

the web browser;
the command line interface.

Requirements for the evaluation of NLP systems:

For the evaluation of systems, it is required three type of files: input file, reference file and output file. A type of file may be associated to a specific name depending on the task.

The tool considers the file's text as a set of segments (which may be a sentence), which are organized by lines. The requirement of the number of segments per file type changes according to the task.

In this section, we will see what are the specific names of the file types, their contents, size requirements and indicate examples for each task.

Machine Translation:

The following files are required:

Source file: File that contains text written in one language (source language) that will be translated into another language (target language);
One or more reference files: File(s) that contain(s) text that will be the point of reference for the outputs of systems. Reference may be the human translation or the "correct" translation;
One or more output files: File(s) that contain(s) the outputs of the models which are texts translated from the source.

All files must have the same number of segments.

You may see examples of files here

You must also indicate the language in which the texts are written. In the web interface, you must indicate the language pair of the files as, for instance, 'en-ru', in which 'en' is the source language and 'ru' is the target language. In the command line interface, you must only indicate the target language. If the language is indifferent and BERTScore metric is not used, then indicate X-X.

You may also upload one file in which each line is a system name. The number of lines must be equal to the number of systems and the order must be the same as the user loaded systems.

Dialogue System:

The following files are required:

Context file: File that contains the context between the user and the system;
One or more truth answers files: File(s) that contain(s) the correct text that the system should have answered to;
One or more systems answers files: File(s) that contain(s) the text that the system actually answered to.

The truth answers and the systems answers must have the same number of segments.

You can see examples of files here.

You must also indicate the language in which the texts are written. If the language is indifferent and BERTScore metric is not used, then indicate X.

You may also upload one file in which each line is a system name. The number of lines must be equal to the number of systems and the order must be the same as the user loaded systems.

Summarization:

The following files are required:

Text to be summarized file: File that contains the text that will se summarized by the system;
One or more references files: File(s) that contain(s) the text that will be the point of reference for the outputs of systems;
One or more systems summaries files: File(s) that contain(s) the summary produced by the system.

The reference and the systems summaries must have the same number of segments.

You can see examples of files here.

You must also indicate the language in which the texts are written. If the language is indifferent and BERTScore metric is not used, then indicate X.

You may also upload one file in which each line is a system name. The number of lines must be equal to the number of systems and the order must be the same as the user loaded systems.

Classification:

The following files are required:

Samples file: File that contains the samples;
One or more true labels files: File(s) that contain(s) the true label of each sample;
One or more predicated labels files: File(s) that contain(s) the predicated labels of each sample.
One file with all available labels: File that contains all available labels. Each line is one label.

Samples file, true labels files and predicated labels files must have the same number of segments.

You can see examples of files here.

You must also indicate the language in which the texts are written. If the language is indifferent then indicate X.

You may also upload one file in which each line is a system name. The number of lines must be equal to the number of systems and the order must be the same as the user loaded systems.

Installation:

Create a virtual environment. Run:

python3 -m venv NLP-ENV

Activate virtual environment. Run:

source NLP-ENV/bin/activate

Make sure you have poetry installed, then run the following commands:

git clone https://github.com/RitaTMCO/NLP-Telescope
cd NLP-Telescope
poetry install --without dev

Finally, run the following commands:

chmod +x download.sh 
./download.sh

Before running the tool:

Some metrics, such as COMET, may take some time. You can switch the COMET model to a more lightweight model with the following env variable:

export COMET_MODEL=wmt21-cometinho-da

Web Interface:

Run the following command to display the web interface:

telescope streamlit

While the web browser is being displayed, it is created a folder called downloaded_data within folder user. The exported tables and plots downloaded from the web will be directed to the folder downloaded_data.

Command Line Interface (CLI):

Comparing NLG systems:

Run command telescope n-compare-nlg to compare NLG systems with CLI.

Usage: telescope n-compare-nlg [OPTIONS]

Options:
  -s, --source FILENAME           Source segments.  [required]
  -c, --system_output FILENAME    System candidate. This option can be
                                  multiple.  [required]
  -r, --reference FILENAME        Reference segments. This option can be
                                  multiple.  [required]
  -t, --task [machine-translation|dialogue-system|summarization]
                                  NLG to evaluate.  [required]
  -l, --language TEXT             Language of the evaluated text.  [required]
  -m, --metric [COMET|BLEU|chrF|ZeroEdit|BERTScore|TER|GLEU|ROUGE-1|ROUGE-2|ROUGE-L|Accuracy|Precision|Recall|F1-score]
                                  Metric to run. This option can be multiple.
                                  
                                  |machine-translation|: [COMET, BLEU, chrF,
                                  ZeroEdit, TER, GLEU, BERTScore].
                                  
                                  |summarization|: [ROUGE-1, ROUGE-2, ROUGE-L,
                                  BERTScore].
                                  
                                  |dialogue-system|: [BLEU, ROUGE-1, ROUGE-2,
                                  ROUGE-L, BERTScore].  [required]
  -f, --filter [named-entities|length|remove-duplicates]
                                  Filter to run. This option can be multiple.
                                  
                                  |machine-translation|: [remove-duplicates,
                                  length, named-entities].
                                  
                                  |summarization|: [remove-duplicates, length,
                                  named-entities].
                                  
                                  |dialogue-system|: [remove-duplicates,
                                  length, named-entities].
  --length_min_val FLOAT          Min interval value for length filtering.
  --length_max_val FLOAT          Max interval value for length filtering.
  --seg_metric [COMET|ZeroEdit|BERTScore|GLEU|ROUGE-1|ROUGE-2|ROUGE-L|Accuracy|Precision|Recall|F1-score]
                                  Segment-level metric to use for segment-
                                  level analysis.
  -o, --output_folder TEXT        Folder in which you wish to save plots.
  --bootstrap                     Avaliable for machine-translation.
  -x, --system_x FILENAME         System X NLG outputs for segment-level
                                  comparison and bootstrap resampling.
  -y, --system_y FILENAME         System Y NLG outputs for segment-level
                                  comparison and bootstrap resampling.
  --num_splits INTEGER            Number of random partitions used in
                                  Bootstrap resampling.
  --sample_ratio FLOAT            Proportion (P) of the initial sample.
  -n, --systems_names FILENAME    File that contains the names of the systems
                                  per line.
  -b, --bias_evaluations [Gender]
                                  Bias Evaluation. This option can be
                                  multiple.
                                  
                                  |machine-translation|: [Gender].
                                  
                                  |summarization|: [].
                                  
                                  |dialogue-system|: [].
  --option_gender_bias_evaluation [dictionary-based approach|linguistic approach|hybrid approach]
                                  Options for Gender Bias Evaluation.
  -u, --universal_metric [average|median|pairwise-comparison|social-choice-theory|weighted-sum-1000|weighted-sum-3000|weighted-sum-5000|weighted-mean-1000|weighted-mean-3000|weighted-mean-5000]
                                  Models Rankings from Universal Metric.
                                  
                                  |machine-translation|: [average, median,
                                  pairwise-comparison, social-choice-theory,
                                  weighted-sum-1000, weighted-mean-1000,
                                  weighted-sum-3000, weighted-mean-3000,
                                  weighted-sum-5000, weighted-mean-5000].
                                  
                                  |summarization|: [].
                                  
                                  |dialogue-system|: [].
  --help                          Show this message and exit.

Example 1: Running several metrics

Running BLEU, chrF BERTScore and COMET to compare three MT systems with two references:

telescope n-compare-nlg \
  -s path/to/src/file.txt \
  -c path/to/system-x/file.txt \
  -c path/to/system-y/file.txt \
  -c path/to/system-z/file.txt \
  -r path/to/ref-1/file.txt \
  -r path/to/ref-2/file.txt \
  -t machine-translation\
  -l en \
  -m BLEU -m chrF -m BERTScore -m COMET

Example 2: Saving a comparison report

telescope n-compare-nlg \
  -s path/to/src/file.txt \
  -c path/to/system-x/file.txt \
  -c path/to/system-y/file.txt \
  -c path/to/system-z/file.txt \
  -r path/to/ref-1/file.txt \
  -r path/to/ref-2/file.txt \
  -t machine-translation\
  -l en \
  -m BLEU -m chrF -m BERTScore -m COMET \
  --output_folder FOLDER-PATH

For FOLDER-PATH location, a folder is created for each reference that contains the report.

Comparing Classification systems:

Run command telescope n-compare-classification to compare classification systems with CLI.

Usage: telescope n-compare-classification [OPTIONS]

Options:
  -s, --source FILENAME           Source segments.  [required]
  -c, --system_output FILENAME    System candidate. This option can be
                                  multiple.  [required]
  -r, --reference FILENAME        Reference segments. This option can be
                                  multiple.  [required]
  --labels FILENAME               Existing labels  [required]
  -l, --language TEXT             Language of the evaluated text.  [required]
  -m, --metric [Accuracy|Precision|Recall|F1-score]
                                  Metric to run. This option can be multiple.
                                  [required]
  -f, --filter [remove-duplicates|length|named-entities]
                                  Filter to run. This option can be multiple.
  --seg_metric [Accuracy|Precision|Recall|F1-score]
                                  Segment-level metric to use for segment-
                                  level analysis.
  -o, --output_folder TEXT        Folder in which you wish to save plots.
  -n, --systems_names FILENAME    File that contains the names of the systems
                                  per line.
  -u, --universal_metric []       Models Rankings from Universal Metric.
  -x, --system_x FILENAME         System X outputs for pairwise-comparison
  -y, --system_y FILENAME         System Y outputs for pairwise-comparison.
  --help                          Show this message and exit.

Example 1: Running two metrics

Running Accuracy and F1-score to compare three systems with two references:

telescope telescope n-compare-classification \
  -s path/to/input/file.txt \
  -c path/to/system-x/file.txt \
  -c path/to/system-y/file.txt \
  -c path/to/system-z/file.txt \
  -r path/to/ref-1/file.txt \
  -r path/to/ref-2/file.txt \
  --labels path/to/all_labels.txt \
  -l en \
  -m Accuracy -m F1-score

Example 2: Saving a comparison report

telescope telescope n-compare-classification \
  -s path/to/input/file.txt \
  -c path/to/system-x/file.txt \
  -c path/to/system-y/file.txt \
  -c path/to/system-z/file.txt \
  -r path/to/ref-1/file.txt \
  -r path/to/ref-2/file.txt \
  --labels path/to/all_labels.txt \
  -m Accuracy -m F1-score \
  -l en \
  --output_folder FOLDER-PATH

For FOLDER-PATH location, a folder is created for each reference that contains the report.

Scoring:

To get the system level scores for a particular MT simply run telescope score.

telescope score -s {path/to/sources} -t {path/to/translations} -r {path/to/references} -l {target_language} -m COMET -m chrF

Comparing two MT systems:

For running MT system comparisons for two system with CLI, you should use the telescope compare command.

Usage: telescope compare [OPTIONS]

Options:
  -s, --source FILENAME           Source segments.  [required]
  -x, --system_x FILENAME         System X MT outputs.  [required]
  -y, --system_y FILENAME         System Y MT outputs.  [required]
  -r, --reference FILENAME        Reference segments.  [required]
  -l, --language TEXT             Language of the evaluated text.  [required]
  -m, --metric [COMET|BLEU|chrF|ZeroEdit|TER|GLEU|BERTScore]
                                  MT metric to run.  [required]
  -f, --filter [named-entities|length|remove-duplicates]
                                  MT metric to run.
  --length_min_val FLOAT          Min interval value for length filtering.
  --length_max_val FLOAT          Max interval value for length filtering.
  --seg_metric [COMET|ZeroEdit|GLEU|BERTScore]
                                  Segment-level metric to use for segment-
                                  level analysis.
  -o, --output_folder TEXT        Folder in which you wish to save plots.
  --bootstrap
  --num_splits INTEGER            Number of random partitions used in
                                  Bootstrap resampling.
  --sample_ratio FLOAT            Proportion (P) of the initial sample.
  --help                          Show this message and exit.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
data-collection		data-collection
data		data
telescope		telescope
tests		tests
user		user
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download.sh		download.sh
evaluate_universal_metrics.sh		evaluate_universal_metrics.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

RitaTMCO/NLP-Telescope

Folders and files

Latest commit

History

Repository files navigation

NLP-Telescope

Table of contents

Introduction:

Requirements for the evaluation of NLP systems:

Machine Translation:

Dialogue System:

Summarization:

Classification:

Installation:

Before running the tool:

Web Interface:

Command Line Interface (CLI):

Comparing NLG systems:

Example 1: Running several metrics

Example 2: Saving a comparison report

Comparing Classification systems:

Example 1: Running two metrics

Example 2: Saving a comparison report

Scoring:

Comparing two MT systems:

About

Resources

License

Stars

Watchers

Forks

Languages