BART-IT: Italian pretraining for BART sequence to sequence model

This repository contains the code for the pretraining BART-IT, an efficient and accurate sequence to sequence model for Italian language.

Notes

As pointed out by the IT5 co-author (@gsarti_, thanks!) the IT5 model compared in the paper was not trained with multi-task learning, but with the regular span masking objective (as adopted from newer versions of T5).

Model Tokenizer

The code for training the tokenizer is self-contained in the train_tokenizer.py script. The tokenizer is trained on mC4, a large Italian corpus, and it is based on the BPE algorithm. The tokenizer is trained using the tokenizers library.

The following parameters are used to train the tokenizer:

vocab_size: 52,000
min_frequency: 10
special_tokens: <s>, </s>, <pad>, <unk>, <mask>

The tokenizer is saved in the tokenizer_bart_it folder.

Model Pretraining

The main script for pretraining the model is pretrain_base.py. The model is trained following the same denoising pretraining strategy used for BART. Model parameters are reported on the table below.

Parameter	Value
VOCAB_SIZE	52,000
MAX_POSITION_EMBEDDINGS	1,024
ENCODER_LAYERS	6
ENCODER_FFN_DIM	3,072
ENCODER_ATTENTION_HEADS	12
DECODER_LAYERS	6
DECODER_FFN_DIM	3,072
DECODER_ATTENTION_HEADS	12
D_MODEL	768
DROPOUT	0.1

The model is trained on 2 NVIDIA RTX A6000 GPUs for a total of 1,7 million steps. The pre-trained model is released for the community on the HuggingFace Hub - BART-IT

Model Fine-tuning

The model is fine-tuned on the abstractive summarization task using the parameters reported in the table below.

Parameter	Value
MAX_NUM_EPOCHS	10
BATCH_SIZE	32
LEARNING_RATE	1e-5
MAX_INPUT_LENGTH	1024
MAX_TARGET_LENGTH	128

For more information about the model parameters, please refer to the summarization/finetune_summarization.py script and to the following paper.

The model is fine-tuned on different summarization datasets and model weights for each dataset are released on the HuggingFace Hub - following table:

Dataset Type	Dataset Name	Model Weights	Dataset Paper
News Summarization	FanPage	`bart-it-fanpage`	Two New Datasets for Italian-Language Abstractive Text Summarization
News Summarization	IlPost	`bart-it-ilpost`	Two New Datasets for Italian-Language Abstractive Text Summarization
Wikipedia Summarization	WITS	`bart-it-WITS`	WITS: Wikipedia for Italian Text Summarization

The model is an efficient and accurate sequence to sequence model for Italian language. The performance of the model are reported using both ROUGE and BERTScore metrics. Please refer to the following paper for more details.

The script for evaluating the model on the summarization task is summarization/evaluate_summarization.py.

Demo

The demo for the summarization of Italian text is available on the HuggingFace Spaces. You can try it out by clicking on the link above or by using the app.py script available in the repository (you may need to install gradio library if you want to run the script locally).

Citation and acknowledgments

If you use this code or the pre-trained model, please cite the following paper:

@Article{BARTIT,
    AUTHOR = {La Quatra, Moreno and Cagliero, Luca},
    TITLE = {BART-IT: An Efficient Sequence-to-Sequence Model for Italian Text Summarization},
    JOURNAL = {Future Internet},
    VOLUME = {15},
    YEAR = {2023},
    NUMBER = {1},
    ARTICLE-NUMBER = {15},
    URL = {https://www.mdpi.com/1999-5903/15/1/15},
    ISSN = {1999-5903},
    DOI = {10.3390/fi15010015}
}

If you use the FanPage or IlPost datasets, please cite the following paper.

If you use the WITS dataset, please cite the following paper.

If you use the mC4 dataset, please refer to the original mT5 paper and if you are interested to the cleaned version of the dataset, please refer to the IT5 paper and to the cleaned mC4 repository.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
summarization		summarization
tokenizer_bart_it		tokenizer_bart_it
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
plot.png		plot.png
plot.svg		plot.svg
plot_perf.py		plot_perf.py
pretrain_base.py		pretrain_base.py
requirements.txt		requirements.txt
test_perturbations.py		test_perturbations.py
train_tokenizer.py		train_tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BART-IT: Italian pretraining for BART sequence to sequence model

Notes

Table of Contents

Model Tokenizer

Model Pretraining

Model Fine-tuning

Demo

Citation and acknowledgments

About

Releases

Packages

Languages

License

MorenoLaQuatra/bart-it

Folders and files

Latest commit

History

Repository files navigation

BART-IT: Italian pretraining for BART sequence to sequence model

Notes

Table of Contents

Model Tokenizer

Model Pretraining

Model Fine-tuning

Demo

Citation and acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages