Leveraging BERT-to-GPT2 for Sentence Simplification

An Encoder-Decoder Transformer model for simplifying English Sentences

📝 Table of Contents

Table of Contents

➤ About The Project
➤ Prerequisites
➤ Folder Structure
➤ Dataset
➤ Model Architecture
➤ Code Usage
➤ Results and Discussion
➤ References

📝 About The Project

The project is based on simplification of English sentences. Given an English sentence as input, the model aims to rearrange the words or substitute words or phrases to make it easier to comprehend without losing the underlying information carried by the original sentence. This project is an effort to find an approach to achieve good result in sentence simplification task. The project’s output can be useful for other NLP tasks which requires simplified sentences such as Machine Translation, Summarization, Classification, etc. The project leverages Bert model as encoder and GPT-2 model as decoder.

💠 Prerequisites

The following major packages are used in this project:

Python v3.0+
Pandas v1.1.0+
SK-learn v0.24+
Pytorch v1.5+
Transformers(Huggingface) v3.3+

🔽 Folder Structure

Base Project Folder
.
│
├── dataset
│   ├── src_train
│   ├── src_valid
│   ├── src_test
│   ├── tgt_train
│   ├── tgt_valid
│   ├── tgt_test
│   ├── ref_test
│   ├── ref_valid
│   ├── src_file
│
├── best_model
│   ├── model.pt
│
├── checkpoint
│   ├── model_ckpt.pt
│
├── outputs
│   ├── decoded.txt
│
├── run.py
├── data.py
├── sari.py   
├── tokenizer.py

💾 Dataset

Wiki dataset comprising of parallel corpus of normal sentences and simple sentences is used to train the model. The original dataset consists of around 167k English sentence pairs from the Wikipedia articles. The dataset comprises of mapping of one-to-many, one-to-one and many-to-one sentence pairs. But the dataset was not suitable for the training without preprocessing. Upon tokenizing the sentences, sentences having token length of more than 80 were removed keeping the maximum token length of sentences to 80. The resulting training dataset became 138k from 167k.
For the evaluation and testing purpose, TurkCorpus is used. The dataset consists of 2k manually prepared sentence pairs with 8 reference sentences and 300 sentences for testing purpose which also has 8 reference sentences.

💾 Model Architecture

The project provides an end-to-end pipeline for the simplification task with supervised technique using SOTA transformer models. The model accepts normal sentences as input. The sentences are converted to token embedding using BERTTokenizer. The token embeddings are fed into the encoder-decoder model. Finally, the model outputs token embeddings for simple sentences which are then converted to sentences using GPT2Tokenizer.

💻 Code Usage

To train the model:

$ run.py train --base_path "./" --src_train "dataset/src_train.txt" --src_valid "dataset/src_valid.txt" /
        --tgt_train "dataset/tgt_train.txt" --tgt_valid "dataset/tgt_valid.txt" /
        --ref_valid "dataset/ref_valid.pkl" --checkpoint_path "checkpoint/model_ckpt.pt" /
        --best_model "best_model/model.pt" --seed 540

To test the model:

$ run.py test --base_path "./" --src_test "dataset/src_test.txt" --tgt_test "dataset/tgt_test.txt" /
        --ref_test "dataset/ref_test.pkl" --best_model "best_model/model.pt"

To decoding user inputs:

$ run.py decode --base_path "./" --src_file "dataset/src_file.txt" --output "dataset/decoded.txt" /
        --best_model "best_model/model.pt"

`--src_file` is the path to the file which contains user's input sentences that need to be simplified.
`--output` is the path where the decoded output by the model need to be stored.
`--base_path` is the project's base path
`--best_model` is the path to the best model after training.
`--checkpoint_path` is the path to store the model checkpoint.

🎏 Results and Discussion

The BERT-to-GPT2 model was able to achieve SARI score of 35.17 with BLEU score of 37.39. However, BERT-to-GPT-2 model was able to simplify sentences with promising results in most of the sentences. The model is mostly seen to substitute words with corrsponding simple words maintaing its context. The model is also able to simplify sentences in phrase level as well. Few of the examples of the results are shown in the table below.

However, my model failed to remember certain words during the inference time. For example, in the first instance of table 4.2, the word ‘tarantula’ has been misinterpreted as ‘talisman’ which might create a lot of confusion to the readers. Also, in the second example of table 4.2, the year 1982 has been missed which is crucial for keeping the exact information of the whole sentence. The model failed to provide good results in some of the examples.

There are various reasons for the poor result for the model’s outputs. Firstly, the dataset itself does not have good pairs of sentences. Most of the sentences are totally similar in the dataset. The model could have performed better if some gold-level dataset, such as Newsela dataset, were used during training of the model. Secondly, the hyper parameters of the model need to be adjusted properly to get the better result. Since it is computationally very expensive to train the model, therefore it is hard to tune the hyperparameters.

📚 References

Raman Chandrasekar and Bangalore Srinivas. 1997. Automatic induction of rules for text simplification. In Knowledge Based Systems.
David Vickrey and Daphne Koller. 2008. Sentence simplification for semantic role labeling. In Proceedings of ACL.
Lijun Feng. 2008. Text simplification: A survey. CUNY Technical Report.
Xu, Wei, CourtneyNapoles, ElliePavlick, QuanzeChen, and ChrisCallison-Burch. 2016. Optimizing statistical machine translation for text simplification.
Sanqiang Zhao, Rui Meng, Daqing He, Saptono Andi, and Parmanto Bambang. Integrating transformer and paraphrase rules for sentence simplification. 2018.
J. Qiang et. al, 16 Aug, 2019. A Simple BERT-based Approach for Lexical Simplification.
Xingxing Zhang and Mirella Lapata. Sentence simplification with deep reinforcement learning, 2017.
Raman Chandrasekar and Bangalore Srinivas. 1997. Automatic induction of rules for text simplification. In Knowledge Based Systems.
Akhilesh Sudhakar, Bhargav Upadhyay, and Arjun Maheswaran. Transforming delete, retrieve, generate approach for controlled text style transfer, 2019.

✤ This was the part of mini project for my fifth semester of Computer Science, at Deerwalk College.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
__pycache__		__pycache__
dataset		dataset
images		images
.gitignore		.gitignore
README.md		README.md
data.py		data.py
requirements.txt		requirements.txt
run.py		run.py
sari.py		sari.py
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Leveraging BERT-to-GPT2 for Sentence Simplification

An Encoder-Decoder Transformer model for simplifying English Sentences

📝 Table of Contents

📝 About The Project

💠 Prerequisites

🔽 Folder Structure

💾 Dataset

💾 Model Architecture

💻 Code Usage

🎏 Results and Discussion

📚 References

About

Releases

Packages

Languages

Aakash12980/Sentence-Simplification-using-BERT-GPT2

Folders and files

Latest commit

History

Repository files navigation

Leveraging BERT-to-GPT2 for Sentence Simplification

An Encoder-Decoder Transformer model for simplifying English Sentences

📝 Table of Contents

📝 About The Project

💠 Prerequisites

🔽 Folder Structure

💾 Dataset

💾 Model Architecture

💻 Code Usage

🎏 Results and Discussion

📚 References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages