=German Unsupervised Text Simplification
Code from the Master Thesis and Paper:
"An Approach Towards Unsupervised Text Simplification on Paragraph-Level for German Texts"
Author: Leon Fruth
This approach is an adaption from the paper Keep it Simple to the German language. Large parts of the code are copied and/or adapted from the Keep it Simple repository: https://github.com/tingofurro/keep_it_simple/
The reward scores and training progression from the training runs of GUTS are visualized here. The model used in this repository is named GUTS-2 in the report.
To test the GUTS models use the script run_guts.py.
You can use arguments to test different models, decoding methods, and input texts
Without any arguments it generates a single simplification using greedy decoding.
To train GUTS use the script train_guts.py
Before training first pre-train a GerPT-2 model on the copy task. This way the generator model learns to copy the original paragraph, which is a good starting point.
All parts for the parts are in the reward folder:
- reward.py: Wraps the scores and utilizes them to calculate the overall reward. Different scoring functions and weights can be used with this.
- The scores can be variable added to the reward
- simplicity.py: Contains the score for the lexical and syntactic simplicity.
- meaning_preservation.py: contains different methods to score the meaning preservation. The TextSimilarity score was used for this work. The file further contains the CoverageModel, used in Keep it Simple and the Summary Loop, and BScoreSimilarity a similarity scoring method only based on BERTScore.
- fluency.py: The LM-fluency score and the TextDiscriminator score are contained in this file.
- guardrails.py: Different Guardrails for Hallucination Detection, Brevity, ArticleRepetition, and NGRamRepetition can be found in this file.
Some of these scores are analysed on the reference datasets TextComplexityDE and GWW_leichtesprache and visualized using some jupyter notebooks in notebooks.
The data folder contains the following datasets:
- textcomplexityde.csv is the processed TextComplexity dataset. Here all sentences from a Wikipedia article are concatenated to form a Complex-Simple aligned dataset. This dataset was used for the analysis of the reward scores.
- leichtesprache2.csv are parallel articles from GWW.
- tc_eval.csv contains the manually composed paragraphs from the TextComplexityDE dataset, and the generated simplifications used for automatic evaluation of the thesis.
- wiki_eval.csv contains paragraphs from Wikipedia and the generated simplifications used for automatic evaluation of the thesis.
- all_wiki_paragraphs.csv contains the extracted paragraphs from Wikipedia articles used for training. The file is contained in the latest release
The jupyter notebook, where the automatic evaluation can be reproduced is located in notebooks/evaluation.ipynb.
This script uses the files tc_eval.csv and wiki_eval.csv to generate the automatic results.
This repository contains the following saved models:
- One trained GUTS model: GUTS.bin (contained in the release)
- morphmodel_ger.pgz: A model used for lemmatization of German words
- wiki_finetune.bin: A saved BERT model trained on wikipedia paragraphs, for the LM-Fluency score. (contained in the release)
- All other models used in the reward scores are retrieved from the huggingface library
- notebooks/generate_pivot_simplifications.ipynb is the script to generate the simplifications using the Pivot model for the automatic evaluation. To run this the generator model from Keep it Simple is required, as well as the trained model.
- train_copy_task.py: The training script to pre-train a GPT-2 model to copy wikipedia articles. Copying the original text is a good starting point for simplification.
- pretrain_coverage.py: The training script to train the coverage model from https://github.com/CannyLab/summary_loop