Skip to content

LFruth/unsupervised-german-ts

Repository files navigation

GUTS

=German Unsupervised Text Simplification

Code from the Master Thesis and Paper:

"An Approach Towards Unsupervised Text Simplification on Paragraph-Level for German Texts"
Author: Leon Fruth

This approach is an adaption from the paper Keep it Simple to the German language. Large parts of the code are copied and/or adapted from the Keep it Simple repository: https://github.com/tingofurro/keep_it_simple/

The reward scores and training progression from the training runs of GUTS are visualized here. The model used in this repository is named GUTS-2 in the report.

Run GUTS

To test the GUTS models use the script run_guts.py.

You can use arguments to test different models, decoding methods, and input texts

Without any arguments it generates a single simplification using greedy decoding.

Training

To train GUTS use the script train_guts.py

Before training first pre-train a GerPT-2 model on the copy task. This way the generator model learns to copy the original paragraph, which is a good starting point.

Reward

All parts for the parts are in the reward folder:

  • reward.py: Wraps the scores and utilizes them to calculate the overall reward. Different scoring functions and weights can be used with this.
  • The scores can be variable added to the reward
    • simplicity.py: Contains the score for the lexical and syntactic simplicity.
    • meaning_preservation.py: contains different methods to score the meaning preservation. The TextSimilarity score was used for this work. The file further contains the CoverageModel, used in Keep it Simple and the Summary Loop, and BScoreSimilarity a similarity scoring method only based on BERTScore.
    • fluency.py: The LM-fluency score and the TextDiscriminator score are contained in this file.
    • guardrails.py: Different Guardrails for Hallucination Detection, Brevity, ArticleRepetition, and NGRamRepetition can be found in this file.

Some of these scores are analysed on the reference datasets TextComplexityDE and GWW_leichtesprache and visualized using some jupyter notebooks in notebooks.

Data

The data folder contains the following datasets:

  • textcomplexityde.csv is the processed TextComplexity dataset. Here all sentences from a Wikipedia article are concatenated to form a Complex-Simple aligned dataset. This dataset was used for the analysis of the reward scores.
  • leichtesprache2.csv are parallel articles from GWW.
  • tc_eval.csv contains the manually composed paragraphs from the TextComplexityDE dataset, and the generated simplifications used for automatic evaluation of the thesis.
  • wiki_eval.csv contains paragraphs from Wikipedia and the generated simplifications used for automatic evaluation of the thesis.
  • all_wiki_paragraphs.csv contains the extracted paragraphs from Wikipedia articles used for training. The file is contained in the latest release

Automatic Evaluation

The jupyter notebook, where the automatic evaluation can be reproduced is located in notebooks/evaluation.ipynb.

This script uses the files tc_eval.csv and wiki_eval.csv to generate the automatic results.

Models

This repository contains the following saved models:

  • One trained GUTS model: GUTS.bin (contained in the release)
  • morphmodel_ger.pgz: A model used for lemmatization of German words
  • wiki_finetune.bin: A saved BERT model trained on wikipedia paragraphs, for the LM-Fluency score. (contained in the release)
  • All other models used in the reward scores are retrieved from the huggingface library

Other scripts

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published