Skip to content

First approach for recognizing logical document structures like texts, sentences, segments, words, chars and sentence/segment depth based on recurrent neural network grammars.

License

Notifications You must be signed in to change notification settings

Psarpei/Recognition-of-logical-document-structures

Repository files navigation

Recognition of logical document structures

Rocognition of logical document structures for german language based on recurrent neural network grammars.

This work is the preliminary work of the paper Recognizing Sentence-level Logical Document Structures with the Help of Context-free Grammars

General information

Instructors

Institutions

Project team

Configuration

Take a look here for the correct configuration recurrent neural network grammars

Project

The model is able to recognizing the followig logical document structures

  • (t - text start
  • (s - sentence start
  • (seg - segment start
  • (w - word start
  • (c - char start
  • ) end of logical document structure
  • Ti - sentence/segment depth will be measured recursive

Example

The sentence

Georg Bendemann, ein junger Kaufmann, saß in seinem Privatzimmer im ersten Stock eines der niedrigen leichtgebauten Häuser, die entlang des Flusses in einer langen Reihe, fast nur in der Höhe und Färbung unterschieden, sich hinzogen.

should be predicted as

(s (seg (w Georg) (w Bendemann) (c ,)) (seg (w ein) (w junger) (w Kaufmann) (c ,)) (seg (w saß) (w in) (w seinem) (w Privatzimmer) (w im) (w ersten) (w Stock) (w eines) (w der) (w niedrigen) (c ,)) (seg (w leichtgebauten) (w Häuser) (c ,)) (seg (w die) (w entlang) (w des) (w Flusses) (w in) (w einer) (w langen) (w Reihe) (c,)) (seg (w fast) (w nur) (w in) (w der) (w Höhe) (w und) (w Färbung) (w unterschieden) (c ,)) (seg (w sich) (w hinzogen) (c .)))

Training results

Validation results

Prediction

The model is able to predict the output in the usually format like in example_predict.txt. or in the .xml format like in example_XML.tei.

In following DATANAME is a free selectable name.

There are to options to make a prediction:

  • You have the ground truth of a text and you want to predict the rnng-output as .tei file as well as the evaluation. For this you have to save your ground_truth data as DATANAME_Ground_Truth.txt.
  • You only want to predict the rnng-output. For this you have to save your text in the file DATANAME.txt as plain text.

In both cases you have to save your files in the directory /PLACE_YOUR_FILES_HERE

To make a prediciton (and get the evaluation results) you only have to navigate to the directory of the predict.sh file and execute

./predict.sh DATEINAME 

The following files are generated in the /PLACE_YOUR_FILES_HERE directory

  • DATEINAME_graminput.txt: The rnng input file
  • DATEINAME_predict.txt: The prediction in grammar format.
  • DATEINAME.tei: This is the final rnng out file which contains the prediction in .tei format.
  • DATEINAME.txt: The purely input text without grammar format is saved here (if not available).
  • DATEINAME_Ground_Truth.txt: This file only exists if, you saved a ground truth file here.
  • DATEINAME_evaluation.txt: Here you can find the evaluation results (this file only exists if you have saved a ground truth file).

About

First approach for recognizing logical document structures like texts, sentences, segments, words, chars and sentence/segment depth based on recurrent neural network grammars.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published