Skip to content

Modules used for separating articles in (historical) newspapers and similar documents. This repository is part of the European Union's Horizon 2020 project NewsEye. For more information about the project see https://www.newseye.eu/.

Notifications You must be signed in to change notification settings

CITlabRostock/citlab-article-separation-new

Repository files navigation

Article Separation

Python modules for different tasks:

  • separating articles in (historical) newspapers or similar documents (article_separation)
  • measuring the performance of article separation algorithms (article_separation_measure and as_eval)
  • utility functions, e.g. for plotting images together with metadata information (python_util)

Table of Contents

Introduction

This repository is part of the European Union's Horizon 2020 project NewsEye and is mainly used for separating articles in (historical) newspapers and similar documents.

The purpose of the NewsEye project is to enable historians and humanities scholars to investigate a great amount of newspaper collections. The newspaper pages are digitized and are available as scanned images. To ensure efficient work, the data processing steps should be as automatic as possible. Generally, newspapers are structured into large numbers of articles. These usually contain a distinct piece of content or describe a certain topic and can mostly be understood without any context. Newspaper articles are crucial entities for historians and humanities scholars who focus on a specific research area and are only interested in articles related to that topic. Additionally, some natural language processing applications, like e.g. topic modeling or event detection, rely on a logical structuring of the underlying text, to be able to extract meaningful information. For this reason it is important to tackle the article separation (AS) task, which tries to form coherent articles, based on previously detected baselines and their respective text.

In the following image a schematic overview of the overall AS workflow can be found.

Article Separation Workflow

Installation

The Python modules in this repository are all tested with Python 3.6. The best way to use the modules is by creating a virtual environment and install the packages given in the requirements.txt file.

The packages should work with TensorFlow 1.12 (pip install tensorflow==1.12) to TensorFlow 1.14 (pip install tensorflow==1.14).

Main Packages / Usage

All modules work with metadata information stored in the well-established PAGE-XML format as defined by the Prima Research group. Some modules require the following folder structure, where PAGE-XML files are stored inside a separate page folder and should have the same basename as the image.

.
+-- file1.jpg
+-- file2.jpg
+-- file3.jpg
+-- page
|	+-- file1.xml
|	+-- file2.xml
|	+-- file3.xml

article_separation

The most important package is article_separation where all scripts can be found to run AS-related tasks. A brief description of all modules that can be used in this repository is given in the following. A more detailed description of the workflow can be found in the official public deliverable D2.7 (Article separation (c) (final)). A link to all public deliverables is given here.

Separator Detection

This module is used to detect visible vertical and horizontal separators on a newspaper page. To use it a TensorFlow model is needed that was trained on an image segmentation task. An example network can be found in nets/separator_detection_net.pb. The underlying model we used is the so called ARU-Net which is a U-Net extended by two key concepts, attention (A) and depth (residual structures (R)). To run the separator detection use the run_net_post_processing.py file like in the following example.

python -u run_net_post_processing.py --path_to_image_list "/path/to/image/list" --path_to_pb "/path/to/separator_detection_graph.pb" --mode "separator" --num_processes N

Text Block Detection

The current version of this module is divided into two parts and only needs the PAGE-XML files:

  1. Cluster the text lines / baselines on a page based on the DBSCAN algorithm.
  2. Based on these clusters create text regions with the Alpha shape algorithm.

The corresponding run scripts are run_baseline_clustering.py and run_textregion_detection.py which can be run as in the following example.

python -u run_baseline_clustering.py --path_to_xml_lst "/path/to/xml/list" --num_threads N
python -u run_textregion_generation.py --path_to_xml_lst "/path/to/xml/list" --num_threads N

Heading Detection

The heading detection combines a distance transformation for detecting approximate text heights and stroke widths with an image segmentation approach that detects headings in an image. An example network can be found in nets/heading_detection_net.pb. The results of both approaches are combined in a weighted manner where most weight is put on the net output. To run the heading detection use the run_net_post_processing.py file like in the following example.

python -u run_net_post_processing.py --path_to_image_list "/path/to/image/list" --path_to_pb "/path/to/heading_detection_graph.pb" --mode "heading" --num_processes N

Graph Neural Network

This module is used to solve a relation prediction task, i.e. to predict which text blocks belong to the same article. Since a Graph Neural Network (GNN) works on graph data, which is enriched with feature information, this first needs to be generated. To run the feature generation process, use the article_separation/gnn/input/feature_generation.py file like in the following example.

python -u feature_generation.py --pagexml_list "/path/to/xml/list" --num_workers N

The graph data for a single PAGE-XML file will be saved in a corresponding json file, and will include feature information from prior modules if the PAGE-XML files were updated accordingly. Usually this entails, on node-level, position and size of the text blocks, position and size of the first and last baselines of each text block, stroke width and height of the contained text, and an indicator whether the text block is a heading. On edge-level it is indicated whether two text blocks are separated by a horizontal or vertical separator. Overall, this results in 15 node features and 2 edge features.

Optionally, visual regions can be added, which can later be used by a visual feature extractor (e.g. ARU-Net) to integrate visual features.

--visual_regions True

Similarly, text block similarity features based on word vectors can be integrated, if they are available.

--language "language" --wv_path "path/to/wordvector/file"

Lastly, additional (previously generated) features can be added via external json files both for nodes and edges. This is currently being used for text block similarties coming from a BERT.

--external_jsons "path/to/external/json/file"

Both, the text block similarity features and any additonal external features will increase the final number of available features for the GNN accordingly.

Text block clustering

The Graph Neural Network outputs a confidence graph regarding the afore-mentioned relation prediction task. Based on these predictions the last step to form articles is a clustering process. This is done jointly using the article_separation/gnn/run_gnn_clustering.py file like in the following example.

python -u run_gnn_clustering.py \
  --model_dir "path/to/trained/gnn/model" \
  --eval_list "path/to/json/list" \
  --input_params \
  node_feature_dim=NUM_NODE_FEATURES \
  edge_feature_dim=NUM_EDGE_FEATURES \
  --clustering_method CLUSTERING_METHOD

For this module a trained GNN model is needed and the number of node and edge features needs to be set according to the GNN, to correctly build the input pipeline. For clustering algorithms, we currently support a greedy approach (greedy), a modified DBSCAN algorithm (dbscan) and a hierarchical clustering method (linkage).

If the GNN was only trained on a subset of the generated features, the redundant features need to be manually masked with a boolean list (1=include, 0=exclude), which should contain an entry for each available feature. For example, if 15 node features and 2 edge features are available, but the GNN was only trained on the first 4 node features, the clustering call would have to look as follows

python -u run_gnn_clustering.py \
  --model_dir "path/to/trained/gnn/model" \
  --eval_list "path/to/json/list" \
  --input_params \
  node_feature_dim=15 \
  edge_feature_dim=2 \
  node_input_feature_mask=[1,1,1,1,0,0,0,0,0,0,0,0,0,0,0] \
  edge_input_feature_mask=[0,0] \
  --clustering_method CLUSTERING_METHOD

If visual regions were generated in the previous step and the GNN was trained accordingly, i.e. a visual feature extractor component was added to the network, additional visual features can be integrated during this process. Note that in this case the corresponding image files will be needed.

--image_input True --assign_visual_features_to_nodes True --assign_visual_features_to_edges False

The output of this module will be new PAGE-XML files containing the final clustering results, which represent the found articles.


article_separation_measure

This package contains a method to measure the performance of an AS algorithm. It is based on the baseline detection measure that was already used at competition like the ICDAR 2017 Competition on Baseline Detection and a description of it can be found here. The AS measure was used at the ICPR 2020 Competition Text Block Segmentation on a NewsEye Dataset. A more detailed description can be found in the public deliverable D2.7.

To run the measure you need a list of hypothesis PAGE-XML files and a list of ground truth PAGE-XML files you want to compare to. The run script is given by run_measure.py and can be executed as in the following example.

python -u run_measure.py --path_to_hyp_xml_lst "/path/to/hyp/xml/list" --path_to_gt_xml_lst "/path/to/gt/xml/list"

as_eval

This is another package for evaluating an AS as described in deliverable D2.7 v6.0 based on how many splits and merges of partition blocks (e.g. text blocks) are needed to convert the ground truth to the hypothesis.

Minimal example run script

  • minRunEx.py
  • works on example data in
../work/
    ├ page/example-[1..?].xml               PAGE-XML with ground truth article separation
    └ clustering/                           PAGE-XML with hypotheses …
        └ method-[1..?]/example-[1..?].xml  … for various methods
  • Simplification: For correct interpretation & labeling, a hypothesis' PAGE-XML parent directory's name must be the method's name. (cf. SepPageCompDict.path2method)

Checking PAGE-XML

  • intitialize an AsChecker with problem type (enums) to be analyzed
  • provide the list of pages to be checked
  • run AsChecker's checkPages() method
  • get results from AsChecker's probDict or cntDict members
  • direct JSON output via probToJSON for convenience

Comparison result interpretation

SepPageComparison container
  • container for counting results
  • collected in SepPageCompDict
  • … counts number of …
    • gtNIs … articles in ground truth
    • hypNIs … articles in hypothesis
    • corrects … properly separated articles in hypothesis
  • walking from the ground truth partition to the the hypothesis partition requires …
    • splits many splittings of partition blocks (increasing their number, thus understood to be ≥0)
    • merges many mergings of partition blocks (decreasing their number, thus understood to be ≤0)
  • … resulting in a partition distance of dist = splits - merges
  • … and requiring consistency (checked by SepPageComparison.checkConsistency) gtNIs + splits - merges = hypNIs
XLSX file

SepPageComparison containers are ordered by ascending dist firstly and descending corrects then. In this sense, one method yielded a better article separation, i.e. gains a victory over the other method. We count such wins, where (1) ties are also counted as wins, and (2) each method is also compared against itself. Note that, due to these (laziness!) conditions, all counters always include victories against oneself, i.e. are the number of samples at least.

The example worksheet(s) (in general: named after the dataset(s)) contain(s) results for pairwise comparison, where

  • main diagonal entries simply count, hence must be equal to the number of comparisons, thus serving as a plausibility check made possible by conditions (1) and (2);
  • off-diagonal entries show the ratio between victories of the row-head method over the column-head method, thus resulting in a "reciprocal symmetric" matrix;
  • the first column counts all victories of the row-head method.

The winner worksheet contains the overall numbers of all victories of the row-head method in any of the datasets under consideration. In column B, this includes all methods under investigation. Then the methods with the fewest victories are removed step-by-step, and the number of victories is computed w.r.t. the reduced set of methods only. This yields the values in the subsequent columns, and it stops when in the last column, only the winner method is compared against itself, which obviously must yield the number of investigated samples as another plausibility check.


python_util

This package contains multiple utility functions that are used by the article_separation package. The most important ones are the PAGE-XML parser (page.py) and the PAGE-XML plotter (plot.py) to load and save the metadata information related to an image.

See Also

About

Modules used for separating articles in (historical) newspapers and similar documents. This repository is part of the European Union's Horizon 2020 project NewsEye. For more information about the project see https://www.newseye.eu/.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages