Description

A step-by-step C# implementation of the Docstrum algorithm for pdf documents

How to run C# code in Jupyter Lab / How Install .NET Interactive

https://devblogs.microsoft.com/cesardelatorre/using-ml-net-in-jupyter-notebooks/

https://devblogs.microsoft.com/dotnet/net-interactive-is-here-net-notebooks-preview-2/ (if the previous fails when installing)

Description

Version 2 is the latest update and handles rotated words/lines/paragraphs. This is the version implemenented in PdfPig.

Link to original paper: The Document Spectrum for Page Layout Analysis by Lawrence O'Gorman

From Performance Comparison of Six Algorithms for Page Segmentation: The Docstrum algorithm by O'Gorman is a bottom-up approach based on nearest-neighborhood clustering of connected components extracted from the document image. After noise removal, the connected components are separated into two groups, one with dominant characters and another one with characters in titles and section heading, using a character size ratio factor fd. Then, K nearest neighbors are found for each connected component. Then, text-lines are found by computing the transitive closure on within-line nearest neighbor pairings using a threshold ft. Finally, text-lines are merged to form text blocks using a parallel distance threshold fpa and a perpendicular distance threshold fpe. - wiki

Variables used in structural block determination

The variables can be accessed by using the GetStructuralBlockingParameters() function.

public static bool GetStructuralBlockingParameters(PdfLine i, PdfLine j, double epsilon,
    out double angularDifference, out double normalisedOverlap, out double perpendicularDistance)
    {
      ...
    }

From the original paper by O'Gorman:

Fig. 8. Variables used in structural block determination. The two text lines, represented by segments i and j, are to be tested here to determine if they should be grouped into the same block. Their angular difference is θ_ij. The overlap length of segment i on segment j is p_j, (and that is normalized to obtain the overlap parameter). The parallel distance between i and j is d^a_ij = p_j in this case. The perpendicular distance betwen i and j is d^e_ij.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
doc		doc
images		images
LICENSE		LICENSE
README.md		README.md
Simple Docstrum v1.ipynb		Simple Docstrum v1.ipynb
docstrum_v1.ipynb		docstrum_v1.ipynb
docstrum_v2.ipynb		docstrum_v2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to run C# code in Jupyter Lab / How Install .NET Interactive

Description

Variables used in structural block determination

Results

0.0 Open pdf document

0.1 Extract words and preprocess

1. Estimate within-line and between-line spacing

1.1 Within-line (between words) spacing

1.2 Between-line spacing

2. Get lines

3. Get paragraphs blocks

About

Releases

Packages

Languages

License

BobLd/simple-docstrum

Folders and files

Latest commit

History

Repository files navigation

How to run C# code in Jupyter Lab / How Install .NET Interactive

Description

Variables used in structural block determination

Results

0.0 Open pdf document

0.1 Extract words and preprocess

1. Estimate within-line and between-line spacing

1.1 Within-line (between words) spacing

1.2 Between-line spacing

2. Get lines

3. Get paragraphs blocks

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages