Skip to content

using open source library the goal on this program is to transform a pdf into data blocks with meta-data usable by any other program

License

Notifications You must be signed in to change notification settings

Nexai-net/pdf-data-extractor

Repository files navigation

PDF Data Extractor

This solution extract information for a PDF (Text, Image, Metadata) and produce a data structure called "Data Block" from it.
This application use IText7 to analyze the document PDF.

The extraction result could be freely be used.

Data block

A data block is define a type (Document, page, text, image, relation ... ), an unique id and an oriented area.

The extract return many blocks. In post process the block are group together following rule:

  • Proximity
  • Orientation
  • Font similare
  • ...

This process construct text group that work together.

An other post process try to create a relation between groups to easi page analysis.

Command line

In command line you need to specify where is the source document and where to put the output.
A file FILE_NAME.json is generate. This file is serialized by .net json serializer.

Tip

Serialization Settings

var settings = new JsonSerializerOptions()
{
WriteIndented = true,
PropertyNameCaseInsensitive = true,
DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingNull,
NumberHandling = JsonNumberHandling.AllowReadingFromString | JsonNumberHandling.AllowNamedFloatingPointLiterals
};

By default the images are extract in a sub folder, with the uid as name. If you select the option "--IncludeImages" the image will be serialized in base64 in the json document.

PDF.Data.Extractor.Console.exe --help
Copyright (C) 2024 PDF.Data.Extractor.Console

PDF.Data.Extractor.Console 1.0.0
Copyright (c) Nexai.net

  -o, --output                Required. Director path where all the datablock
                              will be extract.

  -s, --source                (Group: SOURCE) Pdf file to extract.

  --sourceDir                 (Group: SOURCE) Folder to get all pdf files to
                              extract. (Default search only on top folder)

  -r, --recursive             Couple with option 'SourceDir' to search through
                              all the sub folder.

  -n, --outputName            Directory name create with extraction result;
                              default is the pdf name without extention

  -d, --OutputFolderName      Directory name create with extraction result;
                              default is the pdf name without extention

  -f, --force                 (Default: false) Override the ouput if already
                              exists

  --IncludeImages             (Default: false) If set to true the image content
                              will be integrated in the result json in bas64

  -t, --Timed                 (Default: false) Display computation time.

  --PreventParallelProcess    (Default: false) Define if page should be process
                              in parallel or sequential (Parallel reduce
                              processing time but cost more memory).

  --silent                    (Default: false) Only Write minimal process logs

  --maxConcurrentDocument     (Default: 0) Define number of concurrent document
                              are extract in parallel (default 0 => nb logical
                              processor / 4)

  --help                      Display this help screen.

  --version                   Display version information.

Viewer

The viewer /src/PDF.Data.Extractor.Viewer/ is a basic way to look graphically the data extracted.

Using

We advise to use the console line to extract information before using them in you solutions
A package 'Data.Block.Abstractions' is available on nuget to facilitate the json deserialization.

About

using open source library the goal on this program is to transform a pdf into data blocks with meta-data usable by any other program

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages