dex

Library index:

Reliable metadata handling from ISBN numbers
Robust modern OCR of book indexes from neural networks pretrained for document image analysis
Data are managed as 'shelves':
- If installing from a clone of the git repo, the files will be stored within the repo
- If installing from PyPI, you must set the DEX_SHELVES environment variable with a path to locate the source files

Further handling of the data remains to be done:

Clean up errors in the transcription (punctuation etc.) as far as possible
- Perhaps use low confidence to assist manual corrections if viable?
- It may also be useful to try layoutparser as a '2nd opinion' (but the interface seems more complicated).
Geometry has been commented out of the data model, this could be useful to parse the layout. I would prefer to try avoiding this using simple alphabetical heuristics.

Additionally, I suggest serialising the outputs to JSON and storing alongside the images, to avoid reloading from the images each time.

Requires

Python 3.10+

Installation

dex is available from PyPI, and the code is on GitHub

In theory you can install as follows:

pip install spindex[surya]

In practice, the suggested installation is stored in CONDA_SETUP.md:

conda create -n dex python=3.10 -y
conda activate dex
conda install pytorch torchvision pytorch-cuda=12.4 -c pytorch -c nvidia
pip install spindex[surya]

See the PyTorch docs for other ways to install the PyTorch dependency.

Usage

To prepare your library, photograph the index pages and store them in folders named by the ISBNs of the books. You then load your library metadata in dex like so:

>>> import dex
>>> l = dex.load_library()
>>> l
Library of 6 books
>>> l.items[0].metadata
'📖: Szeliski (2010) Computer Vision - Algorithms And Applications'
>>> l.items[0].metadata.title
'Computer Vision - Algorithms And Applications'
>>> l.items[0].metadata.first_author
'Richard Szeliski'
>>> l.items[0].metadata.first_author.surname
'Szeliski'

There's a helper method to scan the images of all the items in the library, Library.scan() (deprecated)

For a sense of what's inside, here's a "manual" version for one book:

To show a bit more, you can see there's a bit more work to be done in cleaning up this output but the results are very promising.

The first step of the following [dense] snippet is to:

process the library's ISBNs (the l = dex.load_library() helper function)),
take the first item in the library (i = l.items[0])
and scan its page images (i.scan_images())

We can then group its words together by line, taking the first page image as example (iterating through all the 'blocks' identified on it by the page layout detection algorithm)

An error creeps in here where the entry for "Alternating minimization" has 4 page numbers: 97, 106, 199, 252. The first three are on the same line as the entry label, but the 252 spills onto another line and is split apart by the layout parsing algorithm. Fortunately this can be detected in a couple of ways:

The alphabetical order in an index should be monotonic (obviously), and there are multiple chances to establish this in each block
The only time a number would appear in a block on its own would be if it was the page number or if one had been split apart like this.
Note that the block from 'Averages' through to 'Bregman distance' got inserted in between, and this is in fact the entire right column of the page: so shifting it after the next alphabetically preceding block (the end of the block from "Antisymmetric" through "Average pooling") would place it in the correct position.
Care must be taken as a page number could also appear at the top of the page. I suspect this will end up being unambiguous, however a multi-page processor should be able to identify monotonically increasing sequences of consecutive numbers and label them as page numbers of the index.

DEPRECATED

Previously the DocTreeDoc class (a nested data structure with all the words, lines, etc.) is stored on the scanned attribute of each Book in the Library object you get from dex.load_library().

A naive approach can work with these to create the labels, but it helps to use geometry for c(l)ues.

The geometry bounding boxes are the (x,y) coordinates of the top-left and bottom-right corners. It's clear that the first and second row are aligned on their left-hand side as the x value of their top-left is very similar: 0.1044, 0.1035 mostly. However the second block also has 0.140 and 0.141, which could either be a column or an indent in the entry.

An indent in the entry can either be a sub-entry or the continuation of the line (in which case you'd expect it to be a numeric string, though possibly Roman numeric!)

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
data		data
deprecated		deprecated
examples		examples
src/dex		src/dex
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONDA_SETUP.md		CONDA_SETUP.md
LICENSE		LICENSE
README.md		README.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dex

Requires

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

lmmx/dex

Folders and files

Latest commit

History

Repository files navigation

dex

Requires

Installation

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages