ocr_dataset

This dataset, focused on evaluation of document layout analysis and OCR, has been created by a semi supervised approach as part of I-Assistant project funded by Microsoft's AI for Accessibility program. This contains 98 documents that have been categorized on the basis of their column structure (single column vs multi column) and page content (figures, math, tables or plain text)
The annotation folder contains jsons with keys

multicolumn - Boolean indicates if the doc has multicolumn or single column layout.
figures - a list of all the bounding boxes of the images present on that page, if any.
tables - a list of all the table list elements, where each table list element is a two-element list with the first one containing the location (bounding box) and second containing the content of that table in csv format.
maths - a list of all the math list elements. Each math list element is a 3 element list with the first element being the location (bounding box), second containing all the text content of that line and last element is a list which contains MathMl representation of all the math symbols of that line.
text - string with all the text on the page
Bounding boxes are in relative YOLO format.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
annotations		annotations
pdfs		pdfs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ocr_dataset

About

Releases

Packages

Contributors 2

I-Stem/ocr_dataset

Folders and files

Latest commit

History

Repository files navigation

ocr_dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages