Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try approaches to extract images using OCR #30

Closed
manuGil opened this issue Jun 9, 2023 · 3 comments · Fixed by #33
Closed

Try approaches to extract images using OCR #30

manuGil opened this issue Jun 9, 2023 · 3 comments · Fixed by #33
Assignees
Labels
enhancement New feature or request

Comments

@manuGil
Copy link
Collaborator

manuGil commented Jun 9, 2023

The following tools have been suggested:

  • Tesseract OCR: Tesseract is a widely-used OCR engine that supports multiple languages and can be integrated with various programming languages. It has the ability to detect text within an image, and you can leverage its text extraction capabilities to identify regions of the document that do not contain text, which could be potential image regions.

  • OpenCV with OCR: OpenCV can be used alongside OCR libraries like Tesseract to perform more complex document analysis tasks. You can use OpenCV's image processing functions to preprocess the document, isolate text regions, and then pass those regions to an OCR engine for text extraction. The remaining regions without recognized text are likely to contain images.

  • Pytesseract: Pytesseract is a Python wrapper for Tesseract OCR. It provides an easy-to-use interface to integrate Tesseract into your Python code. You can combine Pytesseract with OpenCV to preprocess the image document, extract text regions, and identify image regions accordingly.

  • OCRopus: OCRopus is an OCR system developed by Google that includes various document analysis tools. It provides features for layout analysis, including text and image region identification. OCRopus can be used to preprocess the document and analyze its layout to differentiate between text and image regions.

@manuGil manuGil self-assigned this Jun 9, 2023
@manuGil manuGil added the enhancement New feature or request label Jun 12, 2023
@manuGil
Copy link
Collaborator Author

manuGil commented Jun 12, 2023

I checked the Tesseract OCR tool. It is very good at extracting text from images, however, I cannot clearly see how it can help us to extract images or floorplans.

For example. The following page was converted to JPEG and then processed with Tesseract OCR. The result contains the text in the image.

multi-image-caption

Result

RESEARCH

Figure 13: Aerial of Rotterdam Port and Makers District (Port of Rotterdam, 2020) Source: https://www.
portofrotterdam.com/en/news-and-press-releases/rdm-rotterdam-and-m4h-rotterdam-together-form-the-makers-

district

Figure 14: Waag Textile Lab, Amsterdam (Circl, 2020)
Source: https://waag.org/en/article/experimenting-
alternative-textiles

14

Figure 15: Blue City Labs, Rotterdam (BlueCity, 2020)

Source: https://en.rotterdampartners.nl/stories/bluecity-

circular-playground-with-balls/

5.4.1 Maker Labs

Apart from food production, more self-sufficient
and resourceful communities can be realised
through Maker spaces and labs for circular product
creation. Recent research has been carried out
regarding the potentials for recycled waste and
bio-based material in the production of consumer
goods, such as clothing and craft products. Food
waste and matter, bacteria and enzymes can be
used to develop innovative materials such as
Nullarbor and Woocoa that can be made readily
available for use as fabrics. Moreover, food and
crop waste can be utilised for the creation of
dyes and a variety of sustainable products “from
insulation panels to phone cases” (Hitti, 2019).
Furthermore, 3D printing technology has changed
the way that products can be efficiently and
locally manufactured, with sustainable materials
for printing being researched and developed.
Examples of thriving and... 

@manuGil
Copy link
Collaborator Author

manuGil commented Jun 20, 2023

After a few tries, I found out that Tesseract can indeed tell something about the areas on a page that are not text. And more importantly, we can use the bounding boxes (in red) to extract visuals.

page-100dpi

@manuGil
Copy link
Collaborator Author

manuGil commented Jun 20, 2023

When it comes to identifying bounding boxes of vector-line visuals the result has artefacts. For example, multiple boxes for the same visual or boxes do not cover the visual in its entirety.

page-21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant