Try approaches to extract images using OCR #30

manuGil · 2023-06-09T15:22:06Z

The following tools have been suggested:

Tesseract OCR: Tesseract is a widely-used OCR engine that supports multiple languages and can be integrated with various programming languages. It has the ability to detect text within an image, and you can leverage its text extraction capabilities to identify regions of the document that do not contain text, which could be potential image regions.
OpenCV with OCR: OpenCV can be used alongside OCR libraries like Tesseract to perform more complex document analysis tasks. You can use OpenCV's image processing functions to preprocess the document, isolate text regions, and then pass those regions to an OCR engine for text extraction. The remaining regions without recognized text are likely to contain images.
Pytesseract: Pytesseract is a Python wrapper for Tesseract OCR. It provides an easy-to-use interface to integrate Tesseract into your Python code. You can combine Pytesseract with OpenCV to preprocess the image document, extract text regions, and identify image regions accordingly.
OCRopus: OCRopus is an OCR system developed by Google that includes various document analysis tools. It provides features for layout analysis, including text and image region identification. OCRopus can be used to preprocess the document and analyze its layout to differentiate between text and image regions.

manuGil · 2023-06-12T08:27:56Z

I checked the Tesseract OCR tool. It is very good at extracting text from images, however, I cannot clearly see how it can help us to extract images or floorplans.

For example. The following page was converted to JPEG and then processed with Tesseract OCR. The result contains the text in the image.

Result

RESEARCH

Figure 13: Aerial of Rotterdam Port and Makers District (Port of Rotterdam, 2020) Source: https://www.
portofrotterdam.com/en/news-and-press-releases/rdm-rotterdam-and-m4h-rotterdam-together-form-the-makers-

district

Figure 14: Waag Textile Lab, Amsterdam (Circl, 2020)
Source: https://waag.org/en/article/experimenting-
alternative-textiles

14

Figure 15: Blue City Labs, Rotterdam (BlueCity, 2020)

Source: https://en.rotterdampartners.nl/stories/bluecity-

circular-playground-with-balls/

5.4.1 Maker Labs

Apart from food production, more self-sufficient
and resourceful communities can be realised
through Maker spaces and labs for circular product
creation. Recent research has been carried out
regarding the potentials for recycled waste and
bio-based material in the production of consumer
goods, such as clothing and craft products. Food
waste and matter, bacteria and enzymes can be
used to develop innovative materials such as
Nullarbor and Woocoa that can be made readily
available for use as fabrics. Moreover, food and
crop waste can be utilised for the creation of
dyes and a variety of sustainable products “from
insulation panels to phone cases” (Hitti, 2019).
Furthermore, 3D printing technology has changed
the way that products can be efficiently and
locally manufactured, with sustainable materials
for printing being researched and developed.
Examples of thriving and...

manuGil · 2023-06-20T08:37:40Z

After a few tries, I found out that Tesseract can indeed tell something about the areas on a page that are not text. And more importantly, we can use the bounding boxes (in red) to extract visuals.

manuGil · 2023-06-20T08:40:20Z

When it comes to identifying bounding boxes of vector-line visuals the result has artefacts. For example, multiple boxes for the same visual or boxes do not cover the visual in its entirety.

manuGil self-assigned this Jun 9, 2023

manuGil added the enhancement New feature or request label Jun 12, 2023

manuGil mentioned this issue Jun 26, 2023

Feature/pipeline #33

Merged

manuGil closed this as completed in #33 Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try approaches to extract images using OCR #30

Try approaches to extract images using OCR #30

manuGil commented Jun 9, 2023

manuGil commented Jun 12, 2023 •

edited

manuGil commented Jun 20, 2023 •

edited

manuGil commented Jun 20, 2023 •

edited

Try approaches to extract images using OCR #30

Try approaches to extract images using OCR #30

Comments

manuGil commented Jun 9, 2023

manuGil commented Jun 12, 2023 • edited

manuGil commented Jun 20, 2023 • edited

manuGil commented Jun 20, 2023 • edited

manuGil commented Jun 12, 2023 •

edited

manuGil commented Jun 20, 2023 •

edited

manuGil commented Jun 20, 2023 •

edited