# OCR Demo: Full vs. Partial OCR of Pages
In v1.19.1 of PyMuPDF there are two choices of OCRing a document page: **_full_** or **_partial_**. In both cases, a `TextPage` object will be created - available for text extractions and text searches as usual. All these text processing methods have been extended with the new parameter `textpage` to allow referencing the OCR result.
* A **_full OCR_** makes a photo of the page with the desired resolution and interprets it.
   - All **_visible text_** on the page will be OCRed.
   - All text will have Tesseract's "GlyphlessFont".
   - May take around 2 seonds - depending on text amount and DPI.
* A **_partial OCR_** interprets only the images displayed by the page.
   - The DPI parameter is not needed, because the original images are OCRed.
   - Text will be a **_mixture of normal and OCR text_**. Normal text retains its properties.
   - Can be much faster than a full OCR.

In [41]:
import fitz

if tuple(map(int, fitz.VersionBind.split("."))) < (1, 19, 1):
    raise ValueError("Need at least v1.19.1 of PyMuPDF")

# eample PDF contains normal text and two overlapping images
doc = fitz.open("pipeline_extracao_documentos/2_documentos_para_extracao/01_testes/2023143.pdf")
page = doc[0]

## Full Page OCR
First make a **_full page OCR_**. Please take a look at the PDF and note the two little text lines. They are contained in a separate, non-transparent image, which covers some text of the larger image underneath it.

In [60]:
# make the TextPage object. It does all the OCR.
full_tp = page.get_textpage_ocr(flags=1, dpi=300, full=True, language='por')

# now look at what we have got
print(page.get_text(textpage=full_tp))

Número da Nota:
COLERAO INICIAL
2023142
ii
ma
PREFEITURA MUNICIPAL DE SAO PEDRO DA ALDEIA
Competência:
w |Pao
SECRETARIA MUNICIPAL DA FAZENDA
Julho/2023
VOU
UU
UU
NOTA FISCAL DE SERVIÇOS ELETRÔNICA - NFS-e
HIOTIZODS 143.00
QUALIDADE
DE VIDA PARA TODOS
Código Verificação:
48D30FBCA
PRESTADOR DE SERVIÇOS
CPF/CNPJ:
Inscrição Municipal:
12.852.627/0001-50
710407
Telefone:
Inscrição Estadual:
2226274424..
79.562.319
Nome/Razão Social:
D. VITORIANO PEREIRA
Nome de Fantasia:
dk
Endereço:
EST PAU FERRO ,SN LT 55
A,CAMPO REDONDO- São Pedro da Aldeia-RJ
E-mail:
D.VITORIANOQOLIVE.COM
TOMADOR DE SERVIÇOS
CPF/CNPJ:
RG:
04.968.448/0001-54
|
'
INSC:MUNICIPAL:
Telefone:
Inscrição Estadual:
Nome/Razão Social:
RECICLAR - RECICLAGEM ARARUAMA LTDA
Endereço:
CAJAZEIROS Nº 0 LT 8 QD B BAIRRO: PARQUE HOTEL CIDADE: ARARUAMA - RJ CEP: 28981382
E-mail:
Não Informado
DISCRIMINAÇÃO DOS SERVIÇOS
SERVIÇO DE SOLDA NO PISTÃO HIDRAULICO . OS 12497
VALOR TOTAL DA NOTA: R$ 250,00
CNAE - 2539-0/01 - SERVIÇOS DE USINAGEM,

Or blockwise output, getting rid of some of the unwanted linebreaks:

In [61]:
blocks = page.get_text("blocks", textpage=full_tp)
for b in blocks:
    print(b[4].replace("\n", " "))

Número da Nota: 
COLERAO INICIAL 2023142 ii ma PREFEITURA MUNICIPAL DE SAO PEDRO DA ALDEIA Competência: w |Pao SECRETARIA MUNICIPAL DA FAZENDA Julho/2023 VOU UU UU NOTA FISCAL DE SERVIÇOS ELETRÔNICA - NFS-e HIOTIZODS 143.00 QUALIDADE DE VIDA PARA TODOS Código Verificação: 48D30FBCA 
PRESTADOR DE SERVIÇOS 
CPF/CNPJ: Inscrição Municipal: 12.852.627/0001-50 710407 
Telefone: Inscrição Estadual: 2226274424.. 79.562.319 Nome/Razão Social: 
D. VITORIANO PEREIRA Nome de Fantasia: dk 
Endereço: EST PAU FERRO ,SN LT 55 A,CAMPO REDONDO- São Pedro da Aldeia-RJ E-mail: D.VITORIANOQOLIVE.COM 
TOMADOR DE SERVIÇOS 
CPF/CNPJ: RG: 04.968.448/0001-54 | ' INSC:MUNICIPAL: 
Telefone: Inscrição Estadual: 
Nome/Razão Social: RECICLAR - RECICLAGEM ARARUAMA LTDA Endereço: CAJAZEIROS Nº 0 LT 8 QD B BAIRRO: PARQUE HOTEL CIDADE: ARARUAMA - RJ CEP: 28981382 E-mail: Não Informado 
DISCRIMINAÇÃO DOS SERVIÇOS 
SERVIÇO DE SOLDA NO PISTÃO HIDRAULICO . OS 12497 
VALOR TOTAL DA NOTA: R$ 250,00 
CNAE - 2539-0/01 - SERVIÇO

In [68]:
words = page.get_text("words", textpage=full_tp)


In [111]:
string = "Nota:"
for word in words:
    if string in word:
        print(word)
        print(word[0])
        print(word[1])
        print(word[2])
        print(word[3])
        print(word[4])
        print(word[5])
        print(word[6])
        print(word[7])

(537.251708984375, 37.91236877441406, 552.3685913085938, 42.95381546020508, 'Nota:', 0, 0, 2)
537.251708984375
37.91236877441406
552.3685913085938
42.95381546020508
Nota:
0
0
2


In [117]:
string = "PREFEITURA"
for word in words:
    if string in word:
        print(word)
       
# (x0, y0, x1, y1, "word", block_no, line_no, word_no)        

(125.25474548339844, 47.9903450012207, 210.37762451171875, 61.19412612915039, 'PREFEITURA', 1, 4, 0)


In [118]:
string = "PRESTADOR"
for word in words:
    if string in word:
        print(word)


(254.1087646484375, 113.01724243164062, 295.37811279296875, 119.25902557373047, 'PRESTADOR', 2, 0, 0)


In [119]:
string = "RG:"
for word in words:
    if string in word:
        print(word)


(197.48019409179688, 242.35116577148438, 208.1700439453125, 253.15426635742188, 'RG:', 8, 1, 0)


In [None]:
def extract_text_from_coordinates(image, coordinates, config):
    x0, y0, x1, y1 = coordinates
    frame_image = image.crop((x0, y0, x1, y1))
    extracted_text = pytesseract.image_to_string(frame_image, lang='por', config=config).strip()
    return extracted_text  

In [120]:
string = "Telefone:"
for word in words:
    if string in word:
        print(word)


(44.151100158691406, 144.21090698242188, 70.95127868652344, 152.3732452392578, 'Telefone:', 4, 0, 0)
(44.151100158691406, 267.54608154296875, 70.98370361328125, 275.948486328125, 'Telefone:', 9, 0, 0)


In [121]:
string = "TOMADOR"
for word in words:
    if string in word:
        print(word)


(257.4681091308594, 230.11361694335938, 291.9684143066406, 236.35540771484375, 'TOMADOR', 7, 0, 0)


In [122]:
string = "CPF/CNPJ:"
for word in words:
    if string in word:
        print(word)


(44.39105224609375, 126.21454620361328, 78.40419006347656, 134.61695861816406, 'CPF/CNPJ:', 3, 0, 0)
(44.39105224609375, 242.35116577148438, 78.40419006347656, 253.15426635742188, 'CPF/CNPJ:', 8, 0, 0)


In [78]:
string = "DISCRIMINAÇÃO"
for word in words:
    if string in word:
        print(word)

(244.99061584472656, 334.25262451171875, 299.1429748535156, 341.69476318359375, 'DISCRIMINAÇÃO', 11, 0, 0)


In [82]:
string = "TOTAL"
for word in words:
    if string in word:
        print(word)

(241.87123107910156, 516.3759155273438, 280.2154846191406, 527.6591186523438, 'TOTAL', 13, 0, 1)


In [83]:
string = "CNAE"
for word in words:
    if string in word:
        print(word)

(44.39105224609375, 533.8923950195312, 62.5457878112793, 543.2550659179688, 'CNAE', 14, 0, 0)


In [85]:
string = "Lista"
for word in words:
    if string in word:
        print(word)

(69.10607147216797, 541.5708618164062, 82.44136810302734, 550.9335327148438, 'Lista', 14, 1, 2)


In [86]:
string = "COMPLEMENTARES"
for word in words:
    if string in word:
        print(word)

(278.82379150390625, 658.9071655273438, 344.431396484375, 663.9486083984375, 'COMPLEMENTARES', 17, 0, 1)


In [87]:
string = "CRITICAS"
for word in words:
    if string in word:
        print(word)

(326.3341979980469, 692.5004272460938, 357.2782897949219, 699.9425659179688, 'CRITICAS', 19, 0, 3)


In [89]:
string = "Observação:"
for word in words:
    if string in word:
        print(word)

(46.790565490722656, 737.611328125, 86.32979583740234, 744.333251953125, 'Observação:', 21, 0, 0)


In [115]:
string = "Observação:"
for word in words:
    if string in word:
        print(word)

(46.790565490722656, 737.611328125, 86.32979583740234, 744.333251953125, 'Observação:', 21, 0, 0)


In [116]:
string = "Sistema"
for word in words:
    if string in word:
        print(word)

(219.07583618164062, 764.9657592773438, 244.4723663330078, 771.4476318359375, 'Sistema', 22, 0, 0)


In [67]:
for word in page.get_text("words", sort=False):
        print(word)

Not very impressive either way: the original text (last 4 lines) was detected ok, but text in the pictures looks quite garbled ... no surprise!

> Please note, that the OCR process scans the page from top-left to bottom-right - which therefore also is the sequence of the extraction.

This is what we get when looking at details of each text span:

In [62]:
for block in page.get_text("dict", textpage=full_tp)["blocks"]:
    for line in block["lines"]:
        for span in line["spans"]:
            bbox = fitz.Rect(span["bbox"]).irect  # just to make it short in the display
            print( span["font"], bbox, span["text"])

GlyphLessFont IRect(501, 37, 526, 43) Número
GlyphLessFont IRect(525, 37, 535, 43)  da
GlyphLessFont IRect(534, 37, 553, 43)  Nota:
GlyphLessFont IRect(63, 45, 79, 55) COLERAO
GlyphLessFont IRect(78, 45, 100, 55)  INICIAL
GlyphLessFont IRect(524, 45, 553, 55) 2023142
GlyphLessFont IRect(47, 47, 67, 62) ii
GlyphLessFont IRect(86, 47, 87, 62) ma
GlyphLessFont IRect(125, 47, 211, 62) PREFEITURA
GlyphLessFont IRect(210, 47, 288, 62)  MUNICIPAL
GlyphLessFont IRect(287, 47, 310, 62)  DE
GlyphLessFont IRect(309, 47, 343, 62)  SAO
GlyphLessFont IRect(342, 47, 395, 62)  PEDRO
GlyphLessFont IRect(394, 47, 419, 62)  DA
GlyphLessFont IRect(418, 47, 472, 62)  ALDEIA
GlyphLessFont IRect(511, 47, 552, 62) Competência:
GlyphLessFont IRect(47, 57, 55, 77) w
GlyphLessFont IRect(54, 57, 60, 77)  |
GlyphLessFont IRect(62, 57, 117, 77) Pao
GlyphLessFont IRect(227, 57, 276, 77) SECRETARIA
GlyphLessFont IRect(275, 57, 318, 77)  MUNICIPAL
GlyphLessFont IRect(317, 57, 332, 77)  DA
GlyphLessFont IRect(331, 57, 

## Partial OCR
Let's see what a **_partial OCR_** can do for us.

A partial OCR `TextPage` internally stores text in the following sequence:
1. Normal text
2. OCR text from images in the same sequence as the page displays those images

So we better use the `sort` parameter of text extraction.

In [63]:
partial_tp = page.get_textpage_ocr(flags=1, full=False, language='por')

In [None]:
extractDICT()

In [64]:
partial_tp

<fitz.fitz.TextPage; proxy of <Swig Object of type 'TextPage *' at 0x7efcfc6aa0d0> >

In [65]:
print(page.get_text(textpage=partial_tp, sort=False))

Em
gusta



In [45]:
partial_tp = page.get_textpage_ocr(flags=0, full=False, language='por')

# look at the result
print(page.get_text(textpage=partial_tp, sort=True))  # sort by vertical, then horizontal

Em
gusta



This is very much better. Looking again at span details:

In [66]:
for block in page.get_text("dict", textpage=partial_tp, sort=True)["blocks"]:
    for line in block["lines"]:
        for span in line["spans"]:
            bbox = fitz.Rect(span["bbox"]).irect  # just to make it short
            print( span["font"], bbox, span["text"])

GlyphLessFont IRect(482, 141, 552, 185) Em
GlyphLessFont IRect(46, 581, 96, 602) gusta


As mentioned, normal text is **_not OCRed_** in this case, so keeps its own font, fontsize, position information, etc. Whereas OCRed text appears with Tesseract's `GlyphLessFont`.

> During its internal processing, MuPDF treats every word returned by Tesseract as a separate text span.

## Performance
We mentioned in the beginning, that the OCR work is done during `TextPage` creation. Already without OCR, textpage creation is the most time consuming part of text processing.

Creating OCR textpages may easily take 100 to several thousand times longer. It therefore by all means should happen only once per document page.

The new `textpage` parameter in all text processing methods allows referring to an existing textpage and will suppress creating another one.

Here are some performance comparisons for our example page:

In [9]:
# normal text extraction - no OCR
%timeit page.get_textpage(flags=0)  # suppress image extraction

454 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [11]:
# full page OCR
%timeit page.get_textpage_ocr(flags=0, full=True, dpi=300, language='por', tessdata="C:\\Program Files\\Tesseract-OCR\\tessdata")

1.02 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
# partial OCR
%timeit page.get_textpage_ocr(flags=0, full=False, language='por', tessdata="C:\\Program Files\\Tesseract-OCR\\tessdata")

20.8 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


The above numbers illustrate that OCRing a page is time consuming! Creating an OCR `TextPage` may be several thousand times slower.

Once you **_have_** the textpage however, **_processing_** its text is as fast as it ever was:

In [13]:
# normal textpage
normal_tp = page.get_textpage(flags=0)
%timeit page.get_text(textpage=normal_tp)

30.7 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [54]:
# full page OCR
%timeit page.get_text(textpage=full_tp)

45.4 µs ± 626 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [53]:
# partial page OCR
%timeit page.get_text(textpage=partial_tp)

3.39 µs ± 59.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [2]:
from operator import itemgetter
from itertools import groupby
import fitz


In [3]:


# ==============================================================================
# Function ParseTab - parse a document table into a Python list of lists
# ==============================================================================
def ParseTab(page, bbox, columns=None):
    """Returns the parsed table of a page in a PDF / (open) XPS / EPUB document.
    Parameters:
    page: fitz.Page object
    bbox: containing rectangle, list of numbers [xmin, ymin, xmax, ymax]
    columns: optional list of column coordinates. If None, columns are generated
    Returns the parsed table as a list of lists of strings.
    The number of rows is determined automatically
    from parsing the specified rectangle.
    """
    tab_rect = fitz.Rect(bbox).irect
    xmin, ymin, xmax, ymax = tuple(tab_rect)

    if tab_rect.is_empty or tab_rect.is_infinite:
        print("Warning: incorrect rectangle coordinates!")
        return []

    if type(columns) is not list or columns == []:
        coltab = [tab_rect.x0, tab_rect.x1]
    else:
        coltab = sorted(columns)

    if xmin < min(coltab):
        coltab.insert(0, xmin)
    if xmax > coltab[-1]:
        coltab.append(xmax)

    words = page.get_text("words")

    if words == []:
        print("Warning: page contains no text")
        return []

    alltxt = []

    # get words contained in table rectangle and distribute them into columns
    for w in words:
        ir = fitz.Rect(w[:4]).irect  # word rectangle
        if ir in tab_rect:
            cnr = 0  # column index
            for i in range(1, len(coltab)):  # loop over column coordinates
                if ir.x0 < coltab[i]:  # word start left of column border
                    cnr = i - 1
                    break
            alltxt.append([ir.x0, ir.y0, ir.x1, cnr, w[4]])

    if alltxt == []:
        print("Warning: no text found in rectangle!")
        return []

    alltxt.sort(key=itemgetter(1))  # sort words vertically

    # create the table / matrix
    spantab = []  # the output matrix

    for y, zeile in groupby(alltxt, itemgetter(1)):
        schema = [""] * (len(coltab) - 1)
        for c, words in groupby(zeile, itemgetter(3)):
            entry = " ".join([w[4] for w in words])
            schema[c] = entry
        spantab.append(schema)

    return spantab

In [4]:
import fitz

In [None]:




table = ParseTab(doc, 20, [0, 0, 500, 700])

In [5]:
doc = fitz.Document(r".\PyMuPDF_examples\272519EC-959B-4460-BBE6-D80D272B9ACD.PDF")

In [12]:
table = ParseTab(doc, [0, 0, 5, 600])

AttributeError: 'Document' object has no attribute 'get_text'