# OCR Demo: Full vs. Partial OCR of Pages
In v1.19.1 of PyMuPDF there are two choices of OCRing a document page: **_full_** or **_partial_**. In both cases, a `TextPage` object will be created - available for text extractions and text searches as usual. All these text processing methods have been extended with the new parameter `textpage` to allow referencing the OCR result.
* A **_full OCR_** makes a photo of the page with the desired resolution and interprets it.
   - All **_visible text_** on the page will be OCRed.
   - All text will have Tesseract's "GlyphlessFont".
   - May take around 2 seonds - depending on text amount and DPI.
* A **_partial OCR_** interprets only the images displayed by the page.
   - The DPI parameter is not needed, because the original images are OCRed.
   - Text will be a **_mixture of normal and OCR text_**. Normal text retains its properties.
   - Can be much faster than a full OCR.

In [1]:
import fitz

if tuple(map(int, fitz.VersionBind.split("."))) < (1, 19, 1):
    raise ValueError("Need at least v1.19.1 of PyMuPDF")

# eample PDF contains normal text and two overlapping images
doc = fitz.open(r".\PyMuPDF_examples\272519EC-959B-4460-BBE6-D80D272B9ACD.PDF")
page = doc[0]

FileNotFoundError: no such file: '.\PyMuPDF_examples\272519EC-959B-4460-BBE6-D80D272B9ACD.PDF'

## Full Page OCR
First make a **_full page OCR_**. Please take a look at the PDF and note the two little text lines. They are contained in a separate, non-transparent image, which covers some text of the larger image underneath it.

In [4]:
# make the TextPage object. It does all the OCR.
full_tp = page.get_textpage_ocr(flags=0, dpi=300, full=True, language='por', tessdata="C:\\Program Files\\Tesseract-OCR\\tessdata")

# now look at what we have got
print(page.get_text(textpage=full_tp))

cd
Número da Nota:
ERRME
PREFEITURA MUNICIPAL DE MESQUITA
2023113
Vi
A
Vé
Competência:
ENE” É
SECRETARIA MUNICIPAL DE FAZENDA
JULHO/2023
DS VA
NOTA FISCAL DE SERVIÇO ELETRÔNICA- NFS
Data e Hora de Emisssão:
Y) AO,44::
ç
j
Se
01/08/2023 03:32:41
BR mes quer 1º
RPS Nº: 1386 SERIE: 1
| EMISSÃO: 01/08/2023
Código de Verificação:
C64B104C7
PRESTADOR DE SERVIÇOS
CPF/CNPJ:
Inscrição Municipal:
15.064.270/0001-33
5000545
Telefone:
Inscrição Estadual:
(21) 3848-0080
Nome/Razão Social:
MODERNIZACAO PUBLICA E INFORMATICA LTDA
UOGERNIZAÇÃO
Nome Fantasia: Modernização Pública
Publica
Endereço:
RUA PREFEITO JOSE MONTES PAIXAO, 1708 SOBRADO CENTRO MESQUITA/RJ CEP: 26553-161
E-mail:
vm.direcao(Dyahoo.com.br
TOMADOR DE SERVIÇOS
CPF/CNPJ:
Inscrição Municipal:
28.909.604/0001-74
Telefone:
Inscrição Estadual:
Nome/Razão Social:
PREF.MUN.DE SAO PEDRO DA ALDEIA
Endereço:
Rua Marques da Cruz, 61 CENTRO SAO PEDRO DA ALDEIA/RJ/RJ CEP: 28941-086
E-mail:
DISCRIMINAÇÃO DOS SERVIÇOS
CESSAO DE LICENCIAMENTO DE USO 

Or blockwise output, getting rid of some of the unwanted linebreaks:

In [5]:
blocks = page.get_text("blocks", textpage=full_tp)
for b in blocks:
    print(b[4].replace("\n", " "))

cd Número da Nota: ERRME PREFEITURA MUNICIPAL DE MESQUITA 2023113 Vi A Vé Competência: ENE” É SECRETARIA MUNICIPAL DE FAZENDA JULHO/2023 DS VA NOTA FISCAL DE SERVIÇO ELETRÔNICA- NFS Data e Hora de Emisssão: Y) AO,44:: ç j Se 01/08/2023 03:32:41 BR mes quer 1º RPS Nº: 1386 SERIE: 1 | EMISSÃO: 01/08/2023 Código de Verificação: C64B104C7 
PRESTADOR DE SERVIÇOS 
CPF/CNPJ: Inscrição Municipal: 15.064.270/0001-33 5000545 
Telefone: Inscrição Estadual: (21) 3848-0080 Nome/Razão Social: MODERNIZACAO PUBLICA E INFORMATICA LTDA UOGERNIZAÇÃO Nome Fantasia: Modernização Pública Publica Endereço: RUA PREFEITO JOSE MONTES PAIXAO, 1708 SOBRADO CENTRO MESQUITA/RJ CEP: 26553-161 E-mail: vm.direcao(Dyahoo.com.br 
TOMADOR DE SERVIÇOS 
CPF/CNPJ: Inscrição Municipal: 28.909.604/0001-74 
Telefone: Inscrição Estadual: 
Nome/Razão Social: PREF.MUN.DE SAO PEDRO DA ALDEIA 
Endereço: Rua Marques da Cruz, 61 CENTRO SAO PEDRO DA ALDEIA/RJ/RJ CEP: 28941-086 E-mail: 
DISCRIMINAÇÃO DOS SERVIÇOS 
CESSAO DE LICENCIAMEN

Not very impressive either way: the original text (last 4 lines) was detected ok, but text in the pictures looks quite garbled ... no surprise!

> Please note, that the OCR process scans the page from top-left to bottom-right - which therefore also is the sequence of the extraction.

This is what we get when looking at details of each text span:

In [6]:
for block in page.get_text("dict", textpage=full_tp)["blocks"]:
    for line in block["lines"]:
        for span in line["spans"]:
            bbox = fitz.Rect(span["bbox"]).irect  # just to make it short in the display
            print( span["font"], bbox, span["text"])

GlyphLessFont IRect(54, 48, 81, 53) cd
GlyphLessFont IRect(489, 40, 525, 49) Número
GlyphLessFont IRect(524, 40, 539, 49)  da
GlyphLessFont IRect(538, 40, 565, 49)  Nota:
GlyphLessFont IRect(47, 52, 87, 77) ERRME
GlyphLessFont IRect(118, 52, 194, 77) PREFEITURA
GlyphLessFont IRect(193, 52, 264, 77)  MUNICIPAL
GlyphLessFont IRect(263, 52, 284, 77)  DE
GlyphLessFont IRect(283, 52, 351, 77)  MESQUITA
GlyphLessFont IRect(526, 52, 565, 77) 2023113
GlyphLessFont IRect(37, 52, 45, 77) Vi
GlyphLessFont IRect(60, 63, 76, 76) A
GlyphLessFont IRect(91, 63, 94, 76) Vé
GlyphLessFont IRect(516, 63, 566, 76) Competência:
GlyphLessFont IRect(49, 75, 87, 99) ENE”
GlyphLessFont IRect(86, 75, 96, 99)  É
GlyphLessFont IRect(158, 75, 210, 99) SECRETARIA
GlyphLessFont IRect(209, 75, 256, 99)  MUNICIPAL
GlyphLessFont IRect(255, 75, 270, 99)  DE
GlyphLessFont IRect(269, 75, 311, 99)  FAZENDA
GlyphLessFont IRect(518, 75, 565, 99) JULHO/2023
GlyphLessFont IRect(36, 75, 44, 99) DS
GlyphLessFont IRect(43, 77, 65,

## Partial OCR
Let's see what a **_partial OCR_** can do for us.

A partial OCR `TextPage` internally stores text in the following sequence:
1. Normal text
2. OCR text from images in the same sequence as the page displays those images

So we better use the `sort` parameter of text extraction.

In [7]:
partial_tp = page.get_textpage_ocr(flags=0, full=False, language='por', tessdata="C:\\Program Files\\Tesseract-OCR\\tessdata")

# look at the result
print(page.get_text(textpage=partial_tp, sort=True))  # sort by vertical, then horizontal

Número da Nota:
PREFEITURA MUNICIPAL DE MESQUITA
2023173
Competência:
JULHO/2023
SECRETARIA MUNICIPAL DE FAZENDA
Data e Hora de Emisssão:
01/08/2023 03:32:41
NOTA FISCAL DE SERVIÇO ELETRÔNICA - NFS-e
RPS Nº: 1386 SÉRIE: 1 | EMISSÃO: 01/08/2023
Código de Verificação:
C64B104C7
PRESTADOR DE SERVIÇOS
CPF/CNPJ:
Inscrição Municipal:
15.064.270/0001-33
5000545
Telefone:
Inscrição Estadual:
(21) 3848-0080
Nome/Razão Social:
MODERNIZACAO PUBLICA E INFORMATICA LTDA
0 Pública |
Nome Fantasia: Modernização Pública
Endereço:
RUA PREFEITO JOSE MONTES PAIXAO, 1708 SOBRADO CENTRO MESQUITA/RJ CEP: 26553-161
E-mail:
vm.direcao@yahoo.com.br
TOMADOR DE SERVIÇOS
CPF/CNPJ:
Inscrição Municipal:
28.909.604/0001-74
Telefone:
Inscrição Estadual:
Nome/Razão Social:
PREF.MUN.DE SAO PEDRO DA ALDEIA
Endereço:
Rua Marques da Cruz, 61  CENTRO SAO PEDRO DA ALDEIA/RJ/RJ CEP: 28941-086
E-mail:
DISCRIMINAÇÃO DOS SERVIÇOS
CESSAO DE LICENCIAMENTO DE USO DE SOLUCAO PARA SISTEMA INTEGRADO DE GESTAO PUBLICA CONTRATO  94/2022

This is very much better. Looking again at span details:

In [8]:
for block in page.get_text("dict", textpage=partial_tp, sort=True)["blocks"]:
    for line in block["lines"]:
        for span in line["spans"]:
            bbox = fitz.Rect(span["bbox"]).irect  # just to make it short
            print( span["font"], bbox, span["text"])

Arial IRect(489, 37, 566, 51) Número da Nota:
Arial,Bold IRect(117, 55, 352, 72) PREFEITURA MUNICIPAL DE MESQUITA
Arial,Bold IRect(526, 50, 566, 65) 2023173
Arial IRect(516, 61, 566, 73) Competência:
Arial,Bold IRect(518, 73, 566, 85) JULHO/2023
Arial,Bold IRect(158, 73, 311, 85) SECRETARIA MUNICIPAL DE FAZENDA
Arial IRect(472, 81, 566, 93) Data e Hora de Emisssão:
Arial,Bold IRect(491, 93, 566, 105) 01/08/2023 03:32:41
Arial,Bold IRect(139, 88, 330, 100) NOTA FISCAL DE SERVIÇO ELETRÔNICA - NFS-e
Arial,Bold IRect(147, 102, 322, 114) RPS Nº: 1386 SÉRIE: 1 | EMISSÃO: 01/08/2023
Arial IRect(485, 106, 566, 118) Código de Verificação:
Arial,Bold IRect(521, 118, 566, 130) C64B104C7
Arial,Bold IRect(217, 134, 378, 152) PRESTADOR DE SERVIÇOS
Arial IRect(27, 155, 74, 169) CPF/CNPJ:
Arial IRect(282, 155, 362, 169) Inscrição Municipal:
Arial,Bold IRect(34, 166, 116, 180) 15.064.270/0001-33
Arial,Bold IRect(290, 166, 326, 180) 5000545
Arial IRect(27, 178, 65, 191) Telefone:
Arial IRect(282, 178, 3

As mentioned, normal text is **_not OCRed_** in this case, so keeps its own font, fontsize, position information, etc. Whereas OCRed text appears with Tesseract's `GlyphLessFont`.

> During its internal processing, MuPDF treats every word returned by Tesseract as a separate text span.

## Performance
We mentioned in the beginning, that the OCR work is done during `TextPage` creation. Already without OCR, textpage creation is the most time consuming part of text processing.

Creating OCR textpages may easily take 100 to several thousand times longer. It therefore by all means should happen only once per document page.

The new `textpage` parameter in all text processing methods allows referring to an existing textpage and will suppress creating another one.

Here are some performance comparisons for our example page:

In [9]:
# normal text extraction - no OCR
%timeit page.get_textpage(flags=0)  # suppress image extraction

454 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [11]:
# full page OCR
%timeit page.get_textpage_ocr(flags=0, full=True, dpi=300, language='por', tessdata="C:\\Program Files\\Tesseract-OCR\\tessdata")

1.02 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
# partial OCR
%timeit page.get_textpage_ocr(flags=0, full=False, language='por', tessdata="C:\\Program Files\\Tesseract-OCR\\tessdata")

20.8 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


The above numbers illustrate that OCRing a page is time consuming! Creating an OCR `TextPage` may be several thousand times slower.

Once you **_have_** the textpage however, **_processing_** its text is as fast as it ever was:

In [13]:
# normal textpage
normal_tp = page.get_textpage(flags=0)
%timeit page.get_text(textpage=normal_tp)

30.7 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [14]:
# full page OCR
%timeit page.get_text(textpage=full_tp)

30.4 µs ± 288 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [15]:
# partial page OCR
%timeit page.get_text(textpage=partial_tp)

29.8 µs ± 553 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [2]:
from operator import itemgetter
from itertools import groupby
import fitz


In [3]:


# ==============================================================================
# Function ParseTab - parse a document table into a Python list of lists
# ==============================================================================
def ParseTab(page, bbox, columns=None):
    """Returns the parsed table of a page in a PDF / (open) XPS / EPUB document.
    Parameters:
    page: fitz.Page object
    bbox: containing rectangle, list of numbers [xmin, ymin, xmax, ymax]
    columns: optional list of column coordinates. If None, columns are generated
    Returns the parsed table as a list of lists of strings.
    The number of rows is determined automatically
    from parsing the specified rectangle.
    """
    tab_rect = fitz.Rect(bbox).irect
    xmin, ymin, xmax, ymax = tuple(tab_rect)

    if tab_rect.is_empty or tab_rect.is_infinite:
        print("Warning: incorrect rectangle coordinates!")
        return []

    if type(columns) is not list or columns == []:
        coltab = [tab_rect.x0, tab_rect.x1]
    else:
        coltab = sorted(columns)

    if xmin < min(coltab):
        coltab.insert(0, xmin)
    if xmax > coltab[-1]:
        coltab.append(xmax)

    words = page.get_text("words")

    if words == []:
        print("Warning: page contains no text")
        return []

    alltxt = []

    # get words contained in table rectangle and distribute them into columns
    for w in words:
        ir = fitz.Rect(w[:4]).irect  # word rectangle
        if ir in tab_rect:
            cnr = 0  # column index
            for i in range(1, len(coltab)):  # loop over column coordinates
                if ir.x0 < coltab[i]:  # word start left of column border
                    cnr = i - 1
                    break
            alltxt.append([ir.x0, ir.y0, ir.x1, cnr, w[4]])

    if alltxt == []:
        print("Warning: no text found in rectangle!")
        return []

    alltxt.sort(key=itemgetter(1))  # sort words vertically

    # create the table / matrix
    spantab = []  # the output matrix

    for y, zeile in groupby(alltxt, itemgetter(1)):
        schema = [""] * (len(coltab) - 1)
        for c, words in groupby(zeile, itemgetter(3)):
            entry = " ".join([w[4] for w in words])
            schema[c] = entry
        spantab.append(schema)

    return spantab

In [4]:
import fitz

In [None]:




table = ParseTab(doc, 20, [0, 0, 500, 700])

In [5]:
doc = fitz.Document(r".\PyMuPDF_examples\272519EC-959B-4460-BBE6-D80D272B9ACD.PDF")

In [12]:
table = ParseTab(doc, [0, 0, 5, 600])

AttributeError: 'Document' object has no attribute 'get_text'