Looking for some examples on how to use docTR for OCR-related tasks? You've come to the right place 😀

# Installation

Install all the dependencies to make the most out of docTR. The project provides two main [installation](https://mindee.github.io/doctr/latest/installing.html) streams: one for stable release, and developer mode.

## Latest stable release

This will install the last stable release that was published by our teams on pypi. It is expected to provide a clean and non-buggy experience for all users.

In [None]:
# TensorFlow
# !pip install python-doctr[tf]
# PyTorch
!pip install python-doctr[torch]
# Restart runtime

Collecting python-doctr[torch]
  Downloading python_doctr-0.8.1-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting pypdfium2<5.0.0,>=4.0.0 (from python-doctr[torch])
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyclipper<2.0.0,>=1.2.0 (from python-doctr[torch])
  Downloading pyclipper-1.3.0.post5-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (908 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m908.3/908.3 kB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
Collecting langdetect<2.0.0,>=1.0.9 (from python-doctr[torch])
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m57.7 MB/s[0m eta 

## From source

Before being staged for a stable release, we constantly iterate on the community feedback to improve the library. Bug fixes and performance improvements are regularly pushed to the project Git repository. Using this installation method, you will access all the latest features that have not yet made their way to a pypi release!

In [None]:
# Colab related installations to install pyproject.toml projects correctly
!sudo apt install libcairo2-dev pkg-config
!pip3 install pycairo
# Install the most up-to-date version from GitHub
# TensorFlow
# !pip install python-doctr[tf]@git+https://github.com/mindee/doctr.git
# PyTorch
!pip3 install python-doctr[torch]@git+https://github.com/mindee/doctr.git

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libbz2-dev libpkgconf3 libreadline-dev
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  libblkid-dev libblkid1 libcairo-script-interpreter2 libffi-dev
  libglib2.0-dev libglib2.0-dev-bin libice-dev liblzo2-2 libmount-dev
  libmount1 libpixman-1-dev libselinux1-dev libsepol-dev libsm-dev
  libxcb-render0-dev libxcb-shm0-dev
Suggested packages:
  libcairo2-doc libgirepository1.0-dev libglib2.0-doc libgdk-pixbuf2.0-bin
  | libgdk-pixbuf2.0-dev libxml2-utils libice-doc cryptsetup-bin libsm-doc
The following packages will be REMOVED:
  pkgconf r-base-dev
The following NEW packages will be installed:
  libblkid-dev libcairo-script-interpreter2 libcairo2-dev libffi-dev
  libglib2.0-dev libglib2.0-dev-bin libice-dev liblzo2-2 libmount-dev
  libpixman-1-dev libselinux1-de

Now go to  `Runtime/Restart runtime` for your changes to take effect!

# Basic usage

We're going to review the main features of docTR 🎁
And for you to have a proper overview of its capabilities, we will need some free fonts for a proper output visualization:

In [None]:
# Install some free fonts for result rendering
!sudo apt-get install fonts-freefont-ttf -y

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libbz2-dev libpkgconf3 libreadline-dev
Use 'sudo apt autoremove' to remove them.
The following NEW packages will be installed:
  fonts-freefont-ttf
0 upgraded, 1 newly installed, 0 to remove and 43 not upgraded.
Need to get 2,388 kB of archives.
After this operation, 6,653 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-freefont-ttf all 20120503-10build1 [2,388 kB]
Fetched 2,388 kB in 1s (1,957 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
d

Let's take care of all the imports directly

In [None]:
%matplotlib inline
import os

# Let's pick the desired backend
# os.environ['USE_TF'] = '1'
os.environ['USE_TORCH'] = '1'

import matplotlib.pyplot as plt

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

For the next steps, we will need a proper PDF document that will be used to showcase the library features

In [None]:
# Download a sample
!wget https://eforms.com/download/2019/01/Cash-Payment-Receipt-Template.pdf
# Read the file
doc = DocumentFile.from_pdf("Cash-Payment-Receipt-Template.pdf")
print(f"Number of pages: {len(doc)}")

--2024-07-11 00:43:23--  https://eforms.com/download/2019/01/Cash-Payment-Receipt-Template.pdf
Resolving eforms.com (eforms.com)... 104.26.1.24, 172.67.73.188, 104.26.0.24, ...
Connecting to eforms.com (eforms.com)|104.26.1.24|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16080 (16K) [application/pdf]
Saving to: ‘Cash-Payment-Receipt-Template.pdf’


2024-07-11 00:43:24 (77.4 MB/s) - ‘Cash-Payment-Receipt-Template.pdf’ saved [16080/16080]

Number of pages: 1


docTR is, under the hood, running Deep Learning models to perform the different tasks it supports. Those models were built and trained with very popular frameworks for maximum compatibility (you will be pleased to know that you can switch from [PyTorch](https://pytorch.org/) to [TensorFlow](https://www.tensorflow.org/) without noticing any difference for you). By default, our high-level API sets the best default values so that you get high performing models without having to know anything about it. All of this is wrapper in a `Predictor` object, which will take care of pre-processing, model inference and post-processing for you ⚡

Let's instantiate one!

In [None]:
# Instantiate a pretrained model
predictor = ocr_predictor(pretrained=True)

Downloading https://doctr-static.mindee.com/models?id=v0.8.1/fast_base-688a8b34.pt&src=0 to /root/.cache/doctr/models/fast_base-688a8b34.pt


  0%|          | 0/65814772 [00:00<?, ?it/s]

Downloading https://doctr-static.mindee.com/models?id=v0.3.1/crnn_vgg16_bn-9762b0b0.pt&src=0 to /root/.cache/doctr/models/crnn_vgg16_bn-9762b0b0.pt


  0%|          | 0/63286381 [00:00<?, ?it/s]

By default, PyTorch model provides a nice visual description of a model, which is handy when it comes to debugging or knowing what you just created. We also added a similar feature for TensorFlow backend so that you don't miss on this nice assistance.

Let's dive into this model 🕵

In [None]:
# Display the architecture
print(predictor)

OCRPredictor(
  (det_predictor): DetectionPredictor(
    (pre_processor): PreProcessor(
      (resize): Resize(output_size=(1024, 1024), interpolation='bilinear', preserve_aspect_ratio=True, symmetric_pad=True)
      (normalize): Normalize(mean=(0.798, 0.785, 0.772), std=(0.264, 0.2749, 0.287))
    )
    (model): FAST()
  )
  (reco_predictor): RecognitionPredictor(
    (pre_processor): PreProcessor(
      (resize): Resize(output_size=(32, 128), interpolation='bilinear', preserve_aspect_ratio=True, symmetric_pad=False)
      (normalize): Normalize(mean=(0.694, 0.695, 0.693), std=(0.299, 0.296, 0.301))
    )
    (model): CRNN(
      (feat_extractor): Sequential(
        (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace=True)
        (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): BatchNorm2d(64, eps=1e-05, momentum=

Here we are inspecting the most complex (and high-level) object of docTR API: an OCR predictor. Since docTR achieves Optical Character Recognition by first localizing textual elements (Text Detection), then extracting the corresponding text from each location (Text Recognition), the OCR Predictor wraps two sub-predictors: one for text detection, and the other for text recognition.

## Basic inference

It looks quite complex, isn't it?
Well that will not prevent you from easily get nice results. See for yourself:

In [None]:
image_path = 'id_6_velocidade_9.jpg'
single_img_doc = DocumentFile.from_images(image_path)
ocr_result = predictor(single_img_doc)

In [None]:
ocr_result

Document(
  (pages): [Page(
    dimensions=(258, 378)
    (blocks): [Block(
      (lines): [Line(
        (words): [
          Word(value='OR', confidence=0.47),
          Word(value='2292', confidence=0.96),
        ]
      )]
      (artefacts): []
    )]
  )]
)

In [None]:
image_path = 'id_3_velocidade_10.jpg'
single_img_doc = DocumentFile.from_images(image_path)
ocr_result = predictor(single_img_doc)

In [None]:
ocr_result

Document(
  (pages): [Page(
    dimensions=(390, 732)
    (blocks): [
      Block(
        (lines): [Line(
          (words): [Word(value='-', confidence=0.51)]
        )]
        (artefacts): []
      ),
      Block(
        (lines): [Line(
          (words): [Word(value='-', confidence=0.51)]
        )]
        (artefacts): []
      ),
      Block(
        (lines): [Line(
          (words): [Word(value='SYBGH84', confidence=0.67)]
        )]
        (artefacts): []
      ),
    ]
  )]
)

In [None]:
palavra = ''
# Debug: Printar resultados do OCR
print("OCR Result Blocks:")
for i, block in enumerate(ocr_result.pages[0].blocks):
  print(f"Block {i}:")
  print(block)
  for j, line in enumerate(block.lines):
    print(f"  Line {j}:")
    print(line)
    for k, word in enumerate(line.words):
      print(f"    Word {k}:")
      print(word)
      palavra += word.value

print(palavra)

OCR Result Blocks:
Block 0:
Block(
  (lines): [Line(
    (words): [
      Word(value='OR', confidence=0.47),
      Word(value='2292', confidence=0.96),
    ]
  )]
  (artefacts): []
)
  Line 0:
Line(
  (words): [
    Word(value='OR', confidence=0.47),
    Word(value='2292', confidence=0.96),
  ]
)
    Word 0:
Word(value='OR', confidence=0.47)
    Word 1:
Word(value='2292', confidence=0.96)
OR2292


In [None]:
!pip install inference-sdk

Collecting inference-sdk
  Downloading inference_sdk-0.13.0-py3-none-any.whl (30 kB)
Collecting dataclasses-json>=0.6.0 (from inference-sdk)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Collecting supervision<1.0.0,>=0.20.0 (from inference-sdk)
  Downloading supervision-0.21.0-py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.0/124.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting backoff>=2.2.0 (from inference-sdk)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting aioresponses>=0.7.6 (from inference-sdk)
  Downloading aioresponses-0.7.6-py2.py3-none-any.whl (11 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json>=0.6.0->inference-sdk)
  Downloading marshmallow-3.21.3-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json>=0.6.0->inferenc

In [None]:
!export api_key=sGTAqNQmiPb1fo72CDJd

In [None]:
import os
from inference_sdk import InferenceHTTPClient

CLIENT = InferenceHTTPClient(
    api_url="https://infer.roboflow.com",
    api_key="sGTAqNQmiPb1fo72CDJd"
)

result = CLIENT.ocr_image(inference_input="./id_6_velocidade_9.jpg")  # single image request
print(result)


{'result': '0R02297', 'time': 1.31004712999993, 'parent_id': None}


In [None]:
print(result['result'])

0R02297


In [None]:
import os
from inference_sdk import InferenceHTTPClient

CLIENT = InferenceHTTPClient(
    api_url="https://infer.roboflow.com",
    api_key="sGTAqNQmiPb1fo72CDJd"
)

result = CLIENT.ocr_image(inference_input="./id_3_velocidade_10.jpg")  # single image request
print(result)

{'result': 'SYBGH84', 'time': 1.4804244770002697, 'parent_id': None}


In [None]:
print(result['result'])

SYBGH84


## Prediction visualization

If you rightfully prefer to see the results with your eyes, docTR includes a few visualization features. We will first overlay our predictions on the original document:

In [None]:
result.show()

Looks accurate!
But we can go further: if the extracted information is correctly structured, we should be able to recreate the page entirely. So let's do this 🎨

In [None]:
synthetic_pages = result.synthesize()
plt.imshow(synthetic_pages[0]); plt.axis('off'); plt.show()

## Exporting results

OK, so the predictions are relevant, but how would you integrate this into your own document processing pipeline? Perhaps you're not using Python at all?

Well, if you happen to be using JSON or XML exports, they are already supported 🤗

In [None]:
# JSON export
json_export = result.export()
print(json_export)

In [None]:
# XML export
xml_output = result.export_as_xml()
print(xml_output[0][0])