<div style="text-align: center; background-color: #0077b6; padding: 20px; border-radius: 10px;">
  <h1 style="color: white; font-weight: bold;">OCRFlux-3B: Testing Document Parsing Across Different Examples</h1>
</div>


In this notebook, we explore and evaluate **[OCRFlux-3B](https://huggingface.co/ChatDOC/OCRFlux-3B)**, a powerful open-source model developed for document parsing and OCR (Optical Character Recognition). Our goal is to test its performance across a variety of page images sourced from different types of documents. The core implementation and setup are based on the excellent tutorial provided by [Venelin Valkov](https://www.youtube.com/@venelin_valkov/videos). This notebook is executed on **Lightning AI** using an **NVIDIA L40S GPU**, allowing us to leverage accelerated hardware for efficient inference.

# **1. Import Libraries**

In [2]:
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
from rich.console import Console
from bs4 import BeautifulSoup
from IPython.display import display, HTML
import json
from rich.pretty import pprint
from rich.table import Table
from rich.markdown import Markdown
import base64
from markdown import markdown

# **2. Load OCRFlux-3B**

In [4]:
model_path = "ChatDOC/OCRFlux-3B"

model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    #attn_implementation="flash_attention_2",
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/302 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


chat_template.json: 0.00B [00:00, ?B/s]

# **3. Define the Prompt**

In [14]:
EXTRACT_PROMPT = (
    "Below is the image of one page of a document. "
    "Just return the plain text representation of this document as if you were reading it naturally.\n"
    "ALL tables should be presented in HTML format.\n"
    'If there are images or figures in the page, present them in table format.\n'
    "Present all titles and headings as H1 headings.\n"
    "Do not hallucinate.\n"
)

# **4. Inference**

In [16]:
def ocr_page(image_path, model, processor, max_new_tokens=4096):
    image = Image.open(image_path)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": f"file://{image_path}"},
                {"type": "text", "text": EXTRACT_PROMPT},
            ],
        },
    ]
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)

    output_ids = model.generate(
        **inputs, temperature=0.0, max_new_tokens=max_new_tokens, do_sample=False
    )
    generated_ids = [
        output_ids[len(input_ids) :]
        for input_ids, output_ids in zip(inputs.input_ids, output_ids)
    ]

    output_text = processor.batch_decode(
        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )
    return output_text[0]

In [17]:
def display_vertical_split(image_path: str, html_content: str):
    try:
        with open(image_path, "rb") as f:
            image_data = f.read()

        ext = image_path.split(".")[-1].lower()
        mime_map = {
            "jpg": "jpeg",
            "jpeg": "jpeg",
            "png": "png",
            "gif": "gif",
            "svg": "svg+xml",
        }
        mime_type = f"image/{mime_map.get(ext, 'png')}"
        b64_image = base64.b64encode(image_data).decode("utf-8")
        image_uri = f"data:{mime_type};base64,{b64_image}"

    except FileNotFoundError:
        print(f"Error: Image file not found at '{image_path}'")
        return
    except Exception as e:
        print(f"An error occurred while processing the image: {e}")
        return

    # Step 3: Convert Markdown to HTML, enabling the 'extra' extension for tables
    # Do NOT wrap the source markdown text, as it will break table syntax.
    html_content = markdown(html_content.replace("\$", "$"), extensions=["extra"])

    # Step 4: Construct the HTML and CSS for the vertical split view
    # We add a <style> block to properly format the markdown output, especially tables.
    html_template = f"""
    <style>
        .container {{
            display: flex;
            align-items: flex-start;
            width: 100%;
            border: 1px solid #e0e0e0;
            border-radius: 8px;
            overflow: hidden;
            font-family: sans-serif;
        }}
        .pane {{
            flex: 1;
            padding: 15px;
            min-width: 0; /* Important for flexbox wrapping */
        }}
        .pane img {{
            width: 100%;
            height: auto;
            object-fit: contain;
            border-radius: 4px;
        }}
        .divider {{
            width: 1px;
            background-color: #e0e0e0;
            align-self: stretch;
        }}
        /* Markdown-specific styles */
        .markdown-body {{
            font-size: 14px;
            line-height: 1.6;
        }}
        .markdown-body h1, .markdown-body h2, .markdown-body h3 {{
            border-bottom: 1px solid #eee;
            padding-bottom: .3em;
            margin-top: 24px;
            margin-bottom: 16px;
        }}
        .markdown-body table {{
            border-collapse: collapse;
            width: 100%;
            margin-top: 1em;
            margin-bottom: 1em;
        }}
        .markdown-body th, .markdown-body td {{
            border: 1px solid #ccc;
            padding: 8px 12px;
            text-align: left;
        }}
        .markdown-body th {{
            font-weight: bold;
        }}
        .markdown-body code {{
            background-color: rgba(27,31,35,.05);
            padding: .2em .4em;
            margin: 0;
            font-size: 85%;
            border-radius: 3px;
        }}
    </style>

    <div class="container">
      <!-- Left Pane: Image -->
      <div class="pane">
        <img src="{image_uri}">
      </div>

      <!-- Vertical Divider -->
      <div class="divider"></div>

      <!-- Right Pane: Markdown Text -->
      <div class="pane markdown-body">
        {html_content}
      </div>
    </div>
    """
    display(HTML(html_template))

## **Example n°1**

In [18]:
%%time

image_path = "cga_images/RAP_CGA_FR_ANG_2022-images-79 (1) (1).jpg"
result = ocr_page(image_path, model, processor, max_new_tokens=15000)
result_json = json.loads(result)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


CPU times: user 1min 1s, sys: 31.7 ms, total: 1min 1s
Wall time: 1min 1s


In [20]:
display_vertical_split(
    image_path=image_path, html_content=result_json["natural_text"]
)

0,1,2,3,4,5
LOCAL INSURANCE COMPANIES,LEGAL FORM,SPECIALITY,NET PREMIUMS,,ANNUAL CHANGE
,,,2021,2022,2022/2021
DIRECT INSURANCE COMPANIES,,,,,
STAR,LIMITED COMPANY,COMPOSITE,3682,3863,"4,9%"
COMAR,LIMITED COMPANY,COMPOSITE,2333,2528,"8,4%"
ASTREE,LIMITED COMPANY,COMPOSITE,1872,2360,"26,1%"
GAT,LIMITED COMPANY,COMPOSITE,2184,2350,"7,6%"
MAGHREBIA,LIMITED COMPANY,COMPOSITE,2025,2261,"11,7%"
ASSURANCES BIAT,LIMITED COMPANY,COMPOSITE,1714,2063,"20,4%"
AMI,LIMITED COMPANY,COMPOSITE,1429,1898,"32,8%"


## **Example n°2**

In [21]:
%%time

image_path = "Different_Tables_Images_Testing/other_table.jpg"
result = ocr_page(image_path, model, processor, max_new_tokens=15000)
result_json = json.loads(result)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


CPU times: user 2min 52s, sys: 341 ms, total: 2min 53s
Wall time: 2min 53s


In [22]:
display_vertical_split(
    image_path=image_path, html_content=result_json["natural_text"]
)

0,1,2,3,4,5,6,7,8
PAYS,Primes Nettes (M.D),,,,Total,Part du marché mondial (%),Densité d'assurance (D),Taux de Pénétration de l'assurance (%)
,Assurance Vie,Assurance Non-Vie,,,,,,
,Valeur,Part %,Valeur,Part %,,,Primes Nettes/ Primes Mondiales,Primes Nettes/ Population
Le Monde,8 720 399,415,12 304 529,585,21 024 929,100,2 644,68
Etats-Unis et Canada,2 305 303,238,7 400 112,762,9 705 415,4620,26 087,113
Etats - Unis,2 083 219,227,7 092 186,773,9 175 405,4364,27 544,116
Canada,222 081,419,307 932,581,530 013,252,13 615,80
Amérique Latine et Caraibes,229 164,435,297 681,565,526 845,251,800,30
Brésil,123 098,523,112 118,477,235 216,112,1 091,40
Mexique,46 953,447,58 100,553,105 053,050,822,24


## **Example n°3**

In [23]:
%%time

image_path = "cga_images/RAP_CGA_FR_ANG_2022-images-15.jpg"
result = ocr_page(image_path, model, processor, max_new_tokens=15000)
result_json = json.loads(result)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


CPU times: user 28.6 s, sys: 290 ms, total: 28.9 s
Wall time: 28.8 s


In [24]:
display_vertical_split(
    image_path=image_path, html_content=result_json["natural_text"]
)

0,1,2,3
Pays,Primes nettes (Milliard USD),Part du marché mondial (%),
1,Etats Unis,2960,436.0
2,RP Chine,698,103.0
3,Royaume Uni,363,54.0
4,Japon,338,50.0
5,France,261,39.0
Total,,4620,682.0


## **Example n°4**

In [25]:
%%time

image_path = "Different_Tables_Images_Testing/NVIDIA_Report_Removed_Pages-1-16-14_page-0001.jpg"
result = ocr_page(image_path, model, processor, max_new_tokens=15000)
result_json = json.loads(result)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


CPU times: user 44.1 s, sys: 33.2 ms, total: 44.2 s
Wall time: 44.2 s


In [26]:
display_vertical_split(
    image_path=image_path, html_content=result_json["natural_text"]
)

0,1,2
,"Jan 28, 2024","Jan 29, 2023"
Assets,,
Current assets:,,
Cash and cash equivalents,"$ 7,280","$ 3,389"
Marketable securities,18704,9907
"Accounts receivable, net",9999,3827
Inventories,5282,5159
Prepaid expenses and other current assets,3080,791
Total current assets,44345,23073
"Property and equipment, net",3914,3807


## **Example n°5**

In [27]:
%%time

image_path = "pdf_files_pages/Blackstone4Q24EarningsPressRelease_page-0020.jpg"
result = ocr_page(image_path, model, processor, max_new_tokens=15000)
result_json = json.loads(result)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


CPU times: user 1min 37s, sys: 33.6 ms, total: 1min 37s
Wall time: 1min 37s


In [28]:
display_vertical_split(
    image_path=image_path, html_content=result_json["natural_text"]
)

0,1,2,3,4,5
,"Three Months Ended December 31, 2024",,,,
,Real Estate,Private Equity,Credit & Insurance,Multi-Asset Investing,Total
Beginning Balance,"$ 325,076","$ 344,710","$ 354,742","$ 83,101","$ 1,107,628"
Inflows,8094,11617,34181,3607,57500
Outflows,"(3,047)","(2,735)","(3,907)","(3,856)","(13,545)"
Net Flows,5047,8882,30274,(248),43955
Realizations,"(5,457)","(10,566)","(8,698)","(1,179)","(25,900)"
Market Activity,"(9,312)",9142,(810),2477,1497
Ending Balance,"$ 315,353","$ 352,169","$ 375,508","$ 84,150","$ 1,127,180"
% Change,(3)%,2%,6%,1%,2%

0,1,2,3,4,5
,"Twelve Months Ended December 31, 2024",,,,
,Real Estate,Private Equity,Credit & Insurance,Multi-Asset Investing,Total
,"$ 336,940","$ 314,391","$ 312,674","$ 76,187","$ 1,040,192"
Beginning Balance,27941,41285,91200,11032,171459
Inflows,"(24,543)","(7,226)","(6,348)","(9,688)","(47,805)"
Outflows,3398,34059,84853,1344,123654
Net Flows,"(22,164)","(28,931)","(33,319)","(2,729)","(87,142)"
Realizations,"(2,820)",32648,11300,9348,50476
Market Activity,"(2,820)",32648,11300,9348,50476
Ending Balance,"$ 315,353","$ 352,169","$ 375,508","$ 84,150","$ 1,127,180"

0,1,2,3,4,5
,"Three Months Ended December 31, 2024",,,,
,Real Estate,Private Equity,Credit & Insurance,Multi-Asset Investing,Total
Beginning Balance,"$ 285,488","$ 208,682","$ 251,567","$ 74,720","$ 820,457"
Inflows,6565,7086,22872,2685,39208
Outflows,"(1,691)","(1,729)","(3,150)","(3,615)","(10,184)"
Net Flows,4874,5358,19722,(930),29024
Realizations,"(6,038)","(3,791)","(4,947)","(1,102)","(15,879)"
Market Activity,"(5,409)",1935,"(1,725)",2305,"(2,894)"
Ending Balance,"$ 278,915","$ 212,183","$ 264,618","$ 74,993","$ 830,709"
% Change,(2)%,2%,5%,0%,1%

0,1,2,3,4,5
,"Twelve Months Ended December 31, 2024",,,,
,Real Estate,Private Equity,Credit & Insurance,Multi-Asset Investing,Total
,"$ 298,889","$ 176,997","$ 218,189","$ 68,532","$ 762,608"
Beginning Balance,28674,46270,71530,8958,155432
Inflows,"(23,207)","(7,998)","(6,392)","(8,769)","(46,365)"
Outflows,5467,38272,65138,189,109067
Net Flows,"(23,409)","(9,409)","(23,840)","(2,505)","(59,163)"
Realizations,"(2,033)",6322,5131,8777,18197
Market Activity,"(2,033)",6322,5131,8777,18197
Ending Balance,"$ 278,915","$ 212,183","$ 264,618","$ 74,993","$ 830,709"


## **Example n°6**

In [29]:
%%time

image_path = "pdf_files_pages/RHG_annual_report_2022_page-0047.jpg"
result = ocr_page(image_path, model, processor, max_new_tokens=15000)
result_json = json.loads(result)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


CPU times: user 1min 19s, sys: 35.7 ms, total: 1min 19s
Wall time: 1min 19s


In [30]:
display_vertical_split(
    image_path=image_path, html_content=result_json["natural_text"]
)

0,1,2
,As of Dec. 31,
TEUR,2022,2021
Cost,,
Balance as of January 1,176,176
Investments,—,—
Balance as of December 31,176,176
Accumulated depreciations,,
Balance as of January 1,-176,-176
Depreciation,—,—
Balance as of December 31,-176,-176

0,1,2
,As of Dec. 31,
TEUR,2022,2021.0
Cost,,
Balance as of January 1,197,813.0
Derecognised on disposals,—,-616.0
Balance as of December 31,197,197.0
Accumulated depreciations,,
Balance as of January 1,-193,-782.0
Depreciation,-2,-27.0
Derecognised on disposals,—,616.0

0,1,2
,As of Dec. 31,
TEUR,2022,2021.0
Balance as of January 1,447017,322888.0
Investment in Radisson Hospitality Belgium SPRL,310000,212392.0
Write-down of Radisson Hotel Holdings AB,—,-88263.0
Balance as of December 31,757017,447017.0

0,1,2,3,4,5
,Registered in,Identity no.,No. of shares,Owned share in %,Book value
Radisson Hotel Holdings AB,Stockholm,556674–0972,106667,100,234625
Radisson Hospitality,,,,,
Belgium SPRL,Brussels,442832318,21240373169,100,522392

0,1,2
,As of Dec. 31,
TEUR,2022,2021
Non-current receivables on group companies,,
Interest-bearing receivables,290000,748000
Current receivables on group companies,,
Cash pool,6297,4125
Accrued interest income,3626,—
Group contribution,837,2914
Other,511,987
Total,301271,756026

0,1,2
,As of Dec. 31,
TEUR,2022,2021.0
Accounts payables,2382,165.0
Total,2382,165.0

0,1,2,3,4
,Current As of Dec. 31,,Non-current As of Dec. 31,
,2022,2021.0,2022,2021.0
Borrowings from related parties,500,493.0,297000,755000.0
Total,500,493.0,297000,755000.0

0,1,2
,As of Dec. 31,
TEUR,2022,2021.0
Salaries and remuneration,1334,1258.0
Accrued interest expense,4012,517.0
Other accrued expenses,311,690.0
Total,5657,2465.0

0,1,2
,As of Dec. 31,
TEUR,2022,2021
Pledged assets,—,—
Contingent liabilities,See below,See below


## **Example n°7**

In [31]:
%%time

image_path = "pdf_files_pages/CLAS-FY2023-AR_page-0076.jpg"
result = ocr_page(image_path, model, processor, max_new_tokens=15000)
result_json = json.loads(result)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


CPU times: user 52.1 s, sys: 66.4 ms, total: 52.1 s
Wall time: 52.1 s


In [32]:
display_vertical_split(
    image_path=image_path, html_content=result_json["natural_text"]
)

0,1,2,3,4,5
Property Name,Address,Number of Units,Tenure (Years),Tenure Expiry Date (Year),Agreed Property Value at Acquisition (S$’million)
United Kingdom,,,,,
Citadines Barbican London,"7-21 Goswell Road, London EC1M 7AH, United Kingdom",129,Freehold,-,75.0
Citadines Holborn- Covent Garden London,"94-99 High Holborn, London WC1V 6LF, United Kingdom",192,Freehold,-,127.5
Citadines South Kensington London,"35A Gloucester Road, London SW7 4PL, United Kingdom",92,Freehold,-,71.1
Citadines Trafalgar Square London,"18/21 Northumberland Avenue, London WC2N 5EA, United Kingdom",187,Freehold,-,130.9
The Cavendish London,"81 Jermyn St, St. James’s, London SW1Y 6JF, United Kingdom",230,150,2158,372.3
United States of America (USA),,,,,
Element New York Times Square West,"311 West 39th Street, New York, New York 10018, The United States of America",411,99,2112,220.7
Sheraton Tribeca New York Hotel,"370 Canal Street, New York, New York 10013, The United States of America",369,99,2112,218.0


## **Example n°8**

In [33]:
%%time

image_path = "Different_Tables_Images_Testing/table_3.jpg"
result = ocr_page(image_path, model, processor, max_new_tokens=15000)
result_json = json.loads(result)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


CPU times: user 1min 22s, sys: 190 ms, total: 1min 22s
Wall time: 1min 22s


In [34]:
display_vertical_split(
    image_path=image_path, html_content=result_json["natural_text"]
)

0,1,2,3,4,5,6,7,8
,Common Stock,,Additional Paid-in Capital,Accumulated Other Comprehensive Income (Loss),Retained Earnings,Total Costco Stockholders’ Equity,Noncontrolling Interests,Total Equity
,Shares (000’s),Amount,,,,,,
"BALANCE AT AUGUST 30, 2020............",441255,$ 4,"$ 6,698","$ (1,297)","$ 12,879","$ 18,284",$ 421,"$ 18,705"
Net income....................,—,—,—,—,5007,5007,72,5079
"Foreign-currency translation adjustment and other, net...",—,—,—,160,—,160,21,181
Stock-based compensation...,—,—,668,—,—,668,—,668
"Release of vested restricted stock units (RSUs), including tax effects........",1928,—,(312),—,—,(312),—,(312)
Repurchases of common stock,"(1,358)",—,(23),—,(472),(495),—,(495)
Cash dividends declared......,—,—,—,"(5,748)","(5,748)",—,—,"(5,748)"
"BALANCE AT AUGUST 29, 2021............",441825,4,7031,"(1,137)",11666,17564,514,18078


## **Example n°9**

In [35]:
%%time

image_path = "pdf_files_pages/CLAS-FY2023-AR_page-0066.jpg"
result = ocr_page(image_path, model, processor, max_new_tokens=15000)
result_json = json.loads(result)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


CPU times: user 59.2 s, sys: 75.4 ms, total: 59.3 s
Wall time: 59.3 s


In [36]:
display_vertical_split(
    image_path=image_path, html_content=result_json["natural_text"]
)

0,1,2,3,4
,FY 2023,,FY 2022,
,Revenue (S$million),Gross Profit (S$million),Revenue (S$million),Gross Profit (S$million)
Master Leases,,,,
Australia,10.6,9.8,10.6,10.0
France,33.1,29.8,27.2,25.0
Germany,16.6,14.8,13.4,12.2
Japan,22.2,19.7,22.0,19.5
South Korea,8.5,7.9,5.5,5.0
Subtotal,91.0,82.0,78.7,71.7
Management Contracts with Minimum Guaranteed Income,,,,


## **Example n°10**

In [37]:
%%time

image_path = "cga_images/RAP_CGA_FR_ANG_2022-images-17.jpg"
result = ocr_page(image_path, model, processor, max_new_tokens=15000)
result_json = json.loads(result)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


CPU times: user 23.3 s, sys: 276 ms, total: 23.6 s
Wall time: 23.5 s


In [38]:
display_vertical_split(
    image_path=image_path, html_content=result_json["natural_text"]
)