In [4]:
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
from pathlib import Path
import torch

In [5]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [6]:
image = load_image("../data/page_1.jpeg")

In [7]:
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
    "ds4sd/SmolDocling-256M-preview",
    torch_dtype=torch.bfloat16,
).to(DEVICE)

In [8]:
# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]

In [9]:
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [10]:
generated_ids = model.generate(**inputs, max_new_tokens=8192)

In [11]:
prompt_length = inputs.input_ids.shape[1]

In [12]:
prompt_length

878

In [13]:
trimmed_generated_ids = generated_ids[:, prompt_length:]

In [14]:
doctags  = processor.batch_decode(trimmed_generated_ids,skip_special_tokens=False)[0].lstrip()

In [15]:
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument

In [16]:
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)

<doctag><page_header><loc_15><loc_157><loc_31><loc_362>arXiv:2302.13971v1 [cs.CL] 27 Feb 2023</page_header>
<section_header_level_1><loc_95><loc_44><loc_405><loc_53>LLaMA: Open and Efficient Foundation Language Models</section_header_level_1>
<text><loc_94><loc_68><loc_408><loc_103>Hugo Touvron$^{*}$ Thibaut Lavril$^{†}$ Gautier Izacard$^{†}$ Xavier Martinet Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin Edouard Grave$^{†}$ Guillaume Lample$^{*}$</text>
<section_header_level_1><loc_232><loc_108><loc_268><loc_115>Meta AI</section_header_level_1>
<section_header_level_1><loc_131><loc_129><loc_170><loc_136>Abstract</section_header_level_1>
<text><loc_69><loc_141><loc_230><loc_226>We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available dat

In [17]:
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")

In [18]:
print(doc.export_to_markdown())

## LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron$^{*}$ Thibaut Lavril$^{†}$ Gautier Izacard$^{†}$ Xavier Martinet Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin Edouard Grave$^{†}$ Guillaume Lample$^{*}$

## Meta AI

## Abstract

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community$^{1}$.

## 1 Introduction

Large Languages Models (LMs) trained on massive corpora of texts have shown their ability to perform new tasks from textual ins