# PDF Support

This notebook demonstrates PDF processing support for both `Call` and `Chat` objects. PDFs can be provided as URLs or local file paths with various parsing engines.

In [1]:
from irouter import Call, Chat
from irouter.base import nb_markdown

# To load OPENROUTER_API_KEY from .env file create a .env file at the root of the project with OPENROUTER_API_KEY=your_api_key
# Alternatively pass api_key=your_api_key to the Call or Chat class
from dotenv import load_dotenv

load_dotenv()

True

For this example we use Moonshot AI's `kimi-k2` model.

If the selected LLM has native file processing capabilities, that parser be used. Else the `mistral-ocr` parser is used, which has some small costs associated with it.

Under the `PDF Parsing Configuration` section in this notebook you can see how to configure a (free) PDF parsing engine. For more details on PDF support in OpenRouter and pricing, check [this docs page](https://openrouter.ai/docs/features/images-and-pdfs#pdf-support).

To see an overview of which LLMs support file input, check the [OpenRouter Model Overview](https://openrouter.ai/models?fmt=cards&input_modalities=audio%2Cfile).

In [2]:
model = "moonshotai/kimi-k2:free"
# The "Attention Is All You Need" paper
pdf_url = "https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"

In this example we will ask the LLM questions about the "Attention Is All You Need" paper.

<img src="https://nlp.seas.harvard.edu/images/the-annotated-transformer_0_0.png" width="300"/>



# PDF URL

The simplest way to work with PDF files in `irouter` is to pass the URL of the PDF file and instruction as a list of strings.


In [3]:
c = Call(model)

In [4]:
nb_markdown(c([pdf_url, "What is the main contribution of this paper?"]))

The main contribution of this paper is the introduction of the **Transformer**, a novel neural network architecture for sequence transduction that **relies entirely on attention mechanisms, eliminating the need for recurrence or convolution**. This model is designed to be more **parallelizable**, **computationally efficient**, and **effective** than existing recurrent or convolutional approaches. Specifically:

1. **Architecture**: The Transformer replaces recurrent and convolutional layers with **multi-head self-attention**, allowing the model to process sequences in parallel and efficiently capture global dependencies.

2. **Performance**: The model achieves **state-of-the-art results** on WMT 2014 English-to-German (28.4 BLEU) and English-to-French (41.0 BLEU) translation tasks, **outperforming prior ensembles** while requiring **significantly less training time**.

3. **Efficiency**: The Transformer reduces computational complexity and training time through its **parallelizable** design, as it avoids the sequential nature of RNNs and the fixed kernel limitations of CNNs.

In summary, the paper demonstrates that **"attention is all you need"**—self-attention alone can effectively model sequence relationships while offering major advantages in speed and scalability.

# PDF Parsing configuration

You can specify different PDF parsing engines using the `extra_body` parameter. For example, use the `pdf-text` engine for free parsing. Check [this docs page](https://openrouter.ai/docs/features/multimodal/pdfs#plugin-configuration) for more details on plugin configuration.

In [5]:
extra_body = {"plugins": [{"id": "file-parser", "pdf": {"engine": "pdf-text"}}]}

In [6]:
nb_markdown(c([pdf_url, "Summarize the key innovations in this paper."], extra_body=extra_body))

The paper introduces the **Transformer**, a novel neural network architecture that abandons recurrence and convolution entirely, relying solely on **self-attention mechanisms** for sequence transduction tasks like machine translation. Key innovations include:

1. **Self-Attention as the Core Mechanism**:  
   - Replaces RNNs/CNNs with **multi-head self-attention**, enabling direct modeling of dependencies between any two positions in a sequence, regardless of their distance.  
   - Uses **scaled dot-product attention** with a scaling factor \( \frac{1}{\sqrt{d_k}} \) to mitigate gradient vanishing for high-dimensional keys.

2. **Multi-Head Attention**:  
   - Parallel attention "heads" allow the model to jointly focus on information from different representation subspaces, improving depth and expressivity without significant computational overhead.

3. **Positional Encoding**:  
   - **Sinusoidal positional encodings** are added to token embeddings to inject sequence order information, eliminating the need for recurrence.  
   - Functions are chosen to enable extrapolation to longer sequences.

4. **Architecture Details**:  
   - **Encoder**: 6 identical layers, each with multi-head self-attention + position-wise feed-forward networks (FFNs).  
   - **Decoder**: 6 layers with additional encoder-decoder attention and **causal masking** to preserve auto-regressive properties.

5. **Efficiency**:  
   - Enables **massive parallelization** (O(1) sequential operations per layer vs. O(n) for RNNs).  
   - **Faster training**: Achieves state-of-the-art BLEU scores (28.4 EN-DE, 41.0 EN-FR) with **8 GPUs in 3.5 days**, far less than prior models.

6. **Interpretability**:  
   - Attention visualizations show heads capturing syntactic/semantic features (e.g., verb-object relationships).

**Impact**: Shows attention alone can match or surpass RNN/CNN-based models while being more scalable, laying the groundwork for modern LLMs.

# Chat with PDF

In contrast to the `Call` class, the `Chat` tracks history and token usage.

In [7]:
chat = Chat(model)

In [8]:
chat([pdf_url, "What is this paper about?"])

'The paper introduces the Transformer.  \nIt replaces recurrence and convolution with a new architecture that relies entirely on attention mechanisms.  \nThe encoder and decoder stacks consist of layers of multi-head self-attention and point-wise, fully-connected feed-forward networks.  \nUnlike recurrent models, this allows computation for all positions in parallel, halving training time on modern GPUs.  \nExperimental results on WMT 2014 English-German and English-French translation achieve new state-of-the-art BLEU scores (28.4 and 41.0 respectively) while costing much less to train than prior ensemble systems.'

Now we can ask follow-up questions about the PDF. `Chat` will update history and token usage.

In [9]:
chat("What are the key advantages of this approach over RNNs?")

'Key advantages of the Transformer (attention-only) over RNN-based sequence transduction models:\n\n1. **Massive Parallelization** – no sequential recurrence. All positions can be processed simultaneously, slashing wall-clock training time.\n2. **Lower Training Cost** – 12 h on 8 P100 GPUs for the base model, <¼ the FLOPs of previous best systems, while achieving higher BLEU scores.\n3. **Somewhat Lower Per-layer Complexity** for typical NLP sequence lengths – \u202f\n   O(n²·d) vs O(n·d²) when n ‹‹ d.\n4. **Shorter Maximum Path Length** – constant O(1) between any two positions versus O(n) for RNNs, improving gradient flow for long-range dependencies.'

In [10]:
chat.history

[{'role': 'system', 'content': 'You are a helpful assistant.'},
 {'role': 'user',
  'content': [{'type': 'file',
    'file': {'filename': 'document.pdf',
     'file_data': 'https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf'}},
   {'type': 'text', 'text': 'What is this paper about?'}]},
 {'role': 'assistant',
  'content': 'The paper introduces the Transformer.  \nIt replaces recurrence and convolution with a new architecture that relies entirely on attention mechanisms.  \nThe encoder and decoder stacks consist of layers of multi-head self-attention and point-wise, fully-connected feed-forward networks.  \nUnlike recurrent models, this allows computation for all positions in parallel, halving training time on modern GPUs.  \nExperimental results on WMT 2014 English-German and English-French translation achieve new state-of-the-art BLEU scores (28.4 and 41.0 respectively) while costing much less to train than prior ensemble systems.'},


In [11]:
chat.usage

{'prompt_tokens': 18709, 'completion_tokens': 282, 'total_tokens': 18991}