### Chat with your PDFs using byaldi + Claude 🚀
This notebook runs through the following examples:
- academic paper
- financial report

### Setup
- Follow the byaldi setup instructions [here](https://github.com/AnswerDotAI/byaldi/)
- pip install claudette

In [11]:
import base64
import os
from byaldi import RAGMultiModalModel
from claudette import *

os.environ["HF_TOKEN"] = "YOUR_HF_TOKEN" # to download the ColPali model
os.environ["ANTHROPIC_API_KEY"] = "ANTHROPIC_API_KEY"
model = RAGMultiModalModel.from_pretrained("vidore/colpali")

Verbosity is set to 1 (active). Pass verbose=0 to make quieter.


Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.05s/it]


### Academic Paper

In [7]:
# Let's create the index
model.index(
    input_path="attention.pdf",
    index_name="attention",
    store_collection_with_index=True,
    overwrite=True
)

Added page 1 of document 0 to index.
Added page 2 of document 0 to index.
Added page 3 of document 0 to index.
Added page 4 of document 0 to index.
Added page 5 of document 0 to index.
Added page 6 of document 0 to index.
Added page 7 of document 0 to index.
Added page 8 of document 0 to index.
Added page 9 of document 0 to index.
Added page 10 of document 0 to index.
Added page 11 of document 0 to index.
Added page 12 of document 0 to index.
Added page 13 of document 0 to index.
Added page 14 of document 0 to index.
Added page 15 of document 0 to index.
Index exported to .byaldi/attention
Index exported to .byaldi/attention


In [8]:
query = "What's the BLEU score for the transfomer base model?"
results = model.search(query, k=3)
for result in results:
    print(f"Doc ID: {result.doc_id}, Page: {result.page_num}, Score: {result.score}")
    
chat = Chat(models[1]) # let's use sonnet
image_bytes = base64.b64decode(results[0].base64)
chat([image_bytes, query])

Doc ID: 0, Page: 8, Score: 23.375
Doc ID: 0, Page: 9, Score: 22.625
Doc ID: 0, Page: 2, Score: 20.875


According to the table in the image, the BLEU score for the Transformer (base model) is:

- 27.3 for EN-DE (English to German)
- 38.1 for EN-FR (English to French)

<details>

- id: `msg_01HmK5iN2JtLnnmTeMs587mw`
- content: `[{'text': 'According to the table in the image, the BLEU score for the Transformer (base model) is:\n\n- 27.3 for EN-DE (English to German)\n- 38.1 for EN-FR (English to French)', 'type': 'text'}]`
- model: `claude-3-5-sonnet-20240620`
- role: `assistant`
- stop_reason: `end_turn`
- stop_sequence: `None`
- type: `message`
- usage: `{'input_tokens': 1522, 'output_tokens': 58, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0}`

</details>

### Financial Report

In [9]:
# Let's create the index
model.index(
    input_path="financial_report.pdf",
    index_name="financial_report",
    store_collection_with_index=True,
    overwrite=True
)

Added page 1 of document 1 to index.
Added page 2 of document 1 to index.
Added page 3 of document 1 to index.
Added page 4 of document 1 to index.
Added page 5 of document 1 to index.
Added page 6 of document 1 to index.
Index exported to .byaldi/financial_report
Index exported to .byaldi/financial_report


In [10]:
query = "In which month did Product C generate the most revenue?"
results = model.search(query, k=3)
for result in results:
    print(f"Doc ID: {result.doc_id}, Page: {result.page_num}, Score: {result.score}")
    
chat = Chat(models[1]) # let's use sonnet
image_bytes = base64.b64decode(results[0].base64)
chat([image_bytes, query])

Doc ID: 1, Page: 4, Score: 20.375
Doc ID: 1, Page: 6, Score: 18.625
Doc ID: 1, Page: 5, Score: 18.5


According to the bar graph showing monthly revenue for Product C, the month with the highest revenue was June. The bar for June is visibly the tallest, reaching above 2500 on the revenue scale, indicating it generated the most revenue compared to all other months shown.

<details>

- id: `msg_01FEU91Kwi8PfLGpqvrycM8n`
- content: `[{'text': 'According to the bar graph showing monthly revenue for Product C, the month with the highest revenue was June. The bar for June is visibly the tallest, reaching above 2500 on the revenue scale, indicating it generated the most revenue compared to all other months shown.', 'type': 'text'}]`
- model: `claude-3-5-sonnet-20240620`
- role: `assistant`
- stop_reason: `end_turn`
- stop_sequence: `None`
- type: `message`
- usage: `{'input_tokens': 1573, 'output_tokens': 59, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0}`

</details>