# Multimodal Parsing using GPT4o-mini


This cookbook shows you how to use LlamaParse to parse any document with the multimodal capabilities of GPT4o-mini.

LlamaParse allows you to plug in external, multimodal model vendors for parsing - we handle the error correction, validation, and scalability/reliability for you.


## Setup

Download the data - the blog post from Meta on Llama3.1, in PDF form.

In [1]:
import nest_asyncio

nest_asyncio.apply()

In [13]:
!wget "https://www.dropbox.com/scl/fi/8iu23epvv3473im5rq19g/llama3.1_blog.pdf?rlkey=5u417tbdox4aip33fdubvni56&st=dzozd11e&dl=1" -O "data/report.pdf"

--2024-10-06 18:49:23--  https://www.dropbox.com/scl/fi/8iu23epvv3473im5rq19g/llama3.1_blog.pdf?rlkey=5u417tbdox4aip33fdubvni56&st=dzozd11e&dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.81.18, 2620:100:6031:18::a27d:5112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.81.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc79992fcff9ede59fc27b1e6352.dl.dropboxusercontent.com/cd/0/inline/Cb-FUxDnwQHu5ZhOHQe0kushb4ZgLxftc-a7eKcyGRPuPqGimIDdgAY9m9CsFdU-9nrmgajYazl2UKPODsYEpIZS0fc26rfJvzaEXU83lfinGHk4wQj6QeHXKHGRrPPj3Ia2VlOpnLZYk1sJmxxjJCFG/file?dl=1# [following]
--2024-10-06 18:49:24--  https://uc79992fcff9ede59fc27b1e6352.dl.dropboxusercontent.com/cd/0/inline/Cb-FUxDnwQHu5ZhOHQe0kushb4ZgLxftc-a7eKcyGRPuPqGimIDdgAY9m9CsFdU-9nrmgajYazl2UKPODsYEpIZS0fc26rfJvzaEXU83lfinGHk4wQj6QeHXKHGRrPPj3Ia2VlOpnLZYk1sJmxxjJCFG/file?dl=1
Resolving uc79992fcff9ede59fc27b1e6352.dl.dropboxusercontent.com (uc79992fcff9ede59fc27b1e6352.dl.dropboxuserc

![llama_blog_img](llama3.1-p5.png)

## Initialize LlamaParse

Initialize LlamaParse in multimodal mode, and specify the vendor.

**NOTE**: optionally you can specify the OpenAI API key. If you do so you will be charged our base LlamaParse price of 0.3c per page. If you don't then you will be charged 1.5c per page, as we will make the calls to gpt4o-mini for you and give you price predictability.

In [2]:
from llama_index.core.schema import TextNode
from typing import List
import json


def get_text_nodes(json_list: List[dict]):
    text_nodes = []
    for idx, page in enumerate(json_list):
        text_node = TextNode(text=page["md"], metadata={"page": page["page"]})
        text_nodes.append(text_node)
    return text_nodes


def save_jsonl(data_list, filename):
    """Save a list of dictionaries as JSON Lines."""
    with open(filename, "w") as file:
        for item in data_list:
            json.dump(item, file)
            file.write("\n")


def load_jsonl(filename):
    """Load a list of dictionaries from JSON Lines."""
    data_list = []
    with open(filename, "r") as file:
        for line in file:
            data_list.append(json.loads(line))
    return data_list

In [16]:
from dotenv import load_dotenv

load_dotenv()

from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    
 
    invalidate_cache=True,
)
json_objs = parser.get_json_result("./data/report.pdf")
json_list = json_objs[0]["pages"]
docs = get_text_nodes(json_list)

Started parsing the file under job_id bf28a200-24a8-4e73-b834-2dc60cc33ac4


In [17]:
# Optional: Save
save_jsonl([d.dict() for d in docs], "docs.jsonl")

In [18]:
# Optional: Load
from llama_index.core import Document

docs_dicts = load_jsonl("docs.jsonl")
docs = [Document.parse_obj(d) for d in docs_dicts]

In [19]:
docs

[Document(id_='daf829c8-2a1b-4abb-8686-f80a1fe9e8d6', embedding=None, metadata={'page': 1}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# Introducing Llama 3.1: Our most capable models to date\n\n# Large Language Model\n\n# Introducing Llama 3.1: Our most capable models to date\n\nJuly 23, 2024 • 15 minute read\n\n405B\nMeet Llama 3.1\n70B\n8 B\n# Takeaways:\n\n- Meta is committed to openly accessible AI. Read Mark Zuckerberg’s letter detailing why open source is good for developers, good for Meta, and good for the world.\n- Bringing open intelligence to all, our latest models expand context length to 128K, add support across eight languages, and include Llama 3.1 405B—the first frontier-level open source AI model.\n- Llama 3.1 405B is in a class of its own, with unmatched flexibility, control, and state-of-the-art capabilities that rival the best closed source models. Our new model will enable the community to unlock new workflows, such as s

### Setup GPT-4o-mini baseline

For comparison, we will also parse the document using GPT-4o-mini (3c per page).

In [24]:
from llama_parse import LlamaParse

parser_gpt4o = LlamaParse(
    result_type="markdown",
    use_vendor_multimodal_model=True,
    vendor_multimodal_model="openai-gpt4o-mini",
    # invalidate_cache=True
)
json_objs_gpt4o = parser_gpt4o.get_json_result("./data/report.pdf")

json_list_gpt4o = json_objs_gpt4o[0]["pages"]
docs_gpt4o = get_text_nodes(json_list_gpt4o)

Started parsing the file under job_id 546bab8a-2806-4907-8e08-1cdb37091cdd


In [25]:
# Optional: Save
save_jsonl([d.dict() for d in docs_gpt4o], "docs_gpt4o-mini.jsonl")

In [27]:
# Optional: Load
from llama_index.core import Document

docs_gpt4o_mini_d = load_jsonl("docs_gpt4o-mini.jsonl")
docs_gpt4o_mini = [Document.parse_obj(d) for d in docs_gpt4o_mini_d]

## View Results

Let's visualize the results between GPT-4o-mini and GPT-4o along with the original document page.

We see that 

**NOTE**: If you're using llama2-p33, just use `docs[0]`

In [28]:
# using GPT4o-mini
print(docs[4].get_content(metadata_mode="all"))

page: 5

# Introducing Llama 3.1: Our most capable models to date

|Category|Llama 3.1|Gemma|Mistral|Llama 3.1|Mixtral|GPT 3.5| |
|---|---|---|---|---|---|---|---|
|Benchmark|8B|9B IT|7B Instruct|70B|8x22B Instruct|Turbo| |
|Gcnam|72.3|72.3|72.3|72.3|72.3|72.3| | | | | | |
|MMLU (shot CoT)|73.0|(5 shot non Cot)|60.5|86.0|79.9|69.8| |
|MMLUPRO (S-shot; CoT)|48.3| |36.9|66.4|56.3|49.2| |
|IFEval|80.4|73.6|57.6|87.5|72.7|59.9| |
|Code| | | | | | | |
|HumanEval (0-shot)|72.6|54.3|40.2|80.5|75.6|68.0| |
|MBPP EvalPlus|72.8|71.7|49.5|86.0|78.6|82.0| |
|(basc) (0-shot)|Math|76.7|53.2|95.1|88.2|81.6| |
|GSMBK (S-shot, CoT)|84.5| | | | | | |
|MATH (0-shot, CoT)|51.9|44.3|13.0|68.0|54.1|43.1| |
|Reasoning?|ARC Challenge (S-shot)|83.4|87.6|74.2|94.8|88.7|83.7|
|GPQA (S-shot CoT)|32.8| |28.8|46.7|33.3|30.8| |
|Tool uSC|85.9|85.9|85.9|85.9|85.9|85.9| | | | | | |
|BFCL|76.1| |60.4|84.8| | | |
|Nexus|38.5|30.0|24.7|56.7|48.5|37.2| |
|Long context|ZeroSCROLLSIQuALITY|81.0| | |90.5| | |
|InfiniteBench/

In [None]:
# using GPT-4o
print(docs_gpt4o[4].get_content(metadata_mode="all"))

page: 5

# Introducing Llama 3.1: Our most capable models to date

## Meta

| Category | Benchmark | Llama 3.1 8B | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mixtral 8x22B Instruct | GPT 3.5 Turbo |
|----------|-----------|--------------|---------------|---------------------|---------------|-----------------------|---------------|
| General  | MMLU (0-shot, CoT) | 73.0 | 72.3 (0-shot, non-CoT) | 60.5 | 86.0 | 79.9 | 69.8 |
|          | MMLU PRO (5-shot, CoT) | 48.3 | 71.7 | 36.9 | 66.4 | 56.3 | 49.2 |
|          | ITEval | 80.4 | 73.6 | 57.6 | 87.5 | 72.7 | 69.9 |
| Code     | HumanEval (0-shot) | 72.6 | 54.3 | 40.2 | 80.5 | 75.6 | 68.0 |
|          | MBPP EvalPlus (5-shot) (0-shot) | 72.8 | 71.7 | 49.5 | 86.0 | 78.6 | 82.0 |
| Math     | GSM8K | 84.5 | 76.7 | 53.2 | 95.1 | 88.2 | 81.6 |
|          | MATH (0-shot, CoT) | 51.9 | 44.3 | 13.0 | 68.0 | 54.1 | 43.1 |
| Reasoning | ARC Challenge (0-shot) | 83.4 | 87.6 | 74.2 | 94.8 | 88.7 | 83.7 |
|          | GOPA (0-shot) | 32.

## Setup RAG Pipeline

Let's setup a RAG pipeline over this data.

(we also use gpt4o-mini for the actual text synthesis step).

In [37]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-4o-mini",api_base="https://models.inference.ai.azure.com",api_key="")

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large",api_base="https://models.inference.ai.azure.com",api_key="")

In [38]:
# from llama_index.core import SummaryIndex
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI

index = VectorStoreIndex(docs)
query_engine = index.as_query_engine(similarity_top_k=5)

index_gpt4o = VectorStoreIndex(docs_gpt4o)
query_engine_gpt4o = index_gpt4o.as_query_engine(similarity_top_k=5)

In [39]:
query = "How does Llama3.1 compare against gpt-4o and Claude 3.5 Sonnet in human evals?"

response = query_engine.query(query)
response_gpt4o = query_engine_gpt4o.query(query)

In [40]:
print(response)

In human evaluations, Llama 3.1 405B achieved a win rate of 23.39%, with 52.2% ties and 24.5% losses when compared to GPT-4-0125-Preview, which had a win rate of 19.1%, 51.7% ties, and 29.2% losses. Against Claude 3.5 Sonnet, Llama 3.1 had a win rate of 24.9%, with 50.8% ties and 24.2% losses. Overall, Llama 3.1 performed competitively against both models in these evaluations.


In [41]:
print(response.source_nodes[1].get_content())

# Introducing Llama 3.1: Our most capable models to date

|Category|Llama 3.1|Nemotron 4|GPT-4|GPT-4|Claude 3.5|
|---|---|---|---|---|---|
|Benchmark|405B|340B Instruct|401251|Omni|Sonnet|
|Gcnam|78.7|78.7|78.7|78.7|78.7| | | | |
|MMLU (shot CoT)|88.6|(non-CoT)|85.4|88.7|88.3|
|MMLUPRO (S-shot; CoT)|73.3|62.7|64.8|74.0|77.0|
|IFEval|88.6|85.1|84.3|85.6|88.0|
|HumanEval (0-shot)|89.0|73.2|86.6|90.2|92.0|
|MBPP EvalPlus (basic) (0-shot)|88.6|72.8|83.6|87.8|90.5|
|Math|92.3| | | | | | | |
|GSMBK (A-shot, CoT)|96.8| |94.2|96.1| |
|MATH (0-shot, CoT)|73.8|41.1|64.5|76.6|71.1|
|ARC Challenge (0-shot)|96.9|94.6|96.4|96.7|96.7|
|GPQA (0-shot, CoT)|51.1| |41.4|53.6|59.4|
|BFCL|88.5|86.5|88.3|80.5|90.2|
|Nexus|58.7| |50.3|56.1|45.7|
|ZeroSCROLLSIQuALITY|95.2|95.2| |90.5|90.5|
|InfiniteBench/En MC|83.4| |72.1|82.5| |
|NIHIMulti-needle|98.1| |100.0|100.0|90.8|
|Multilingual|Multilingual MGSMShot|91.6|85.9|90.5|91.6|

https://ai.meta.com/blog/meta-llama-3-1/


In [42]:
print(response_gpt4o)

In human evaluations, Llama 3.1 405B shows competitive performance against GPT-4o and Claude 3.5 Sonnet. Specifically, Llama 3.1 won 19.1% of the comparisons against GPT-4o, tied in 51.7%, and lost 29.2%. Against Claude 3.5 Sonnet, it won 24.9%, tied 50.8%, and lost 24.2%. This indicates that while Llama 3.1 performs well, it has a mix of wins, ties, and losses compared to these models.


In [43]:
print(response_gpt4o.source_nodes[1].get_content())

# Meta

## Category Benchmark

| Category                  | Llama 3.1 8B | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mixtral 8x22B Instruct | GPT 3.5 Turbo |
|---------------------------|--------------|---------------|---------------------|---------------|------------------------|---------------|
| **General**               |              |               |                     |               |                        |               |
| MMLU (0-shot, CoT)        | 73.0         | 72.3          | 60.5                | 86.0          | 79.9                   | 69.8          |
| MMLU PRO (5-shot, CoT)    | 48.3         | 53.6          | 37.6                | 66.4          | 56.3                   | 49.2          |
| IIEval                    | 80.4         | 73.6          | 57.6                | 87.5          | 72.7                   | 69.9          |
| **Code**                  |              |               |                     |               |                        |      