# Multimodal Parsing with LlamaParse

### Set Environment variables

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
import nest_asyncio

nest_asyncio.apply()

## LlamaParse로 돌려보기

In [3]:
from glob import glob
import os

current_dir = os.getcwd()
root_dir = os.path.join(current_dir, '..', '..')
data_dir = os.path.join(root_dir, 'data', 'parse_multimodal')
glob_path = os.path.join(data_dir, '*')

pdf_path = glob(glob_path)
pdf_path

['/Users/kimbwook/PycharmProjects/fast-campus/notebook/파싱/../../data/parse_multimodal/Llama_image_text.pdf',
 '/Users/kimbwook/PycharmProjects/fast-campus/notebook/파싱/../../data/parse_multimodal/Llama_image.pdf']

### Model list 

https://docs.cloud.llamaindex.ai/llamaparse/features/multimodal

- openai-gpt4o       
- openai-gpt-4o-mini
- anthropic-sonnet-3.5
- gemini-1.5-flash
- gemini-1.5-pro

In [4]:
from llama_parse import LlamaParse

parse_instance = LlamaParse(
    result_type="markdown", 
    language="ko", 
    use_vendor_multimodal_model=True,
    vendor_multimodal_model="openai-gpt-4o-mini",
    # vendor_multimodal_api_key="sk-111",
)

In [5]:
parse_instance.load_data(pdf_path)

Parsing files: 100%|██████████| 2/2 [00:18<00:00,  9.40s/it]


[Document(id_='f72d5d5b-b4c8-4359-bdc5-61cb319723de', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# Introducing Llama 3.1: Our most capable models to date\n\nAs our largest model yet, training Llama 3.1 405B on over 16 trillion tokens was a major challenge. To enable training runs at this scale and achieve the results we have in a reasonable amount of time, we significantly optimized our full training stack and pushed our model training to over 16 thousand H100 GPUs, making the 405B the first Llama model trained at this scale.\n\n```\n[INPUT]\nText tokens\n\n[Token embeddings]\n\n[Self-attention]\n[Feedforward network]\n...\n\n[Self-attention]\n[Feedforward network]\n\n[OUTPUT]\nText token\n\nAUTOREGRESSIVE DECODING\n```\n\nTo address this, we made design choices that focus on keeping the model development process scalable and straightforward.\n\n- We opted for a standard decoder-only transformer model architectur

## AutoRAG로 돌려보기

In [6]:
from autorag.parser import Parser

project_dir = os.path.join(root_dir, 'autorag_project', 'parse', 'multimodal')
parser = Parser(data_path_glob=glob_path, project_dir=project_dir)

In [7]:
yaml_path = os.path.join(root_dir, 'config', 'parse', 'multimodal.yaml')
parser.start_parsing(yaml_path, all_files=True)

Started parsing the file under job_id f27f4c2d-4638-457f-8f73-aaf96206f482


Started parsing the file under job_id 9103a76b-742d-4f7c-9401-d72dec892d43


## 결과 확인

https://markdownlivepreview.com/ 여기 사이트에서 더 예쁘게 마크다운을 확인할 수 있다.

In [8]:
import pandas as pd

In [9]:
result_path = os.path.join(project_dir, 'parsed_result.parquet')
multimodal_result = pd.read_parquet(result_path)
multimodal_result

Unnamed: 0,texts,path,page,last_modified_datetime
0,# Introducing Llama 3.1: Our most capable mode...,/Users/kimbwook/PycharmProjects/fast-campus/no...,1,2024-12-11
1,# Introducing Llama 3.1: Our most capable mode...,/Users/kimbwook/PycharmProjects/fast-campus/no...,1,2024-12-11


In [10]:
texts = multimodal_result["texts"].tolist()

In [11]:
print(texts[0])

# Introducing Llama 3.1: Our most capable models to date

As our largest model yet, training Llama 3.1 on over 16 trillion tokens was a major challenge. To enable training runs at this scale and achieve the results we have in a reasonable amount of time, we significantly optimized our full training stack and pushed our model training to over 16 thousand H100 GPUs, making the 405B the first Llama model trained at this scale.

## Model Architecture

- **INPUT**: Text tokens
- **Token embeddings**
- **Self-attention**
- **Feedforward network**
- **...**
- **Self-attention**
- **Feedforward network**
- **OUTPUT**: Text token
- **AUTOREGRESSIVE DECODING**

To address this, we made design choices that focus on keeping the model development process scalable and straightforward.

- We opted for a standard decoder-only transformer model architecture with minor adaptations rather than a mixture-of-experts model to maximize training stability.
- We adopted an iterative post-training procedure, wh

In [9]:
print(texts[1])

# Introducing Llama 3.1: Our most capable models to date

| Category        | Benchmark                     | Llama 3.1 88B | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mixtral 8x228 Instruct | GPT 3.5 Turbo |
|-----------------|-------------------------------|----------------|----------------|---------------------|----------------|------------------------|----------------|
| General         | MMLU (0-shot, CoT)           | 73.0           | 72.3           | 60.5                | 86.0           | 79.9                   | 69.8           |
|                 | MMLU PRO (5-shot, CoT)       | 48.3           | 36.9           | 36.9                | 66.4           | 56.3                   | 49.2           |
|                 | IFEval                        | 80.4           | 73.6           | 57.6                | 87.5           | 72.7                   | 69.9           |
| Code            | HumanEval (0-shot)           | 72.6           | 54.3           | 40.2                | 80.5  