<a href="https://colab.research.google.com/github/ToniJansen/Analisador-de-finan-as-com-LLMs/blob/main/C%C3%B3pia_de_financial_ppt_databricks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LlamaParse - Parsing Financial Powerpoints 📊

In this cookbook we show you how to use LlamaParse to parse a financial powerpoint.

## Installation

Parsing instruction are part of the LlamaParse API. They can be access by directly specifying the parsing_instruction parameter in the API or by using LlamaParse python module (which we will use for this tutorial).

To install llama-parse, just get it from `pip`:

In [None]:
%pip install llama-index
%pip install llama-index-llms-databricks
%pip install llama-index-embeddings-huggingface
%pip install llama-parse
%pip install torch transformers python-pptx Pillow

Collecting llama-index
  Downloading llama_index-0.10.44-py3-none-any.whl (6.8 kB)
Collecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.2.7-py3-none-any.whl (12 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.12-py3-none-any.whl (26 kB)
Collecting llama-index-core==0.10.44 (from llama-index)
  Downloading llama_index_core-0.10.44-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m50.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Downloading llama_index_embeddings_openai-0.1.10-py3-none-any.whl (6.2 kB)
Collecting llama-index-indices-managed-llama-cloud<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.1.6-py3-none-any.whl (6.7 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_i

## API Key

The use of LlamaParse requires an API key which you can get here: https://cloud.llamaindex.ai/parse

In [None]:
import os

os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

**NOTE**: Since LlamaParse is natively async, running the sync code in a notebook requires the use of nest_asyncio.


In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
# databricks api key
# api_key = "<api_key>"

In [None]:
from llama_index.llms.databricks import Databricks
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
llm = Databricks(
    model="databricks-meta-llama-3-70b-instruct",
    api_key=api_key,
    api_base="https://<cluster_id>.cloud.databricks.com/serving-endpoints"
)

Settings.llm = llm
Settings.embed_model = embed_model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Importing the package

To import llama_parse simply do:

In [None]:
from llama_parse import LlamaParse

## Using LlamaParse to Parse Presentations

Like Powerpoints, presentations are often hard to extract for RAG. With LlamaParse we can now parse them and unclock their content of presentations for RAG.

Let's download a financial report from the World Meteorological Association.

In [None]:
! mkdir data; wget "https://meetings.wmo.int/Cg-19/PublishingImages/SitePages/FINAC-43/7%20-%20EC-77-Doc%205%20Financial%20Statements%20for%202022%20(FINAC).pptx" -O data/presentation.pptx

--2024-06-11 23:41:34--  https://meetings.wmo.int/Cg-19/PublishingImages/SitePages/FINAC-43/7%20-%20EC-77-Doc%205%20Financial%20Statements%20for%202022%20(FINAC).pptx
Resolving meetings.wmo.int (meetings.wmo.int)... 195.55.64.242
Connecting to meetings.wmo.int (meetings.wmo.int)|195.55.64.242|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 820828 (802K) [application/vnd.openxmlformats-officedocument.presentationml.presentation]
Saving to: ‘data/presentation.pptx’


2024-06-11 23:41:37 (428 KB/s) - ‘data/presentation.pptx’ saved [820828/820828]



### Parsing the presentation

Now let's parse it into Markdown with LlamaParse and the default LlamaIndex parser.




#### Llama Index default

In [None]:
from llama_index.core import SimpleDirectoryReader

vanilla_documents = SimpleDirectoryReader("./data/").load_data()

config.json:   0%|          | 0.00/4.61k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/982M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/228 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/241 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


In [None]:
print(vanilla_documents[0].get_content())



Slide #0: 
Financial Statements for 2022
FINAC-43
20 May 2023
EC-77/Doc 5 and EC-77/INF 5(1)


Slide #1: 
Agenda
Highlights of 2022
Details of 2022 elements
Draft Resolutions
2
5/18/2023


Slide #2: 
Highlights of 2022”A Return to the New Normal”



Slide #3: 
Highlights of 2022 – Comparing 2022 to 2021
4
5/18/2023

 Image: a large red and white cake with flowers on it



Slide #4: 
Highlights of 2022 – Comparing 2022 to 2021
5
5/18/2023

 Image: a painting of a girl riding a skateboard



Slide #5: 
Highlights of 2022 – Comparing 2022 to 2021
6
5/18/2023

 Image: a green street sign on top of a pole



Slide #6: 
Highlights of 2022 – Comparing 2022 to 2021
7
5/18/2023

 Image: a pair of scissors sitting on top of a book



Slide #7: 
Highlights of 2022 – Comparing 2022 to 2021
8
5/18/2023

 Image: a pair of scissors sitting on top of a book



Slide #8: 
Highlights of 2022 – Comparing 2022 to 2021
9
5/18/2023

 Image: a collage of photographs of a person holding a ribbon



Slid

#### Llama Parse

In [None]:
llama_parse_documents = LlamaParse(result_type="markdown").load_data(
    "./data/presentation.pptx"
)

Started parsing the file under job_id cac11eca-9bc5-4e9b-b8af-a03d99bb55b0


In [None]:
print(llama_parse_documents[0].get_content())

Financial Statements for 2022

FINAC-43
20 May 2023
EC-77/Doc 5 and EC-77/INF 5(1)
---
# Agenda

- Highlights of 2022
- Details of 2022 elements
- Draft Resolutions

06/09/2024
---
Highlights of 2022


”A Return to the New
Normal”
---
|Item|2022|2021|
|---|---|---|
|Impact COVID-19 pandemic|Global restrictions on travel and face-to-face meetings significantly reduced.|Continued low level of travel, meetings and fellowships|
| |Allowed for significant increase in travel expenditures beginning in Q3 2022 and increased meeting and project related expenditure| |
| |Reduction in new extrabudgetary contributions – significant no-cost extensions| |
| |Implementation modalities shifted for improved delivery| |
| |Face-to-face meetings, particularly constituent body meetings funded by the Regular Budget, began to be held again.|Virtual possibilities|
---
# Anarchist Organizational Chart, Society

|Item|2022|2021|
|---|---|---|
|Secretariat|Most technical position|Significant hiring|
|Reorganiza

Let's take a look at the parsed output from an example slide (see image below).

As we can see the table is faithfully extracted!

In [None]:
print(llama_parse_documents[0].get_content()[-2800:-2300])

|Item|31 Dec 2022|31 Dec 2021|Change|
|---|---|---|---|
|Payables and accruals|4,685|4,066|619|
|Employee benefits|127,215|84,676|42,539|
|Contributions received in advance|6,975|10,192|(3,217)|
|Unearned revenue from exchange transactions|20|651|(631)|
|Deferred Revenue|71,301|55,737|15,564|
|Borrowings|28,229|29,002|(773)|
|Funds held in trust|30,373|29,014|1,359|
|Provisions|1,706|1,910|(204)|
|Total Liabilities|270,504|215,248|55,256|
---
# Liabilities

|Liability|Explanation|
|---|---|
|Emp


Compared against the original slide image.
![Demo](demo_ppt_financial_1.png)

## Comparing the two for RAG

The main difference between LlamaParse and the previous directory reader approach, it that LlamaParse will extract the document in a structured format, allowing better RAG.

### Query Engine on SimpleDirectoryReader results

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

vanilla_index = VectorStoreIndex.from_documents(vanilla_documents)
vanilla_query_engine = vanilla_index.as_query_engine()

### Query Engine on LlamaParse Results


In [None]:
llama_parse_index = VectorStoreIndex.from_documents(llama_parse_documents)
llama_parse_query_engine = llama_parse_index.as_query_engine()

### Liability provision
What was the liability provision as of Dec 31 2021?

<!-- <img src="https://drive.usercontent.google.com/download?id=184jVq0QyspDnmCyRfV0ebmJJxmAOJHba&authuser=0" /> -->

In [None]:
vanilla_response = vanilla_query_engine.query(
    "What was the liability provision as of Dec 31 2021?"
)
print(vanilla_response)

The liability provision as of Dec 31, 2021, is not explicitly stated in the provided context. However, Slide #17 and Slide #18 discuss the liabilities as of 2022, and Slide #18 mentions the changes in Employee Benefit Liabilities, but it does not provide the exact figure for 2021.


In [None]:
llama_parse_response = llama_parse_query_engine.query(
    "What was the liability provision as of Dec 31 2021?"
)
print(llama_parse_response)

The liability provision as of Dec 31, 2021 was 1,910.
