# Multimodal Financial Document Analysis from PDFs

Application of the retrieval-augmented generation (RAG) method in processing financial information from company's PDF document. The steps involve extracting critical data such as text, tables and graphs from a PDF's file and storing them in a vector database like FAISS. Multiple tools will be used like Unstructured.io for text and table extraction from PDF, Cohere models for graph information extraction from images, and LlamaIndex for creating an agent with retrieval capabilities.

### Extracting Data

In [None]:
!wget https://digitalassets.tesla.com/tesla-contents/image/upload/IR/TSLA-Q3-2023-Update-3.pdf

The unstructured package is a proficient tool for extracting information from pdf files. It relies on two key tools, poppler and tesseract, essentail for rendering PDF documents. They have to be installed using apt-get(Linux) or brew(MacOs), in addition to the necessary packages.

apt-get -qq install poppler-utils <br>
apt-get -qq install tesseract-ocr

I recommend using [Unstructured Quickstart](https://docs.unstructured.io/open-source/introduction/quick-start) to a clean install of the Unstructured package. <br>
Please note that I am using the [UV project manager](https://docs.astral.sh/uv/) instead of pip and the latest python version at this time which is 3.13.3.<br>
Terminal (create virtual environment): <br>
uv venv --python 3.13 <br>
source .venv/bin/activate <br>
uv pip install ipykernel


In [None]:
!uv pip install "unstructured[all-docs]"

### Text and Tables

Use partition_pdf function to extract text and table data from the PDF and divide it into multiple chunks. The size of these chunks can be customized based on the number of characters.

In [None]:
from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
        filename="./TSLA-Q3-2023-Update-3.pdf",
        # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
        # Titles are any sub-section of the document
        infer_table_structure=True,
        # Post processing to aggregate text once we have the title
        chunking_strategy="by_title",
        # Chunking params to aggregate text blocks
        # Attempt to create a new chunk 3800 chars
        # Attempt to keep chunks > 2000 chars
        # Hard max on chunks
        max_characters=4000,
        new_after_n_chars=3800,
        combine_text_under_n_chars=2000
    )

The above code recognizes and extracts various PDF elements, which can be divided into CompositeElements (text) and Tables.

The Pydantic package is used to create a new data structure containing information about each element, such as its type and text. The code below lopps through all extracted elements, storing them in a list where each item is an instance of the Element type. I will also use the pickle library to store the extracted elements, they can be loaded later. 

In [5]:
from pydantic import BaseModel
from typing import Any
import pickle

# Define data structure
class Element(BaseModel):
    type: str
    text: Any

# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))


with open('categorized_elements.pkl', 'wb') as handle:
    pickle.dump(categorized_elements, handle)

### Graphs

The next step is to extract information from graphs and charts by separating images from the document and analyzing them using the c4ai-aya-vision-8b model developed by Cohere. In practice, each PDF page have to be converted into images and passed to the model individually if it identifies any graph. An empty array will be returned in cases where no graphs are found.<br>
Converting the PDF into images necessitates installing the pdf2image package.

In [None]:
!uv pip install pdf2image

In [7]:
import os
from pdf2image import convert_from_path


os.mkdir("./pages")
convertor = convert_from_path('./TSLA-Q3-2023-Update-3.pdf')

for idx, image in enumerate( convertor ):
    image.save(f"./pages/page-{idx}.png")

pages_png = [file for file in os.listdir("./pages") if file.endswith('.png')]

The Cohere API Key have to be stored in .env file. Before submitting image files into the model API, some functions have to be defined : the images have to be resized to fit the Cohere API limitations, a function to convert images into base64 format to make them compatible with the API and a function to send requests to the model using a payload variable for configurations, including specifying the model name, setting the maximum token limit and defining **prompts**. These prompts instrcut the model to analyze and describe graphs and generate responses in JSON format.

In [None]:
!uv pip install python-dotenv

In [10]:
from PIL import Image
import os

def resize_and_compress_png(input_path, output_path, max_size_bytes=5242880, max_dim=1024):
    """
    Resize and compress the PNG image to fit within the size limit.
    
    Parameters:
    - input_path (str): Path to the input PNG image.
    - output_path (str): Path to save the processed PNG image.
    - max_size_bytes (int): Maximum allowed file size in bytes (default 5MB).
    - max_dim (int): Maximum dimension (width or height) for resizing.
    """
    
    # Open the PNG image using Pillow
    with Image.open(input_path) as img:
        # Check current image size
        current_size = os.path.getsize(input_path)
        print(f"Original image size: {current_size / 1024:.2f} KB")
        
        # If the image size exceeds the max allowed size, we need to resize and compress
        if current_size > max_size_bytes:
            # Resize the image while maintaining aspect ratio
            img.thumbnail((max_dim, max_dim))
            print(f"Resized image to: {img.size[0]}x{img.size[1]} pixels")
            
            # Save the image in PNG format with compression (using PNG optimization)
            img.save(output_path, format="PNG", optimize=True)
            
            # Check the file size after saving
            final_size = os.path.getsize(output_path)
            print(f"Processed image size: {final_size / 1024:.2f} KB")
            
            # If still too large, further resize or optimize the PNG (e.g., reducing bit depth)
            while final_size > max_size_bytes:
                print(f"Size is still too large, resizing further...")
                # Further reduce dimensions if needed
                img = img.resize((img.size[0] // 2, img.size[1] // 2))
                img.save(output_path, format="PNG", optimize=True)
                final_size = os.path.getsize(output_path)
                print(f"Processed image size: {final_size / 1024:.2f} KB")
                
            print(f"Final image size is within the limit: {final_size / 1024:.2f} KB")
        else:
            # If the image is already within the limit, save it directly
            img.save(output_path, format="PNG")
            print(f"Image size is already within the limit, saved as is: {current_size / 1024:.2f} KB")

In [None]:
pages_resize = [file for file in os.listdir("./pages") if file.endswith('.png')]

for page in pages_resize:
    resize_and_compress_png(f"./pages/{page}",f"./pages/{page}" )

In [12]:
import base64
def encode_img(image_path):
    with open(image_path, "rb") as img_file:
        base64_image_url = f"data:image/jpeg;base64,{base64.b64encode(img_file.read()).decode('utf-8')}"
        return base64_image_url

In [13]:
import requests
from dotenv import load_dotenv

load_dotenv(".env")

def treat_img(base64_image_url):
    tmp_payload = {
    "model": "c4ai-aya-vision-8b",
    "temperature": 0.3,
    "max_tokens": 1000,
    "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "You are an assistant that find charts, graphs, or diagrams from an image and summarize their information. There could be multiple diagrams in one image, so explain each one of them separately. ignore tables."},
                    {
                        "type": "image_url",
                        "image_url": {"url": base64_image_url},
                    },
                ],
            },
            {
                "role": "user",
                "content": '''The response must be a JSON in following format {"graphs": [<chart_1>, <chart_2>, <chart_3>]} where <chart_1>, <chart_2>, and <chart_3> placeholders that describe each graph found in the image. Do not append or add anything other than the JSON format response.'''
            },
            {
                "role": "user",
                "content": '''If could not find a graph in the image, return an empty list JSON as follows: {"graphs": []}. Do not append or add anything other than the JSON format response. Dont use coding "```" marks or the word json.'''
            },
                        {
                "role": "user",
                "content": """Look at the attached image and describe all the graphs inside it in JSON format. ignore tables and be concise."""
            }
    ]
}

    headers ={
    "Content-type" : "application/json",
    "Authorization" : "Bearer " + str(os.environ["COHERE_API_KEY"])
}
    url="https://api.cohere.com/v2/chat"
    response = requests.post(url, headers=headers, json=tmp_payload)

    return response.json()

In [14]:
import json
from tqdm import tqdm
import pickle

graphs_description = []
for idx, page in tqdm(enumerate(pages_png)):
    base64_image = encode_img(f"./pages/{page}")
    rs = treat_img(base64_image)

    raw_text = rs['message']['content'][0]['text']
    graph_data = json.loads(rs['message']['content'][0]['text'])['graphs']

    if not graph_data:
        print("No graphs found.")
    else:
        desc = [
            f"Page: {page}\n" + '\n'.join(f"{key}: {item[key]}" for key in item.keys())
            if isinstance(item, dict)
            else item
            for item in graph_data
        ]
        print(desc)

    graphs_description.extend(desc)

    graphs_description = \
    [Element(type="graph", text=str(item)) for item in graphs_description]

with open('graphs_description.pkl', 'wb') as handle:
    pickle.dump(graphs_description, handle)

1it [00:06,  6.07s/it]

['Page: page-15.png\nlabel: Vehicle Deliveries\ndescription: Millions of units delivered over time', 'Page: page-15.png\nlabel: Operating Cash Flow\ndescription: Cash generated from operations over time', 'Page: page-15.png\nlabel: Free Cash Flow\ndescription: Cash generated after expenses over time', 'Page: page-15.png\nlabel: Net Income\ndescription: Profit or loss over time', 'Page: page-15.png\nlabel: Adjusted EBITDA\ndescription: Earnings before interest, taxes, depreciation, and amortization over time']


2it [00:19, 10.55s/it]

No graphs found.


3it [00:25,  8.44s/it]

['Page: page-16.png\ntitle: Vehicle Deliveries\ndescription: Blue bars represent vehicle deliveries in millions of units over time, showing a steady increase from Q1 2020 to Q4 2022.', 'Page: page-16.png\ntitle: Operating Cash Flow\ndescription: Blue bars show operating cash flow in billions of dollars, with fluctuations throughout the period, while red bars indicate free cash flow, also fluctuating.', 'Page: page-16.png\ntitle: Net Income and Adjusted EBITDA\ndescription: Red bars represent net income in billions of dollars, and blue bars show adjusted EBITDA, both increasing over time with some seasonal variations.']


4it [00:32,  7.66s/it]

["Page: page-17.png\ntitle: Revenue Growth\ndescription: Red line showing Tesla's revenue growth over 12 months, compared to Q3 2019. Blue line represents the S&P 500 index.", "Page: page-17.png\ntitle: Operating Margin\ndescription: Red line shows Tesla's operating margin percentage over 12 months. Blue line represents the S&P 500 index."]


5it [00:43,  8.96s/it]

No graphs found.


6it [00:53,  9.33s/it]

No graphs found.


7it [00:55,  6.91s/it]

['A line graph showing a trend over time with labeled axes.', 'A bar chart comparing different categories with labeled values.', 'A pie chart illustrating proportions with labeled segments.']


8it [01:13, 10.61s/it]

['Page: page-11.png\ntype: map\ndescription: Color-coded map of the United States showing state-level subsidies for Tesla Model Y.', 'Page: page-11.png\ntype: text\ndescription: Starting price of Tesla Model Y inclusive of national and state-level subsidies.']


9it [01:15,  7.81s/it]

No graphs found.


10it [01:18,  6.35s/it]

No graphs found.


11it [01:27,  7.07s/it]

['Bar chart showing quarterly revenues and gross profit from 2020 to 2023, with a focus on Q3 2022 and Q4 2022.', 'Line chart depicting operating expenses, income from operations, and operating margin over the same period.', 'Scatter plot comparing adjusted EBITDA and adjusted EBITDA margins with a focus on Q3 2022 and Q4 2022.', 'Bar chart illustrating net cash provided by operating activities, capital expenditures, and free cash flow from 2020 to 2023.']


12it [01:34,  7.12s/it]

No graphs found.


13it [01:47,  8.94s/it]

["Page: page-6.png\ndescription: Vehicle capacity across regions, showing production and pilot production status.\nregions: ['California', 'Shanghai', 'Berlin-Brandenburg', 'Texas', 'Nevada', 'Various']\nmodels: ['Model S', 'Model X', 'Model 3', 'Model Y', 'Cybertruck', 'Tesla Semi', 'Next Gen Platform']\ncapacity: in thousands\nstatus: ['Production', 'Pilot production', 'In development']", "Page: page-6.png\ndescription: Vehicle production capacity over time, with different lines for US, Canada, Europe, and China.\nregions: ['US', 'Canada', 'Europe', 'China']\nlines: ['Model S', 'Model X', 'Model 3', 'Model Y', 'Cybertruck', 'Tesla Semi']\ncapacity: in thousands\nproduction_rates: ['4%', '3%', '2%', '1%']", "Page: page-6.png\ndescription: Market share of Tesla vehicles by region over time.\nregions: ['US', 'Canada', 'Europe', 'China']\nlines: ['Model S', 'Model X', 'Model 3', 'Model Y', 'Cybertruck', 'Tesla Semi']\nmarket_share: in percentage\ntime: Q3 2022 to Q3 2023"]


14it [01:55,  8.62s/it]

['Page: page-7.png\ntype: bar\ndescription: Cumulative miles driven with FSD Beta, showing a steady increase over time.', 'Page: page-7.png\ntype: line\ndescription: Cost of goods sold per vehicle, indicating a decrease over time.']


15it [02:04,  8.67s/it]

['Page: page-5.png\ntype: bar\ndescription: Comparison of Model S, Model 3, and Total Production over quarters', 'Page: page-5.png\ntype: bar\ndescription: Comparison of Model S and Model 3 Deliveries over quarters', 'Page: page-5.png\ntype: line\ndescription: Comparison of Total Deliveries and Operating Lease accounting over quarters', 'Page: page-5.png\ntype: bar\ndescription: Comparison of Vehicle Count and Operating Lease accounting over quarters', 'Page: page-5.png\ntype: bar\ndescription: Comparison of Solar Deployed, Storage Deployed, and Tesla Locations over quarters', 'Page: page-5.png\ntype: bar\ndescription: Comparison of Mobile Service Fleet, Supercharger Stations, and Supercharger Connectors over quarters']


16it [02:09,  7.67s/it]

No graphs found.


17it [02:13,  6.44s/it]

No graphs found.


18it [02:21,  6.88s/it]

['Page: page-8.png\ntype: bar\ndescription: Energy Storage deployments in GWh, showing growth from Q1 2017 to Q4 2023.', 'Page: page-8.png\ntype: line\ndescription: Solar deployments trend over time, influenced by interest rates and net metering termination in California.', 'Page: page-8.png\ntype: bar\ndescription: Services and other business gross profit in millions of dollars, from Q1 2017 to Q4 2023.']


19it [02:32,  8.32s/it]

['Page: page-20.png\ntype: bar\ndescription: Comparison of monthly revenue for Q3 and Q4 of 2022, showing a significant increase in revenue from September to December.', 'Page: page-20.png\ntype: line\ndescription: Trend of total assets over time, indicating a steady increase from September 2022 to June 2023.', 'Page: page-20.png\ntype: pie\ndescription: Breakdown of total liabilities into different categories, including current and long-term liabilities.']


20it [02:41,  8.45s/it]

No graphs found.


21it [02:51,  8.89s/it]

['Page: page-23.png\ntype: bar\ndescription: Comparison of GAAP and Non-GAAP financial metrics over time, including net income, depreciation, and adjusted EBITDA.', 'Page: page-23.png\ntype: line\ndescription: Quarterly trends of net cash provided by operating activities and free cash flow under both GAAP and Non-GAAP.']


22it [02:59,  8.57s/it]

['Page: page-22.png\ntype: bar\ndescription: Quarterly revenue reconciliation between GAAP and non-GAAP, showing the difference in earnings per share.', 'Page: page-22.png\ntype: pie\ndescription: Breakdown of stock-based compensation expense for both GAAP and non-GAAP calculations.', 'Page: page-22.png\ntype: line\ndescription: Comparison of total revenues and adjusted EBITDA margin between GAAP and non-GAAP metrics over time.']


23it [03:01,  6.49s/it]

No graphs found.
No graphs found.


24it [03:11,  7.73s/it]

['A bar chart showing financial statements with labels for income, expenses, and net income over time.', "A line graph illustrating the company's revenue growth over several years.", "A pie chart depicting the company's expenses across different categories."]


25it [03:14,  6.23s/it]

["A line graph showing Tesla's stock price over time with a logarithmic scale on the y-axis.", "A bar chart comparing Tesla's revenue and expenses for different quarters.", "A pie chart illustrating Tesla's revenue distribution across various product lines.", "A scatter plot depicting the relationship between Tesla's stock price and market capitalization.", "A heatmap showing Tesla's financial performance across different metrics and time periods."]


26it [03:31,  8.15s/it]


### Store on FAISS Vector Database

Tha data preparation phase is complete and the vital information was extracted from the PDF. I will use in this section the FAISS vector database to store the collected information and their embeddings.<br>
Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

In [15]:
all_docs = categorized_elements + graphs_description
print(len(all_docs))

100
