# Multimodal Financial Document Analysis from PDFs

Application of the retrieval-augmented generation (RAG) method in processing financial information from company's PDF document. The steps involve extracting critical data such as text, tables and graphs from a PDF's file and storing them in a vector database like FAISS. Multiple tools will be used like Unstructured.io for text and table extraction from PDF, Cohere models for graph information extraction from images, and LlamaIndex for creating an agent with retrieval capabilities.

### Extracting Data

In [None]:
!wget https://digitalassets.tesla.com/tesla-contents/image/upload/IR/TSLA-Q3-2023-Update-3.pdf

The unstructured package is a proficient tool for extracting information from pdf files. It relies on two key tools, poppler and tesseract, essentail for rendering PDF documents. They have to be installed using apt-get(Linux) or brew(MacOs), in addition to the necessary packages.

apt-get -qq install poppler-utils <br>
apt-get -qq install tesseract-ocr

I recommend using [Unstructured Quickstart](https://docs.unstructured.io/open-source/introduction/quick-start) to a clean install of the Unstructured package. <br>
Please note that I am using the [UV project manager](https://docs.astral.sh/uv/) instead of pip and the latest python version at this time which is 3.13.3.<br>
Terminal (create virtual environment): <br>
uv venv --python 3.13 <br>
source .venv/bin/activate <br>
uv pip install ipykernel


In [None]:
!uv pip install "unstructured[all-docs]"

### Text and Tables

Use partition_pdf function to extract text and table data from the PDF and divide it into multiple chunks. The size of these chunks can be customized based on the number of characters.

In [None]:
from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
        filename="./TSLA-Q3-2023-Update-3.pdf",
        # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
        # Titles are any sub-section of the document
        infer_table_structure=True,
        # Post processing to aggregate text once we have the title
        chunking_strategy="by_title",
        # Chunking params to aggregate text blocks
        # Attempt to create a new chunk 3800 chars
        # Attempt to keep chunks > 2000 chars
        # Hard max on chunks
        max_characters=4000,
        new_after_n_chars=3800,
        combine_text_under_n_chars=2000
    )

The above code recognizes and extracts various PDF elements, which can be divided into CompositeElements (text) and Tables.