Skip to content

A multimodal Retrieval Augmented Generation with code execution capabilities. Process multiple complex documents with images, table, charts to distill insights or generate new documents.

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.md

Azure-Samples/multimodal-rag-code-execution

Note: Start with the Tutorial Notebooks in the Tutorials folder here.


Research CoPilot: Multimodal RAG with Code Execution

Multimodal Document Analysis with RAG and Code Execution: using Text, Images and Data Tables with GPT4-V, TaskWeaver, and Assistants API:

  1. The work focuses on processing multi-modal analytical documents by extracting text, images, and data tables to maximize data representation and information extraction, utilizing formats like Python code, Markdown, and Mermaid script for compatibility with GPT-4 models.
  2. Text is programmatically extracted from documents, processed to improve structure and tag extraction for better searchability, and numerical data is captured through generated Python code for later use.
  3. Images and data tables are processed to generate multiple text-based representations (including detailed text descriptions, Mermaid, and Python code for images, and various formats for tables) to ensure information is searchable and usable for calculations, forecasts, and applying machine learning models using Code Interpreter capabilities.

Current Challenges

  1. As of today with conventional techniques, to be able to search through a knowledge base with RAG, text from documents need to be extracted, chunked and stored in a vector database
  2. This process now is purely concerned with text:
    • If the documents have any images, graphs or tables, these elements are usually either ignored or extracted as messy unstructured text
    • Retrieving unstructured table data through RAG will lead to very low accuracy answers
  3. LLMs are usually very bad with numbers. If the query requires any sort of calculations, LLMs usually hallucinate or make basic math mistakes

Why do we need this solution?

  1. Ingest and interact with multi-modal analytics documents with lots of graphs, numbers and tables
  2. Extract structured information from some elements in documents which wasn’t possible before:
    • Images
    • Graphs
    • Tables
  3. Use the Code Interpreter to formulate answers where calculations are needed based on search results

Examples of Industry Applications

  1. Analyze Investment opportunity documents for Private Equity deals
  2. Analyze tables from tax documents for audit purposes
  3. Analyze financial statements and perform initial computations
  4. Analyze and interact with multi-modal Manufacturing documents
  5. Process academic and research papers
  6. Ingest and interact with textbooks, manuals and guides
  7. Analyze traffic and city planning documents


Solution Features

The following are technical features implemented as part of this solution:

  1. Supported file formats are PDFs, MS Word documents, MS Excel sheets, and csv files.
  2. Ingestion of multimodal documents including images and tables
  3. Ingestion jobs run on Azure Machine Learning for reliable long-duration execution and monitoring
  4. Full deployment script that will create the solution components on Azure and build the docker images for the web apps
  5. Hybrid search with AI Search using vector and keyword search, and semantic re-ranker
  6. Extraction of chunk-level tags and whole-document-level chunks to optimize keyword search
  7. Whole document summaries used as part of the final search prompt to give extra context
  8. Code Execution with the OpenAI Assistants API Code Interpreter
  9. Tag-based Search for optimizing really long user query, e.g. Generation Prompts
  10. Modular and easy-to-use interface with Processors for customizable processing pipelines
  11. Smart chunking of Markdown tables with repeatable header and table summary in every chunk
  12. Support for the two new embedding models text-embedding-3-small and text-embedding-3-large, as well as for text-embedding-ada-002

In-the-Works Upcoming Features

  1. Dynamic semantic chunking with approximate fixed size chunks (soon)
  2. Graph DB support for enhanced data retrieval. The Graph DB will complement, and not replace, the AI Search resource.


The Concept of Processing Pipelines and Processors

For the sake of providing an extendable modular architecture, we have implemented in this accelerator the concept of a processing pipeline, where each document undergoes a pre-specified number of processing steps, each step adding some degree of change to the documents. Processors are format-specific (e.g. PDF, MS Word, Excel, etc..), and are created to ingest multimodal documents in the most efficient way for that format. Therefore the list of processing steps for a PDF is different than the list of steps for an Excel sheet. This is implemented in the processor.py Python file. The list of processing steps can be customized by changing the file processing_plan.json. As an example, processing Excel files will follow the below steps, each step building on the results of the previous one:

  1. extract_xlsx_using_openpyxl: read the Excel sheet with OpenPyxl and store it in a dataframe.
  2. create_table_doc_chunks_markdown: go through the dataframe after converting it to Markdown, and chunk into text in a smart way: chunks that are almost equal in size but without breaking any sentences in the middle.
  3. create_image_doc_chunks: extract images from the Excel if any
  4. generate_tags_for_all_chunks: for each chunk of text, generate tags. This is very important for hybrid search in AI Search.
  5. generate_document_wide_tags: genereate tags for the whole documents. This is very important for hybrid search in AI Search.
  6. generate_document_wide_summary: provide a document summary that will be inserted into the Context for RAG, as well as the top chunks.
  7. generate_analysis_for_text: provide an analysis for each chunk of text in relation to the whole document, e.g. what does chunk add as information vs the whole text.

At the start of the processing pipeline, a Python dictionary variable called ingestion_pipeline_dict with all the input parameters is created in the constructor of the Processor and then passed to the first step. The step will do its own processing, will change variables inside the ingestion_pipeline_dict and will add new ones. The ingestion_pipeline_dict is then returned by this first step, and will then become the input for the second step. This way, the ingestion_pipeline_dict is passed from each step to the next downstream the pipeline. It is the common context which all steps work on. The ingestion_pipeline_dict is saved in a text file at the end of each step, so as to provide a way for debugging and troubleshooting under the processing folder name in the stages directory.

At the end of this document, there is a list of all the steps and a short explanation for each one of them. The below JSON block describes the processing pipelines per document format per processing option:

{
    ".pdf": {
        "gpt-4-vision": [
            "create_pdf_chunks", "pdf_extract_high_res_chunk_images", "pdf_extract_text", "pdf_extract_images", "delete_pdf_chunks", "post_process_images", "extract_tables_from_images", "post_process_tables", "generate_tags_for_all_chunks", "generate_document_wide_tags", "generate_document_wide_summary", "generate_analysis_for_text"
        ],
        "document-intelligence": [
            "create_pdf_chunks", "pdf_extract_high_res_chunk_images", "pdf_extract_text", "pdf_extract_images", "delete_pdf_chunks", "extract_doc_using_doc_int", "create_doc_chunks_with_doc_int_markdown", "post_process_images", "generate_tags_for_all_chunks", "generate_document_wide_tags", "generate_document_wide_summary", "generate_analysis_for_text"
        ],
        "hybrid": [
            "create_pdf_chunks", "pdf_extract_high_res_chunk_images", "delete_pdf_chunks", "extract_doc_using_doc_int", "create_doc_chunks_with_doc_int_markdown", "post_process_images", "post_process_tables", "generate_tags_for_all_chunks", "generate_document_wide_tags", "generate_document_wide_summary", "generate_analysis_for_text"
        ]
    },
    ".docx": {
        "py-docx": [
            "extract_docx_using_py_docx", "create_doc_chunks_with_doc_int_markdown", "post_process_images", "generate_tags_for_all_chunks", "generate_document_wide_tags", "generate_document_wide_summary", "generate_analysis_for_text"
        ],
        "document-intelligence": [
            "extract_doc_using_doc_int", "create_doc_chunks_with_doc_int_markdown", "post_process_images", "generate_tags_for_all_chunks", "generate_document_wide_tags", "generate_document_wide_summary", "generate_analysis_for_text"
        ]
    },
    ".xlsx": {
        "openpyxl": [
            "extract_xlsx_using_openpyxl", "create_table_doc_chunks_markdown", "create_image_doc_chunks", "generate_tags_for_all_chunks", "generate_document_wide_tags", "generate_document_wide_summary", "generate_analysis_for_text"
        ]
    }
}

Solution Architecture

The below is the logical architecture of this solution. The GraphDB is not yet added to the solution, but the integration is currently in development:




Important Findings

  1. GPT-4-Turbo is a great help with its large 128k token window
  2. GPT-4-Turbo with Vision is great at extracting tables from unstructured document formats
  3. GPT-4 models can understand a wide variety of formats (Python, Markdown, Mermaid, GraphViz DOT, etc..) which was essential in maximizing information extraction
  4. A new approach to vector index searching based on tags was needed because the Generation Prompts were very lengthy compared to the usual user queries
  5. Taskweaver’s and Assistants API’s Code Interpreters were introduced to conduct open-ended analytics questions


Enterprise Deployment

Please check our Enterprise Deployment guide for how to deploy this in a secure manner to a client's tenant. For local development or testing the solution, please use the tutorial notebooks or the Chainlit app described below.



Tutorial Notebooks

Please start with the Tutorial notebooks here. These notebooks illustrate a series of concepts that have been used in this repo.



How to Use this Solution

There are two web apps that are implemented as part of this solution. The Streamlit web app and the Chainlit web app.

  1. The Streamlit web app includes the following:
    • The web app can ingest documents, which will create an ingestion job either using Azure Machine Learning (recommended) or using a Python sub-process on the web app itself (for local testing only).
    • The second part of the Streamlit app is Generation. The "Prompt Management" view will enable the user to build complex prompts with sub-sections, save them to Cosmos, and use the solution to generate output based on these prompts
  2. The Chainlit web app is used to chat with the ingested documents, and has advanced functionality, such as an audit trail for the search, and references section for the answer with multimodal support (images and tables can be viewed).

Prepare the local Conda Environment

The Conda environment can be installed by running the following commands from the project root folder. Please follow the below commands to create a new conda environment. The Python version can be >= 3.10 (but was thoroughly tested on 3.10):

# create the conda environment
conda create -n mmdoc python=3.10

# activate the conda environment
conda activate mmdoc

# install the project requirements
pip install -r requirements.txt

Prepare the .env File

Configure properly your .env file. Refer to the .env.sample file included in this solution. All non-optional values must be filled in, in order for this solution to function properly.

The .env file is used for:

  1. Local Development if needed
  2. The deployment script will read values from the .env file and will population the Configuration Variables for both web apps.

Running the Chainlit Web App

The Chainlit web app is the main web app to chat with your data. To run the web app locally, please execute in your conda environment the following:

# cd into the ui folder
cd ui

# run the chainlit app
chainlit run chat.py

Running the Streamlit Web App

The Streamlit web app is the main web app to ingest your documents and to build prompts for Generation. To run the web app locally, please execute in your conda environment the following:

# cd into the ui folder
cd ui

# run the chainlit app
streamlit run main.py

Guide to configure the Chainlit and Streamlit Web Apps

  1. Configure properly your .env file. Refer to the .env.sample file included in this solution.
  2. In the Chainlit web app, use cmd index to set the index name.


Deploying on Azure

We are currently building an ARM template for a one-click deployment. In the meantime, please use the below script to deploy on the Azure cloud. Please make sure to fill in your .env file properly before running the deployment script. The below script has to run in a Git Bash shell, and will not run in Powershell bash. Visit the deployment section here to get detailed instructions and advance deployment options.

# cd into the deployment folder
cd deployment

# run the deployment script
./deploy_public.sh


Local Development for Azure Cloud

For rapid development iterations and for testing on the cloud, the push.ps1 script can be used to build only the docker images and push them to the Azure Container Registry, without creating or changing any other component in the resource group or in the architecture. The docker images will have then to be manually assigned to the web app, by going to the web app page in the Azure Portal, and then navigate to Deployment > Deployment Center on the left-hand side, and then go to Settings on the right-hand side, then to the Tag dropdown and choose the correct docker image.

Please edit the push.ps1 script, and fill in the right values for the Azure Container Registry endpoint, username and password, for the Resource Group name, and Subscription ID. Then, to run the script, follow the below instrcutions in a Powershell. It is important that Docker Desktop version is installed and running at that point locally. The command has to be run from the root directory of the project:

# cd into the root folder of the project
cd <project root>

# run the docker images update script
deployment/push.ps1



Code Interpreters

Code Interpreters Available in this Solution:

  1. Assistants API: OpenAI AssistantsAPI is the default out-of-the-box code interpreter for this solution running on Azure.
  2. Taskweaver: is optional to install and use, and is fully supported


Taskweaver Installation (optional)

TaskWeaver requires Python >= 3.10. It can be installed by running the following command from the project root folder. Please follow the below commands very carefully and start by creating a new conda environment:

# create the conda environment
conda create -n mmdoc python=3.10

# activate the conda environment
conda activate mmdoc

# install the project requirements
pip install -r requirements.txt

# clone the repository
git clone https://github.com/microsoft/TaskWeaver.git

# cd into Taskweaver
cd TaskWeaver

# install the Taskweaver requirements
pip install -r requirements.txt

# copy the Taskweaver project directory into the root folder and name it 'test_project'
cp -r project ../test_project/

Note: Inside the test_project directory, there's a file called taskweaver_config.json which needs to be populated. Please refer to the taskweaver_config.sample.json file in the root folder of this repo, fill in the Azure OpenAI model values for GPT-4-Turbo, rename it to taskweaver_config.json, and then copy it inside test_project (or overwrite existing).


Note: Similiarly, there are a number of test notebooks in this solution that use Autogen. If the user wants to experiment with Autogen, then in this case, the file OAI_CONFIG_LIST in the code folder needs to be configured. Please refer to OAI_CONFIG_LIST.sample, populate it with the right values, and then rename it to OAI_CONFIG_LIST.



Processing Steps

The create_doc_chunks_with_doc_int_markdown function is integral to the processing of documents, particularly when utilizing the Document Intelligence service. It's designed to handle the markdown conversion of document chunks, ensuring that the extracted data is formatted correctly for further analysis. This function is applicable to various document formats and is capable of processing text, images, and tables, making it versatile in the multimodal information extraction process. Its role is crucial in structuring the raw extracted data into a more accessible and analyzable form.

The create_image_doc_chunks function is integral to the processing of image data within multimodal documents. It specifically targets the extraction and organization of image-related content, segmenting each image as a discrete chunk for further analysis. This function is applicable across various document formats that include image data, playing a crucial role in the multimodal extraction pipeline by ensuring that visual information is accurately captured and prepared for subsequent processing steps such as tagging and analysis. It deals exclusively with the image modality, isolating it from text and tables to streamline the handling of visual content.

The create_pdf_chunks function is a crucial step in the document ingestion process, particularly for PDF files. It segments the input PDF document into individual chunks, which are then processed separately in subsequent stages of the pipeline. This function is applicable to all modalities within a PDF document, including text, images, and tables, ensuring a comprehensive breakdown of the document's content for detailed analysis and extraction. Its role is foundational, as it sets the stage for the specialized processing of each modality by other functions in the pipeline.

The function create_table_doc_chunks_markdown is responsible for processing tables within documents, specifically converting them into Markdown format. It is applicable to .xlsx files as part of the openpyxl pipeline. This function not only handles the conversion but also manages the chunking of tables when they are too large, ensuring that the Markdown representation is accurate and manageable. It processes the table modality exclusively and is crucial for preserving the structure and data of tables during the document ingestion process.

The delete_pdf_chunks function is a crucial step in the document processing pipeline, particularly for PDF files. It is responsible for removing the temporary storage of PDF chunks from memory, ensuring that the system resources are efficiently managed and not overburdened with unnecessary data. This function is applied after the initial extraction of high-resolution images and text from the PDF document, and before any post-processing of images or tables. It is applicable to all modalities—text, images, and tables—since it deals with the cleanup of data extracted from PDF chunks.

The extract_doc_using_doc_int function is a key component in the document processing pipeline, specifically tailored for handling .docx and .pdf files. It leverages the capabilities of Azure's Document Intelligence Service to analyze and extract structured data, including text and tables, from documents. This function is crucial for converting document content into a format that can be further processed for insights and is versatile in dealing with both textual and tabular data modalities.

The extract_docx_using_py_docx function is designed to handle the extraction of content from .docx files, specifically focusing on text, images, and tables. It utilizes the python-docx library to access and extract these elements, ensuring that the data is accurately retrieved and stored in a structured format suitable for further processing. This function is crucial for the initial stage of the ingestion pipeline, setting the foundation for subsequent analysis and processing steps. It is applicable to .docx files and is responsible for extracting all three modalities: text, images, and tables, from the document.

The extract_tables_from_images function is designed to identify and extract tables from image files within a document. It applies to image modalities, specifically targeting visual data representations such as tables embedded within image files. This function is crucial for converting visual table data into a structured format that can be further processed or analyzed, making it an essential step in multimodal document processing pipelines that deal with both textual and visual information. It is particularly relevant for documents where tabular information is presented in non-textual formats.

The extract_xlsx_using_openpyxl function is designed to handle the extraction of data from .xlsx files, specifically focusing on the retrieval of tables and their conversion into various formats for further processing. It leverages the openpyxl library to access and manipulate Excel files, ensuring that the extracted data is accurately represented in Python-friendly structures such as DataFrames. This function is crucial for parsing spreadsheet data, which is often rich in structured information, making it a key step in the data extraction phase for .xlsx files within the ingestion pipeline. It processes the table modality, transforming Excel sheets into Markdown, plain text, and Python scripts, which can then be integrated into the multimodal information extraction framework.

The generate_analysis_for_text function is designed to analyze the relationship between a specific text chunk and the overall content of a document. It highlights entity relationships introduced or extended in the text chunk, providing a concise analysis that adds context to the document's topics. This function is applicable to all modalities—text, images, and tables—ensuring a comprehensive understanding of the document's content. It plays a crucial role in enhancing the document's metadata by providing insights into the significance of each section within the larger document structure.

The generate_document_wide_summary function is responsible for creating a concise summary of the entire document's content. It extracts key information and presents it in a few paragraphs, ensuring that the essence of the document is captured without unnecessary details. This function is applicable to all document formats, including text, images, and tables, making it a versatile component in the multimodal information extraction pipeline. It plays a crucial role in providing a quick overview of the document, which can be beneficial for both indexing and search purposes.

The generate_document_wide_tags function is a crucial component in the document ingestion pipeline, applicable across various document formats including PDF, DOCX, and XLSX. It is responsible for extracting key tags from the entire document, which are essential for enhancing search and retrieval capabilities. This function processes text modality, ensuring that significant entities and topics within the document are captured as tags, aiding in the creation of a searchable index for the ingested content.

The generate_tags_for_all_chunks function is integral to the multimodal information extraction process, applicable across various document formats including PDF, DOCX, and XLSX. It operates on all three modalities—text, images, and tables—extracting and optimizing tags for enhanced search and retrieval within a vector store. This function ensures that each chunk of the document, regardless of its content type, is accurately represented by a set of descriptive tags, facilitating efficient indexing and subsequent search operations.

The pdf_extract_high_res_chunk_images function is responsible for extracting high-resolution images from each chunk of a PDF document. It plays a crucial role in the initial stages of the document processing pipeline, particularly for PDF formats, ensuring that visual data is captured in detail for subsequent analysis. This function focuses on the image modality, converting document chunks into PNG images at a DPI of 300, which are then used for further image-based processing tasks.

The pdf_extract_images function is designed to handle the extraction of images from PDF documents. It is applicable to PDF formats and operates within a multimodal extraction context, where it specifically processes the image modality. This function plays a crucial role in isolating visual content from PDFs, which is essential for subsequent image analysis and understanding in the broader multimodal information extraction process.

The pdf_extract_text function is a crucial component in the document processing pipeline, specifically tailored for handling PDF files. It is responsible for extracting textual content from each page of a PDF document, converting it into a machine-readable format. This function is pivotal for subsequent stages that may involve text analysis, search indexing, or further data extraction tasks. It operates solely on the text modality, ensuring that the rich textual information embedded within PDFs is accurately captured and made available for downstream processing.

The post_process_images function is integral to refining the output from image extraction operations within the document ingestion process. It specifically handles the enhancement and clarification of images, ensuring that any visual data is accurately represented and usable for subsequent analysis. This function is applicable across various document formats that include image content, playing a pivotal role in multimodal information extraction where visual data is a key component. It is designed to work with images as a modality, complementing other functions that handle text and tables.

The post_process_tables function is designed to handle the refinement of table data extracted from documents. It applies to various document formats, including PDFs and images, where tables are present. The function's role is to enhance the quality of the extracted table information, ensuring that it is accurately represented and formatted for further use. It specifically deals with the 'table' modality, focusing on the post-extraction processing of tables to prepare them for integration into a searchable vector index or for analytical computations.

About

A multimodal Retrieval Augmented Generation with code execution capabilities. Process multiple complex documents with images, table, charts to distill insights or generate new documents.

Resources

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.md

Code of conduct

Stars

Watchers

Forks

Packages

No packages published