SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.  
SPDX-License-Identifier: Apache-2.0

# Vision NIMs for Structured Text Extraction Workshop

NVIDIA Inference Microservices (NIMs) are a collection of easy to use API driven microservices to interact with AI models.

This workshop will focus on combining Florence, OCDRNet, VLMs and LLMs to build a robust structured text extraction pipeline. It is often a challenge to extract specific pieces of information from documents such Photo IDs. With many different formats of Photo IDs and irregular placement of key information it can be difficult to use traditional CV models to robustly extract fields from the Photo ID such as First Name, Last Name, Date of Birth etc.

This notebook will show how to build a robust text extraction pipeline where the user can specify in natural language what fields to extract from a given image and receive the filled out fields in JSON format. To demonstrate this pipeline, the notebook will use synthetically generated Photo ID images. 

The Pipeline will be built using NVIDIA NIMs, which allows access to powerfull generative AI models through REST APIs to build this the pipeline. 

To learn more about NIMs visit <a href=https://build.nvidia.com/explore/discover> ai.nvidia.com </a>

![semantic search architecture diagram](readme_assets/text_extract_pipeline.png)

This workshop has four parts:

**Part 0**: Setup Environment  
**Part 1**: Preview Dataset  
**Part 2**: VLM Text Extraction   
**Part 3**: Optical Character Detection and Recognition  
**Part 4**: Structured Text Extraction Pipeline  

# Part 0: Prepare the Workspace

## Part 0.1: Setup Environment

First, we set up the environment. This includes installing the required libraries.

In [1]:
import subprocess
import platform
import os

# Check/create virtual environment based on notebook directory name

# Get the directory name of the current notebook
notebook_dir = os.path.basename(os.path.abspath('.'))
venv_name = notebook_dir + '_venv'

print(f"Notebook directory name: {notebook_dir}")


# Check if a virtual environment is active
active_venv = os.environ.get('VIRTUAL_ENV')
if active_venv:
    active_venv_name = os.path.basename(active_venv)
    print(f"Currently active virtual environment: {active_venv_name}")
    
    # If the active venv doesn't match the directory name
    if active_venv_name != venv_name:
        print(f"Warning: Active environment doesn't match notebook directory name.")


# Check if a virtual environment with the directory name exists
venv_path = os.path.join(os.path.abspath('.'), venv_name)
print(f"Virtual environment path: {venv_path}")
bin_dir = 'Scripts' if platform.system() == 'Windows' else 'bin'
activate_script = os.path.join(venv_path, bin_dir, 'activate')

if os.path.exists(activate_script):
    print(f"Virtual environment '{venv_name}' exists.")
    
    if not active_venv or active_venv_name != venv_name:
        print(f"To activate this environment, restart the kernel and run:")
        if platform.system() == 'Windows':
            print(f"    {os.path.join(venv_path, 'Scripts', 'activate')}")
        else:
            print(f"    source {os.path.join(venv_path, 'bin', 'activate')}")
    
else:
    print(f"Creating virtual environment '{venv_name}'...")
    try:
        subprocess.run(['python', '-m', 'venv', venv_name], check=True)
        print(f"Virtual environment '{venv_name}' created successfully.")
        if platform.system() == 'Windows':
            print(f"    {os.path.join(venv_path, 'Scripts', 'activate')}")
        else:
            print(f"    source {os.path.join(venv_path, 'bin', 'activate')}")
    except subprocess.CalledProcessError as e:
        print(f"Error creating virtual environment: {e}")

Notebook directory name: vision_text_extraction
Virtual environment path: /home/luke/Documents/GitHub/Camera-Based-Tracking/metropolis-providencecv/nim_workflows/vision_text_extraction/vision_text_extraction_venv
Virtual environment 'vision_text_extraction_venv' exists.
To activate this environment, restart the kernel and run:
    source /home/luke/Documents/GitHub/Camera-Based-Tracking/metropolis-providencecv/nim_workflows/vision_text_extraction/vision_text_extraction_venv/bin/activate


## Part 0.2: Activate the environment and install dependencies

***Run the script from the previous output to activate the environment.***

```bash
source /path/to/your/env/
```

This cell will change to the relevant Python version.

Once that is done, it will install the required dependencies.

In [2]:
# Deactivate the current environment if any
if active_venv and active_venv_name != venv_name:
    print(f"Deactivating the current environment '{active_venv_name}'...")
    subprocess.run("bash -c 'deactivate'", shell=True)

# Activate the virtual environment
print(f"Activating virtual environment '{venv_name}'...")
subprocess.run(f"bash -c 'source {activate_script}'", shell=True)
venv_python = os.path.join(venv_path, bin_dir, "python")

try:
    subprocess.run([venv_python, '-m', 'pip', 'install', '-r', 'requirements.txt'], check=True)
    print("Requirements installed successfully.")
except subprocess.CalledProcessError as e:
    print(f"Error installing required packages: {e}")

Activating virtual environment 'vision_text_extraction_venv'...


Requirements installed successfully.



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


## Part 0.3: Restart the kernel and update Python version

At the top of the notebook, you will see a button that says "Restart". Press it to restart the kernel.

Once that is done, change the Python version in the top right to the one matching the name of the environment you just created. For example, if you created an environment called `nim`, select the Python version that says `nim` in the name.

In [1]:
from dotenv import load_dotenv, find_dotenv
import os
import re
import json
import math 
from pathlib import Path 
from random import sample 
import matplotlib.pyplot as plt 
from PIL import Image 
import pandas as pd 

## Part 0.4: Retrieve API Key

The API key is required to access the NIMs, which should be stored in a .env file in either this directory or a master directory above this.
The .env file should contain the following line:

```
NIM_API_KEY=nvapi-<the_rest_of_your_api_key>
```

You can get your API key from the NVIDIA NIMs portal: https://build.nvidia.com/

In [2]:
# Find the .env file in the current directory. If issues are encountered with the function of this API, please check there are no clashing .env files in the same directory as this notebook.
find_dotenv()
# Load environment variables from .env file
load_dotenv()

api_key = os.getenv("NIM_API_KEY")

# Part 1: Preview Dataset

This notebook uses a subset of the [Synthetic dataset of ID and Travel Document (SIDTD) dataset](https://tc11.cvc.uab.es/datasets/SIDTD_1) included in the repository. The dataset is licensed under Creative Commons Attribution-ShareAlike 3.0 Unported License.

The full dataset contains a collection of synthetically generated Photo IDs from various countries and includes annotations for all text in the image. Run the cells below to preview the images from the dataset and the associated labels. 

In [None]:
def plot_images(image_folder, num_images):
    image_files = os.listdir(image_folder)
    sample_image_paths = sample(image_files, num_images)
    image_paths = [Path(image_folder)/x for x in sample_image_paths]
    grid_size = (math.ceil(num_images/3), 3)
    fig, axes = plt.subplots(grid_size[0], grid_size[1], figsize=(15,15))
    fig.subplots_adjust()

    for i, ax in enumerate(axes.flat):
        ax.axis("off")
        if i >= num_images:
            break 
        img = Image.open(image_paths[i])
        ax.imshow(img)
        
    plt.show()

In [None]:
image_folder = "sample_data/images"
image_files = os.listdir(image_folder)
image_paths = [Path(image_folder)/x for x in image_files]

In [None]:
plot_images(image_folder, 9)

In [None]:
#load field data
esp_id = "sample_data/esp_id.json"
with open(esp_id, "r")  as file:
    esp_id_truth = json.load(file)

In [None]:
def get_fields(image_name, json_data):
    """Get fields from annotations file given image name"""
    id_list = esp_id_truth["_via_image_id_list"]
    image_index = int(image_name.split(".")[0].split("_")[2])
    image_id = id_list[image_index]
    
    regions = json_data["_via_img_metadata"][image_id]["regions"]
    fields = []
    for region in regions:
        field = region["region_attributes"]["field_name"]
        value = region["region_attributes"]["value"]
        fields.append({"field":field, "value":value})
    return fields

From the annotations file, we can view the associated metadata with each Photo ID. This is what we want to extract from the image of the Photo ID using AI.

In [None]:
fields = get_fields("esp_id_84.jpg", esp_id_truth)
for x in fields:
    print(x)

The next sections will show how to apply various Vision AI NIMs to extract these fields from the images. 

Several Vision NIMs are available to help us do this:

- [Visual Language Models (VLMs)](https://build.nvidia.com/microsoft/microsoft-florence-2)
- [Florence](https://build.nvidia.com/microsoft/microsoft-florence-2)
- [OCDRNet](https://build.nvidia.com/nvidia/ocdrnet)

To view all available Vision NIMs, view [this page](https://build.nvidia.com/explore/vision). 

# Part 2: VLM Text Extraction

VLMs are capable of taking in natural language text prompts and images. By using a VLM, we can build a pipeline that is customizable and can be prompt tuned to work on difference use cases. 

The VLM can be provided the image of the Photo ID and a prompt with a list of fields to find and extract. VLMs are also capable of performing OCDR on their own to varying levels of success allowing it to find names, dates, ID numbers etc. 

The following cell will set the fields we want the VLM to find. 

In [None]:
fields = ["name", "surname", "issue date", "nationality", "gender"]

To use the VLM NIM, a wrapper class has been implemented. View the vlm.py file in the same directory as this notebook to view the full code. For the rest of this notebook, the NEVA 22b VLM NIM will be used. However, this can be adjusted to any of the following VLMs by changing the input link. 

- "https://ai.api.nvidia.com/v1/vlm/nvidia/neva-22b"
- "https://ai.api.nvidia.com/v1/vlm/microsoft/kosmos-2"
- "https://ai.api.nvidia.com/v1/vlm/adept/fuyu-8b"
- "https://ai.api.nvidia.com/v1/vlm/google/paligemma"
- "https://ai.api.nvidia.com/v1/vlm/microsoft/phi-3-vision-128k-instruct"

In [None]:
from vlm import VLM 
vlm = VLM("https://ai.api.nvidia.com/v1/vlm/nvidia/neva-22b", api_key)

To make the VLM extract the fields, it needs some background information in the system prompt. This will tell the VLM that its goal is to extract the supplied fields from the input image. 

In [None]:
system_prompt = "Your job is to inspect an image and fill out a form provided by the user. This form will be provided in JSON format and will include a list of fields and field descriptions. Inspect the image and do your best to fill out the fields in JSON format based on image. Only find the provided fields. Do not add anything extra."

The user prompt can then be a template with the fields that need to be extracted. 

In [None]:
user_prompt = f"Here are the fields: {fields}. Fill out each field based on the image and respond in JSON format."

Now the system prompt, user prompt and Photo ID image can be given to the VLM to extract the fields. 

In [None]:
image = image_paths[0]
print(f"Image File: {image}")
response = vlm(user_prompt, image, system_prompt=system_prompt)

#Show image
plt.figure(figsize=(9, 6))
plt.title(f"{image}")
plt.imshow(Image.open(image))

#Show VLM response 
print(f"VLM Response \n{response}")

#Show labelled fields 
labelled_fields = get_fields(Path(image).name, esp_id_truth)
print("Labelled Fields\n")
for x in labelled_fields:
    print(x)


The output from the VLM may not be accurate. Depending on the VLM model, it can misread some characters from the ID and output mispelled fields. An option to detect the characters more accurately is to use a more powerful model built for optical character detection and recognition. 

# Part 3: Optical Character Detection and Recognition 

While VLMs are capable of some OCDR, a dedicated OCDR model will often perform better. Two Vision NIMs can be used for this:

- [Florence](https://build.nvidia.com/microsoft/microsoft-florence-2)
- [OCDRNet](https://build.nvidia.com/nvidia/ocdrnet)

To more easily use OCDRNet and Florence, wrapper classes are provided. To view the full code, look at the florence.py and ocdrnet.py scripts in the same folder as this notebook. 

## Part 3.1 OCDRNet

With the OCDRNet NIM, we can extract the text from the image more accurately than with a VLM alone. However, the output of this model is just raw text that has been detected in the image. It is difficult to piece together to get structured outputs. To overcome this we can provide the extracted text to a VLM or LLM for further processing. 

In [None]:
from ocdrnet import OCDRNET
ocdrnet = OCDRNET(api_key)
ocd_response = ocdrnet(image)
ocd_response = [x["label"] for x in ocd_response["metadata"]]
ocd_response = " ".join(ocd_response)
print(ocd_response)

## Part 3.2 Florence

Florence is a very powerful and small model capable of several vision tasks such as OCDR, detection, captioning and segmentation. We can also use Florence to extract the text. Lets see how it compares to OCDRNet. 

In [None]:
from florence import Florence
florence = Florence(api_key)
ocd_response = florence(12, image) #12 is the task ID for OCR. Other IDs will change the task to detection, captioning etc. 
ocd_response = ocd_response["choices"][0]["message"]["content"]
print(ocd_response)

## Part 3.3 LLM Post Processing 

The output from Florence and OCDRNet is not structured. This makes it difficult to extract specific fields. To solve these problems, we can use an LLM NIM to post process the string returned by OCDRNet or Florence. LLMs are very good at taking raw text and reformatting it. 

To accesss the NVIDIA LLM NIMs, the OpenAi library can be used and pointed to the LLM NIM APIs.  

The LLM can be provided a prompt with that includes the OCDR results, the fields we want to extract and instructions stating to output it in JSON format. If the fields are output in JSON format then it will be easier to integrate with other services. 

In [None]:
from openai import OpenAI
client = OpenAI(base_url="https://integrate.api.nvidia.com/v1", api_key=api_key)
messages = [
        {
            "role": "user",
            "content": f"I have a text string that may or may not be formatted in proper json. The response should have the following keys: [{[x for x in fields]}]. Please parse the response and match the key pair values and format it in JSON. Here is the response: {ocd_response}",
        }
    ]
print(messages)
completion = client.chat.completions.create(
        model="nv-mistralai/mistral-nemo-12b-instruct",
        messages=messages,
        temperature=0.2,
        top_p=0.7,
        max_tokens=1024,
        stream=False,
    )
print(completion)
response = completion.choices[0].message.content
print(response)

From the output, you can see the LLM was able to take the raw text string and reformat it into JSON output with the specified fields. However, there are still some errors. To get better accuracy the VLM, OCDR model and LLM can all be combined into one pipeline. 

# Part 4: Text Extraction Pipeline

With these three pieces, we can build a full pipeline to combine a VLM, OCDR model and LLM to extract structured text from any image. 

The OCDR model (Florence or OCDRNet) will extract all characters from the image and pass it to either the VLM or LLM. 
The VLM will attempt to extract fields from the image and output the fields in JSON format. 
The LLM will take the output from either the VLM or OCD model and ensure it is in the proper JSON format. 

![Pipeline Diagram](readme_assets/notebook_pipeline.png)

## Part 4.1: Text Extraction Class

Below is the code to piece together the different models to form the full pipeline.

In [None]:
from vlm import VLM
from florence import Florence
from ocdrnet import OCDRNET
from openai import OpenAI

class TextExtraction:

    def __init__(self, api_key, vlm=None, llm=None, ocd=None, **kwargs):
        self.api_key = api_key
        self.vlm = vlm  # ["nvidia/neva-22b"]
        self.llm = llm  # ["nv-mistralai/mistral-nemo-12b-instruct"]
        self.ocd = ocd  # ["nvidia/ocdrnet", "microsoft/florence-2"]

        # VLM or LLM and OCD required
        if not self.vlm:
            if not (self.ocd and self.llm):
                raise Exception("VLM or OCD and LLM required.")

        self.vlm_system_prompt = kwargs.get(
            "vlm_system_prompt",
            "Your job is to inspect an image and fill out a form provided by the user. This form will be provided in JSON format and will include a list of fields and field descriptions. Inspect the image and do your best to fill out the fields in JSON format based on image. The JSON output should be in a JSON code block.",
        )
        self.llm_system_prompt = kwargs.get(
            "llm_system_prompt",
            "You are an AI assistant whose job is to inspect a string that may have json formatted output. This json format may not be correct so you must extract the json and make it properly formatted in a JSON block. You will be provided a list of keys that you must find in the input string. If you cannot find the associated value then put an empty string.",
        )

    def __call__(self, image, field_names, field_descriptions=None):
        """image - PIL image or file path"""

        field_descriptions = (
            [""] * len(field_names)
            if field_descriptions is None
            else field_descriptions
        )

        # Get Field Names and Descriptions in dict
        fields = {}
        for x in range(len(field_names)):
            fields[field_names[x]] = field_descriptions[x]

        # Stage 1: OCDR with OCDRNet or Florence
        if self.ocd is not None:
            # Setup OCD
            if self.ocd == "microsoft/florence-2":
                florence = Florence(self.api_key)
                ocd_response = florence(12, image)
                ocd_response = ocd_response["choices"][0]["message"]["content"]
            elif self.ocd == "nvidia/ocdrnet":
                ocdrnet = OCDRNET(self.api_key)
                ocd_response = ocdrnet(image)
                ocd_response = [x["label"] for x in ocd_response["metadata"]]
                ocd_response = " ".join(ocd_response)
        else:
            ocd_response = None

        # Stage 2: VLM Field Extraction
        if self.vlm is not None:
            # setup VLM
            vlm = VLM(f"https://ai.api.nvidia.com/v1/vlm/{self.vlm}", self.api_key)

            # Form Prompt
            user_prompt = f"Here are the fields: {fields}. Fill out each field based on the image and respond in JSON format."
            if ocd_response:
                user_prompt = (
                    user_prompt
                    + f"To assist you with filling out the fields. The following text has been extract from the image: {ocd_response}"  # add OCDR output if available
                )
            vlm_response = vlm(user_prompt, image, system_prompt=self.vlm_system_prompt)
        else:
            vlm_response = None

        # Stage 3: LLM Post Processing
        if self.llm:
            llm_input = vlm_response if vlm_response else ocd_response
            # LLM call for fixing json formatting
            client = OpenAI(
                base_url="https://integrate.api.nvidia.com/v1", api_key=self.api_key
            )
            messages = [
                {"role": "system", "content": self.llm_system_prompt},
                {
                    "role": "user",
                    "content": f"I have a text string that may or may not be formatted in proper json. The response should have the following keys: [{[x for x in fields.keys()]}]. Please parse the response and match the key pair values and format it in JSON. Here is the response: {llm_input}",
                },
            ]
            completion = client.chat.completions.create(
                model=self.llm,
                messages=messages,
                temperature=0.2,
                top_p=0.7,
                max_tokens=1024,
                stream=False,
            )

            llm_response = completion.choices[0].message.content
        else:
            llm_response = None

        final_response = llm_response if llm_response else vlm_response

        # Extract the JSON part from the code block
        try:
            #Try to find json code block
            re_search = re.search(
                r"```json\n(.*?)\n```", final_response, re.DOTALL
            )
            if re_search:
                json_string = re_search.group(1)
            #If no code block then find curly braces 
            else:
                left_index = final_response.find("{")
                right_index = final_response.rfind("}")
                json_string = final_response[left_index:right_index+1]
                
            json_object = json.loads(json_string)
        except Exception as e:
            print(f"JSON Parsing Error: {e}")
            return {key:None for key in field_names} #return empty expected dict with no values 

        return json_object


## Part 4.2: Building and Testing The Pipelines 

The pipeline code is adaptable such that each model can be included or exluded. 
This allows us to build four combinations. 

- VLM
- OCD + VLM 
- OCD + LLM
- OCD + VLM + LLM

Run the cells below to execute the pipelines on a sample Photo ID image and compare the results. Experiment with different VLMs, LLMs and ocd models. The valid OCDR, VLM and LLM parameter inputs are listed below:

OCDR
- "nvidia/ocdrnet"
- "microsoft/florence-2"

VLMs
- "nvidia/neva-22b"
- "microsoft/phi-3-vision-128k-instruct"
- "google/paligemma"
- "adept/fuyu-8b"
- "microsoft/kosmos-2"

LLMs
- "nv-mistralai/mistral-nemo-12b-instruct"
- "mistralai/mixtral-8x22b-instruct-v0.1"
- "meta/llama-3.1-8b-instruct"
- "meta/llama-3.1-70b-instruct"
- Any other LLM listed on [this page](https://docs.api.nvidia.com/nim/reference/llm-apis) can be used. 

Full documentation on these models can be found on [this page](https://docs.api.nvidia.com/nim/reference/models-1)

Now we can instantiate all four pipelines. 

In [None]:
pipeline_vlm = TextExtraction(api_key, vlm="nvidia/neva-22b")
pipeline_ocd_vlm = TextExtraction(api_key, vlm="nvidia/neva-22b", ocd="nvidia/ocdrnet")
pipeline_ocd_llm = TextExtraction(api_key, llm="nv-mistralai/mistral-nemo-12b-instruct", ocd="nvidia/ocdrnet")
pipeline_ocd_vlm_llm = TextExtraction(api_key, vlm="nvidia/neva-22b", llm="nv-mistralai/mistral-nemo-12b-instruct", ocd="nvidia/ocdrnet")
pipelines = [pipeline_vlm, pipeline_ocd_vlm, pipeline_ocd_llm, pipeline_ocd_vlm_llm]

The following function will take the pipelines and run them on the same image and set of fields and return the results. 

In [None]:
def test_pipelines(pipelines, fields, field_descriptions, data, image):
    """Test a lists of pipelines on a sample image"""
    results = []
    true_fields = get_fields(Path(image).name, esp_id_truth)
    formatted_true_fields = {}
    for tf in true_fields:
        if tf["field"] in fields:
            formatted_true_fields[tf["field"]] = tf["value"]
    results.append(formatted_true_fields)
    for pipeline in pipelines:
        result = pipeline(image, fields, field_descriptions=field_descriptions)
        results.append(result)
    return results 

Now we can define the fields we want to extract from the Image, run all the pipelines and review the results. You can run the following cell multiple times to see the resluts on different images. You can also adjust the field names and descriptions to control the information that pipeline extracts from the image.

In [None]:
#Run pipeline to extract fields on random sample image 
fields = ["birth_date", "expiry_date", "gender", "nationality", "name", "surname"]
field_descriptions = ["date of birth", "expiration date of ID", "gender", "nationality or country of origin", "first name", "last name or surname"]
image = sample(image_paths, 1)[0]
results = test_pipelines(pipelines, fields, field_descriptions, esp_id_truth, image)

#print results 
df = pd.DataFrame(results)
df.insert(0, "Pipeline",  ["Truth", "VLM", "OCD+VLM", "OCD+LLM", "OCD+VLM+LLM"]) #add identifiable names to each pipeline 
display(df.style.set_caption("Structured Text Extraction").hide(axis="index"))

#Show image
plt.figure(figsize=(9, 6))
plt.title(f"{image}")
plt.imshow(Image.open(image))

## 4.3 Interactive Gradio UI for Structured Text Extraction

This pipeline can be wrapped in a Gradio UI to provide an easy to use interface to test structured text extraction on any image and model combinations. The UI is a great way to quickly explore new uses cases beyond just Photo IDs.  Run the cell below to launch the Gradio UI. 

Once launched, the UI will be available at http://localhost:7860

In [None]:
!{python_exe} main.py {api_key}