This Repo aims at evaluating the (Document Intelligence+LLM) technique for entity extraction from Complex Tax Documents. We use schema2doc mapping based on Document Intelligence (DI) output of the processed document. DI provides a JSON or Markdown output format, including the styles information. Using LLM prompting, we ask the LLM (GPT4o) to process the DI output and provide a JSON format with the defined schema.
- Azure account: Create azure account by signing up here
- Azure CLI: Install the Azure CLI from here.
Open your terminal and login to your Azure account:
az loginFollow the instructions to complete the authentication process. If you are using a specific subscription, set it as the default:
az account set --subscription "your-subscription-id"az group create --name <resource_group_name> --location <region>
- In your Azure portal, click on “Create a resource”.
- Search for “OpenAI” and select it.
- Click on “Create” and fill in the necessary details such as name, subscription, resource group, etc.
- Click on “Review + Create” and then “Create” to create the resource.
- Once the deployment is complete, go to the resource page.
- Under “Keys and Endpoint”, you can find your key and endpoint. Save these for later use.
- In your Azure portal, click on “Create a resource”.
- Search for “Document Intelligence” and select it.
- Click on “Create” and fill in the necessary details.
- Click on “Review + Create” and then “Create” to create the resource.
- Once the deployment is complete, go to the resource page.
- Under “Keys and Endpoint”, you can find your key and endpoint. Save these for later use.
- In your Azure portal, click on “Create a resource”.
- Search for “Azure Search” and select it.
- Click on “Create” and fill in the necessary details.
- Click on “Review + Create” and then “Create” to create the resource.
- Once the deployment is complete, go to the resource page.
- Under “Keys and Endpoint”, you can find your key and endpoint. Save these for later use.
- In your Azure portal, Click on Azure Active Directory in the left-hand menu.
- Your Tenant ID is listed as Directory ID on the default page.
- In the Azure portal, click on App Registrations in the left-hand menu under Azure Active Directory.
- Click on New Registration at the top.
- Fill in the details such as name, supported account types, and redirect URI (if necessary), then click Register.
- After the app is registered, the Application (client) ID is displayed on the app page. This is your Client ID.
- To get the Client Secret, click on Certificates & secrets in the left-hand menu of the app page.
- Click on New client secret, add a description, select an expiry period, and click Add.
- After the client secret is created, copy the Value. This is your Client Secret.
- Open your code editor or terminal.
- Navigate to the root directory of your project.
- Create a new file named
.env.- Open the
.envfile.- Add your environment variables in the format
KEY=VALUE, one per line. For example:
AZURE_SUBSCRIPTION_ID=<your-subscription-id>
AZURE_TENANT_ID=<your-tenant-id>
AZURE_CLIENT_ID=<your-client-id>
AZURE_CLIENT_SECRET=<your-client-secret>
AZURE_OPENAI_ENDPOINT=<YOUR_RESOURCE_ENDPOINT_HERE>
AZURE_OPENAI_API_KEY=<YOUR_RESOURCE_KEY_HERE>
DOC_INTELLIGENCE_ENDPOINT=<YOUR_RESOURCE_ENDPOINT_HERE>
DOC_INTELLIGENCE_KEY=<YOUR_RESOURCE_KEY_HERE>
VECTOR_SEARCH_ENDPOINT=<YOUR_RESOURCE_ENDPOINT_HERE>
VECTOR_SEARCH_KEY=<YOUR_RESOURCE_KEY_HERE>
DEPLOYMENT_NAME=<YOUR_MODEL_DEPLOYMENT_NAME_HERE>
It is recommended that Python virtual environments are used for local branch development.
Then main advantage of using virtual environments is that you can create a separate workspace environment for a branch, so that yo can safely install, remove or upgrade a library without affecting other environments.
venv docs: https://docs.python.org/3/library/venv.html
Create a new environment with Python version=3.11
conda create -n myenv python==3.11Then, activate the environment
conda activate myenvpip install -r requirements.txtRefer to the notebook Complex-Data-Extraction.ipynb for testing process with your own documents.