# Introduction

LayoutLM ([huggingface documentation](https://huggingface.co/docs/transformers/model_doc/layoutlm#overview)) is a model for effective pretraining method of text and layout for document image understanding and information extraction tasks, such as form understanding and receipt understanding. \
In this notebook, a fine-tuned LayoutLM model is used for the task of Document Question Answering on documents such as invoices, forms, Cheques, ID documents, etc. \
It can also perform well on unstructured documents (without clear key-value pairs) like lease agreements, just like nlp question-answering models.

# Imports

In [None]:
# Necessary Installations and Imports

!pip install transformers
!pip install torch
!pip install pillow

from transformers import AutoTokenizer, AutoModelForDocumentQuestionAnswering, AutoProcessor
tokenizer = AutoTokenizer.from_pretrained("magorshunov/layoutlm-invoices")
# processor = AutoProcessor.from_pretrained("magorshunov/layoutlm-invoices")
model = AutoModelForDocumentQuestionAnswering.from_pretrained("magorshunov/layoutlm-invoices")

from transformers import DocumentQuestionAnsweringPipeline
pipe = DocumentQuestionAnsweringPipeline(model=model, framework = "pt", tokenizer=tokenizer) #, processor=processor

!sudo apt install tesseract-ocr
!pip install pytesseract
!which tesseract
tesseract_cmd = (r'/usr/bin/tesseract')

import pytesseract
from pytesseract import Output
from PIL import Image
import cv2
from google.colab.patches import cv2_imshow

# Helper Functions
1. `up_dpi(image, scale)`: It takes as input 2 parameters- `image`, the image as filepath, `scale`, the factor by which you want to scale the image. \
It returns the resized image `res` as array.
2. `get_word_boxes(image, scale)`: It takes as input 2 parameters- `image`, the image as filepath, `scale`, the factor by which you want to scale the image. It performs OCR using Tesseract. \
Currently, the configuration used is `--psm 6`. The configuration can be played around with, to get the best OCR results. A good read on Tesseract page segmentation modes can be found [here](https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/). \
This functions extracts the word and bounding box coordinates, normalizes it to (1000, 1000) and returns them as a `list[(str)text, ((int)x0, (int)y0, (int)x1, (int)y1)]`. \
The statement `if(d["conf"][i]>60)` sets the minimum confidence score for text to be considered. This helps remove noise caused due to images, few characters of different languages, etc.
3. `get_answers(image, questions, scale)`: It takes 3 parameters as input- `image`, the image as filepath, `questions`, a `list[(str)question]`. \
It iterates through `questions` and passes the `image`, i_th question, the word and bounding box information into the document question answering pipeline created in the imports section. \
It returns a list `answers`.
4. `get_answers(image, questions, scale)`: Same as `get_answers` except that it returns a `list[(str)answer, (float)confidence score]`. It is sometimes useful to know the confidence score because it is an indicator of a possible incorrect answer.

In [2]:
def up_dpi(image, scale):
    res = cv2.resize(cv2.imread(image), None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC) # scale=2 (default)
    return res


def get_word_boxes(image, scale):
    # image_as_array = cv2.imread(image)
    image_as_array = up_dpi(image, scale)
    H, W = image_as_array.shape[:2]
    # options = "tesseract sample_images/image2.jpg stdout -l eng --psm 6 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyz0123456789/-"
    d = pytesseract.image_to_data(image_as_array, config="--psm 6", output_type=Output.DICT) # Image.open(image) if image is a filepath  config=options,
    n_boxes = len(d['level'])
    word_boxes = []
    for i in range(n_boxes):
        if(d["conf"][i] > 60): # Setting minimum confidence to consider
            text = d["text"][i]
            (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
            bbox = (int(x/W*1000), int(y/H*1000), int((x+w)/W*1000), int((y+h)/H*1000))
            word_boxes.append((text, bbox))
            cv2.rectangle(image_as_array, (x, y), (x + w, y + h), (0, 255, 0), 2)
    # cv2_imshow(image_as_array)
    # cv2.waitKey(0)
    return word_boxes


def get_answers(image, questions, scale):
    # image->filepath, questions->List
    word_boxes = get_word_boxes(image, scale)
    n_q = len(questions)
    answers = []
    for i in range(n_q):
        result = pipe([{"image": image, "question": questions[i], "word_boxes":word_boxes}])
        ans = result[0][0]["answer"]
        # score = result[0][0]["score"]
        answers.append(ans)
    return answers

def get_answers_scores(image, questions, scale):
    # image->filepath, questions->List
    word_boxes = get_word_boxes(image, scale)
    n_q = len(questions)
    answers = []
    scores = []
    for i in range(n_q):
        result = pipe([{"image": image, "question": questions[i], "word_boxes":word_boxes}])
        ans = result[0][0]["answer"]
        score = result[0][0]["score"]
        answers.append(ans)
        scores.append(score)
    return list(zip(answers, scores))

# Demo
Used Gradio for the demo
- Tab 1: Invoices- \
Click on upload an image. It uploads an image and stores it as filepath (because the functions that are called, take an image as filepath). Set a suitable scale. Click on "Get Payment Details" button to extract some key information from the invoice. The "Process Payment" button does not work for now.
- Tab 2: e-KYC- \
This tab has a chatbot for natural language queries. Upload an image, set a suitable scale, type a question and hit enter. Click on "Clear" when you want to clear the chat history.
- Tab 3: Cheques- \
Functionality is similar to Invoices tab. It performs decent on handwritten cheques. It almost always gets the A/C No. correctly. Sometimes messes up with the name and amount. The right question asked in the `cheque_answers()` function can get the required answer. Sometimes, fancy font styles in cheques also affects the results.
- Tab 4: Forms- \
You can ask all possible questions, in newlines, in the "Questions" Textbox. Click the "Get answers" button. All the answers with their confidence scores will appear in the "Answers Textbox. **This tab is good for experimentation.**

In [None]:
# !pip uninstall gradio
# !pip install -q gradio --use-deprecated=legacy-resolver
!pip install gradio

In [6]:
import gradio as gr
import time

with gr.Blocks() as demo:
    gr.Markdown("NOTE: Upload a document image and set a suitable scale factor for accurate OCR")
    with gr.Tab("Invoice"):
        image_1 = gr.Image(type="filepath")
        number_1 = gr.Number(value=2, label="Scale factor")
        button_1 = gr.Button("Get Payment Details")
        with gr.Row():
            name_1 = gr.Textbox(label="Client Name", interactive=True)
            inv_1 = gr.Textbox(label="Invoice Number", interactive=True)
            dt_1 = gr.Textbox(label="Invoice Date", interactive=True)
            tax_1 = gr.Textbox(label="Sales Tax", interactive=True)
            tot_1 = gr.Textbox(label="Total", interactive=True)
        dump_button_3 = gr.Button("Process Payment")
        def invoice_answers(image, scale):
            questions = ["What is the Client Name?", "What is the Invoice Number?", "What is the invoice Date?", "What is the Sales Tax?", "What is the Total?"]
            answers = get_answers(image, questions, scale)
            return answers

    with gr.Tab("e-KYC"):
        with gr.Row():
            image_2 = gr.Image(type="filepath")
            chatbot_2 = gr.Chatbot()
        number_2 = gr.Number(value=2, label="Scale factor")
        question_2 = gr.Textbox(label="Question")
        clear_2 = gr.ClearButton([question_2, chatbot_2])
        def respond_2(image, question, scale, chat_history):
            bot_message = get_answers(image, [question], scale)[0]
            chat_history.append((question, bot_message))
            time.sleep(2)
            return "", chat_history
        question_2.submit(respond_2, [image_2, question_2, number_2, chatbot_2], [question_2, chatbot_2])

    with gr.Tab("Cheque"):
        image_3 = gr.Image(type="filepath")
        number_3 = gr.Number(value=2, label="Scale factor")
        button_3 = gr.Button("Get Info")
        with gr.Row():
            pay_3 = gr.Textbox(label="Pay", interactive=True)
            amt_3 = gr.Textbox(label="Amount", interactive=True)
            ac_3 = gr.Textbox(label="Ac/No.", interactive=True)
            date_3 = gr.Textbox(label="Date", interactive=True)
        dump_button_3 = gr.Button("Process")
        def cheque_answers(image, scale):
            questions = ["What is the Date?", "What is the Pay?", "What is the Ac/No?", "What is the Amount?"]
            answers = get_answers(image, questions, scale)
            return answers

    with gr.Tab("Form"):
        gr.Markdown("Type your questions in the 'Questions' Textbox, each question in a new line")
        with gr.Row():
            image_4 = gr.Image(type="filepath")
            number_4 = gr.Number(label="Scale factor", value=2)
        with gr.Row():
            question_4 = gr.Textbox(label="Questions", lines=12)
            answer_4 = gr.Textbox(label="Answers", lines=12)
        with gr.Row():
            button_4 = gr.Button("Get answers")
            dump_button_4 = gr.Button("Dump")
        def form_answers(image, questions, scale):
            q = questions.strip("\n").split("\n")
            results = get_answers_scores(image, q, scale)
            out = ""
            for i, (ans, sco) in enumerate(results):
                out = out + f"{i+1}. {ans} (score={sco})\n"
            return out

    # with gr.Accordion("Open for More!"):
    #     gr.Markdown("Look at me...")

    button_1.click(invoice_answers, inputs=[image_1, number_1], outputs=[name_1, inv_1, dt_1, tax_1, tot_1])
    button_3.click(cheque_answers, inputs=[image_3, number_3], outputs=[date_3, pay_3, ac_3, amt_3])
    button_4.click(form_answers, inputs=[image_4, question_4, number_4], outputs=answer_4)

demo.launch() # debug=True

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://e98e7e3220d76589bc.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




# Examples


In [7]:
print("0")
NOTE = """
These examples can be best tried out with the "Forms" Tab to see how many of these questions are answered
"""

invoice_0 = {
    "image": "https://eswap.global/wp-content/uploads/2021/08/invoice.png",
    "scale": 2,
    "questions": """
What is the Client Name?
What is the Invoice No?
What is the Invoice Date?
What is the Due Date?
What is the Amount of Labor 3hrs?
What is the Unit Price of New set of Pedal arms?
What is the Subtotal?
What is the Sales Tax?
What is the Total?
"""
}

invoice_1 = {
    "image": "https://docs.bellatrix.solutions/product-integrations/images/sampleinvoice.png",
    "scale": 2,
    "questions": """
What is the Company?
What is the name?
What is the address?
What is the email?
What is the Subtotal?
What is the Shipping?
What is the Tax?
What is the Total?
What is the quantity of Cotton Male T-shirt?
What is the Unit price of Cotton Male T-shirt?
"""
}

kyc_0 = {
    "image": "https://i.imgur.com/aHeN4vj.jpeg",
    "scale": 4,
    "questions": """
What is the Licence No.?
What is the Name?
What is the S/W/D?
What is the Address?
What is the BG?
What is the Authorization to Drive?
What is the Issue Date?
What is the Validity Date?
What is the Inv Carr No?
    """
}

kyc_1 = {
    "image": "https://upload.wikimedia.org/wikipedia/commons/5/56/Specimen_Personal_Information_Page_South_Korean_Passport.jpg",
    "scale": 3,
    "questions": """
What is the Type?
What is the Issuing country?
What is the Passport No.?
What is the Surname?
What are the Given Names?
What is the Nationality?
What is the Date of Birth?
What is the Personal No.?
What is the Sex?
What is the Date of Issue?
What is the Date of Expiry?
What is the Authority?
    """
}

kyc_2 = {
    "image": "https://www.immihelp.com/assets/article-images/sample-oci-card-2.jpg",
    "scale": 6,
    "questions": """
What is the Surname?
What is the Given Name?
What is the Nationality?
What is the Place of Birth?
What is the Sex?
What is the Date of Birth?
What is the Occupation?
What is the Place of Issue?
What is the Date of Issue?
What is the No.?
    """
}

# Questions 5, 6, 7, 8 get incorrect answers due to improper OCR. Scroll below to find a Gradio demo to check this out.
# Try different scales and page segmentation modes
form_0 = {
    "image": "https://bemoneyaware.com/wp-content/uploads/2019/03/car-insurance-policy.jpg",
    "scale": 2.5,
    "questions": """
What is the Total Liability Premium?
What is the Basic Premium on Vehicle and non-electrical accessories?
What is the Aggravation Cover?
What is the Net Premium (A+B)?
What is the Total Premium Payable?
What is the Total Own Damage Premium?
What is the Nominee Age?
What is the OD Premium the preceding year?
Where was this policy signed at?
    """
}

form_1 = {
    "image": "https://imgv2-2-f.scribdassets.com/img/document/371202185/original/a6147f7f76/1688705823?v=1",
    "scale": 2.5,
    "questions": """
What is the Policy No?
What is the Prev Policy No?
What is the Insured's Name?
What is the Helpline No.?
What is the Issue Office Name?
What is the Service Tax?
What is the Total?
What is the Total value?
What is the Date?
What is the Address?
    """
}

0


- The following demo is to show how changing the Tesseract page segmentation mode can change OCR results.
- I have used psm 6 as the default configuration in the above demo because it worked well in most cases.
- If you have a folder full of document images of one type, you can experiment with one image to get the suitable scale and psm for images of that type.

In [8]:
from google.colab.patches import cv2_imshow
def get_word_boxes_psm(image, scale, psm=3):
    if scale<=0:
        scale = 2
    image_as_array = up_dpi(image, scale)
    H, W = image_as_array.shape[:2]
    if psm not in [1,3,4,5,6,7,8,9,10,11,12,13]:
        psm = 3
    d = pytesseract.image_to_data(image_as_array, config=f"--psm {psm}", output_type=Output.DICT)
    n_boxes = len(d['level'])
    word_boxes = []
    for i in range(n_boxes):
        if(d["conf"][i] > 60): # Setting minimum confidence to consider
            text = d["text"][i]
            (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
            bbox = (int(x/W*1000), int(y/H*1000), int((x+w)/W*1000), int((y+h)/H*1000))
            word_boxes.append((text, bbox))
            cv2.rectangle(image_as_array, (x, y), (x + w, y + h), (0, 255, 0), 2)
            cv2.putText(image_as_array, text, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0, 0, 255), 2)
    # cv2_imshow(image_as_array)
    # cv2.waitKey(0)
    return image_as_array

In [9]:
with gr.Blocks() as demo2:
    with gr.Row():
        image_x = gr.Image(type="filepath")
        output_x = gr.Image()
    with gr.Row():
        scale_x = gr.Number(value=2, label="Scale factor")
        psm_x = gr.Slider(1,13,3,step=1,label="psm")
    put_text = gr.Button("OCR")
    put_text.click(fn=get_word_boxes_psm, inputs=[image_x, scale_x, psm_x], outputs=output_x)
demo2.launch() # debug=True

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://3af7d08adb8eaba99a.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




Another idea is to use langchain to include Generative AI in the chatbot. You can first use the "Forms" mode to get all possible questions and answers and include it under the "Summary of the conversation". Then, under "Current conversation", you can enter a question that can be answered by Generative AI.

In [None]:
!pip install langchain

In [11]:
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferWindowMemory, CombinedMemory, ConversationSummaryMemory

In [13]:
conv_memory = ConversationBufferWindowMemory(
    memory_key="chat_history_lines",
    input_key="input",
    k=1
)

summary_memory = ConversationSummaryMemory(llm=OpenAI(), input_key="input")
# Combined
memory = CombinedMemory(memories=[conv_memory, summary_memory])
_DEFAULT_TEMPLATE = """The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Summary of conversation:
{history}
Current conversation:
{chat_history_lines}
Human: {input}
AI:"""
PROMPT = PromptTemplate(
    input_variables=["history", "input", "chat_history_lines"], template=_DEFAULT_TEMPLATE
)
llm = OpenAI(temperature=0)
conversation = ConversationChain(
    llm=llm,
    verbose=True,
    memory=memory,
    prompt=PROMPT
)