<a href="https://colab.research.google.com/github/LxYuan0420/nlp/blob/main/notebooks/Inference_with_MarkupLM_for_question_answering_on_web_pages.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set-up environment

We'll first install 🤗 Transformers.

In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 7.6 MB 5.0 MB/s 
[K     |████████████████████████████████| 163 kB 69.6 MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone


## Load model and processor

Next, we'll load the model and its corresponding processor (useful for preparing data for the model) from the 🤗 [hub](https://huggingface.co/microsoft/markuplm-base-finetuned-websrc).

This particular model is fine-tuned on [WebSRC](https://x-lance.github.io/WebSRC/), a dataset of question-answering pairs on web pages. It's a bit like the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset but for web pages.

In [2]:
from transformers import MarkupLMProcessor, MarkupLMForQuestionAnswering

processor = MarkupLMProcessor.from_pretrained("microsoft/markuplm-base-finetuned-websrc")
model = MarkupLMForQuestionAnswering.from_pretrained("microsoft/markuplm-base-finetuned-websrc")

Downloading:   0%|          | 0.00/100 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.50k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/55.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/539M [00:00<?, ?B/s]

## Prepare example

Let's say we have a web page with the following HTML.

In [3]:
html_string = """
<!DOCTYPE html>
<html>
<body>

<h1>Do you know that</h1>
<h2>My name is Niels</h2>
<h3>This is heading 3</h3>
<h4>This is heading 4</h4>
<h5>This is heading 5</h5>
<h6>This is another header</h6>

</body>
</html>

"""

Let's ask a question related to the web page. We can prepare the HTML + question for the model using the processor:

In [4]:
question = "what is his name?"

encoding = processor(html_string, questions=question, return_tensors="pt")

As can be seen, the processor creates all token-level inputs for the model, including the 2 additional inputs (xpaths_tags_seq and xpath_subs_seq) - which MarkupLM uses to obtain additional embeddings internally:

In [5]:
for k,v in encoding.items():
  print(k,v.shape)

input_ids torch.Size([1, 33])
token_type_ids torch.Size([1, 33])
attention_mask torch.Size([1, 33])
xpath_tags_seq torch.Size([1, 33, 50])
xpath_subs_seq torch.Size([1, 33, 50])


## Forward pass

Let's perform a forward pass:

In [6]:
import torch

# we use torch.no_grad() as we don't need any gradient computation here
# we're just doing inference! This saves memory
with torch.no_grad():
  outputs = model(**encoding)

## Decode

Finally, we can decode what the model has predicted. The model predicts which tokens of the context (in this case the HTML string) are at the start and at the end of the answer.

In [7]:
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = encoding.input_ids[0, answer_start_index : answer_end_index + 1]
processor.decode(predict_answer_tokens, skip_special_tokens=True)

' Niels'

#### More examples

In [15]:
html_string = """
<!DOCTYPE html>
<html>
<body>

<h1>Do you know that</h1>
<h2>PM Lee, President Halimah meet China's Vice Premier Han Zheng
Vice Premier Han Zheng's visit is a milestone in the gradual resumption of more in-person exchanges between Singapore and China, says MFA.
PM Lee, President Halimah meet China's Vice Premier Han Zheng 
China's Vice Premier Han Zheng and Prime Minister Lee Hsien Loong met on Wednesday, Nov, 2, 2022. (Photo: Ministry of Communications and Information)
02 Nov 2022 04:18PM
SINGAPORE: Visiting Chinese Vice Premier Han Zheng called on President Halimah Yacob and Prime Minister Lee Hsien Loong on Wednesday (Nov 2), a day after Singapore and China signed 19 agreements at an apex meeting to deepen cooperation. 
During his meetings with Singapore leaders, both sides reaffirmed the excellent ties between the two countries.
Singapore and China "kept up the momentum of high-level interactions amidst the COVID-19 pandemic, including through virtual meetings and bilateral visits", said Singapore's Ministry of Foreign Affairs (MFA) in a press statement on Wednesday. 
It added that Mr Han's two-day visit to Singapore was a milestone in the gradual resumption of more in-person exchanges between both countries.
This is Mr Han's first overseas trip since the COVID-19 pandemic started. He is also the most senior Chinese leader to visit Singapore since the pandemic.
Chinese Vice Premier Han Zheng met President Halimah Yacob on Wednesday, Nov 2, 2022. (Photo: Ministry of Communications and Information)
<h3>This is heading 3</h3>
<h4>This is heading 4</h4>
<h5>This is heading 5</h5>
<h6>This is another header</h6>

</body>
</html>

"""

In [18]:
question = "Who is the China's Vice Premier?"

encoding = processor(html_string, questions=question, return_tensors="pt")

with torch.no_grad():
  outputs = model(**encoding)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = encoding.input_ids[0, answer_start_index : answer_end_index + 1]
processor.decode(predict_answer_tokens, skip_special_tokens=True)

"Who is the China's Vice Premier?Do you know thatPM Lee, President Halimah meet China's Vice Premier Han Zheng\nVice Premier Han Zheng's visit is a milestone in the gradual resumption of more in-person exchanges between Singapore and China, says MFA.\nPM Lee, President Halimah meet China's Vice Premier Han Zheng"

In [20]:
outputs.start_logits.shape

torch.Size([1, 357])

In [24]:
start = outputs.start_logits.argmax()
start

tensor(0)

In [25]:
end = outputs.end_logits.argmax()
end

tensor(72)

In [21]:
encoding.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'xpath_tags_seq', 'xpath_subs_seq'])

In [23]:
encoding["input_ids"].shape

torch.Size([1, 357])