<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/MarkupLM/Inference_with_MarkupLM_for_question_answering_on_web_pages.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set-up environment

We'll first install 🤗 Transformers.

In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 7.0 MB 8.0 MB/s 
[K     |████████████████████████████████| 163 kB 58.9 MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone


## Load model and processor

Next, we'll load the model and its corresponding processor (useful for preparing data for the model) from the 🤗 [hub](https://huggingface.co/microsoft/markuplm-base-finetuned-websrc).

This particular model is fine-tuned on [WebSRC](https://x-lance.github.io/WebSRC/), a dataset of question-answering pairs on web pages. It's a bit like the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset but for web pages.

In [7]:
from transformers import MarkupLMProcessor, MarkupLMForQuestionAnswering

processor = MarkupLMProcessor.from_pretrained("microsoft/markuplm-base-finetuned-websrc")
model = MarkupLMForQuestionAnswering.from_pretrained("microsoft/markuplm-base-finetuned-websrc")

Downloading:   0%|          | 0.00/100 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.50k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/55.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/539M [00:00<?, ?B/s]

## Prepare example

Let's say we have a web page with the following HTML.

In [8]:
html_string = """
<!DOCTYPE html>
<html>
<body>

<h1>Do you know that</h1>
<h2>My name is Niels</h2>
<h3>This is heading 3</h3>
<h4>This is heading 4</h4>
<h5>This is heading 5</h5>
<h6>This is another header</h6>

</body>
</html>

"""

Let's ask a question related to the web page. We can prepare the HTML + question for the model using the processor:

In [9]:
question = "what is his name?"

encoding = processor(html_string, questions=question, return_tensors="pt")

As can be seen, the processor creates all token-level inputs for the model, including the 2 additional inputs (xpaths_tags_seq and xpath_subs_seq) - which MarkupLM uses to obtain additional embeddings internally:

In [11]:
for k,v in encoding.items():
  print(k,v.shape)

input_ids torch.Size([1, 33])
token_type_ids torch.Size([1, 33])
attention_mask torch.Size([1, 33])
xpath_tags_seq torch.Size([1, 33, 50])
xpath_subs_seq torch.Size([1, 33, 50])


## Forward pass

Let's perform a forward pass:

In [12]:
import torch

# we use torch.no_grad() as we don't need any gradient computation here
# we're just doing inference! This saves memory
with torch.no_grad():
  outputs = model(**encoding)

## Decode

Finally, we can decode what the model has predicted. The model predicts which tokens of the context (in this case the HTML string) are at the start and at the end of the answer.

In [13]:
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = encoding.input_ids[0, answer_start_index : answer_end_index + 1]
processor.decode(predict_answer_tokens, skip_special_tokens=True)

' Niels'