<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/MarkupLM/Fine_tune_MarkupLMForTokenClassification_on_a_custom_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set-up environment

First, we install 🤗 Transformers.

We also install 🤗 Evaluate and Seqeval, for computing metrics like F1, recall and precision.

In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 163 kB 16.3 MB/s 
[K     |████████████████████████████████| 7.0 MB 58.5 MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone


In [2]:
!pip install -q evaluate seqeval

[K     |████████████████████████████████| 69 kB 273 kB/s 
[K     |████████████████████████████████| 43 kB 1.8 MB/s 
[K     |████████████████████████████████| 212 kB 45.1 MB/s 
[K     |████████████████████████████████| 115 kB 55.9 MB/s 
[K     |████████████████████████████████| 431 kB 62.0 MB/s 
[K     |████████████████████████████████| 127 kB 64.5 MB/s 
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


## Prepare dataset

Next, let's load a toy dataset which we'll use to fine-tune MarkupLM on.

The goal for the model is to label nodes of HTML strings with the appropriate class.

In [4]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
        
Token: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in yo

In [5]:
from huggingface_hub import HfApi

api = HfApi()
api.upload_folder(
    folder_path='/content/drive/MyDrive/MarkupLM/Notebooks/Tutorial notebooks/snippet_of_codes/atoydataset',
    path_in_repo="toy_dataset",
    repo_id="nielsr/markuplm-toy-dataset",
    repo_type="dataset",
)

'https://huggingface.co/datasets/nielsr/markuplm-toy-dataset/tree/main/toy_dataset'

In [17]:
import os
import json
from huggingface_hub import hf_hub_download

file = hf_hub_download(repo_id="nielsr/markuplm-toy-dataset", filename="label.json", repo_type="dataset")

with open(file) as f:
    labels = json.load(f)

for k,v in labels.items():
  print(k,v)

0000 {'model': 'Samsung 12.2 MegaPixel Compact Digital Camera w/ Dual LCD Displays - Dark Blue - TL205 DARK BLUE', 'price': '$89.99', 'manufacturer': 'Samsung'}
0001 {'model': 'Sony Cyber-shot 10.1 Megapixel Digital Camera - Silver - DSCS930S', 'price': '$55.99', 'manufacturer': 'Sony'}
0002 {'model': 'Nikon Coolpix 12.0 MegaPixel Compact Digital Camera - Red - L22', 'price': '$74.99', 'manufacturer': 'Nikon'}
0003 {'model': 'Samsung 12.2 MegaPixel Compact Digital Camera w/ Dual LCD Displays - Red Trim - TL210', 'price': '$94.99', 'manufacturer': 'Samsung'}
0004 {'model': 'Nikon Coolpix S60 10.0 MegaPixel Compact Digital Camera - Red - 26134', 'price': '$99.99', 'manufacturer': 'Nikon'}
0005 {'model': 'Olympus 14 Megapixel Digital Camera - Grey - OLYM FE-4030 GREY REF', 'price': '$78.99', 'manufacturer': 'Olympus'}
0006 {'model': 'Kodak EasyShare 8.2 MegaPixel Digital Camera - Silver - CD14', 'price': '$41.99', 'manufacturer': 'Kodak'}
0007 {'model': 'Olympus 10-MegaPixel Digital Camer

In [18]:
# import os
# import json

# basic_dir = '/content/drive/MyDrive/MarkupLM/Notebooks/Tutorial notebooks/snippet_of_codes/atoydataset'
# with open(os.path.join(basic_dir, 'label.json')) as f:
#     labels = json.load(f)

# for k,v in labels.items():
#   print(k,v)

We'll use MarkupLMFeatureExtractor to extract the nodes and xpaths from the HTML strings. Next, we label the nodes with the appropriate class (in this case, with "model", "price", "manufacturer" or "other").

In [19]:
from transformers import MarkupLMFeatureExtractor

feature_extractor = MarkupLMFeatureExtractor()

# we have 4 labels
id2label = {0: "model", 1:"price", 2:"manufacturer", 3:"other"}
label2id = {label:id for id, label in id2label.items()}

data = []
for k, v in labels.items():
    file_prefix = k
    annotations = v
    print(annotations)
    html_file_path = hf_hub_download(repo_id="nielsr/markuplm-toy-dataset", filename=f"htmls/{file_prefix}.html", repo_type="dataset")
    with open(html_file_path) as f:
        html_code = f.read()
    encoding = feature_extractor(html_code)
    node_labels = [[]]
    for node_text in encoding['nodes'][0]:
        print(node_text)
        if node_text == annotations['model']:
            node_labels[0].append(label2id['model'])
        elif node_text == annotations['price']:
            node_labels[0].append(label2id['price'])
        elif node_text == annotations['manufacturer']:
            node_labels[0].append(label2id['manufacturer'])
        else:
            node_labels[0].append(label2id['other'])
    data.append({'nodes': encoding['nodes'],
               'xpaths': encoding['xpaths'],
               'node_labels': node_labels, })

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

{'model': 'Samsung 12.2 MegaPixel Compact Digital Camera w/ Dual LCD Displays - Dark Blue - TL205 DARK BLUE', 'price': '$89.99', 'manufacturer': 'Samsung'}
eCOST.com - Samsung 12.2 MegaPixel Compact Digital Camera w/ Dual LCD Displays - Dark Blue - TL205 DARK BLUE
Samsung 12.2 MegaPixel Compact Digital Camera w/ Dual LCD Displays - Dark Blue - TL205 DARK BLUE
12.2 Megapixel, 3x Optical Zoom, 3x Digital Zoom, Dual LCD Displays - Back:  2.7" Color LCD and Front: 1.5" Color LCD, Image Stabilization, Image Sensor, SD / SDHC Memory Card Slot - Refurbished / Recertified
List Price:
$179.00
Regular Price:
$93.99
Price:
$89.99
You Save:
$89.01 (49.73%)
QTY:
Availability:
In Stock
eCOST Part #:
58093748
Manufacturer:
Samsung
MFG Part #:
TL205 DARK BLUE
Item Condition:
Recertified/Refurbished
{'model': 'Sony Cyber-shot 10.1 Megapixel Digital Camera - Silver - DSCS930S', 'price': '$55.99', 'manufacturer': 'Sony'}
eCOST.com - Sony Cyber-shot 10.1 Megapixel Digital Camera - Silver - DSCS930S
Sony C

In [20]:
for k,v in data[0].items():
  print(k,v)

nodes [['eCOST.com - Samsung 12.2 MegaPixel Compact Digital Camera w/ Dual LCD Displays - Dark Blue - TL205 DARK BLUE', 'Samsung 12.2 MegaPixel Compact Digital Camera w/ Dual LCD Displays - Dark Blue - TL205 DARK BLUE', '12.2 Megapixel, 3x Optical Zoom, 3x Digital Zoom, Dual LCD Displays - Back:  2.7" Color LCD and Front: 1.5" Color LCD, Image Stabilization, Image Sensor, SD / SDHC Memory Card Slot - Refurbished / Recertified', 'List Price:', '$179.00', 'Regular Price:', '$93.99', 'Price:', '$89.99', 'You Save:', '$89.01 (49.73%)', 'QTY:', 'Availability:', 'In Stock', 'eCOST Part #:', '58093748', 'Manufacturer:', 'Samsung', 'MFG Part #:', 'TL205 DARK BLUE', 'Item Condition:', 'Recertified/Refurbished']]
xpaths [['/html/head/title', '/html/body/div[1]/div/div/div/div/div/div/div/h1', '/html/body/h2', '/html/body/div[2]/div[5]/div[1]/table/tr[1]/td/table/tr/td[1]', '/html/body/div[2]/div[5]/div[1]/table/tr[1]/td/table/tr/td[2]/table/tr/td/span/span', '/html/body/div[2]/div[5]/div[1]/tabl

In [21]:
for node, label in zip(data[0]['nodes'][0], data[0]['node_labels'][0]):
  print(node, id2label[label])

eCOST.com - Samsung 12.2 MegaPixel Compact Digital Camera w/ Dual LCD Displays - Dark Blue - TL205 DARK BLUE other
Samsung 12.2 MegaPixel Compact Digital Camera w/ Dual LCD Displays - Dark Blue - TL205 DARK BLUE model
12.2 Megapixel, 3x Optical Zoom, 3x Digital Zoom, Dual LCD Displays - Back:  2.7" Color LCD and Front: 1.5" Color LCD, Image Stabilization, Image Sensor, SD / SDHC Memory Card Slot - Refurbished / Recertified other
List Price: other
$179.00 other
Regular Price: other
$93.99 other
Price: other
$89.99 price
You Save: other
$89.01 (49.73%) other
QTY: other
Availability: other
In Stock other
eCOST Part #: other
58093748 other
Manufacturer: other
Samsung manufacturer
MFG Part #: other
TL205 DARK BLUE other
Item Condition: other
Recertified/Refurbished other


## Create PyTorch Dataset

Next, we'll create a regular PyTorch dataset. Each item of the dataset is an HTML string, encoded using MarkupLMProcessor. Note that we initialize the processor with parse_html = False, as we have already parsed the HTML ourselves and we're providing the nodes, xpaths and node labels.

Note that by default, the processor will only label the first token of a given node and label the remaining tokens with -100. you can change this by setting the `only_label_first_subword` attribute of the processor's tokenizer to `False`.

In [None]:
from torch.utils.data import Dataset

class MarkupLMDataset(Dataset):
    """Dataset for token classification with MarkupLM."""

    def __init__(self, data, processor=None, max_length=512):
        self.data = data
        self.processor = processor
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # first, get nodes, xpaths and node labels
        item = self.data[idx]
        nodes, xpaths, node_labels = item['nodes'], item['xpaths'], item['node_labels']

        # provide to processor
        encoding = self.processor(nodes=nodes, xpaths=xpaths, node_labels=node_labels, padding="max_length",
                                  max_length=self.max_length, return_tensors="pt")

        # remove batch dimension
        encoding = {k: v.squeeze() for k, v in encoding.items()}

        return encoding

In [None]:
from transformers import MarkupLMProcessor

processor = MarkupLMProcessor.from_pretrained("microsoft/markuplm-base")
processor.parse_html = False

dataset = MarkupLMDataset(data=data, processor=processor, max_length=512)

Let's check an example:

In [None]:
example = dataset[0]
for k,v in example.items():
  print(k,v.shape)

input_ids torch.Size([512])
token_type_ids torch.Size([512])
attention_mask torch.Size([512])
xpath_tags_seq torch.Size([512, 50])
xpath_subs_seq torch.Size([512, 50])
labels torch.Size([512])


Let's decode the input_ids back to text:

In [None]:
processor.decode(example['input_ids'])

'<s>eCOST.com - Samsung 12.2 MegaPixel Compact Digital Camera w/ Dual LCD Displays - Dark Blue - TL205 DARK BLUESamsung 12.2 MegaPixel Compact Digital Camera w/ Dual LCD Displays - Dark Blue - TL205 DARK BLUE12.2 Megapixel, 3x Optical Zoom, 3x Digital Zoom, Dual LCD Displays - Back:  2.7" Color LCD and Front: 1.5" Color LCD, Image Stabilization, Image Sensor, SD / SDHC Memory Card Slot - Refurbished / RecertifiedList Price:$179.00Regular Price:$93.99Price:$89.99You Save:$89.01 (49.73%)QTY:Availability:In StockeCOST Part #:58093748Manufacturer:SamsungMFG Part #:TL205 DARK BLUEItem Condition:Recertified/Refurbished</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><

Let's verify the correspondence between input_ids and labels. -100 means that those tokens will be ignored by the loss function, hence these won't contribute to the final loss. 

In [None]:
for id, label in zip(example['input_ids'].tolist(), example['labels'].tolist()):
  if label != -100:
    print(processor.decode([id]), id2label[label])
  else:
    print(processor.decode([id]), label)

<s> -100
e other
CO -100
ST -100
. -100
com -100
 - -100
 Samsung -100
 12 -100
. -100
2 -100
 Mega -100
Pixel -100
 Compact -100
 Digital -100
 Camera -100
 w -100
/ -100
 Dual -100
 LCD -100
 Dis -100
plays -100
 - -100
 Dark -100
 Blue -100
 - -100
 TL -100
205 -100
 DARK -100
 BL -100
UE -100
Samsung model
 12 -100
. -100
2 -100
 Mega -100
Pixel -100
 Compact -100
 Digital -100
 Camera -100
 w -100
/ -100
 Dual -100
 LCD -100
 Dis -100
plays -100
 - -100
 Dark -100
 Blue -100
 - -100
 TL -100
205 -100
 DARK -100
 BL -100
UE -100
12 other
. -100
2 -100
 Meg -100
apixel -100
, -100
 3 -100
x -100
 Optical -100
 Zoom -100
, -100
 3 -100
x -100
 Digital -100
 Zoom -100
, -100
 Dual -100
 LCD -100
 Dis -100
plays -100
 - -100
 Back -100
: -100
  -100
 2 -100
. -100
7 -100
" -100
 Color -100
 LCD -100
 and -100
 Front -100
: -100
 1 -100
. -100
5 -100
" -100
 Color -100
 LCD -100
, -100
 Image -100
 St -100
abil -100
ization -100
, -100
 Image -100
 Sensor -100
, -100
 SD -100
 / -100
 S

## Create PyTorch Dataloaders

The next step is to create a PyTorch DataLoader, which allows us to get batches from the dataset.

In [None]:
from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

## Define model

We define the model here, which is a MarkupLM-base Transformer, with a token classifier head on top. The token classifier will have randomly initialized weights, while the base Transformer has pre-trained weights.



In [None]:
from transformers import MarkupLMForTokenClassification

model = MarkupLMForTokenClassification.from_pretrained("microsoft/markuplm-base", id2label=id2label, label2id=label2id)

Some weights of the model checkpoint at microsoft/markuplm-base were not used when initializing MarkupLMForTokenClassification: ['nrp_cls.decoder.weight', 'cls.predictions.transform.dense.bias', 'markuplm.pooler.dense.bias', 'markuplm.pooler.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'nrp_cls.LayerNorm.bias', 'ptc_cls.weight', 'cls.predictions.transform.dense.weight', 'ptc_cls.bias', 'nrp_cls.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'nrp_cls.dense.weight', 'nrp_cls.LayerNorm.weight', 'nrp_cls.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing MarkupLMForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MarkupLMForTokenClassification from the checkpoint of a model that 

We also create a label_list, where each tag starts with a B (as seqeval expects the labels to be in IOB format).

In [None]:
label_list = ["B-" + x for x in list(id2label.values())]
label_list

['B-model', 'B-price', 'B-manufacturer', 'B-other']

We also define metric calculations (as we'd like to know the F1 score etc. during training). We'll use 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) for that, which is a library containing many tools for evaluating ML models.

In [None]:
import evaluate

# Metric
metric = evaluate.load("seqeval")

def get_labels(predictions, references):
    # Transform predictions and references tensos to numpy arrays
    if device.type == "cpu":
        y_pred = predictions.detach().clone().numpy()
        y_true = references.detach().clone().numpy()
    else:
        y_pred = predictions.detach().cpu().clone().numpy()
        y_true = references.detach().cpu().clone().numpy()

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(pred, gold_label) if l != -100]
        for pred, gold_label in zip(y_pred, y_true)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(pred, gold_label) if l != -100]
        for pred, gold_label in zip(y_pred, y_true)
    ]
    return true_predictions, true_labels

def compute_metrics(return_entity_level_metrics=True):
    results = metric.compute()
    if return_entity_level_metrics:
        # Unpack nested dictionaries
        final_results = {}
        for key, value in results.items():
            if isinstance(value, dict):
                for n, v in value.items():
                    final_results[f"{key}_{n}"] = v
            else:
                final_results[key] = value
        return final_results
    else:
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }

## Train

Alright, let's train! Here we're training the model in native PyTorch, but of course you could also opt for things like 🤗 Accelerate, 🤗 Trainer, PyTorch Lightning,...

In [None]:
import torch
from torch.optim import AdamW
from tqdm.auto import tqdm

optimizer = AdamW(model.parameters(), lr=5e-5)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

model.train()
for epoch in range(5):  # loop over the dataset multiple times
    for batch in tqdm(dataloader):
        # get the inputs;
        inputs = {k:v.to(device) for k,v in batch.items()}

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(**inputs)

        loss = outputs.loss
        loss.backward()
        optimizer.step()

        print("Loss:", loss.item())

        predictions = outputs.logits.argmax(dim=-1)
        labels = batch["labels"]
        preds, refs = get_labels(predictions, labels)
        metric.add_batch(
            predictions=preds,
            references=refs,
        )

    eval_metric = compute_metrics()
    print(f"Epoch {epoch}:", eval_metric)

  0%|          | 0/5 [00:00<?, ?it/s]

Loss: 0.0028853437397629023
Loss: 0.16177695989608765
Loss: 0.004325489979237318
Loss: 0.0730261281132698
Loss: 0.010141934268176556
Epoch 0: {'manufacturer_precision': 1.0, 'manufacturer_recall': 1.0, 'manufacturer_f1': 1.0, 'manufacturer_number': 9, 'model_precision': 1.0, 'model_recall': 1.0, 'model_f1': 1.0, 'model_number': 10, 'other_precision': 0.9893048128342246, 'other_recall': 0.9893048128342246, 'other_f1': 0.9893048128342246, 'other_number': 187, 'price_precision': 0.8, 'price_recall': 0.8, 'price_f1': 0.8000000000000002, 'price_number': 10, 'overall_precision': 0.9814814814814815, 'overall_recall': 0.9814814814814815, 'overall_f1': 0.9814814814814815, 'overall_accuracy': 0.9814814814814815}


  0%|          | 0/5 [00:00<?, ?it/s]

Loss: 0.0028456775471568108
Loss: 0.06846997886896133
Loss: 0.007746108807623386
Loss: 0.009106344543397427
Loss: 0.007496472913771868
Epoch 1: {'manufacturer_precision': 1.0, 'manufacturer_recall': 1.0, 'manufacturer_f1': 1.0, 'manufacturer_number': 9, 'model_precision': 1.0, 'model_recall': 1.0, 'model_f1': 1.0, 'model_number': 10, 'other_precision': 1.0, 'other_recall': 0.9946524064171123, 'other_f1': 0.9973190348525469, 'other_number': 187, 'price_precision': 0.9090909090909091, 'price_recall': 1.0, 'price_f1': 0.9523809523809523, 'price_number': 10, 'overall_precision': 0.9953703703703703, 'overall_recall': 0.9953703703703703, 'overall_f1': 0.9953703703703703, 'overall_accuracy': 0.9953703703703703}


  0%|          | 0/5 [00:00<?, ?it/s]

Loss: 0.004064284730702639
Loss: 0.004021643660962582
Loss: 0.03697982430458069
Loss: 0.004647796507924795
Loss: 0.010091465897858143
Epoch 2: {'manufacturer_precision': 1.0, 'manufacturer_recall': 1.0, 'manufacturer_f1': 1.0, 'manufacturer_number': 9, 'model_precision': 1.0, 'model_recall': 1.0, 'model_f1': 1.0, 'model_number': 10, 'other_precision': 1.0, 'other_recall': 0.9946524064171123, 'other_f1': 0.9973190348525469, 'other_number': 187, 'price_precision': 0.9090909090909091, 'price_recall': 1.0, 'price_f1': 0.9523809523809523, 'price_number': 10, 'overall_precision': 0.9953703703703703, 'overall_recall': 0.9953703703703703, 'overall_f1': 0.9953703703703703, 'overall_accuracy': 0.9953703703703703}


  0%|          | 0/5 [00:00<?, ?it/s]

Loss: 0.023364050313830376
Loss: 0.003646040568128228
Loss: 0.009868927299976349
Loss: 0.007390023209154606
Loss: 0.010390682145953178
Epoch 3: {'manufacturer_precision': 1.0, 'manufacturer_recall': 1.0, 'manufacturer_f1': 1.0, 'manufacturer_number': 9, 'model_precision': 1.0, 'model_recall': 1.0, 'model_f1': 1.0, 'model_number': 10, 'other_precision': 1.0, 'other_recall': 1.0, 'other_f1': 1.0, 'other_number': 187, 'price_precision': 1.0, 'price_recall': 1.0, 'price_f1': 1.0, 'price_number': 10, 'overall_precision': 1.0, 'overall_recall': 1.0, 'overall_f1': 1.0, 'overall_accuracy': 1.0}


  0%|          | 0/5 [00:00<?, ?it/s]

Loss: 0.008640694431960583
Loss: 0.0014533746289089322
Loss: 0.0015390758635476232
Loss: 0.0013109208084642887
Loss: 0.0010877938475459814
Epoch 4: {'manufacturer_precision': 1.0, 'manufacturer_recall': 1.0, 'manufacturer_f1': 1.0, 'manufacturer_number': 9, 'model_precision': 1.0, 'model_recall': 1.0, 'model_f1': 1.0, 'model_number': 10, 'other_precision': 1.0, 'other_recall': 1.0, 'other_f1': 1.0, 'other_number': 187, 'price_precision': 1.0, 'price_recall': 1.0, 'price_f1': 1.0, 'price_number': 10, 'overall_precision': 1.0, 'overall_recall': 1.0, 'overall_f1': 1.0, 'overall_accuracy': 1.0}


## Inference

Let's try out the model on a new web page for which we have the nodes and xpaths. Here we'll just use one of our training set.

In [None]:
nodes = data[0]['nodes']
xpaths = data[0]['xpaths']
node_labels = data[0]['node_labels']
print("Nodes:", nodes)
print("Xpaths:", xpaths)

Nodes: [['eCOST.com - Samsung 12.2 MegaPixel Compact Digital Camera w/ Dual LCD Displays - Dark Blue - TL205 DARK BLUE', 'Samsung 12.2 MegaPixel Compact Digital Camera w/ Dual LCD Displays - Dark Blue - TL205 DARK BLUE', '12.2 Megapixel, 3x Optical Zoom, 3x Digital Zoom, Dual LCD Displays - Back:  2.7" Color LCD and Front: 1.5" Color LCD, Image Stabilization, Image Sensor, SD / SDHC Memory Card Slot - Refurbished / Recertified', 'List Price:', '$179.00', 'Regular Price:', '$93.99', 'Price:', '$89.99', 'You Save:', '$89.01 (49.73%)', 'QTY:', 'Availability:', 'In Stock', 'eCOST Part #:', '58093748', 'Manufacturer:', 'Samsung', 'MFG Part #:', 'TL205 DARK BLUE', 'Item Condition:', 'Recertified/Refurbished']]
Xpaths: [['/html/head/title', '/html/body/div[1]/div/div/div/div/div/div/div/h1', '/html/body/h2', '/html/body/div[2]/div[5]/div[1]/table/tr[1]/td/table/tr/td[1]', '/html/body/div[2]/div[5]/div[1]/table/tr[1]/td/table/tr/td[2]/table/tr/td/span/span', '/html/body/div[2]/div[5]/div[1]/ta

We'll prepare the example for the model using the processor. Note that we're passing `return_offsets_mapping=True`, as the offsets allow us to determine which tokens are at the start of a given word at which aren't.

In [None]:
# prepare for model
# note that you don't need to prepare node_labels, we just have them available here so we'll compare to the ground truth
encoding = processor(nodes=nodes, xpaths=xpaths, node_labels=node_labels, return_offsets_mapping=True, return_tensors="pt").to(device)
for k,v in encoding.items():
  print(k,v.shape)

input_ids torch.Size([1, 192])
token_type_ids torch.Size([1, 192])
attention_mask torch.Size([1, 192])
offset_mapping torch.Size([1, 192, 2])
xpath_tags_seq torch.Size([1, 192, 50])
xpath_subs_seq torch.Size([1, 192, 50])
labels torch.Size([1, 192])


Let's perform a forward pass:

In [None]:
# we don't need the offset mapping and labels for the forward pass
offset_mapping = encoding.pop("offset_mapping")
labels = encoding.pop("labels")

# forward pass
with torch.no_grad():
  outputs = model(**encoding)

The model outputs logits of shape (batch_size, seq_len, num_labels). We just take the highest logit (score) per token as prediction:

In [None]:
predictions = outputs.logits.argmax(-1)
print(predictions)

tensor([[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 3, 3, 0, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]],
       device='cuda:0')


The model makes predictions at the token level, however we're only interested in the predicted label for the first token of each node.

This can be achieved by accessing the word_ids (to know whether or not the token is a special token or not) and the offset_mapping (to know whether or not the token is the first of a particular node).

In [None]:
results = {"Node": [], "Predicted": [], "Ground truth": []}

for pred_id, word_id, offset, label_id in zip(predictions[0].tolist(), encoding.word_ids(0), offset_mapping[0].tolist(), labels[0].tolist()):
  if word_id is not None and offset[0] == 0:
    # print(f"Node: {nodes[0][word_id]}")
    # print(f"Predicted: {id2label[pred_id]}")
    # print(f"Ground truth: {id2label[label_id]}")
    # print("----------")
    results["Node"].append(nodes[0][word_id])
    results["Predicted"].append(id2label[pred_id])
    results["Ground truth"].append(id2label[label_id])

Let's pretty print the results as a Pandas dataframe:

In [None]:
import pandas as pd

pd.DataFrame.from_dict(results).head()

Unnamed: 0,Node,Predicted,Ground truth
0,eCOST.com - Samsung 12.2 MegaPixel Compact Dig...,other,other
1,Samsung 12.2 MegaPixel Compact Digital Camera ...,model,model
2,"12.2 Megapixel, 3x Optical Zoom, 3x Digital Zo...",other,other
3,List Price:,other,other
4,$179.00,other,other
