# **LayoutLM V1**

### BERT Based encoder only model

#### Main componenets:-

1. Textual embeddings
2. Text position embeddings
3. 2D Position embeddings
4. Image embeddings

#
5. Pre-built OCR parser (For 2d-positional encoding and image patches)
6. Faster RCNN (for image features)


## Pre-training objectives
### 2 main objective functions:- 
1. Maksed Vision Language Modelling 
2. Multi-class Document classification 

#### LayoutLM Embeddings :-


    class LayoutLMEmbeddings(nn.Module):
        """Construct the embeddings from word, position and token_type embeddings."""

        def __init__(self, config):
            super(LayoutLMEmbeddings, self).__init__()
            self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
            self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
            self.x_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.hidden_size)
            self.y_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.hidden_size)
            self.h_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.hidden_size)
            self.w_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.hidden_size)
            self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

            self.LayerNorm = LayoutLMLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
            self.dropout = nn.Dropout(config.hidden_dropout_prob)


All the embeddings are added (absolute positional embeddings) while feeding the input to the network.

    h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
    w_position_embeddings = self.w_position_embeddings(bbox[:, :, 2] - bbox[:, :, 0])
    token_type_embeddings = self.token_type_embeddings(token_type_ids)

    embeddings = (
                words_embeddings
                + position_embeddings
                + left_position_embeddings
                + upper_position_embeddings
                + right_position_embeddings
                + lower_position_embeddings
                + h_position_embeddings
                + w_position_embeddings
                + token_type_embeddings
            )
            embeddings = self.LayerNorm(embeddings)
            embeddings = self.dropout(embeddings)


Once all the embeddings are added, a layer norm followed by dropout is done on the inputs.

Here, 
token_type_ids are  Segment token indices to indicate first and second portions of the inputs. 

Indices are selected in 
`[0,1]`: `0` corresponds to a *sentence A* token, `1` corresponds to a *sentence B* token.


**Encoder contains several of these LayoutLM blocks similar to the BERT and the visual, positional and textual information will be encoded in these layers**

## **Main flow :-**

    embedding_output = self.embeddings(
                input_ids=input_ids,
                bbox=bbox,
                position_ids=position_ids,
                token_type_ids=token_type_ids,
                inputs_embeds=inputs_embeds,
            )
    encoder_outputs = self.encoder(
                embedding_output,
                extended_attention_mask,
                head_mask=head_mask,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )
            
    sequence_output = encoder_outputs[0]
    
    pooled_output = self.pooler(sequence_output)

Here the 'sequence_output' is the last hidden state after passing through the set of encoder layers.
The 0th index contains the last hidden state in the tuple containing other intermediate outputs.

Ex:- (
                    hidden_states,
                    next_decoder_cache,
                    all_hidden_states,
                    all_self_attentions,
                    all_cross_attentions
    )


### Pooling layer:-

    class LayoutLMPooler(nn.Module):
        def __init__(self, config):
            super().__init__()
            self.dense = nn.Linear(config.hidden_size, config.hidden_size)
            self.activation = nn.Tanh()

        def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
            # We "pool" the model by simply taking the hidden state corresponding
            # to the first token.
            first_token_tensor = hidden_states[:, 0]
            pooled_output = self.dense(first_token_tensor)
            pooled_output = self.activation(pooled_output)
            return pooled_output


This returns an embedding for the [CLS] token, after passing it through a non-linear tanh activation.

##################################################################################################################################

### For Document Q&A

Once we get the LayoutLM output (hidden states), the logits for text generation is produced.


        logits = self.qa_outputs(sequence_output)
                start_logits, end_logits = logits.split(1, dim=-1)
                start_logits = start_logits.squeeze(-1).contiguous()
                end_logits = end_logits.squeeze(-1).contiguous()

These start and end scores of the logits are then used to get the tokens from the vocabualary

    start_scores = outputs.start_logits
    end_scores = outputs.end_logits
    start, end = word_ids[start_scores.argmax(-1)], word_ids[end_scores.argmax(-1)]
    print(" ".join(words[start : end + 1]))

################################################################################################################################################

### LayoutLM model without any downstream task

In [1]:
from transformers import AutoTokenizer, LayoutLMForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/layoutlm-base-uncased")
model = LayoutLMForMaskedLM.from_pretrained("microsoft/layoutlm-base-uncased")

words = ["Hello", "[MASK]"]
normalized_word_boxes = [637, 773, 693, 782], [698, 773, 733, 782]

token_boxes = []
for word, box in zip(words, normalized_word_boxes):
    word_tokens = tokenizer.tokenize(word)
    token_boxes.extend([box] * len(word_tokens))

# add bounding boxes of cls + sep tokens
token_boxes = [[0, 0, 0, 0]] + token_boxes + [[1000, 1000, 1000, 1000]]

encoding = tokenizer(" ".join(words), return_tensors="pt")
input_ids = encoding["input_ids"]
attention_mask = encoding["attention_mask"]
token_type_ids = encoding["token_type_ids"]
bbox = torch.tensor([token_boxes])

labels = tokenizer("Hello world", return_tensors="pt")["input_ids"]

outputs = model(
    input_ids=input_ids,
    bbox=bbox,
    attention_mask=attention_mask,
    token_type_ids=token_type_ids,
    labels=labels,
)

loss = outputs.loss

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/170 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/606 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/453M [00:00<?, ?B/s]

In [None]:
'''

token_boxes :- 

    [[0, 0, 0, 0],
    [637, 773, 693, 782],
    [698, 773, 733, 782],
    [1000, 1000, 1000, 1000]]

 
 '''

In [2]:
outputs

MaskedLMOutput(loss=tensor(11.0031, grad_fn=<NllLossBackward0>), logits=tensor([[[ 0.5534,  0.1442,  0.0901,  ..., -0.0590,  0.8132,  0.7086],
         [-1.6607, -0.8606, -1.0854,  ...,  0.2116,  0.2717, -0.9807],
         [-1.4802, -0.6127, -1.2147,  ...,  0.5059,  0.1404, -1.1180],
         [ 0.5582,  0.1473,  0.0929,  ..., -0.0526,  0.8160,  0.7153]]],
       grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

### LayoutLM for Question and Answering

In [3]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.17.1 dill-0.3.8 multiprocess-0.70.16


In [4]:
from transformers import AutoTokenizer, LayoutLMForQuestionAnswering
from datasets import load_dataset
import torch

tokenizer = AutoTokenizer.from_pretrained("impira/layoutlm-document-qa", add_prefix_space=True)
model = LayoutLMForQuestionAnswering.from_pretrained("impira/layoutlm-document-qa", revision="1e3ebac")

dataset = load_dataset("nielsr/funsd", split="train")
example = dataset[0]
question = "what's his name?"
words = example["words"]
boxes = example["bboxes"]

encoding = tokenizer(
    question.split(), words, is_split_into_words=True, return_token_type_ids=True, return_tensors="pt"
)
bbox = []
for i, s, w in zip(encoding.input_ids[0], encoding.sequence_ids(0), encoding.word_ids(0)):
    if s == 1:
        bbox.append(boxes[w])
    elif i == tokenizer.sep_token_id:
        bbox.append([1000] * 4)
    else:
        bbox.append([0] * 4)


# encoding.sequence_ids?
# - `None` for special tokens added around or between sequences,
# - `0` for tokens corresponding to words in the first sequence,
# - `1` for tokens corresponding to words in the second sequence when a pair of sequences was jointly encoded.


encoding["bbox"] = torch.tensor([bbox])

word_ids = encoding.word_ids(0)
outputs = model(**encoding)
loss = outputs.loss
start_scores = outputs.start_logits
end_scores = outputs.end_logits
start, end = word_ids[start_scores.argmax(-1)], word_ids[end_scores.argmax(-1)]
print(" ".join(words[start : end + 1]))

tokenizer_config.json:   0%|          | 0.00/315 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/993 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/511M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/4.54k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

M. Hamann P. Harper, P. Martinez


In [5]:
question, len(words), len(boxes)

("what's his name?", 145, 145)

In [6]:
encoding.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox'])

In [7]:
encoding["bbox"].shape

torch.Size([1, 215, 4])

In [8]:
start_scores.shape, end_scores.shape

(torch.Size([1, 215]), torch.Size([1, 215]))

In [9]:
tokenizer?