<a href="https://colab.research.google.com/github/ChicoQ/my-bookmarks/blob/main/%D0%9A%D0%BE%D0%BF%D0%B8%D1%8F_%D0%B1%D0%BB%D0%BE%D0%BA%D0%BD%D0%BE%D1%82%D0%B0_%22LayoutLM_fine_tunning_for_SROIE_dataset_ipynb%22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine tune SROIE on LayoutLM
This notebook is an effort to fine tune the LayoutLM model for the SROIE dataset. The model is presented in the paper "[LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318)" by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei and Ming Zhou. 

- Git-hub repo [here](https://github.com/microsoft/unilm/tree/master/layoutlm).

- Read about the SROIE competition and dataset [here](https://rrc.cvc.uab.es/?ch=13).

- Inspiration from this Kaggle notebook [here](https://www.kaggle.com/jpmiller/layoutlm-starter)

##Notes:
- The repo includes a pre processing script and fune tunning for the FUNSD dataset, but not for the SROIE dataset (though the paper includes computations on the SROIE dataset). So this notebook intends to fill that gap

- I have used my google drive to manage the files. If you want to use it, just change the folder names (both the ones where you keep the SROIE files and also were you keep the LayoutLM files)

- The best f1 results on the predicitons I got were between 93%~ 94.5%, which is a bit less than the value presented in the paper (~94%/95%). The differences may be explained by 
  - different parameters (I haven't done an exaustive grid search)
  - different sampling
  - different pre processing. This one is far from perfect, some labels and invoices are lost in the way. 
  - different OCR base. As I understood, the authors also did their own OCR, while I run from th one provided in the dataset
  - I was having difficulties with the label "company address" so I have dropped it
  - any other differences, as the paper doesn't explain this fine tunning in detail

- Make sure you have GPU enabled on the notebook (Edit->Notebook settings)

- Yes I know, the code is horrible and badly explained, sorry for that. Nevertheless, hope it helps somehow

# 1. Pre-process dataset

In [9]:
# Imports  
import os
import pandas as pd
import glob
import json 
import ast
import re
import random

In [3]:
# Connection to google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
# Define path for the dataset files (you should previously download the dataset from the link given at the header of the notebook)
# This is the folder with the files that contain the bounding boxes and the words
spath_words = '/content/drive/My Drive/data/layoutml/'
os.chdir(spath_words)
# Create a dataframe to store and manage the invoices bounding boxes and words
df_sentences = pd.DataFrame(columns=['filename', 'sentence'])

# Loops over every file in the folder
for file in glob.glob("*.txt"):
  try:
    # Treat each invoice as a sentence and a row of the df
    sfullpath = spath_words + file
    df_file = pd.read_csv(sfullpath, header=None, names=['x0', 'y0', 'x1', 'y1', 'x2', 'y2', 'x3', 'y3', 'words'])
    if not df_file['words'].isnull().values.any():
      sentence_list = [str(i) for i in df_file['words']]
      bbox_list = []
      for index, row in df_file.iterrows():
        bbox_list.append([row['x0'],row['y0'],row['x2'],row['y2']])
      new_row = {'filename':file, 'sentence':sentence_list, 'bboxes':bbox_list}
      # Append row to the dataframe
      df_sentences = df_sentences.append(new_row, ignore_index=True)
  except Exception as e:
    # There are a few problems, we will just ignore them and print the error associated with it
    print(file + " | " + repr(e))

In [6]:
df_sentences

NameError: ignored

In [None]:
# Define path for the dataset files (you should previously download the dataset from the link given at the header of the notebook)
# This is the folder with the files that contain the values (company name, date, address and total)
spath_labels = '/content/drive/My Drive/data/layoutml/'
os.chdir(spath_labels)
# Create a dataframe to store and manage the invoices tags
df_labels = pd.DataFrame(columns=['filename', 'value_company', 'value_date', 'value_address', 'value_total'])

for file in glob.glob("*.txt"):
  try:
    with open(file, 'r') as fileread:
      data = res = json.loads(fileread.read()) 
    new_row = {'filename':file, 'value_company':data['company'], 'value_date':data['date'], 'value_address':data['address'], 'value_total':data['total']}
    # Append row to the dataframe
    df_labels = df_labels.append(new_row, ignore_index=True)
  except Exception as e:
    print(file + " | " + repr(e))

X51005663280(1).txt | KeyError('address',)
X51005663280.txt | KeyError('address',)


In [None]:
# Now let's merge the two dataframes based on the filename
df = pd.merge(df_sentences,df_labels,on='filename')

In [None]:
# In case you want to store the df on drive (to avoid running the previous cells again and again), just uncomment this cell
#os.chdir('/content/drive/My Drive/Datasets/SROIE2019/')
#df.to_csv('df.csv')

In [11]:
# In case the df is stored on drive, just uncomment this cell
df = pd.read_csv('/content/drive/MyDrive/data/layoutml/df.csv')
    #'/content/drive/My Drive/Datasets/SROIE2019/df.csv')
df = df.drop(['Unnamed: 0'], axis=1)

In [12]:
# Drop unecessary column and parse data (need to avoid some quotes inside the lists)
df['sentence'] = df['sentence'].map(lambda a: ast.literal_eval(a))
df['bboxes'] = df['bboxes'].map(lambda a: ast.literal_eval(a))

In [13]:
df.head(5)

Unnamed: 0,filename,sentence,bboxes,value_company,value_date,value_address,value_total
0,X51006555072.txt,"[DIGI TELECOMMUNICATIONS SDN BHD, (201283-M), ...","[[106, 179, 502, 204], [239, 205, 364, 231], [...",DIGI TELECOMMUNICATIONS SDN BHD,13/10/2017,"LOT LG 315, 1-UTAMA SHOPPING CENTRE, LEBUH BAN...",234.40
1,X51006557117.txt,"[GARDENIA BAKERIES (KL) SDN BHD (139386 X), LO...","[[35, 87, 590, 110], [172, 109, 448, 133], [16...",GARDENIA BAKERIES (KL) SDN BHD,30/10/2017,"LOT 3, JALAN PELABUR 23/1, 40300 SHAH ALAM, SE...",62.60
2,X51005568884.txt,"[MR. D.I.Y. SDN BHD, (CO.REG :704427-T), LOT 1...","[[259, 337, 632, 374], [241, 380, 627, 421], [...",MR. D.I.Y. SDN BHD,24-11-17,"LOT 1851-A & 1851-B, JALAN KPB 6, KAWASAN PERI...",RM 3.90
3,X51005711441.txt,"[RESTORAN WAN SHENG, 002043319-W, NO.2, SEKSYE...","[[224, 266, 553, 308], [282, 316, 484, 350], [...",RESTORAN WAN SHENG,21-03-2018,"NO.2, JALAN TEMENGGUNG 19/9, SEKSYEN 9, BANDAR...",6.70
4,X51005757304(1).txt,"[#000002 BAIFU (M) SDN BHD, COMPANY NO(814198-...","[[147, 153, 549, 193], [195, 193, 514, 226], [...",BAIFU (M) SDN BHD,20/03/2018,"DAISO JAPAN, IOI MALL",35.40


In [14]:
# Define some auxiliary functions
def a_in_x(A, X):
  '''
  Returns list with indexes of elements of list X which contain A
  '''
  l = []
  for i in range(len(X) - len(A) + 1):
    if str(A[0]) in str(X[i:i+len(A)][0]): 
      l.append(i)
  return l

def flat_list_one_level(l):
  '''
  Flattens list
  Doesn't include second level list of lists, only first level
  '''
  flat_list = []
  for sublist in l:
    if type(sublist) is list:
      for item in sublist:
          flat_list.append(item)
    else:
      flat_list.append(sublist)
  return flat_list

def flat_list_one_level_list_of_lists(l):
  '''
  Flattens list
  Flattens only the first element of the sub-list
  '''
  flat_list = []
  for sublist in l:
    if type(sublist) is list and len(sublist) > 0 and type(sublist[0]) is list:
      for item in sublist:
        flat_list.append(item)
    else:
      flat_list.append(sublist)
  return flat_list
    
def intersperse(lst, item):
  '''
  Places an item between elements of a list
  '''
  result = [item] * (len(lst) * 2 - 1)
  result[0::2] = lst
  return result

def split_box(box, n_splits):
  '''
  Splits a bbox [x0,y0,x1,y1] by its coordinates into n_splits bboxes of equal size
  '''
  boxs_splitted = []
  x0 = box[0]
  y0 = box[1]
  x1 = box[2]
  y1 = box[3]
  width = x1 - x0
  for i_split in range(0, n_splits):
    boxs_splitted.append([x0 + i_split * int(width/n_splits), y0, x0 + (i_split + 1) * int(width/n_splits), y1])
  return boxs_splitted

def split_box_weighted(box, l_splits):
  '''
  Splits a bbox [x0,y0,x1,y1] by its coordinates into len(l_splits)
  The size of each bbox is proportional to the weight present in l_splits
  '''
  boxs_splitted = []
  x0 = box[0]
  y0 = box[1]
  x1 = box[2]
  y1 = box[3]
  width = x1 - x0
  sum_splits = sum(l_splits)
  for i_split in l_splits:
    split_fraction = i_split/sum_splits
    x1f = x0 + int(width * split_fraction)
    boxs_splitted.append([x0, y0, x1f, y1])
    x0 = x1f
  return boxs_splitted

In [15]:
# Define function to set the labels to the words
def define_labels(pos, sent, labels, bbox, class_value, classification, label_other = 'O'):
  # Pos is a list whith the position of the words associated with this label
  # So this loops each group of words which has some relation to the label
  for i_pos in pos:
    if sent[i_pos] == class_value:
      # If the group of words is equal to the class value, then this group of words is attributted the label
      labels[i_pos] = classification
    else:
      # The value is contained within the group of words, so we have to split the group (ex: [... , "Date: 01/01/2020", ...] -> [..., ["Date: ", "01/01/2020"], ...])
      # We start by replacing the group of words by a splitted list 
      sent[i_pos] = intersperse(sent[i_pos].split(str(class_value)), str(class_value))
      # This split leaves a white space element at the initial or final position, so we have to remove it
      if sent[i_pos][0].isspace() or len(sent[i_pos][0])==0: sent[i_pos] = sent[i_pos][1:]
      if sent[i_pos][-1].isspace() or len(sent[i_pos][-1])==0: sent[i_pos] = sent[i_pos][0:-1]
      # Now we may associate the labels with the correct group of words (ex: [... , "Date: 01/01/2020", ...] -> [..., ["Date: ", "01/01/2020"], ...], the labels would be [..., ["O", "B-DATE"], ...])
      labels[i_pos] = [classification if s == class_value else label_other for s in sent[i_pos]]
      # The bounding boxes should also be splitted
      # Here we do it proportionally to the number of chars of the words
      bbox[i_pos] = split_box_weighted(bbox[i_pos], [len(i) for i in sent[i_pos]])

  # The obtained lists have now some second level lists, so we have to flatten
  sent = flat_list_one_level(sent)
  labels = flat_list_one_level(labels)
  bbox = flat_list_one_level_list_of_lists(bbox)
  return sent, labels, bbox

In [16]:
# Finally the loop to create lists with the sentences and their corresponding labels and bboxes
sentences_list = []
labels_list = []
bbox_list = []
class_other = 'O'
for index, row in df.iterrows():
  labels = [class_other] * len(row['sentence'])
  sent = row['sentence'].copy()
  bbox = row['bboxes'].copy()
  
  # Define labels for date
  class_value = row['value_date']
  classification = 'B-DATE'
  pos = a_in_x([class_value], sent)
  if len(pos) > 0:
    sent, labels, bbox = define_labels(pos, sent, labels, bbox, class_value, classification)
  
  # Define labels for total value
  class_value = row['value_total']
  classification = 'B-TOTAL'
  pos = a_in_x([class_value], sent)
  if len(pos) > 0:
    sent, labels, bbox = define_labels(pos, sent, labels, bbox, class_value, classification)  

  # Define labels for company name
  class_value = row['value_company']
  classification = 'B-COMPANY'
  pos = a_in_x([class_value], sent)
  if len(pos) > 0:
    sent, labels, bbox = define_labels(pos, sent, labels, bbox, class_value, classification)

  # Define labels for address 
  # class_value = row['value_address']
  # classification = 'B-ADDRESS'
  # pos = a_in_x([class_value], sent)
  # if len(pos) > 0:
  #   sent, labels, bbox = define_labels(pos, sent, labels, bbox, class_value, classification)

  # Appends the group of words, labels and bboxes to lists
  sentences_list.append(sent.copy())
  labels_list.append(labels.copy())
  bbox_list.append(bbox.copy())

At this point we have lists in which the elements are also lists (groups of words)

In order to discretize the problem, we should split the groups of words into single words

In [17]:
def break_sentences(sl, bl, ll):
  sentences_list_temp = []
  bbox_list_temp = []
  labels_list_temp = []
  for sents, labels, boxs in zip(sl, bl, ll):
    sentences_list3 = []
    bbox_list3 = []
    labels_list3 = []
    for sent, label, box in zip(sents, labels, boxs):
      word_tokens = sent.split(" ")
      # Strip white spaces
      word_tokens = [w for w in word_tokens if (w != "" and w != " ")] 
      sentences_list3.extend(word_tokens)
      splitted_boxes = split_box_weighted(box, [len(i) for i in word_tokens])
      bbox_list3.extend(splitted_boxes)
      # BO
      labels_list3.extend([label] * len(word_tokens))
      # BIO
      #labels_list3.extend([label] + [label.replace('B-','I-')] * (len(word_tokens) - 1))
    sentences_list_temp.append(sentences_list3)
    bbox_list_temp.append(bbox_list3)
    labels_list_temp.append(labels_list3)
  return sentences_list_temp, bbox_list_temp, labels_list_temp

In [18]:
sentences_list, bbox_list, labels_list = break_sentences(sentences_list, labels_list, bbox_list)

In [19]:
# Check the first invoice data
for s, l, b in zip(sentences_list[0],labels_list[0],bbox_list[0]):
  print("{}\t\t{}\t\t{}".format(s,l,b))

DIGI		B-COMPANY		[106, 179, 162, 204]
TELECOMMUNICATIONS		B-COMPANY		[162, 179, 416, 204]
SDN		B-COMPANY		[416, 179, 458, 204]
BHD		B-COMPANY		[458, 179, 500, 204]
(201283-M)		O		[239, 205, 364, 231]
LOT		O		[80, 229, 252, 256]
LG		O		[252, 229, 366, 256]
315		O		[366, 229, 538, 256]
LEBUH		O		[98, 255, 169, 280]
BANDAR		O		[169, 255, 254, 280]
UTAMA-BANDAR		O		[254, 255, 425, 280]
UTAMA		O		[425, 255, 496, 280]
PETALING		O		[172, 278, 343, 307]
JAYA		O		[343, 278, 428, 307]
SELANGOR		O		[247, 305, 350, 328]
TAX		O		[236, 354, 277, 380]
INVOICE		O		[277, 354, 374, 380]
GST		O		[72, 380, 115, 406]
REG		O		[115, 380, 158, 406]
NUMBER:		O		[158, 380, 259, 406]
001211957248		O		[335, 381, 490, 405]
13/10/2017		B-DATE		[48, 429, 177, 451]
12:35		O		[487, 427, 552, 453]
POS		O		[48, 452, 91, 479]
LOGIN		O		[91, 452, 163, 479]
ID:		O		[163, 452, 206, 479]
DMGR34013		O		[206, 452, 336, 479]
STORE		O		[48, 480, 120, 503]
NAME:		O		[120, 480, 192, 503]
DS001-BP009		O		[192, 480, 350, 503]
OSCAR	

Now everything is ready to write the files in the correct format (accepted by the layoutLM process)

In [20]:
def bbox_string(box, width, length):
    return (
        str(int(1000 * (box[0] / width)))
        + " "
        + str(int(1000 * (box[1] / length)))
        + " "
        + str(int(1000 * (box[2] / width)))
        + " "
        + str(int(1000 * (box[3] / length)))
    )

def actual_bbox_string(box, width, length):
    return (
        str(box[0])
        + " "
        + str(box[1])
        + " "
        + str(box[2])
        + " "
        + str(box[3])
        + "\t"
        + str(width)
        + " "
        + str(length)
    )

def size(bboxes):
  max_width = 0
  max_height = 0
  min_x0 = 10e8
  min_y0 = 10e8
  for box in bboxes:
    if box[0] < min_x0: min_x0 = box[0]
    if box[1] < min_y0: min_y0 = box[1]
    if box[2] > max_width: max_width = box[2]
    if box[3] > max_height: max_height = box[3]
  max_width += min_x0
  max_height += min_y0
  return max_height, max_width

In [21]:
def write_files(output_dir, data_split, sentences_list, labels_list, bbox_list, split_indexes):
  with open(
      os.path.join(output_dir, data_split + ".txt"),
      "w",
      encoding="utf8",
  ) as fw, open(
      os.path.join(output_dir, data_split + "_box.txt"),
      "w",
      encoding="utf8",
  ) as fbw, open(
      os.path.join(output_dir, data_split + "_image.txt"),
      "w",
      encoding="utf8",
  ) as fiw:
      for index in split_indexes:
          sent = sentences_list[index]
          lab = labels_list[index]
          boxes = bbox_list[index]
          length, width = size(boxes)

          for words, label, box in zip(sent, lab, boxes):
              fw.write("{}\t{}\n".format(words, label))
              fbw.write("{}\t{}\n".format(words, bbox_string(box, width, length)))
              fiw.write("{}\t{}\t{}\n".format(words, actual_bbox_string(box, width, length), "filename.jpg"))
          fw.write("\n")
          fbw.write("\n")
          fiw.write("\n")

In [22]:
# First we split into train and test set
split_indexes = [*range(len(sentences_list))]
random.Random(4).shuffle(split_indexes)
cut = int(len(sentences_list) * 0.8)
split_indexes_train = split_indexes[:cut]
split_indexes_test = split_indexes[cut:]

In [24]:
write_files('/content/drive/MyDrive/data/SROIE2019',
    #'/content/drive/My Drive/Datasets/SROIE2019/',
            'train', sentences_list, labels_list, bbox_list, split_indexes_train)

In [25]:
write_files('/content/drive/MyDrive/data/SROIE2019',
    #'/content/drive/My Drive/Datasets/SROIE2019/',
     'test', sentences_list, labels_list, bbox_list, split_indexes_test)

# 2. Fine tune LayoutLM

In [26]:
os.chdir('/content')

In [2]:
%%bash
git clone https://github.com/microsoft/unilm.git
cd unilm/layoutlm
pip install .

Processing /content/unilm/layoutlm
Collecting transformers==2.9.0
  Downloading https://files.pythonhosted.org/packages/cd/38/c9527aa055241c66c4d785381eaf6f80a28c224cae97daa1f8b183b5fabb/transformers-2.9.0-py3-none-any.whl (635kB)
Collecting tensorboardX==2.0
  Downloading https://files.pythonhosted.org/packages/35/f1/5843425495765c8c2dd0784a851a93ef204d314fc87bcc2bbb9f662a3ad1/tensorboardX-2.0-py2.py3-none-any.whl (195kB)
Collecting lxml==4.5.1
  Downloading https://files.pythonhosted.org/packages/ba/39/0b5d76e64681243db516491bc449eff847d2708b465b60465b31ca13522e/lxml-4.5.1-cp37-cp37m-manylinux1_x86_64.whl (5.5MB)
Collecting seqeval==0.0.12
  Downloading https://files.pythonhosted.org/packages/34/91/068aca8d60ce56dd9ba4506850e876aba5e66a6f2f29aa223224b50df0de/seqeval-0.0.12.tar.gz
Collecting tokenizers==0.7.0
  Downloading https://files.pythonhosted.org/packages/ea/59/bb06dd5ca53547d523422d32735585493e0103c992a52a97ba3aa3be33bf/tokenizers-0.7.0-cp37-cp37m-manylinux1_x86_64.whl (5.6MB)

Cloning into 'unilm'...


In [28]:
os.chdir('/content/unilm/layoutlm/examples/seq_labeling')

In [None]:
# Move the previously created files
%%bash
mkdir data
cp '/content/drive/My Drive/Datasets/SROIE2019/train.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/My Drive/Datasets/SROIE2019/train_box.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/My Drive/Datasets/SROIE2019/train_image.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/My Drive/Datasets/SROIE2019/test.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/My Drive/Datasets/SROIE2019/test_box.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/My Drive/Datasets/SROIE2019/test_image.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/My Drive/Datasets/SROIE2019/labels.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
# Try to remove cached files (this is optional and only important if we make changes on the input files)
#rm '/content/unilm/layoutlm/examples/seq_labeling/data/cached_train_layoutlm-base-uncased_512'
#rm '/content/unilm/layoutlm/examples/seq_labeling/data/cached_test_layoutlm-base-uncased_512'

In [29]:
%%bash
mkdir data
cp '/content/drive/MyDrive/data/SROIE2019/train.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/MyDrive/data/SROIE2019/train_box.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/MyDrive/data/SROIE2019/train_image.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/MyDrive/data/SROIE2019/test.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/MyDrive/data/SROIE2019/test_box.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/MyDrive/data/SROIE2019/test_image.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/MyDrive/data/SROIE2019/labels.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'

In [30]:
%%bash
ls /content/unilm/layoutlm/examples/seq_labeling/data/
cat /content/unilm/layoutlm/examples/seq_labeling/data/labels.txt

labels.txt
test_box.txt
test_image.txt
test.txt
train_box.txt
train_image.txt
train.txt
S-COMPANY
S-DATE
S-ADDRESS
S-TOTAL
O


In [31]:
# Check model parameters
%%bash
cat "/content/drive/My Drive/Models/layoutlm-base-uncased/config.json"

cat: '/content/drive/My Drive/Models/layoutlm-base-uncased/config.json': No such file or directory


In [33]:
%%bash
cat "/content/drive/MyDrive/data/Models/config.json"

{
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "max_2d_position_embeddings": 1024,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 2,
  "use_bfloat16": false,
  "vocab_size": 30522,
  "model_type": "layoutlm"
}

In [None]:
# Want to change any model parameter? For example here I just replace the number of attention heads from 12 to 8 (the results are much better)
%%bash
sed -i 's/"num_attention_heads": 12,/"num_attention_heads": 8,/' "/content/drive/My Drive/Msc/Tese/Modelos/layoutlm-base-uncased/config.json"

In [40]:
%%bash
ls '/content/drive/MyDrive/data/Models/layoutlm-base-uncased/'
#touch '/content/drive/MyDrive/data/Models/layoutlm-base-uncased/config.json'

config.json


In [45]:
%%bash
cp '/content/drive/MyDrive/data/Models/config.json' '/content/drive/MyDrive/data/Models/layoutlm-base-uncased/config.json'

In [46]:
%%bash
sed -i 's/"num_attention_heads": 12,/"num_attention_heads": 8,/' "/content/drive/MyDrive/data/Models/layoutlm-base-uncased/config.json"

In [47]:
%%bash
cat '/content/drive/MyDrive/data/Models/layoutlm-base-uncased/config.json'

{
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "max_2d_position_embeddings": 1024,
  "num_attention_heads": 8,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 2,
  "use_bfloat16": false,
  "vocab_size": 30522,
  "model_type": "layoutlm"
}

In [53]:
# Train the model
! CUDA_LAUNCH_BLOCKING=1 python run_seq_labeling.py  --data_dir data \
--model_type layoutlm \
                            --model_name_or_path '/content/drive/MyDrive/data/Models/layoutlm-base-uncased/' \ 
                            --do_lower_case \
                            --max_seq_length 512 \
                            --do_train \
                            --num_train_epochs 5.0 \
                            --logging_steps 10 \
                            --save_steps -1 \
                            --output_dir output1 \
                            --overwrite_output_dir \
                            --labels data/labels.txt \
                            --per_gpu_train_batch_size 8 \
                            --per_gpu_eval_batch_size 8

IndentationError: ignored

In [74]:
! CUDA_LAUNCH_BLOCKING=1 python run_seq_labeling.py  \
--data_dir data --model_type layoutlm \
--model_name_or_path '/content/drive/MyDrive/data/Models/layoutlm-base-uncased/' \
--do_lower_case --max_seq_length 512 --do_train --num_train_epochs 5.0 \
--logging_steps 10 \
--save_steps -1 \
--output_dir output1 \
--overwrite_output_dir \
--labels data/labels.txt \
--per_gpu_train_batch_size 8 \
--per_gpu_eval_batch_size 8

2021-05-16 23:09:53.425615: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Epoch:   0% 0/5 [00:00<?, ?it/s]
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:1005.)
  exp_avg.mul_(beta1).add_(1.0 - beta1, grad)

Iteration:   1% 1/71 [00:00<01:08,  1.02it/s][A
Iteration:   3% 2/71 [00:01<01:02,  1.11it/s][A
Iteration:   4% 3/71 [00:02<00:57,  1.18it/s][A
Iteration:   6% 4/71 [00:03<00:54,  1.24it/s][A
Iteration:   7% 5/71 [00:03<00:51,  1.28it/s][A
Iteration:   8% 6/71 [00:04<00:49,  1.31it/s][A
Iteration:  10% 7/71 [00:05<00:48,  1.33it/s][A
Iteration:  11% 8/71 [00:06<00:46,  1.35it/s][A

Iteration:  14% 10/71 [00:07<00:44,  1.37it/s][A
Iteration:  15% 11/71 [00:08<00:43,  1.37it/s][A
Iteration:  17% 12/71 [00:08<00:42,  1.38it/s][A
Iteration:  18% 13/71 [

In [57]:
%%bash
cat '/content/unilm/layoutlm/examples/seq_labeling/data/labels.txt'

S-COMPANY
S-DATE
S-ADDRESS
S-TOTAL
O


In [71]:
%%bash
touch /content/unilm/layoutlm/examples/seq_labeling/data/labels.txt

In [59]:
%%bash
sed '$ a O' /content/unilm/layoutlm/examples/seq_labeling/data/labels.txt
sed '$ a B-DATE' /content/unilm/layoutlm/examples/seq_labeling/data/labels.txt

In [72]:
%%bash
echo 'O' >> /content/unilm/layoutlm/examples/seq_labeling/data/labels.txt
echo 'B-DATE' >> /content/unilm/layoutlm/examples/seq_labeling/data/labels.txt
echo 'B-COMPANY' >> /content/unilm/layoutlm/examples/seq_labeling/data/labels.txt
echo 'B-TOTAL' >> /content/unilm/layoutlm/examples/seq_labeling/data/labels.txt


In [69]:
%%bash 
rm  /content/unilm/layoutlm/examples/seq_labeling/data/labels.txt

In [73]:
%%bash
cat /content/unilm/layoutlm/examples/seq_labeling/data/labels.txt

O
B-DATE
B-COMPANY
B-TOTAL


In [None]:
#'/content/drive/My Drive/Models/layoutlm-base-uncased1'

In [None]:
# Evaluate for test set
! python run_seq_labeling.py  --data_dir data \
                            --model_type layoutlm \
                            --model_name_or_path '/content/drive/MyDrive/data/Models/layoutlm-base-uncased/' \
                            --do_lower_case \
                            --max_seq_length 512 \
                            --do_predict \
                            --logging_steps 10 \
                            --save_steps -1 \
                            --output_dir output1 \
                            --labels data/labels.txt \
                            --per_gpu_eval_batch_size 8

In [75]:
! python run_seq_labeling.py  --data_dir data \
                            --model_type layoutlm \
                            --model_name_or_path '/content/drive/MyDrive/data/Models/layoutlm-base-uncased/' \
                            --do_lower_case \
                            --max_seq_length 512 \
                            --do_predict \
                            --logging_steps 10 \
                            --save_steps -1 \
                            --output_dir output1 \
                            --labels data/labels.txt \
                            --per_gpu_eval_batch_size 8

2021-05-16 23:21:16.520865: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Evaluating: 100% 18/18 [00:04<00:00,  4.29it/s]


In [76]:
cat output1/test_results.txt

f1 = 0.9577590040017786
loss = 0.025354126203132585
precision = 0.9397905759162304
recall = 0.9764279238440616


In [None]:
ls '/content/drive/My Drive/Models'

In [None]:
!cp -r ./output1 '/content/drive/My Drive/Models'

In [77]:
ls '/content/drive/MyDrive/data/Models'

 config.json   [0m[01;34mlayoutlm-base-uncased[0m/  [01;36m'layoutlm-base-uncased (1)'[0m@


In [78]:
!cp -r ./output1 '/content/drive/MyDrive/data/Models'

In [79]:
# We can check the results on the test set
%%bash
head -60 output/test_predictions.txt

head: cannot open 'output/test_predictions.txt' for reading: No such file or directory


In [80]:
%%bash
head -60 output1/test_predictions.txt

TAN O
WOON O
YANN O
MR O
D.I.Y. O
(M) B-COMPANY
SDN B-COMPANY
BHD B-COMPANY
(CO. O
RFG O
: O
860671-D) O
LOT O
1851-A O
& O
1851-B O
KAWASAN O
PERINDUSTRIAN O
BALAKONG O
43300 O
SERI O
KEMBANGAN O
(TESCO O
PUTRA O
NILAI) O
-INVOICE- O
KILAT O
AUTO O
ECO O
WASH O
& O
SHINE O
ES1000 O
1L O
WA45 O
/2A O
- O
12 O
9555916500133 O
1 O
X O
3.11 O
3.11 O
KILAT' O
ECO O
AUTO O
WASH O
&WAX O
EW-1000-1L O
WA44-A O
- O
12 O
9555916500126 O
1 O
X O
4.62 O
4.62 O
WD40 O
27ML O
MOO O


In [81]:
!cat '/content/drive/MyDrive/data/Models/output1/config.json'

{
  "architectures": [
    "LayoutlmForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-12,
  "max_2d_position_embeddings": 1024,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 8,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 30522
}


In [5]:
PRETRAINED_MODEL = "/content/drive/MyDrive/data/Models/output1"

# Path to ONNX model
ONNX_MODEL_PATH = "/content/drive/MyDrive/data/Models/onnx"

MODEL_NAME = "LayoutLMSROIE"

TF_MODEL_PATH = "/content/drive/MyDrive/data/Models/tf"

In [83]:
%pip install transformers==2.9.0



In [84]:
import torch
torch.__version__

'1.8.1+cu101'

In [85]:
import tensorflow as tf

tf.__version__

'2.4.1'

In [86]:
%pip install tensorflow==1.15

Collecting tensorflow==1.15
[?25l  Downloading https://files.pythonhosted.org/packages/92/2b/e3af15221da9ff323521565fa3324b0d7c7c5b1d7a8ca66984c8d59cb0ce/tensorflow-1.15.0-cp37-cp37m-manylinux2010_x86_64.whl (412.3MB)
[K     |████████████████████████████████| 412.3MB 42kB/s 
Collecting keras-applications>=1.0.8
[?25l  Downloading https://files.pythonhosted.org/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
[K     |████████████████████████████████| 51kB 7.8MB/s 
[?25hCollecting gast==0.2.2
  Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Collecting tensorflow-estimator==1.15.1
[?25l  Downloading https://files.pythonhosted.org/packages/de/62/2ee9cd74c9fa2fa450877847ba560b260f5d0fb70ee0595203082dafcc9d/tensorflow_estimator-1.15.1-py2.py3-none-any.whl (503kB)
[K     |████████████████████████████████| 512kB 44.6MB/s 
C

In [16]:
import torch
print(torch.__version__)
import tensorflow
print(tensorflow.__version__)
import transformers
print(transformers.__version__)

1.8.1+cu101
2.4.1
2.9.0


In [17]:
%pip install tensorflow==1.15

Collecting tensorflow==1.15
[?25l  Downloading https://files.pythonhosted.org/packages/92/2b/e3af15221da9ff323521565fa3324b0d7c7c5b1d7a8ca66984c8d59cb0ce/tensorflow-1.15.0-cp37-cp37m-manylinux2010_x86_64.whl (412.3MB)
[K     |████████████████████████████████| 412.3MB 38kB/s 
Collecting keras-applications>=1.0.8
[?25l  Downloading https://files.pythonhosted.org/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
[K     |████████████████████████████████| 51kB 6.1MB/s 
[?25hCollecting tensorflow-estimator==1.15.1
[?25l  Downloading https://files.pythonhosted.org/packages/de/62/2ee9cd74c9fa2fa450877847ba560b260f5d0fb70ee0595203082dafcc9d/tensorflow_estimator-1.15.1-py2.py3-none-any.whl (503kB)
[K     |████████████████████████████████| 512kB 22.9MB/s 
Collecting tensorboard<1.16.0,>=1.15.0
[?25l  Downloading https://files.pythonhosted.org/packages/1e/e9/d3d747a97f7188f48aa5eda486907f3b345cd409f0a0850468ba867db2

In [23]:
%pip install torch==1.8.0

Collecting torch==1.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/94/99/5861239a6e1ffe66e120f114a4d67e96e5c4b17c1a785dfc6ca6769585fc/torch-1.8.0-cp37-cp37m-manylinux1_x86_64.whl (735.5MB)
[K     |████████████████████████████████| 735.5MB 25kB/s 
[31mERROR: torchvision 0.9.1+cu101 has requirement torch==1.8.1, but you'll have torch 1.8.0 which is incompatible.[0m
[31mERROR: torchtext 0.9.1 has requirement torch==1.8.1, but you'll have torch 1.8.0 which is incompatible.[0m
Installing collected packages: torch
  Found existing installation: torch 1.8.1+cu101
    Uninstalling torch-1.8.1+cu101:
      Successfully uninstalled torch-1.8.1+cu101
Successfully installed torch-1.8.0


In [1]:
%pip install torch



In [2]:
%pip install tensorflow==2.4.1

Collecting tensorflow==2.4.1
[?25l  Downloading https://files.pythonhosted.org/packages/70/dc/e8c5e7983866fa4ef3fd619faa35f660b95b01a2ab62b3884f038ccab542/tensorflow-2.4.1-cp37-cp37m-manylinux2010_x86_64.whl (394.3MB)
[K     |████████████████████████████████| 394.3MB 40kB/s 
Collecting gast==0.3.3
  Downloading https://files.pythonhosted.org/packages/d6/84/759f5dd23fec8ba71952d97bcc7e2c9d7d63bdc582421f3cd4be845f0c98/gast-0.3.3-py2.py3-none-any.whl
Collecting tensorboard~=2.4
[?25l  Downloading https://files.pythonhosted.org/packages/44/f5/7feea02a3fb54d5db827ac4b822a7ba8933826b36de21880518250b8733a/tensorboard-2.5.0-py3-none-any.whl (6.0MB)
[K     |████████████████████████████████| 6.0MB 42.2MB/s 
Collecting tensorflow-estimator<2.5.0,>=2.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/74/7e/622d9849abf3afb81e482ffc170758742e392ee129ce1540611199a59237/tensorflow_estimator-2.4.0-py2.py3-none-any.whl (462kB)
[K     |████████████████████████████████| 471kB 40.6MB/s 
C

In [3]:
import torch
print(torch.__version__)
import tensorflow
print(tensorflow.__version__)

1.8.1+cu101
1.15.0


In [9]:
!test -d onnx-tensorflow || git clone https://github.com/onnx/onnx-tensorflow.git

Cloning into 'onnx-tensorflow'...
remote: Enumerating objects: 6128, done.[K
remote: Counting objects: 100% (77/77), done.[K
remote: Compressing objects: 100% (61/61), done.[K
remote: Total 6128 (delta 40), reused 28 (delta 15), pack-reused 6051[K
Receiving objects: 100% (6128/6128), 1.87 MiB | 13.03 MiB/s, done.
Resolving deltas: 100% (4765/4765), done.


In [10]:
%cd onnx-tensorflow/
%pip install -e .

/content/onnx-tensorflow
Obtaining file:///content/onnx-tensorflow
Collecting onnx>=1.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/3f/9b/54c950d3256e27f970a83cd0504efb183a24312702deed0179453316dbd0/onnx-1.9.0-cp37-cp37m-manylinux2010_x86_64.whl (12.2MB)
[K     |████████████████████████████████| 12.2MB 254kB/s 
Collecting tensorflow_addons
[?25l  Downloading https://files.pythonhosted.org/packages/66/4b/e893d194e626c24b3df2253066aa418f46a432fdb68250cde14bf9bb0700/tensorflow_addons-0.13.0-cp37-cp37m-manylinux2010_x86_64.whl (679kB)
[K     |████████████████████████████████| 686kB 32.0MB/s 
Installing collected packages: onnx, tensorflow-addons, onnx-tf
  Running setup.py develop for onnx-tf
Successfully installed onnx-1.9.0 onnx-tf tensorflow-addons-0.13.0


In [3]:
%cd onnx-tensorflow/
!git branch
%pip install -e .

[Errno 2] No such file or directory: 'onnx-tensorflow/'
/content/onnx-tensorflow
* [32m(HEAD detached at v1.6.0-tf-1.15)[m
  master[m
Obtaining file:///content/onnx-tensorflow
Installing collected packages: onnx-tf
  Found existing installation: onnx-tf 1.6.0
    Can't uninstall 'onnx-tf'. No files were found to uninstall.
  Running setup.py develop for onnx-tf
Successfully installed onnx-tf


In [13]:
%cd onnx-tensorflow/
!git checkout master
!git branch 

[Errno 2] No such file or directory: 'onnx-tensorflow/'
/content/onnx-tensorflow
Previous HEAD position was 6b9e76d Create Release 1.6.0 for tf-1.x branch (#676)
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
* [32mmaster[m


In [14]:
%pip install -e .

Obtaining file:///content/onnx-tensorflow
Installing collected packages: onnx-tf
  Found existing installation: onnx-tf 1.6.0
    Can't uninstall 'onnx-tf'. No files were found to uninstall.
  Running setup.py develop for onnx-tf
Successfully installed onnx-tf


In [1]:
%cd onnx-tensorflow/
!git checkout v1.6.0-tf-1.15


/content/onnx-tensorflow
HEAD is now at 6b9e76d Create Release 1.6.0 for tf-1.x branch (#676)


In [13]:
%cd onnx-tensorflow/


[Errno 2] No such file or directory: 'onnx-tensorflow/'
/content/onnx-tensorflow


In [2]:
%pip install -e .

Obtaining file:///content/onnx-tensorflow
Installing collected packages: onnx-tf
  Found existing installation: onnx-tf 1.6.0
    Can't uninstall 'onnx-tf'. No files were found to uninstall.
  Running setup.py develop for onnx-tf
Successfully installed onnx-tf


In [15]:
import sys
print(sys.executable)

/usr/bin/python3


In [9]:
from layoutlm import LayoutlmConfig,  LayoutlmForTokenClassification
import torch

config = LayoutlmConfig.from_pretrained(
    PRETRAINED_MODEL
)

model =  LayoutlmForTokenClassification.from_pretrained(
    PRETRAINED_MODEL,
    from_tf=False,
    config=config,
    cache_dir=None,
)

dummy_input = {
    "input_ids": 
      torch.zeros(1, 128, requires_grad=False, device="cpu").long(),
    "bbox":
      torch.zeros(1, 128, 4, requires_grad=False, device="cpu").long(),
    "attention_mask":
      torch.ones(1, 128, requires_grad=False, device="cpu").long(),    
    "token_type_ids":
      torch.ones(1, 128, requires_grad=False, device="cpu").long(),        
}

dummy_output = model(**dummy_input)

print("Model output")
print(dummy_output)


Model output
(tensor([[[ 0.8069, -0.1788, -0.4051, -0.2850],
         [ 2.7433, -2.0159, -0.4349, -0.2556],
         [ 2.7621, -2.0289, -0.4286, -0.2532],
         [ 2.7492, -2.0275, -0.4348, -0.2430],
         [ 2.7467, -2.0339, -0.4338, -0.2303],
         [ 2.7492, -2.0386, -0.4296, -0.2257],
         [ 2.7362, -2.0325, -0.4301, -0.2362],
         [ 2.7345, -2.0248, -0.4347, -0.2265],
         [ 2.7510, -2.0378, -0.4413, -0.2196],
         [ 2.7688, -2.0385, -0.4472, -0.2155],
         [ 2.7815, -2.0299, -0.4554, -0.2153],
         [ 2.7901, -2.0295, -0.4565, -0.2276],
         [ 2.7840, -2.0086, -0.4553, -0.2370],
         [ 2.7899, -2.0141, -0.4502, -0.2415],
         [ 2.7865, -2.0097, -0.4500, -0.2458],
         [ 2.7906, -2.0139, -0.4488, -0.2501],
         [ 2.7991, -2.0167, -0.4519, -0.2573],
         [ 2.7824, -2.0061, -0.4605, -0.2621],
         [ 2.7848, -2.0127, -0.4633, -0.2577],
         [ 2.7836, -2.0172, -0.4679, -0.2557],
         [ 2.7872, -2.0200, -0.4726, -0.2500],

In [8]:
# Path to ONNX model
ONNX_MODEL_PATH = "/content/drive/MyDrive/data/Models/onnx2"

PRETRAINED_MODEL = "/content/drive/MyDrive/data/Models/output1"

# Path to ONNX model
#ONNX_MODEL_PATH = "/content/drive/MyDrive/data/Models/onnx"

MODEL_NAME = "LayoutLMSROIE"

TF_MODEL_PATH = "/content/drive/MyDrive/data/Models/tf"

In [20]:
%pip install onnx==1.8.0

Collecting onnx==1.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/93/b6/382e24992ff643fc4293d4e660af982934e3149d2f77354812ca51830638/onnx-1.8.0-cp37-cp37m-manylinux2010_x86_64.whl (7.7MB)
[K     |████████████████████████████████| 7.7MB 4.8MB/s 
Installing collected packages: onnx
  Found existing installation: onnx 1.9.0
    Uninstalling onnx-1.9.0:
      Successfully uninstalled onnx-1.9.0
Successfully installed onnx-1.8.0


In [3]:
import torch

In [10]:
import os

if not os.path.exists(ONNX_MODEL_PATH):
    os.mkdir(ONNX_MODEL_PATH)
    
torch.onnx.export(
    model, 
    (
      dummy_input["input_ids"], 
      dummy_input["bbox"],
      dummy_input["attention_mask"],
      dummy_input["token_type_ids"],
    ),
    f"{ONNX_MODEL_PATH}/{MODEL_NAME}",
    verbose=True,
    input_names=['input_ids', 'input_bbox', 'attention_mask', 'token_type_ids'],
    output_names=["outputs"],
    dynamic_axes={
        'input_ids': {0: 'batch', 1: 'max_seq'}, 
        'attention_mask': {0: 'batch', 1: 'max_seq'}, 
        'token_type_ids': {0: 'batch', 1: 'max_seq'}, 
        'bbox': {0: 'batch', 1: 'max_seq'},

    }
)



graph(%input_ids : Long(*, *, strides=[128, 1], requires_grad=0, device=cpu),
      %input_bbox : Long(1, 128, 4, strides=[512, 4, 1], requires_grad=0, device=cpu),
      %attention_mask : Long(*, *, strides=[128, 1], requires_grad=0, device=cpu),
      %token_type_ids : Long(*, *, strides=[128, 1], requires_grad=0, device=cpu),
      %bert.embeddings.word_embeddings.weight : Float(30522, 768, strides=[768, 1], requires_grad=1, device=cpu),
      %bert.embeddings.position_embeddings.weight : Float(512, 768, strides=[768, 1], requires_grad=1, device=cpu),
      %bert.embeddings.x_position_embeddings.weight : Float(1024, 768, strides=[768, 1], requires_grad=1, device=cpu),
      %bert.embeddings.y_position_embeddings.weight : Float(1024, 768, strides=[768, 1], requires_grad=1, device=cpu),
      %bert.embeddings.h_position_embeddings.weight : Float(1024, 768, strides=[768, 1], requires_grad=1, device=cpu),
      %bert.embeddings.w_position_embeddings.weight : Float(1024, 768, strides=[76

In [11]:
import onnx_tf
import onnx

onnx_model = onnx.load(f"{ONNX_MODEL_PATH}/{MODEL_NAME}")
tf_rep = onnx_tf.backend.prepare(onnx_model)




The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.














Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.


In [8]:
!pip install tensorflow-addons




In [9]:
import onnx_tf
import onnx

onnx_model = onnx.load(f"{ONNX_MODEL_PATH}/{MODEL_NAME}")
tf_rep = onnx_tf.backend.prepare(onnx_model)

 The versions of TensorFlow you are currently using is 1.15.0 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons


ImportError: ignored

In [11]:

onnx.checker.check_model(onnx_model, full_check=True)
onnx.helper.printable_graph(onnx_model.graph)

'graph torch-jit-export (\n  %input_ids[INT64, batchxmax_seq]\n  %input_bbox[INT64, 1x128x4]\n  %attention_mask[INT64, batchxmax_seq]\n  %token_type_ids[INT64, batchxmax_seq]\n) initializers (\n  %bert.embeddings.word_embeddings.weight[FLOAT, 30522x768]\n  %bert.embeddings.position_embeddings.weight[FLOAT, 512x768]\n  %bert.embeddings.x_position_embeddings.weight[FLOAT, 1024x768]\n  %bert.embeddings.y_position_embeddings.weight[FLOAT, 1024x768]\n  %bert.embeddings.h_position_embeddings.weight[FLOAT, 1024x768]\n  %bert.embeddings.w_position_embeddings.weight[FLOAT, 1024x768]\n  %bert.embeddings.token_type_embeddings.weight[FLOAT, 2x768]\n  %bert.embeddings.LayerNorm.weight[FLOAT, 768]\n  %bert.embeddings.LayerNorm.bias[FLOAT, 768]\n  %bert.encoder.layer.0.attention.self.query.bias[FLOAT, 768]\n  %bert.encoder.layer.0.attention.self.key.bias[FLOAT, 768]\n  %bert.encoder.layer.0.attention.self.value.bias[FLOAT, 768]\n  %bert.encoder.layer.0.attention.output.dense.bias[FLOAT, 768]\n  %bert

In [12]:
tf_rep = onnx_tf.backend.prepare(onnx_model)

SchemaError: ignored

In [52]:
tf_rep.inputs

['input_ids', 'input_bbox', 'attention_mask', 'token_type_ids']

In [53]:
tf_rep.tf_module

<onnx_tf.backend_tf_module.BackendTFModule at 0x7f14377eded0>

In [12]:
tf_rep.tensor_dict

{'bert.embeddings.word_embeddings.weight': <tf.Tensor 'bert.embeddings.word_embeddings.weight:0' shape=(30522, 768) dtype=float32>,
 'bert.embeddings.position_embeddings.weight': <tf.Tensor 'bert.embeddings.position_embeddings.weight:0' shape=(512, 768) dtype=float32>,
 'bert.embeddings.x_position_embeddings.weight': <tf.Tensor 'bert.embeddings.x_position_embeddings.weight:0' shape=(1024, 768) dtype=float32>,
 'bert.embeddings.y_position_embeddings.weight': <tf.Tensor 'bert.embeddings.y_position_embeddings.weight:0' shape=(1024, 768) dtype=float32>,
 'bert.embeddings.h_position_embeddings.weight': <tf.Tensor 'bert.embeddings.h_position_embeddings.weight:0' shape=(1024, 768) dtype=float32>,
 'bert.embeddings.w_position_embeddings.weight': <tf.Tensor 'bert.embeddings.w_position_embeddings.weight:0' shape=(1024, 768) dtype=float32>,
 'bert.embeddings.token_type_embeddings.weight': <tf.Tensor 'bert.embeddings.token_type_embeddings.weight:0' shape=(2, 768) dtype=float32>,
 'bert.embeddings.

In [19]:
tf_rep.outputs

['outputs']

In [21]:
tf_rep.outputs[0]

'outputs'

In [14]:
import tensorflow.compat.v1 as tf
import shutil

tf.reset_default_graph()            


input_ids_tensor = tf_rep.tensor_dict["input_ids"]
input_bbox_tensor = tf_rep.tensor_dict["input_bbox"]
attention_mask_tensor = tf_rep.tensor_dict["attention_mask"]
token_type_ids_tensor = tf_rep.tensor_dict["token_type_ids"]

output_tensor = tf_rep.tensor_dict["outputs"]

shutil.rmtree(f"{TF_MODEL_PATH}/{MODEL_NAME}", ignore_errors=True)

with tf.Session(graph=tf_rep.graph) as session:
    
    a = tf.Variable(0)
    
    init = tf.global_variables_initializer()
    
    loss = tf.identity(output_tensor, name="loss")
    logits = tf.identity(output_tensor, name="logits")
    hidden_states = tf.identity(output_tensor, name="hidden_states")
    attentions = tf.identity(output_tensor, name="attentions")
     
    session.run(init)    
    
    tf.saved_model.simple_save(
        session,
        f"{TF_MODEL_PATH}/{MODEL_NAME}",
        inputs={
            "input_ids": input_ids_tensor,
            "input_bbox": input_bbox_tensor,
            "attention_mask": attention_mask_tensor,
            "token_type_ids": token_type_ids_tensor
        },
        outputs={
            "loss": loss,
            "logits": logits,
            "hidden_states": hidden_states,
            "attentions": attentions
        }
    )
    
import os
os.mkdir(f"{TF_MODEL_PATH}/{MODEL_NAME}/assets")
with open(f"{TF_MODEL_PATH}/{MODEL_NAME}/assets/labels.txt", "w") as F:
    for label_id in config.id2label:
        F.write(config.id2label[label_id] + "\n")

shutil.copy(f"{PRETRAINED_MODEL}/vocab.txt", f"{TF_MODEL_PATH}/{MODEL_NAME}/assets/vocab.txt")

Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.simple_save.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: /content/drive/MyDrive/data/Models/tf/LayoutLMSROIE/saved_model.pb


'/content/drive/MyDrive/data/Models/tf/LayoutLMSROIE/assets/vocab.txt'

In [15]:
!cd $TF_MODEL_PATH/$MODEL_NAME ; zip -r ../"$MODEL_NAME".zip . *

  adding: variables/ (stored 0%)
  adding: variables/variables.data-00000-of-00001 (stored 0%)
  adding: variables/variables.index (deflated 39%)
  adding: saved_model.pb (deflated 8%)
  adding: assets/ (stored 0%)
  adding: assets/labels.txt (deflated 38%)
  adding: assets/vocab.txt (deflated 53%)


In [17]:
!ls $TF_MODEL_PATH/$MODEL_NAME/assets

labels.txt  vocab.txt


In [18]:
!cat $TF_MODEL_PATH/$MODEL_NAME/assets/labels.txt

LABEL_0
LABEL_1
LABEL_2
LABEL_3


In [13]:
from layoutlm import LayoutlmConfig , LayoutlmForSequenceClassification
import torch

config = LayoutlmConfig.from_pretrained(
    PRETRAINED_MODEL
)

model = LayoutlmForSequenceClassification.from_pretrained(
    PRETRAINED_MODEL,
    from_tf=False,
    config=config,
    cache_dir=None,
)

dummy_input = {
    "input_ids": 
      torch.zeros(1, 128, requires_grad=False, device="cpu").long(),
    "bbox":
      torch.zeros(1, 128, 4, requires_grad=False, device="cpu").long(),
    "attention_mask":
      torch.ones(1, 128, requires_grad=False, device="cpu").long(),    
    "token_type_ids":
      torch.ones(1, 128, requires_grad=False, device="cpu").long(),        
}

dummy_output = model(**dummy_input)

print("Model output")
print(dummy_output)


Model output
(tensor([[-0.7394, -0.2048,  0.0249,  0.1010]], grad_fn=<AddmmBackward>),)


In [14]:
import os

if not os.path.exists(ONNX_MODEL_PATH):
    os.mkdir(ONNX_MODEL_PATH)
    
torch.onnx.export(
    model, 
    (
      dummy_input["input_ids"], 
      dummy_input["bbox"],
      dummy_input["attention_mask"],
      dummy_input["token_type_ids"],
    ),
    f"{ONNX_MODEL_PATH}/{MODEL_NAME}",
    verbose=False,
    input_names=['input_ids', 'input_bbox', 'attention_mask', 'token_type_ids'],
    output_names=["outputs"],
    dynamic_axes={
        'input_ids': {0: 'batch', 1: 'max_seq'}, 
        'attention_mask': {0: 'batch', 1: 'max_seq'}, 
        'token_type_ids': {0: 'batch', 1: 'max_seq'}, 
        'bbox': {0: 'batch', 1: 'max_seq'},
    }
)



In [15]:
import onnx_tf
import onnx

onnx_model = onnx.load(f"{ONNX_MODEL_PATH}/{MODEL_NAME}")

In [20]:
onnx_model

In [16]:

    #"/content/drive/MyDrive/data/Models/onnx/LayoutLMSROIE")
    #f"{ONNX_MODEL_PATH}/{MODEL_NAME}")
tf_rep = onnx_tf.backend.prepare(onnx_model)

In [19]:
tf_rep.tensor_dict

{}

In [17]:
import tensorflow.compat.v1 as tf
import shutil

tf.reset_default_graph()            


input_ids_tensor = tf_rep.tensor_dict["input_ids"]
input_bbox_tensor = tf_rep.tensor_dict["input_bbox"]
attention_mask_tensor = tf_rep.tensor_dict["attention_mask"]
token_type_ids_tensor = tf_rep.tensor_dict["token_type_ids"]

output_tensor = tf_rep.tensor_dict["outputs"]

shutil.rmtree(f"{TF_MODEL_PATH}/{MODEL_NAME}", ignore_errors=True)

with tf.Session(graph=tf_rep.graph) as session:
    
    a = tf.Variable(0)
    
    init = tf.global_variables_initializer()

    logits = tf.identity(output_tensor, name="logits")
    probs = tf.nn.softmax(output_tensor, axis=-1, name="probs")
    predictions = tf.arg_max(logits, dimension=-1, name="predictions")
        
    session.run(init)    
    
    tf.saved_model.simple_save(
        session,
        f"{TF_MODEL_PATH}/{MODEL_NAME}",
        inputs={
            "input_ids": input_ids_tensor,
            "input_bbox": input_bbox_tensor,
            "attention_mask": attention_mask_tensor,
            "token_type_ids": token_type_ids_tensor
        },
        outputs={
            "logits": logits,
            "probs": probs,
            "predictions": predictions
        }
    )
    
import os
os.mkdir(f"{TF_MODEL_PATH}/{MODEL_NAME}/assets")
with open(f"{TF_MODEL_PATH}/{MODEL_NAME}/assets/labels.txt", "w") as F:
    for label_id in config.id2label:
        F.write(config.id2label[label_id] + "\n")

shutil.copy(f"{PRETRAINED_MODEL}/vocab.txt", f"{TF_MODEL_PATH}/{MODEL_NAME}/assets/vocab.txt")

KeyError: ignored