# Fine tune SROIE on LayoutLM
This notebook is an effort to fine tune the LayoutLM model for the SROIE dataset. The model is presented in the paper "[LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318)" by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei and Ming Zhou. 

- Git-hub repo [here](https://github.com/microsoft/unilm/tree/master/layoutlm).

- Read about the SROIE competition and dataset [here](https://rrc.cvc.uab.es/?ch=13).

- Inspiration from this Kaggle notebook [here](https://www.kaggle.com/jpmiller/layoutlm-starter)

##Notes:
- The repo includes a pre processing script and fune tunning for the FUNSD dataset, but not for the SROIE dataset (though the paper includes computations on the SROIE dataset). So this notebook intends to fill that gap

- I have used my google drive to manage the files. If you want to use it, just change the folder names (both the ones where you keep the SROIE files and also were you keep the LayoutLM files)

- The best f1 results on the predicitons I got were between 93%~ 94.5%, which is a bit less than the value presented in the paper (~94%/95%). The differences may be explained by 
  - different parameters (I haven't done an exaustive grid search)
  - different sampling
  - different pre processing. This one is far from perfect, some labels and invoices are lost in the way. 
  - different OCR base. As I understood, the authors also did their own OCR, while I run from th one provided in the dataset
  - I was having difficulties with the label "company address" so I have dropped it
  - any other differences, as the paper doesn't explain this fine tunning in detail

- Make sure you have GPU enabled on the notebook (Edit->Notebook settings)

- Yes I know, the code is horrible and badly explained, sorry for that. Nevertheless, hope it helps somehow

# 1. Pre-process dataset

In [1]:
# Imports  
import os
import pandas as pd
import glob
import json 
import ast
import re
import random

In [2]:
# Connection to google drive
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


In [3]:
# Define path for the dataset files (you should previously download the dataset from the link given at the header of the notebook)
# This is the folder with the files that contain the bounding boxes and the words
spath_words = '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/0325updated.task1train(626p)/'
os.chdir(spath_words)
# Create a dataframe to store and manage the invoices bounding boxes and words
df_sentences = pd.DataFrame(columns=['filename', 'sentence'])

# Loops over every file in the folder
for file in glob.glob("*.txt"):
  try:
    # Treat each invoice as a sentence and a row of the df
    sfullpath = spath_words + file
    df_file = pd.read_csv(sfullpath, header=None, names=['x0', 'y0', 'x1', 'y1', 'x2', 'y2', 'x3', 'y3', 'words'])
    if not df_file['words'].isnull().values.any():
      sentence_list = [str(i) for i in df_file['words']]
      bbox_list = []
      for index, row in df_file.iterrows():
        bbox_list.append([row['x0'],row['y0'],row['x2'],row['y2']])
      new_row = {'filename':file, 'sentence':sentence_list, 'bboxes':bbox_list}
      # Append row to the dataframe
      df_sentences = df_sentences.append(new_row, ignore_index=True)
  except Exception as e:
    # There are a few problems, we will just ignore them and print the error associated with it
    print(file + " | " + repr(e))

X51006619545.txt | ParserError('Error tokenizing data. C error: EOF inside string starting at row 78')
X51006619785.txt | ParserError('Error tokenizing data. C error: EOF inside string starting at row 77')


In [4]:
df_file

Unnamed: 0,x0,y0,x1,y1,x2,y2,x3,y3,words
0,236,306,772,306,772,347,236,347,LEE WAH FLORIST SDN BHD
1,430,345,586,345,586,377,430,377,(521273-W)
2,334,385,676,385,676,419,334,419,129
3,346,424,663,424,663,456,346,456,50000 KUALA LUMPUR.
4,203,486,259,486,259,515,203,515,TEL
...,...,...,...,...,...,...,...,...,...
114,530,2102,616,2102,616,2128,530,2128,353.00
115,726,2063,796,2063,796,2090,726,2090,21.18
116,726,2104,798,2104,798,2130,726,2130,21.18
117,189,2154,777,2154,777,2182,189,2182,GOODS SOLD ARE NOT RETURNABLE


In [5]:
df_sentences

Unnamed: 0,filename,sentence,bboxes
0,X51005757324.txt,"[MR. D.I.Y. (M) SDN BHD, (CO. REG :860671-D), ...","[[153, 253, 516, 286], [190, 287, 482, 318], [..."
1,X51005757304.txt,"[#000002 BAIFU (M) SDN BHD, COMPANY NO(814198-...","[[147, 153, 549, 193], [195, 193, 514, 226], [..."
2,X51005757323.txt,"[MR. D.I.Y. (M) SDN BHD, (CO. REG :860671-D), ...","[[170, 267, 533, 301], [206, 299, 497, 335], [..."
3,X51005757346.txt,"[MR. D.I.Y. (M) SDN BHD, (CO. REG :860671-D), ...","[[137, 257, 524, 294], [189, 294, 482, 327], [..."
4,X51005757294.txt,"[MR. D.I.Y. (M) SDN BHD, (CO. REG :860671-D), ...","[[156, 251, 525, 281], [195, 284, 486, 317], [..."
...,...,...,...
825,X51005763940 (2).txt,"[HARVEY NORMAN, HARVEY NORMAN M'SIA PARADIGM M...","[[237, 111, 470, 152], [20, 224, 485, 259], [2..."
826,X51005806678 (2).txt,"[KAISON FURNISHING SDN BHD, L4-17 (B), UP2-01,...","[[333, 214, 698, 252], [378, 279, 652, 318], [..."
827,X51005757351 (2).txt,"[MR. D.I.Y. (M) SDN BHD, (CO. REG :860671-D), ...","[[149, 259, 512, 292], [187, 293, 477, 324], [..."
828,X51005757353 (2).txt,"[MR. D.I.Y. (M) SDN BHD, (CO.REG :860671-D), L...","[[141, 272, 503, 304], [179, 306, 468, 339], [..."


In [6]:
# Define path for the dataset files (you should previously download the dataset from the link given at the header of the notebook)
# This is the folder with the files that contain the values (company name, date, address and total)
spath_labels = '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/0325updated.task2train(626p)/'
os.chdir(spath_labels)
# Create a dataframe to store and manage the invoices tags
df_labels = pd.DataFrame(columns=['filename', 'value_company', 'value_date', 'value_address', 'value_total'])

for file in glob.glob("*.txt"):
  try:
    with open(file, 'r') as fileread:
      data = res = json.loads(fileread.read()) 
    new_row = {'filename':file, 'value_company':data['company'], 'value_date':data['date'], 'value_address':data['address'], 'value_total':data['total']}
    # Append row to the dataframe
    df_labels = df_labels.append(new_row, ignore_index=True)
  except Exception as e:
    print(file + " | " + repr(e))

X51005663280.txt | KeyError('address')
X51005663280 (1).txt | KeyError('address')


In [7]:
# Now let's merge the two dataframes based on the filename
df = pd.merge(df_sentences,df_labels,on='filename')

In [8]:
# In case the df is stored on drive, just uncomment this cell
# df = pd.read_csv('/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/df.csv')
df = df.drop(['Unnamed: 0'], axis=1)

KeyError: ignored

In [None]:
df.columns

In [9]:
# Drop unecessary column and parse data (need to avoid some quotes inside the lists)
df['sentence'] = df['sentence'].map(lambda a: ast.literal_eval(str(a)))
df['bboxes'] = df['bboxes'].map(lambda a: ast.literal_eval(str(a)))

In [None]:
df.head(5)

In [10]:
# Define some auxiliary functions
def a_in_x(A, X):
  '''
  Returns list with indexes of elements of list X which contain A
  '''
  l = []
  for i in range(len(X) - len(A) + 1):
    if str(A[0]) in str(X[i:i+len(A)][0]): 
      l.append(i)
  return l

def flat_list_one_level(l):
  '''
  Flattens list
  Doesn't include second level list of lists, only first level
  '''
  flat_list = []
  for sublist in l:
    if type(sublist) is list:
      for item in sublist:
          flat_list.append(item)
    else:
      flat_list.append(sublist)
  return flat_list

def flat_list_one_level_list_of_lists(l):
  '''
  Flattens list
  Flattens only the first element of the sub-list
  '''
  flat_list = []
  for sublist in l:
    if type(sublist) is list and len(sublist) > 0 and type(sublist[0]) is list:
      for item in sublist:
        flat_list.append(item)
    else:
      flat_list.append(sublist)
  return flat_list
    
def intersperse(lst, item):
  '''
  Places an item between elements of a list
  '''
  result = [item] * (len(lst) * 2 - 1)
  result[0::2] = lst
  return result

def split_box(box, n_splits):
  '''
  Splits a bbox [x0,y0,x1,y1] by its coordinates into n_splits bboxes of equal size
  '''
  boxs_splitted = []
  x0 = box[0]
  y0 = box[1]
  x1 = box[2]
  y1 = box[3]
  width = x1 - x0
  for i_split in range(0, n_splits):
    boxs_splitted.append([x0 + i_split * int(width/n_splits), y0, x0 + (i_split + 1) * int(width/n_splits), y1])
  return boxs_splitted

def split_box_weighted(box, l_splits):
  '''
  Splits a bbox [x0,y0,x1,y1] by its coordinates into len(l_splits)
  The size of each bbox is proportional to the weight present in l_splits
  '''
  boxs_splitted = []
  x0 = box[0]
  y0 = box[1]
  x1 = box[2]
  y1 = box[3]
  width = x1 - x0
  sum_splits = sum(l_splits)
  for i_split in l_splits:
    split_fraction = i_split/sum_splits
    x1f = x0 + int(width * split_fraction)
    boxs_splitted.append([x0, y0, x1f, y1])
    x0 = x1f
  return boxs_splitted

In [11]:
# Define function to set the labels to the words
def define_labels(pos, sent, labels, bbox, class_value, classification, label_other = 'O'):
  # Pos is a list whith the position of the words associated with this label
  # So this loops each group of words which has some relation to the label
  for i_pos in pos:
    if sent[i_pos] == class_value:
      # If the group of words is equal to the class value, then this group of words is attributted the label
      labels[i_pos] = classification
    else:
      # The value is contained within the group of words, so we have to split the group (ex: [... , "Date: 01/01/2020", ...] -> [..., ["Date: ", "01/01/2020"], ...])
      # We start by replacing the group of words by a splitted list 
      sent[i_pos] = intersperse((sent[i_pos].split(str(class_value))), str(class_value))
      # This split leaves a white space element at the initial or final position, so we have to remove it
      if sent[i_pos][0].isspace() or len(sent[i_pos][0])==0: sent[i_pos] = sent[i_pos][1:]
      if sent[i_pos][-1].isspace() or len(sent[i_pos][-1])==0: sent[i_pos] = sent[i_pos][0:-1]
      # Now we may associate the labels with the correct group of words (ex: [... , "Date: 01/01/2020", ...] -> [..., ["Date: ", "01/01/2020"], ...], the labels would be [..., ["O", "B-DATE"], ...])
      labels[i_pos] = [classification if s == class_value else label_other for s in sent[i_pos]]
      # The bounding boxes should also be splitted
      # Here we do it proportionally to the number of chars of the words
      bbox[i_pos] = split_box_weighted(bbox[i_pos], [len(i) for i in sent[i_pos]])

  # The obtained lists have now some second level lists, so we have to flatten
  sent = flat_list_one_level(sent)
  labels = flat_list_one_level(labels)
  bbox = flat_list_one_level_list_of_lists(bbox)
  return sent, labels, bbox

In [12]:
# Finally the loop to create lists with the sentences and their corresponding labels and bboxes
sentences_list = []
labels_list = []
bbox_list = []
class_other = 'O'
for index, row in df.iterrows():
  labels = [class_other] * len(row['sentence'])
  sent = row['sentence'].copy()
  bbox = row['bboxes'].copy()
  
  # Define labels for date
  class_value = row['value_date']
  classification = 'B-DATE'
  pos = a_in_x([class_value], sent)
  if len(pos) > 0:
    sent, labels, bbox = define_labels(pos, sent, labels, bbox, class_value, classification)
  
  # Define labels for total value
  class_value = row['value_total']
  classification = 'B-TOTAL'
  pos = a_in_x([class_value], sent)
  if len(pos) > 0:
    if class_value=="":
      class_value = " "
    sent, labels, bbox = define_labels(pos, sent, labels, bbox, class_value, classification)

  # Define labels for company name
  class_value = row['value_company']
  classification = 'B-COMPANY'
  pos = a_in_x([class_value], sent)
  if len(pos) > 0:
    sent, labels, bbox = define_labels(pos, sent, labels, bbox, class_value, classification)

  # Define labels for address 
  # class_value = row['value_address']
  # classification = 'B-ADDRESS'
  # pos = a_in_x([class_value], sent)
  # if len(pos) > 0:
  #   sent, labels, bbox = define_labels(pos, sent, labels, bbox, class_value, classification)

  # Appends the group of words, labels and bboxes to lists
  sentences_list.append(sent.copy())
  labels_list.append(labels.copy())
  bbox_list.append(bbox.copy())

In [None]:
sent

At this point we have lists in which the elements are also lists (groups of words)

In order to discretize the problem, we should split the groups of words into single words

In [13]:
def break_sentences(sl, bl, ll):
  sentences_list_temp = []
  bbox_list_temp = []
  labels_list_temp = []
  for sents, labels, boxs in zip(sl, bl, ll):
    sentences_list3 = []
    bbox_list3 = []
    labels_list3 = []
    for sent, label, box in zip(sents, labels, boxs):
      word_tokens = sent.split(" ")
      # Strip white spaces
      word_tokens = [w for w in word_tokens if (w != "" and w != " ")] 
      sentences_list3.extend(word_tokens)
      splitted_boxes = split_box_weighted(box, [len(i) for i in word_tokens])
      bbox_list3.extend(splitted_boxes)
      # BO
      labels_list3.extend([label] * len(word_tokens))
      # BIO
      #labels_list3.extend([label] + [label.replace('B-','I-')] * (len(word_tokens) - 1))
    sentences_list_temp.append(sentences_list3)
    bbox_list_temp.append(bbox_list3)
    labels_list_temp.append(labels_list3)
  return sentences_list_temp, bbox_list_temp, labels_list_temp

In [14]:
sentences_list, bbox_list, labels_list = break_sentences(sentences_list, labels_list, bbox_list)

In [15]:
# Check the first invoice data
for s, l, b in zip(sentences_list[0],labels_list[0],bbox_list[0]):
  print("{}\t\t{}\t\t{}".format(s,l,b))

MR.		B-COMPANY		[153, 253, 213, 286]
D.I.Y.		B-COMPANY		[213, 253, 334, 286]
(M)		B-COMPANY		[334, 253, 394, 286]
SDN		B-COMPANY		[394, 253, 454, 286]
BHD		B-COMPANY		[454, 253, 514, 286]
(CO.		O		[190, 287, 258, 318]
REG		O		[258, 287, 309, 318]
:860671-D)		O		[309, 287, 480, 318]
LOT		O		[56, 321, 157, 354]
1851-A		O		[157, 321, 359, 354]
&		O		[359, 321, 392, 354]
1851-B		O		[392, 321, 594, 354]
KAWASAN		O		[70, 355, 197, 388]
PERINDUSTRIAN		O		[197, 355, 433, 388]
BALAKONG		O		[433, 355, 578, 388]
43300		O		[88, 389, 225, 418]
SERI		O		[225, 389, 335, 418]
KEMBANGAN		O		[335, 389, 582, 418]
(GST		O		[124, 424, 198, 455]
ID		O		[198, 424, 235, 455]
NO		O		[235, 424, 272, 455]
:000306020352)		O		[272, 424, 531, 455]
(IOI		O		[222, 455, 292, 486]
PUCHONG)		O		[292, 455, 432, 486]
-TAX		O		[220, 490, 291, 518]
INVOICE-		O		[291, 490, 433, 518]
TRAC		O		[8, 557, 84, 588]
DRY		O		[84, 557, 141, 588]
IRON		O		[141, 557, 217, 588]
TR-231IR		O		[217, 557, 370, 588]
-		O		[370, 557, 389, 588

Now everything is ready to write the files in the correct format (accepted by the layoutLM process)

In [16]:
def bbox_string(box, width, length):
    return (
        str(int(1000 * (box[0] / width)))
        + " "
        + str(int(1000 * (box[1] / length)))
        + " "
        + str(int(1000 * (box[2] / width)))
        + " "
        + str(int(1000 * (box[3] / length)))
    )

def actual_bbox_string(box, width, length):
    return (
        str(box[0])
        + " "
        + str(box[1])
        + " "
        + str(box[2])
        + " "
        + str(box[3])
        + "\t"
        + str(width)
        + " "
        + str(length)
    )

def size(bboxes):
  max_width = 0
  max_height = 0
  min_x0 = 10e8
  min_y0 = 10e8
  for box in bboxes:
    if box[0] < min_x0: min_x0 = box[0]
    if box[1] < min_y0: min_y0 = box[1]
    if box[2] > max_width: max_width = box[2]
    if box[3] > max_height: max_height = box[3]
  max_width += min_x0
  max_height += min_y0
  return max_height, max_width

def get_unique(some_array, seen=None):
    if seen is None:
        seen = set()
    for i in some_array:
        if isinstance(i, list):
            seen.union(get_unique(i, seen))
        else:
            seen.add(i)
    return list(seen)


In [17]:
def write_files(output_dir, data_split, sentences_list, labels_list, bbox_list, split_indexes):
  with open(
      os.path.join(output_dir, data_split + ".txt"),
      "w",
      encoding="utf8",
  ) as fw, open(
      os.path.join(output_dir, data_split + "_box.txt"),
      "w",
      encoding="utf8",
  ) as fbw, open(
      os.path.join(output_dir, data_split + "_image.txt"),
      "w",
      encoding="utf8",
  ) as fiw:
      for index in split_indexes:
          sent = sentences_list[index]
          lab = labels_list[index]
          boxes = bbox_list[index]
          length, width = size(boxes)

          for words, label, box in zip(sent, lab, boxes):
              fw.write("{}\t{}\n".format(words, label))
              fbw.write("{}\t{}\n".format(words, bbox_string(box, width, length)))
              fiw.write("{}\t{}\t{}\n".format(words, actual_bbox_string(box, width, length), "filename.jpg"))
          fw.write("\n")
          fbw.write("\n")
          fiw.write("\n")

In [18]:
# First we split into train and test set
split_indexes = [*range(len(sentences_list))]
random.Random(4).shuffle(split_indexes)
cut = int(len(sentences_list) * 0.8)
split_indexes_train = split_indexes[:cut]
split_indexes_test = split_indexes[cut:]

In [19]:
write_files('/content/drive/MyDrive/Colab Notebooks/LayoutLM /Annotated Dataset', 'train', sentences_list, labels_list, bbox_list, split_indexes_train)

In [20]:
write_files('/content/drive/MyDrive/Colab Notebooks/LayoutLM /Annotated test', 'test', sentences_list, labels_list, bbox_list, split_indexes_test)

In [33]:
# Finally we write the labels.txt file
tag_values = get_unique(labels_list)
with open(
      os.path.join('/content/drive/MyDrive/Colab Notebooks/LayoutLM /', "labels.txt"),
      "w",
      encoding="utf8",
  ) as lb:
      for val in tag_values:
        lb.write("{}\n".format(val))

# 2. Fine tune LayoutLM

In [22]:
os.chdir('/content/drive/MyDrive/Colab Notebooks/LayoutLM ')

In [None]:
os.getcwd()

In [45]:
!git clone https://github.com/microsoft/unilm.git

Cloning into 'unilm'...
remote: Enumerating objects: 6290, done.[K
remote: Counting objects: 100% (484/484), done.[K
remote: Compressing objects: 100% (353/353), done.[K
remote: Total 6290 (delta 130), reused 460 (delta 122), pack-reused 5806[K
Receiving objects: 100% (6290/6290), 11.29 MiB | 8.06 MiB/s, done.
Resolving deltas: 100% (2575/2575), done.
Checking out files: 100% (3239/3239), done.


In [49]:
cd deprecated

/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated


In [50]:
!pip install .

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing /content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: layoutlm
  Building wheel for layoutlm (setup.py) ... [?25l[?25hdone
  Created wheel for layoutlm: filename=layoutlm-0.0-py3-none-any.whl size=11482 sha256=159bf42cf317eeccf28c30bd6d58e33698116a3da29704c6816a5177d306511d
  Stored in directory: /tmp/pip-ephem-wheel-cache-t04qdsij/wheels/d5/bc/ff/ba3399de59d4a01f57ac56b83c078e83e7b7da15c0d349d06e
Successfully built la

In [25]:
os.chdir('deprecated/examples/seq_labeling')

In [26]:
os.getcwd()

'/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated/examples/seq_labeling'

In [None]:
/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019

In [29]:
# Move the previously created files
!mkdir data
!cp '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/train.txt' '/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated/examples/seq_labeling'
!cp '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/train_box.txt' '/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated/examples/seq_labeling'
!cp '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/train_image.txt' '/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated/examples/seq_labeling'
!cp '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/test.txt' '/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated/examples/seq_labeling'
!cp '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/test_box.txt' '/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated/examples/seq_labeling'
!cp '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/test_image.txt' '/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated/examples/seq_labeling'
!cp '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/labels.txt' '/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated/examples/seq_labeling'
# Try to remove cached files (this is optional and only important if we make changes on the input files)
!rm '/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated/examples/seq_labeling/data/cached_train_layoutlm-base-uncased_512'
!rm '/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated/examples/seq_labeling/data/cached_test_layoutlm-base-uncased_512'

cp: cannot stat '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/train.txt': No such file or directory
cp: cannot stat '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/train_box.txt': No such file or directory
cp: cannot stat '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/train_image.txt': No such file or directory
cp: cannot stat '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/test.txt': No such file or directory
cp: cannot stat '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/test_box.txt': No such file or directory
cp: cannot stat '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/test_image.txt': No such file or directory
cp: cannot stat '/content/drive/MyDrive/Colab Notebooks/LayoutLM /SROIE2019/labels.txt': No such file or directory
rm: cannot remove '/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated/examples/seq_labeling/data/cached_train_layoutlm-base-uncased_512': No such file or direct

In [None]:
%%bash
ls /content/unilm/layoutlm/examples/seq_labeling/data/
cat /content/unilm/layoutlm/examples/seq_labeling/data/labels.txt

In [62]:
# Check model parameters
!cat "/content/drive/MyDrive/Colab Notebooks/LayoutLM /config.json"

In [None]:
# Want to change any model parameter? For example here I just replace the number of attention heads from 12 to 8 (the results are much better)
%%bash
!sed -i 's/"num_attention_heads": 12,/"num_attention_heads": 8,/' "/content/drive/My Drive/Msc/Tese/Modelos/layoutlm-base-uncased/config.json"

In [19]:
os.chdir('/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated')

In [52]:
 os.chdir('/content/drive/MyDrive/Colab Notebooks/LayoutLM /unilm/layoutlm/deprecated/examples/seq_labeling')

In [64]:
# Train the model
! CUDA_LAUNCH_BLOCKING=1 python run_seq_labeling.py  --data_dir '/content/drive/MyDrive/Colab Notebooks/LayoutLM/Annotated Dataset' --model_type layoutlm --model_name_or_path '/content/drive/MyDrive/Colab Notebooks/LayoutLM' --do_lower_case --max_seq_length 512  --do_train --num_train_epochs 5.0 --logging_steps 10 --save_steps -1 --output_dir output '/content/drive/MyDrive/Colab Notebooks/LayoutLM' --overwrite_output_dir --labels '/content/drive/MyDrive/Colab Notebooks/LayoutLM /labels.txt' --per_gpu_train_batch_size 8 --per_gpu_eval_batch_size 8

usage: run_seq_labeling.py [-h] --data_dir DATA_DIR --model_type MODEL_TYPE
                           --model_name_or_path MODEL_NAME_OR_PATH
                           --output_dir OUTPUT_DIR [--labels LABELS]
                           [--config_name CONFIG_NAME]
                           [--tokenizer_name TOKENIZER_NAME]
                           [--cache_dir CACHE_DIR]
                           [--max_seq_length MAX_SEQ_LENGTH] [--do_train]
                           [--do_eval] [--do_predict]
                           [--evaluate_during_training] [--do_lower_case]
                           [--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE]
                           [--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE]
                           [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
                           [--learning_rate LEARNING_RATE]
                           [--weight_decay WEIGHT_DECAY]
                           [--adam_epsilon ADAM_EPSILON]

In [None]:
# Evaluate for test set
! python run_seq_labeling.py  --data_dir data \
                            --model_type layoutlm \
                            --model_name_or_path '/content/drive/My Drive/Msc/Tese/Modelos/layoutlm-base-uncased' \
                            --do_lower_case \
                            --max_seq_length 512 \
                            --do_predict \
                            --logging_steps 10 \
                            --save_steps -1 \
                            --output_dir output \
                            --labels data/labels.txt \
                            --per_gpu_eval_batch_size 8

In [None]:
cat output/test_results.txt

In [None]:
# We can check the results on the test set
%%bash
head -60 output/test_predictions.txt