In [1]:
import json
import re

import pandas as pd
import numpy as np
import torch
import torchtext

# Data Preparation

The available data comes from  a small toy dataset based on the [RVL-CDIP Dataset](http://www.cs.cmu.edu/~aharley/rvl-cdip/). The dataset contains 100 documents of 4 classes: 
- “resumee”,
- “invoice”, 
- “letter”, 
- “email.

For each document we have:
- image, 
- PDF 
- OCR in our proprietary dictionary format

The dictionaries come consolidated in the ``document_type_data.csv`` file, alongside with the text of the document:

## Read data

In [2]:
df = pd.read_csv("./data/document_type_data.csv")
df

Unnamed: 0.1,Unnamed: 0,ocr,text,label,file_name
0,0,"{'pageImages': [{'__typename': 'Image', 'width...","['Chaikin, ', 'Karen ', 'n ', ""O' "", 'o ', 'Fr...",email,2085136614c.pdf
1,1,"{'pageImages': [{'__typename': 'Image', 'width...","['> ', 'Jenny, ', 'After ', 'speaking ', 'with...",email,2085136814a.pdf
2,2,"{'pageImages': [{'__typename': 'Image', 'width...","['Please ', 'call ', 'with ', 'any ', 'questio...",email,2085140145a.pdf
3,3,"{'pageImages': [{'__typename': 'Image', 'width...","['2085158326 ', 'Williams, ', 'Carrie ', 'T. '...",email,2085158326.pdf
4,4,"{'pageImages': [{'__typename': 'Image', 'width...","['GJ ', '□3 ', 'A ', 'nice ', 'ending ', 'to '...",email,2085161311b.pdf
...,...,...,...,...,...
95,95,"{'pageImages': [{'__typename': 'Image', 'width...","['CURRICULUM ', 'VITAE ', 'NILANJAN ', 'ROY ',...",resumee,50701639-1640.pdf
96,96,"{'pageImages': [{'__typename': 'Image', 'width...","['BIOGRAPHICAL ', 'SKETCH ', 'Mark ', 'S. ', '...",resumee,50712092-2093.pdf
97,97,"{'pageImages': [{'__typename': 'Image', 'width...","['May. ', '1997 ', 'CURRICULUM ', 'VITAE ', 'E...",resumee,50735851-5852.pdf
98,98,"{'pageImages': [{'__typename': 'Image', 'width...","['I ', 'CURRICULUM ', 'VITAE ', '* ', 'NAbE: '...",resumee,80412888_80412908.pdf


In [3]:
try:
    json.loads(df.loc[0, "ocr"])
except Exception as e:
    print(type(e), ": ", e)

<class 'json.decoder.JSONDecodeError'> :  Expecting property name enclosed in double quotes: line 1 column 2 (char 1)


In [4]:
try:
    json.loads(df.loc[0, "text"])
except Exception as e:
    print(type(e), ": ", e)

<class 'json.decoder.JSONDecodeError'> :  Expecting value: line 1 column 2 (char 1)


> **Warning**: there are problems with ``"`` for some fields in the json ``ocr`` and ``text`` strings, hence, we will just use ``text`` information stripping out any non-alphanumeric character:

In [5]:
text_series = df["text"].str.lower().apply(lambda x : re.sub(r'[^a-zA-Z0-9 ]', '', x))
text_series

0     chaikin  karen  n  o  o  from  sent  to  subje...
1       jenny  after  speaking  with  elisa  about  ...
2     please  call  with  any  questions  thanks  nw...
3     2085158326  williams  carrie  t  lbco  will  b...
4     gj  3  a  nice  ending  to  the  story  below ...
                            ...                        
95    curriculum  vitae  nilanjan  roy  name  1st  o...
96    biographical  sketch  mark  s  ptashne  profes...
97    may  1997  curriculum  vitae  education  and  ...
98    i  curriculum  vitae    nabe  emil  r  unanue ...
99    vita  email  professor  school  of  social  we...
Name: text, Length: 100, dtype: object

--- 
Lastly, we need to encode the labels for each category:

In [6]:
code2label = dict(enumerate(df['label'].astype("category").cat.categories ) )
code2label

{0: 'email', 1: 'invoice', 2: 'letter', 3: 'resumee'}

In [7]:
label2code = {v : k for k, v in code2label.items()}
label2code

{'email': 0, 'invoice': 1, 'letter': 2, 'resumee': 3}

---
To make everything simpler, we prepare the dataframe with just the data we need:

In [8]:
df = df[["text", "label"]].copy()
df.loc[:, "text"] = text_series
df

Unnamed: 0,text,label
0,chaikin karen n o o from sent to subje...,email
1,jenny after speaking with elisa about ...,email
2,please call with any questions thanks nw...,email
3,2085158326 williams carrie t lbco will b...,email
4,gj 3 a nice ending to the story below ...,email
...,...,...
95,curriculum vitae nilanjan roy name 1st o...,resumee
96,biographical sketch mark s ptashne profes...,resumee
97,may 1997 curriculum vitae education and ...,resumee
98,i curriculum vitae nabe emil r unanue ...,resumee


## Preparing torchtext dataset

The following process is adapted from Pytorch's [Text Sentiment n-Grams classification](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) page.

With torchtext, we can easily build a vocabulary with the raw iterator by using the built-in factory function ``build_vocab_from_iterator``. This function accepts an iterator that yield list or iterator of tokens.

In [9]:
text_iterator = df.set_index("label")["text"].iteritems()
next(text_iterator)

('email',
 'chaikin  karen  n  o  o  from  sent  to  subject  chaikin  karen  monday  july  16  2001  724  pm  plombadogtnadcomcom  re  rfp  and  op  plan  kc  youth  smoking  prevention  hj  q  oe  vi  phil  thanks  for  all  of  these  note  that  i  cannot  open  the  marked  version  of  the  op  plan  can  you  please  reconvert  to  a  pdf  and  resend  thanks ')

In [10]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = torchtext.vocab.build_vocab_from_iterator(yield_tokens(text_iterator), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

Now we have a vocabulary converting a list of tokens into integers:

In [11]:
vocab(['now', "I", "will", "tokenize", "this"])

[756, 0, 33, 0, 32]

Next, we prepare the Prepare processing pipeline with the tokenizer and vocabulary. The text and label pipelines will be used to process the raw data strings from the dataset iterators:

In [12]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: label2code[x]

- The text pipeline converts a text string into a list of integers based on the lookup table defined in the vocabulary. 
- The label pipeline converts the label into integers

In [13]:
text_pipeline('now I will tokenize this')

[756, 7, 33, 0, 32]

In [14]:
label_pipeline('email')

0

Using our data:

In [15]:
text_pipeline(df.iloc[0]["text"])

[0,
 1585,
 88,
 16,
 16,
 19,
 64,
 4,
 47,
 0,
 1585,
 332,
 277,
 856,
 298,
 0,
 70,
 0,
 75,
 0,
 3,
 0,
 769,
 0,
 1176,
 245,
 774,
 3790,
 484,
 1671,
 836,
 1083,
 162,
 8,
 115,
 1,
 190,
 394,
 18,
 7,
 3018,
 0,
 2,
 4231,
 1881,
 1,
 2,
 0,
 769,
 155,
 11,
 39,
 0,
 4,
 6,
 0,
 3,
 0,
 162]

In [16]:
label_pipeline(df.iloc[0]["label"])

0

We can then write a custom torch dataset:

In [17]:
class DocTextDataset(torch.utils.data.Dataset):
    
    @staticmethod
    def yield_tokens(data_iter):
        for _, text in data_iter:
            yield tokenizer(text)
    
    def __init__(self, df: pd.DataFrame, code2label : dict = None):
        self.df = df     
        if code2label is None:
            self.code2label = dict(enumerate(df['label'].astype("category").cat.categories ) )
        else:
            self.code2label = code2label
        self.label2code = {v : k for k, v in code2label.items()}
        text_iterator = self.df.set_index("label")["text"].iteritems()
        tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
        self.vocab = torchtext.vocab.build_vocab_from_iterator(self.yield_tokens(text_iterator), specials=["<unk>"])
        self.vocab.set_default_index(self.vocab["<unk>"])
        self.text_pipeline = lambda x: self.vocab(tokenizer(x))
        self.label_pipeline = lambda x: self.label2code[x]

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        label = self.label_pipeline(self.df.iloc[idx]["label"])
        text = self.text_pipeline(self.df.iloc[idx]["text"])
        sample = {"text": torch.LongTensor(text), "label": torch.LongTensor([label])}
        return sample

In [18]:
dataset = DocTextDataset(df, code2label)
dataset[0]

{'text': tensor([1378,  736,   86,   15,   15,   19,   62,    4,   43, 1378,  736,  285,
          234,  623,  258, 2569,   64, 4671,   72, 4900,    3, 1683,  484, 4033,
          847,  212,  586, 1547,  401, 1072,  617,  773,  139,    8,  108,    1,
          169,  336,   18,    7, 1369, 4520,    2, 1631, 1169,    1,    2, 1683,
          484,  145,   11,   38, 4837,    4,    6, 4609,    3, 4875,  139]),
 'label': tensor([0])}

## Preparing the DataLoader

Since we will use pytorch, we will need to use a 
[DataLoader](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader)
Before sending the loaded batch of samples data to the model, we can use DataLoader's ``collate_fn`` **to process the batch**.

> **Warning**: ``collate_fn`` is declared as a top level def, ensuring that the function is available in each DataLoader's worker.

In [19]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for sample in batch:
        _label = sample["label"] 
        _text = sample["text"]
        label_list.append(_label)
        processed_text = torch.LongTensor(_text)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return {
        "label" : label_list.to(device), 
        "text" : text_list.to(device), 
        "offset" : offsets.to(device)    
    }

Now we just have to split the dataframe in train-validation-test splits:
- 60% - train set,
- 20% - validation set,
- 20% - test set

Then, we initialize the corresponding DataLoaders

In [20]:
# shuffle df
df = df.sample(frac=1, random_state=42)
# split it
train_df, val_df, test_df = np.split(df, [int(.6*len(df)), int(.8*len(df))])
# initialize datasets
train_dataset = DocTextDataset(train_df, code2label)
val_dataset = DocTextDataset(val_df, code2label)
test_dataset = DocTextDataset(test_df, code2label)
print("Sizes:", len(train_dataset), len(val_dataset), len(test_dataset))

Sizes: 60 20 20


In [21]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=8, shuffle=False, collate_fn=collate_batch)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=8, shuffle=False, collate_fn=collate_batch)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=8, shuffle=False, collate_fn=collate_batch)

We can now test if the DataLoader works correctly:

In [22]:
next(iter(train_loader))

{'label': tensor([3, 2, 2, 1, 1, 1, 0, 3]),
 'text': tensor([   3,    5,  402,  ..., 2516,  174,  252]),
 'offset': tensor([   0,  249,  369,  497,  626,  789, 1020, 1140])}