In [None]:
%pip install pandas==1.3.5 transformers --quiet

In [None]:
%rm -r old

# Fine-tuning BERT on long texts

In this notebook we explore different approach to overcome one of the main limitation of BERT (which stands for Bidirectional Encoder Representations from Transformers), the ability to process long document. In fact BERT can only be applied on text that have less than 512 token after tokenization with the Bert Tokenizer.


&nbsp;


We will implement [this paper, which introduce a new method to deal with Long Documents : RoBERT (Recurrence over BERT)](https://arxiv.org/abs/1910.10781).



We will also implement [this paper, which introduce diferents methods to deal with Long Documents and BERT](https://arxiv.org/abs/1905.05583) to see if the RoBERT paper bring some significative improvement on the classification of Long Texts with BERT.

This paper introduce 2 main approaches, the **Truncation methods** and the **Hierarchical methods** :
 * Truncation methods
   * head-only
   * tail-only
   * head+tail
 * Hierarchical methods
   * mean pooling
   * max pooling
 
The Truncation methods applies to the input of the BERT model (the Tokens), while the Hierarchical methods applies to the ouputs of the Bert model (the embbeding), we will go into more detail in the respective parts


&nbsp;


The goal of the original RoBERT article was to solve the following problem: BERT has a fixed input token count; how can we use its power on long texts. This notebook implements the same approach using HuggingFace's `transformers` library and `pytorch`.

The dataset used is the *US Consumer Finance Complaints* available on [Kaggle](https://www.kaggle.com/cfpb/us-consumer-finance-complaints).

Basically, the article goes as follows:
1. Read the data and do some basic preprocessing
2. Break the documents into smaller segments with a number of tokens that can be handled by BERT
3. Fine-tune BERT on those segments using a classification head
4. Combine the segments of each document by using an LSTM. The fixed output size of the LSTM can be used by a single fully connected layer for the final classification.


&nbsp;


For code clarity, we separated out some parts of the code into python scripts that are fully commented. They will be referenced throughout the notebook.

Here is a graph that represents the differents Interaction between the differents Classes :

![img/Class_Interactions.png](img/Class_Interactions.png)

In [1]:
%matplotlib inline
import torch
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import re

from sklearn.model_selection import train_test_split
from transformers import BertTokenizer
from transformers import BertForSequenceClassification, AdamW, BertConfig
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split
from torch.utils.data.sampler import SubsetRandomSampler
import transformers
from transformers import RobertaTokenizer, BertTokenizer, RobertaModel, BertModel, AdamW# get_linear_schedule_with_warmup
from transformers import get_linear_schedule_with_warmup
import time
from RoBERT import RoBERT_Model
import os
from BERT_Hierarchical import BERT_Hierarchical_Model



#### IMPORT CUSTOMIZZATE
from Custom_Dataset_Class_Path import ConsumerComplaintsDataset1
from Bert_Classification_Custom import Bert_Classification_Model
from utils_Custom import *

#### IMPORT ORIGINALI
# from utils import *
# from Custom_Dataset_Class import ConsumerComplaintsDataset1
# from Bert_Classification import Bert_Classification_Model

import warnings
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Quadro RTX 6000


## Data Exploration :

The dataset used in this work was retrieved from kaggle as said in the paper, these are consumer complaints about financial products and services which are sent by the CFPB (Consumer FinancialProtection  Bureau)  to  the  company  for  answer.

The  dataset  consists  of  555957  rows  and  18columns. 

As our model attempts to predict which product the complaint is about, we only used the consumer-complaint-narrative and product columns.
* comsumer-complaint-narrative:  contains the consumer complaint in text format.
* product:  label of the product concerned by the complaint\

The final dataset used in this work consists of 555957 rows and 2 columns (one column for the texts and the other for the labels).

In [None]:
# Load the dataset into a pandas dataframe.
df=pd.read_csv("./us-consumer-finance-complaints/consumer_complaints.csv")

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

train_raw = df[df.consumer_complaint_narrative.notnull()]
print('Number of training sentences with complain narrative not null: {:,}\n'.format(train_raw.shape[0]))

# Display 10 random rows from the data.
train_raw.sample(10)

In [None]:
train_raw.consumer_complaint_narrative.apply(lambda x: len(x.split()) if len(x.split())<800 else 800).plot(kind='hist', title="nombre d'ocurence par nombre de mots dans un commentaire")

In [None]:
train_raw['len_txt'] =train_raw.consumer_complaint_narrative.apply(lambda x: len(x.split()))
train_raw.describe()

In [None]:
#Select only the row with number of words greater than 250:
train_raw = train_raw[train_raw.len_txt >249]
train_raw.shape

In [None]:
#Select only the column 'consumer_complaint_narrative' and 'product'
train_raw = train_raw[['consumer_complaint_narrative', 'product']]
train_raw.reset_index(inplace=True, drop=True)
train_raw.head()

In [None]:
#Group similar products
train_raw.at[train_raw['product'] == 'Credit reporting', 'product'] = 'Credit reporting, credit repair services, or other personal consumer reports'
train_raw.at[train_raw['product'] == 'Credit card', 'product'] = 'Credit card or prepaid card'
train_raw.at[train_raw['product'] == 'Prepaid card', 'product'] = 'Credit card or prepaid card'
train_raw.at[train_raw['product'] == 'Payday loan', 'product'] = 'Payday loan, title loan, or personal loan'
train_raw.at[train_raw['product'] == 'Virtual currency', 'product'] = 'Money transfer, virtual currency, or money service'
train_raw.head()

In [None]:
#all the different classes
for l in np.unique(train_raw['product']):
    print(l)

In [None]:
train_raw=train_raw.rename(columns = {'consumer_complaint_narrative':'text', 'product':'label'})
train_raw.head()

## Data preprocessing and segmentation:

### 1. Preprocessing:
The preprocessing step goes as follows:
1. Remove all documents with fewer than 250 tokens. We want to concentrate only on long texts
2. Consolidate the classes by combining those that are similar. (e.g.: "Credit card" or "prepaid card" complaints) :
 * Credit reporting‘ to ‘Credit reporting, credit repair services, or other personal consumerreports‘.
 * ‘Credit card‘ to ‘Credit card or prepaid card‘.
 * ‘Payday loan‘ to ‘Payday loan, title loan or personal loan‘1
 * ‘Virtual currency‘ to ‘Money transfer, virtual currency or money servic
3. Remove all non-word characters
4. Encode the labels
5. Split the dataset in train set (80%) and validation set (20%).

### 2. Segmentation and tokenization:
First, each complaint is split into 200-token chunk with an overlap of 50 between each of them. This means that the last 50 tokens of a segment are the first 50 of the next segment.  
Then, each segment is tokenized using BERT's tokenizer. This is needed for two main reasons:
1. BERT's vocabulary is not made of just English words, but also subwords and single characters
2. BERT does not take raw string as inputs. It needs:
    - token ids: those values allow BERT to retrieve the tensor representation of a token
    - input mask: a tensor of 0s and 1s that shows whether a token should be ignored (0) or not (1) by BERT
    - segment ids: those are used to tell BERT what tokens form the first sentence and the second sentence (in the next sentence prediction task)  
    
The parameter `MAX_SEQ_LENGTH` ensures that any tokenized sentence longer that that number (200 in this case) will be truncated.  
Each returned segment is given the same class as the document containing it.

PyTorch offers the `Dataset` and `DataLoader` classes that make it easier to group the data reading and preparation operations while decreasing the memory usage in case the dataset is large.  
We implemented the above steps in the `Custom_Dataset_Class.py` file. Our dataset class `ConsumerComplaintsDataset1` has a constructor (`__init__`) taking the necessary parameters to load the .csv file, segment the documents, and then tokenize them. It also preprocesses the data as explained in the preprocessing section.    
The two other important methods are the following:
- `__len__` returns the number of documents
- `__getitem__` is the method where most of the work is done. It takes a tensor of idx values and returns the tokenized data:


&nbsp;


As said above, there is differents approaches for overflowing tokens, we added the differents strategies (described in [the following paper](https://arxiv.org/abs/1905.05583)) via the use of the parameter `approach` in the initialisation of Consumer Complaints Dataset class, here is the differents value of the `approach` parameter to handle that situation :
- **all**: overflowing tokens from a document are used to create new 200 token chunk with 50 tokens overlap between them
- **head**: overflowing tokens are truncated. Only the first 200 tokens are used.
- **tail**: only the last 200 tokens are kept.

In [None]:
TRAIN_BATCH_SIZE=8
EPOCH=10
validation_split = .2
shuffle_dataset = True
random_seed= 42
MIN_LEN=249
MAX_LEN = 100000
CHUNK_LEN=200
OVERLAP_LEN=50
#MAX_LEN=10000000
MAX_SIZE_DATASET=300

print('Loading BERT tokenizer...')
bert_tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-italian-xxl-uncased', do_lower_case=True)

dataset=ConsumerComplaintsDataset1(
    tokenizer=bert_tokenizer,
    get_popular_keys = True,
    min_len=MIN_LEN,
    max_len=MAX_LEN,
    chunk_len=CHUNK_LEN,
    # max_size_dataset=MAX_SIZE_DATASET,
    overlap_len=OVERLAP_LEN)


#train_size = int(0.8 * len(dataset))
#test_size = len(dataset) - train_size
#train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

# Creating data indices for training and validation splits:
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)

train_data_loader=DataLoader(
    dataset,
    batch_size=TRAIN_BATCH_SIZE,
    sampler=train_sampler,
    collate_fn=my_collate1)

valid_data_loader=DataLoader(
    dataset,
    batch_size=TRAIN_BATCH_SIZE,
    sampler=valid_sampler,
    collate_fn=my_collate1)

### 3. Fine-tuning on the 200 tokens chunks:
BERT is fine-tuned on the 200 tokens chunks

In our implementation, we put a neural network on top of the pooled output from BERT, each BERT's input token has a embeding as an output, the embedding of the `CLS` token (the first token) corresponds to the pooled output of the all sentence input (the 200 tokens chunk in our case).

The neural network is composed of a dense layer with a SoftMax activation function.

This Model corresponds to the class `Bert_Classification_Model` defined in the file `Bert_Classification.py`. This class inherits from `torch.nn.Module` which is the parent of all neural network models.

Then, in the function `train_loop_fun1` defined in `utils.py`, for each batch, the list of dictionaries containing the values for the token_ids, masks, token_type_ids, and targets are respectively concatenated into `torch.tensors` which are then fed into the model in order to get predictions and apply backpropagation according to the Cross Entropy loss.

#### First segmetation approach: all

in this approach, we will consider each chunk of 200 tokens as a new document, so if a document is split into 3 chunk of 200, tokens we will consider each chunks as a new document with the same label.\
We use this approach to fine tune BERT, so this model will be used as an input for RoBERT. 

![img/each_Chunk_as_Document.png](img/each_Chunk_as_Document.png)

In [None]:
lr=3e-5#1e-3
num_training_steps=int(len(dataset) / TRAIN_BATCH_SIZE * EPOCH)

#decommenta per run con dati maggioli
n_class = dataset.get_n_classes()
print(n_class)
model=Bert_Classification_Model(n_class).to(device)

#decommenta per run con dati originali
# model=Bert_Classification_Model().to(device)

optimizer=AdamW(model.parameters(), lr=lr)
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                        num_warmup_steps = 0,
                                                                                     num_training_steps = num_training_steps)
val_losses=[]
batches_losses=[]
val_acc=[]
prev_val_loss = 1
log_file = open("log_file_all.txt", "w")

for epoch in range(EPOCH):
    t0 = time.time()    
    print(f"\n=============== EPOCH {epoch+1} / {EPOCH} ===============\n")
    log_file.write(f"\n=============== EPOCH {epoch+1} / {EPOCH} ===============\n")
    batches_losses_tmp=train_loop_fun1(train_data_loader, model, optimizer, device)
    epoch_loss=np.mean(batches_losses_tmp)
    print(f"\n*** avg_loss : {epoch_loss:.2f}, time : ~{(time.time()-t0)//60} min ({time.time()-t0:.2f} sec) ***\n")
    log_file.write(f"\n*** avg_loss : {epoch_loss:.2f}, time : ~{(time.time()-t0)//60} min ({time.time()-t0:.2f} sec) ***\n")
    t1=time.time()
    output, target, val_losses_tmp=eval_loop_fun1(valid_data_loader, model, device)
    avg_val_loss = np.mean(val_losses_tmp)
    print(f"==> evaluation : avg_loss = {avg_val_loss:.2f}, time : {time.time()-t1:.2f} sec\n")
    log_file.write(f"==> evaluation : avg_loss = {avg_val_loss:.2f}, time : {time.time()-t1:.2f} sec\n")
    ac, cr=evaluate(target, output)
    print(f"ACCURACY: {ac}")
    log_file.write(f"ACCURACY: {ac}")
    print("CLASSIFICATION_REPORT:")
    print(cr)
    log_file.write(f"CLASSIFICATION_REPORT:\n{cr}")
    #print(f"=====>\t{tmp_evaluate}")
    #log_file.write(f"=====>\t{tmp_evaluate}")
    #val_acc.append(tmp_evaluate['accuracy'])
    #val_losses.append(val_losses_tmp)
    batches_losses.append(batches_losses_tmp)
    print("\t§§ model has been saved §§")
    torch.save(model, f"models/best_model_{epoch+1}.pt")    
    
log_file.close()

In [None]:
pd.DataFrame(np.array([[np.mean(x) for x in batches_losses], [np.mean(x) for x in val_losses]]).T,
                   columns=['Training', 'Validation']).plot(title="loss")

In [None]:
pd.DataFrame(np.array(val_acc).T,
                   columns=['Validation']).plot(title="accuracy")

##### Now we will experience the [Truncation strategies presented in this paper](https://arxiv.org/abs/1905.05583)

such as:  
* Tunction Method:  
    + Head only  
    + Tail only  
* Hierarchical Method:  
    + Mean pooling  
    + Max pooling  


### Truncation Methods

 
The Truncation methods applies to the input of the BERT model (the Tokens)
 
Usually, the key information of an article is at the beginning and end. We
use two different methods of truncate text to perform BERT fine-tuning


#### Second segmentation approach: head

in this approach, we will keep only the first chunk of 200 tokens for each documents, so if a document is split into 3 chunk of 200, we will only keep the first chunk and we will not keep the last two chunk.

![img/Head_Truncation.png](img/Head_Truncation.png)

In [None]:
TRAIN_BATCH_SIZE=8
EPOCH=10
validation_split = .2
shuffle_dataset = True
random_seed= 42
MIN_LEN=249
MAX_LEN = 100000
CHUNK_LEN=200
OVERLAP_LEN=50
#MAX_LEN=10000000
MAX_SIZE_DATASET=1000

approach='head'

print('Loading BERT tokenizer...')
bert_tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-italian-xxl-uncased', do_lower_case=True)

dataset=ConsumerComplaintsDataset1(
    tokenizer=bert_tokenizer,
    min_len=MIN_LEN,
    max_len=MAX_LEN,
    chunk_len=CHUNK_LEN,
    max_size_dataset=MAX_SIZE_DATASET,
    overlap_len=OVERLAP_LEN,
    approach=approach)


#train_size = int(0.8 * len(dataset))
#test_size = len(dataset) - train_size
#train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

# Creating data indices for training and validation splits:
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)

train_data_loader=DataLoader(
    dataset,
    batch_size=TRAIN_BATCH_SIZE,
    sampler=train_sampler,
    collate_fn=my_collate1)

valid_data_loader=DataLoader(
    dataset,
    batch_size=TRAIN_BATCH_SIZE,
    sampler=valid_sampler,
    collate_fn=my_collate1)


lr=3e-5#1e-3
num_training_steps=int(len(dataset) / TRAIN_BATCH_SIZE * EPOCH)

#decommenta per run con dati maggioli
n_class = dataset.get_n_classes()
print(n_class)
model=Bert_Classification_Model(n_class).to(device)
optimizer=AdamW(model.parameters(), lr=lr)
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                        num_warmup_steps = 0,
                                        num_training_steps = num_training_steps)
val_losses=[]
batches_losses=[]
val_acc=[]
prev_val_loss = 1
log_file = open("log_file_temp.txt", "w")
for epoch in range(EPOCH):
    t0 = time.time()    
    print(f"\n=============== EPOCH {epoch+1} / {EPOCH} ===============\n")
    log_file.write(f"\n=============== EPOCH {epoch+1} / {EPOCH} ===============\n")
    batches_losses_tmp=train_loop_fun1(train_data_loader, model, optimizer, device)
    epoch_loss=np.mean(batches_losses_tmp)
    print(f"\n*** avg_loss : {epoch_loss:.2f}, time : ~{(time.time()-t0)//60} min ({time.time()-t0:.2f} sec) ***\n")
    log_file.write(f"\n*** avg_loss : {epoch_loss:.2f}, time : ~{(time.time()-t0)//60} min ({time.time()-t0:.2f} sec) ***\n")
    t1=time.time()
    output, target, val_losses_tmp=eval_loop_fun1(valid_data_loader, model, device)
    avg_val_loss = np.mean(val_losses_tmp)
    print(f"==> evaluation : avg_loss = {avg_val_loss:.2f}, time : {time.time()-t1:.2f} sec\n")
    log_file.write(f"==> evaluation : avg_loss = {avg_val_loss:.2f}, time : {time.time()-t1:.2f} sec\n")
    ac, cr=evaluate(target, output)
    print(f"ACCURACY: {ac}")
    log_file.write(f"ACCURACY: {ac}")
    print("CLASSIFICATION_REPORT:")
    print(cr)
    exit()
    log_file.write(f"CLASSIFICATION_REPORT:\n{cr}")
    #print(f"=====>\t{tmp_evaluate}")
    #log_file.write(f"=====>\t{tmp_evaluate}")
    #val_acc.append(tmp_evaluate['accuracy'])
    #val_losses.append(val_losses_tmp)
    batches_losses.append(batches_losses_tmp)
    print("\t§§ model has been saved §§")
    torch.save(model, f"model_head/best_model_{epoch+1}.pt")    
    
log_file.close()

#### Third segmetation approach: tail

in this approach, we will keep only the last chunk of 200 tokens for each documents, so if a document is split into 3 chunk of 200, we will only keep the last chunk and we will not keep the first two chunk.

![img/Tail_Truncation.png](img/Tail_Truncation.png)

In [None]:
TRAIN_BATCH_SIZE=8
EPOCH=10
validation_split = .2
shuffle_dataset = True
random_seed= 42
MIN_LEN=249
MAX_LEN = 100000
CHUNK_LEN=200
OVERLAP_LEN=50
#MAX_LEN=10000000
#MAX_SIZE_DATASET=1000
approach='tail'

print('Loading BERT tokenizer...')
bert_tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-italian-xxl-uncased', do_lower_case=True)

dataset=ConsumerComplaintsDataset1(
    tokenizer=bert_tokenizer,
    min_len=MIN_LEN,
    max_len=MAX_LEN,
    chunk_len=CHUNK_LEN,
    #max_size_dataset=MAX_SIZE_DATASET,
    overlap_len=OVERLAP_LEN,
    approach=approach)


#train_size = int(0.8 * len(dataset))
#test_size = len(dataset) - train_size
#train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

# Creating data indices for training and validation splits:
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)

train_data_loader=DataLoader(
    dataset,
    batch_size=TRAIN_BATCH_SIZE,
    sampler=train_sampler,
    collate_fn=my_collate1)

valid_data_loader=DataLoader(
    dataset,
    batch_size=TRAIN_BATCH_SIZE,
    sampler=valid_sampler,
    collate_fn=my_collate1)


n_class = dataset.get_n_classes()
print(n_class)
model=Bert_Classification_Model(n_class).to(device)
lr=3e-5#1e-3
num_training_steps=int(len(dataset) / TRAIN_BATCH_SIZE * EPOCH)

optimizer=AdamW(model.parameters(), lr=lr)
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                        num_warmup_steps = 0,
                                        num_training_steps = num_training_steps)
val_losses=[]
batches_losses=[]
val_acc=[]
log_file = open("log_file_tail.txt", "w")
for epoch in range(EPOCH):
    t0 = time.time()    
    print(f"\n=============== EPOCH {epoch+1} / {EPOCH} ===============\n")
    log_file.write(f"\n=============== EPOCH {epoch+1} / {EPOCH} ===============\n")
    batches_losses_tmp=train_loop_fun1(train_data_loader, model, optimizer, device)
    epoch_loss=np.mean(batches_losses_tmp)
    print(f"\n*** avg_loss : {epoch_loss:.2f}, time : ~{(time.time()-t0)//60} min ({time.time()-t0:.2f} sec) ***\n")
    log_file.write(f"\n*** avg_loss : {epoch_loss:.2f}, time : ~{(time.time()-t0)//60} min ({time.time()-t0:.2f} sec) ***\n")
    t1=time.time()
    output, target, val_losses_tmp=eval_loop_fun1(valid_data_loader, model, device)
    avg_val_loss = np.mean(val_losses_tmp)
    print(f"==> evaluation : avg_loss = {avg_val_loss:.2f}, time : {time.time()-t1:.2f} sec\n")
    log_file.write(f"==> evaluation : avg_loss = {avg_val_loss:.2f}, time : {time.time()-t1:.2f} sec\n")
    ac, cr=evaluate(target, output)
    print(f"ACCURACY: {ac}")
    log_file.write(f"ACCURACY: {ac}")
    print("CLASSIFICATION_REPORT:")
    print(cr)
    log_file.write(f"CLASSIFICATION_REPORT:\n{cr}")
    #print(f"=====>\t{tmp_evaluate}")
    #log_file.write(f"=====>\t{tmp_evaluate}")
    #val_acc.append(tmp_evaluate['accuracy'])
    #val_losses.append(val_losses_tmp)
    batches_losses.append(batches_losses_tmp)
    print("\t§§ model has been saved §§")
    torch.save(model, f"model_tail/best_model_{epoch+1}.pt")    
    
log_file.close()

### Hierarchical Method

the Hierarchical methods applies to the ouputs of the Bert model (the embbeding)
The input text is firstly divided into k = L/510 fractions, which is fed into BERT to obtain the representation of the k text fractions. The representation of each fraction is the hidden state of the `[CLS]` tokens of the last layer. Then we use mean pooling, max pooling


#### Mean Pooling 

in this approach, we average the embedding of all k chunks across each dimensions.

![img/Mean_Pooling_Hierarchical.png](img/Mean_Pooling_Hierarchical.png)

In [None]:
TRAIN_BATCH_SIZE=8
EPOCH=10
validation_split = .2
shuffle_dataset = True
random_seed= 42
MIN_LEN=249
MAX_LEN = 100000
CHUNK_LEN=200
OVERLAP_LEN=50
#MAX_LEN=10000000
#MAX_SIZE_DATASET=1000

print('Loading BERT tokenizer...')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

dataset=ConsumerComplaintsDataset1(
    tokenizer=bert_tokenizer,
    min_len=MIN_LEN,
    max_len=MAX_LEN,
    chunk_len=CHUNK_LEN,
    #max_size_dataset=MAX_SIZE_DATASET,
    overlap_len=OVERLAP_LEN)


#train_size = int(0.8 * len(dataset))
#test_size = len(dataset) - train_size
#train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

# Creating data indices for training and validation splits:
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)

train_data_loader=DataLoader(
    dataset,
    batch_size=TRAIN_BATCH_SIZE,
    sampler=train_sampler,
    collate_fn=my_collate1)

valid_data_loader=DataLoader(
    dataset,
    batch_size=TRAIN_BATCH_SIZE,
    sampler=valid_sampler,
    collate_fn=my_collate1)


lr=3e-5#1e-3
num_training_steps=int(len(dataset) / TRAIN_BATCH_SIZE * EPOCH)

pooling_method="mean"
model_hierarchical=BERT_Hierarchical_Model(pooling_method=pooling_method).to(device)
optimizer=AdamW(model_hierarchical.parameters(), lr=lr)
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                        num_warmup_steps = 0,
                                        num_training_steps = num_training_steps)
val_losses=[]
batches_losses=[]
val_acc=[]
for epoch in range(EPOCH):
    t0 = time.time()    
    print(f"\n=============== EPOCH {epoch+1} / {EPOCH} ===============\n")
    batches_losses_tmp=rnn_train_loop_fun1(train_data_loader, model_hierarchical, optimizer, device)
    epoch_loss=np.mean(batches_losses_tmp)
    print(f"\n*** avg_loss : {epoch_loss:.2f}, time : ~{(time.time()-t0)//60} min ({time.time()-t0:.2f} sec) ***\n")
    t1=time.time()
    output, target, val_losses_tmp=rnn_eval_loop_fun1(valid_data_loader, model_hierarchical, device)
    print(f"==> evaluation : avg_loss = {np.mean(val_losses_tmp):.2f}, time : {time.time()-t1:.2f} sec\n")    
    tmp_evaluate=evaluate(target.reshape(-1), output)
    print(f"=====>\t{tmp_evaluate}")
    val_acc.append(tmp_evaluate['accuracy'])
    val_losses.append(val_losses_tmp)
    batches_losses.append(batches_losses_tmp)
    print(f"\t§§ the Hierarchical {pooling_method} pooling model has been saved §§")
    torch.save(model_hierarchical, f"model_hierarchical/{pooling_method}_pooling/model_{pooling_method}_pooling_epoch{epoch+1}.pt")    

#### Max Pooling 

in this approach, we take the maximum embedding of all the k chunks across each dimensions.

![img/Max_Pooling_Hierarchical.png](img/Max_Pooling_Hierarchical.png)

In [None]:
TRAIN_BATCH_SIZE=3
EPOCH=1
validation_split = .2
shuffle_dataset = True
random_seed= 42
MIN_LEN=249
MAX_LEN = 100000
CHUNK_LEN=200
OVERLAP_LEN=50
#MAX_LEN=10000000
#MAX_SIZE_DATASET=1000

print('Loading BERT tokenizer...')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

dataset=ConsumerComplaintsDataset1(
    tokenizer=bert_tokenizer,
    min_len=MIN_LEN,
    max_len=MAX_LEN,
    chunk_len=CHUNK_LEN,
    #max_size_dataset=MAX_SIZE_DATASET,
    overlap_len=OVERLAP_LEN)


#train_size = int(0.8 * len(dataset))
#test_size = len(dataset) - train_size
#train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

# Creating data indices for training and validation splits:
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)

train_data_loader=DataLoader(
    dataset,
    batch_size=TRAIN_BATCH_SIZE,
    sampler=train_sampler,
    collate_fn=my_collate1)

valid_data_loader=DataLoader(
    dataset,
    batch_size=TRAIN_BATCH_SIZE,
    sampler=valid_sampler,
    collate_fn=my_collate1)


device="cpu"
lr=3e-5#1e-3
num_training_steps=int(len(dataset) / TRAIN_BATCH_SIZE * EPOCH)

pooling_method="max"
model_hierarchical=BERT_Hierarchical_Model(pooling_method=pooling_method).to(device)
optimizer=AdamW(model_hierarchical.parameters(), lr=lr)
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                        num_warmup_steps = 0,
                                        num_training_steps = num_training_steps)
val_losses=[]
batches_losses=[]
val_acc=[]
for epoch in range(EPOCH):
    t0 = time.time()    
    print(f"\n=============== EPOCH {epoch+1} / {EPOCH} ===============\n")
    batches_losses_tmp=rnn_train_loop_fun1(train_data_loader, model_hierarchical, optimizer, device)
    epoch_loss=np.mean(batches_losses_tmp)
    print(f"\n*** avg_loss : {epoch_loss:.2f}, time : ~{(time.time()-t0)//60} min ({time.time()-t0:.2f} sec) ***\n")
    t1=time.time()
    output, target, val_losses_tmp=rnn_eval_loop_fun1(valid_data_loader, model_hierarchical, device)
    print(f"==> evaluation : avg_loss = {np.mean(val_losses_tmp):.2f}, time : {time.time()-t1:.2f} sec\n")    
    tmp_evaluate=evaluate(target.reshape(-1), output)
    print(f"=====>\t{tmp_evaluate}")
    val_acc.append(tmp_evaluate['accuracy'])
    val_losses.append(val_losses_tmp)
    batches_losses.append(batches_losses_tmp)
    print(f"\t§§ the Hierarchical {pooling_method} pooling model has been saved §§")
    torch.save(model_hierarchical, f"model_hierarchical/{pooling_method}_pooling/model_{pooling_method}_pooling_epoch{epoch+1}.pt")    

# RoBERT RNN classifier on top of the Fine Tuned Bert Model

 The input text is firstly divided into k = L/510 fractions, which is fed into the fine tuned BERT (as describe above, *cf: First segmetation approach: all*) to obtain the representation of the k text chunks. The representation of each fraction is the hidden state of the `[CLS]` tokens of the last layer.
 
Each chunk embedding (representation) become the input of an LSTM cell, this way the order is preserved and the length of the document is not a limitation anymore because of the dynamic aspect of the LSTM that allow different variable sequence lengths (accros different batches)

we then pass the last hidden state (nbDoc * 100) to a neural network with the same architecture (as describe above, we use the same neural network architecture for classification through the all notebook, [cf: 3. Fine-tuning on the 200 tokens chunks](#3.-Fine-tuning-on-the-200-tokens-chunks:)) 

![img/RoBERT.png](img/RoBERT.png)

In [None]:
TRAIN_BATCH_SIZE=8
EPOCH=10
validation_split = .2
shuffle_dataset = True
random_seed= 42
MIN_LEN=249
MAX_LEN = 100000
CHUNK_LEN=200
OVERLAP_LEN=50
#MAX_LEN=10000000
MAX_SIZE_DATASET=1000

print('Loading BERT tokenizer...')
bert_tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-italian-xxl-uncased', do_lower_case=True)

dataset=ConsumerComplaintsDataset1(
    tokenizer=bert_tokenizer,
    min_len=MIN_LEN,
    max_len=MAX_LEN,
    chunk_len=CHUNK_LEN,
    #max_size_dataset=MAX_SIZE_DATASET,
    overlap_len=OVERLAP_LEN)


#train_size = int(0.8 * len(dataset))
#test_size = len(dataset) - train_size
#train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

# Creating data indices for training and validation splits:
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)

train_data_loader=DataLoader(
    dataset,
    batch_size=TRAIN_BATCH_SIZE,
    sampler=train_sampler,
    collate_fn=my_collate1)

valid_data_loader=DataLoader(
    dataset,
    batch_size=TRAIN_BATCH_SIZE,
    sampler=valid_sampler,
    collate_fn=my_collate1)


lr=3e-5#1e-3
num_training_steps=int(len(dataset) / TRAIN_BATCH_SIZE * EPOCH)

n_class = dataset.get_n_classes()
model=torch.load("model_head/best_model_10.pt")

model_rnn=RoBERT_Model(n_class=n_class, bertFineTuned=list(model.children())[0]).to(device)
optimizer=AdamW(model_rnn.parameters(), lr=lr)
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                        num_warmup_steps = 0,
                                        num_training_steps = num_training_steps)
val_losses=[]
batches_losses=[]
val_acc=[]


log_file = open("log_file_rnn.txt", "w")
for epoch in range(EPOCH):
    t0 = time.time()    
    print(f"\n=============== EPOCH {epoch+1} / {EPOCH} ===============\n")
    log_file.write(f"\n=============== EPOCH {epoch+1} / {EPOCH} ===============\n")
    batches_losses_tmp=rnn_train_loop_fun1(train_data_loader, model_rnn, optimizer, device)
    epoch_loss=np.mean(batches_losses_tmp)
    print(f"\n*** avg_loss : {epoch_loss:.2f}, time : ~{(time.time()-t0)//60} min ({time.time()-t0:.2f} sec) ***\n")
    log_file.write(f"\n*** avg_loss : {epoch_loss:.2f}, time : ~{(time.time()-t0)//60} min ({time.time()-t0:.2f} sec) ***\n")
    t1=time.time()
    output, target, val_losses_tmp=rnn_eval_loop_fun1(valid_data_loader, model_rnn, device)
    avg_val_loss = np.mean(val_losses_tmp)
    print(f"==> evaluation : avg_loss = {avg_val_loss:.2f}, time : {time.time()-t1:.2f} sec\n")
    log_file.write(f"==> evaluation : avg_loss = {avg_val_loss:.2f}, time : {time.time()-t1:.2f} sec\n")
    ac, cr=evaluate(target, output)
    print(f"ACCURACY: {ac}")
    log_file.write(f"ACCURACY: {ac}")
    print("CLASSIFICATION_REPORT:")
    print(cr)
    log_file.write(f"CLASSIFICATION_REPORT:\n{cr}")
    #print(f"=====>\t{tmp_evaluate}")
    #log_file.write(f"=====>\t{tmp_evaluate}")
    #val_acc.append(tmp_evaluate['accuracy'])
    #val_losses.append(val_losses_tmp)
    batches_losses.append(batches_losses_tmp)
    print("\t§§ model has been saved §§")
    torch.save(model, f"model_rnn1/best_model_{epoch+1}.pt")    
    
log_file.close()

Loading BERT tokenizer...
Nettoyage des données
['Manuale del Commercio', 'Periodici', 'Leggioggi', 'Gazzetta', 'Manuale della Circolazione Stradale']


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.




___ batch index = 0 / 3387 (0.00%), loss = 0.8812, time = 0.64 secondes ___
___ batch index = 640 / 3387 (18.90%), loss = 0.6725, time = 431.24 secondes ___
___ batch index = 1280 / 3387 (37.79%), loss = 0.6508, time = 444.84 secondes ___
___ batch index = 1920 / 3387 (56.69%), loss = 0.6489, time = 431.20 secondes ___
___ batch index = 2560 / 3387 (75.58%), loss = 0.6273, time = 436.58 secondes ___
___ batch index = 3200 / 3387 (94.48%), loss = 0.6529, time = 429.62 secondes ___

*** avg_loss : 0.65, time : ~38.0 min (2298.30 sec) ***

___ batch index = 640 / 3387 (18.90%), loss = 0.6190, time = 395.36 secondes ___
___ batch index = 1280 / 3387 (37.79%), loss = 0.6298, time = 390.85 secondes ___


In [None]:
pd.DataFrame(np.array([[np.mean(x) for x in batches_losses], [np.mean(x) for x in val_losses]]).T,
                   columns=['Training', 'Validation']).plot(title="loss")

In [None]:
pd.DataFrame(np.array(val_acc).T,
                   columns=['Validation']).plot(title="accuracy")

# Summary

In [None]:
summary=pd.DataFrame({"all":[0.74, 0.57, 0.83], "head only":[0.62, 0.45, 0.87], "tail only":[0.93, 0.75, 0.77], "mean pooling":[0.62, 0.47, 0.87], "max pooling":[0.65, 0.45, 0.87], "RoBERT":[0.46, 0.34, 0.91]}, index=["avg_loss_train", "avg_loss_val", "accuracy"])
summary.columns.name = 'after one epoch'
summary.style.set_properties(
    subset=['RoBERT'], 
    **{'font-weight': 'bold'}
)

We can observe that the RoBERT Model give the best result, so we can conclude that this Model a net State Of the Art improvement in term of the loss function and the accuracy, the second model is the head only, this make sense because we can imagine that the consumer introduce his complain within the first part of his comment.

In [None]:
model=torch.load("model_rnn1/model_rnn_epoch3.pt")

#model_rnn=RoBERT_Model(bertFineTuned=list(model.children())[0]).to(device)

In [None]:
t1=time.time()
output, target, val_losses_tmp=rnn_eval_loop_fun1(valid_data_loader, model, device)
print(f"==> evaluation : avg_loss = {np.mean(val_losses_tmp):.2f}, time : {time.time()-t1:.2f} sec\n")    
tmp_evaluate=evaluate(target.reshape(-1), output)
print(f"=====>\t{tmp_evaluate}")