# POC for classifying financial excerpts

## Objectives

This notebook demonstrates the process of text classification using three different transformer models: DistilBERT, BERT, and RoBERTa. We will train each model on labeled data, use them to label unlabeled data, and visualize the results.

**Steps to perform text-classification**
1. Prepare and preprocess data.
2. Hyperparameter Tuning using DistilBERT
3. Train and evaluate DistilBERT, BERT, and RoBERTa models.
4. Label the unlabeled dataset using the trained models.
5. Visualize the label distribution.

## Imports

In [None]:
# !pip install pandas
# !pip install scikit-learn
# !pip install transformers==4.18.0
# !pip install tensorflow==2.15.0

In [83]:
import json
import os
import pandas as pd
import transformers
import matplotlib.pyplot as plt
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification #DistilBERT
from transformers import BertTokenizer, TFBertForSequenceClassification #BERT
from transformers import RobertaTokenizer, TFRobertaForSequenceClassification #RoBERTa

from transformers import logging, TFTrainingArguments
from transformers.trainer_tf import TFTrainer
from transformers import TextClassificationPipeline

import tensorflow as tf

## Data Preprocessing
In this section, we will load the dataset, preprocess the data, and split it into training and validation sets.

In [15]:
def extract_normalized_table(table: dict) -> str:
    """
    Extracts and normalizes table data from a dictionary.
    Converts the table content into a formatted string.
    """
    table_data = []
    if 'normalized_csv' in table:
        col_sep = '|'
        row_sep = '.'
        # rows = [row for row in table['normalized_csv']]
        for row in table['normalized_csv']:
            row_data = [col for col in row]
            row_str = [row_data[i] + col_sep for i in range(len(row_data))]
            # Example: Consumer installment loans| 35| 35| 6| 6| 41| 41| —| —| 6| 6|.
            table_data.append(' '.join(row_str).replace('\n', ' ') + row_sep)

    # Return table as a string
    # The new line is not really needed but it makes it easier to read
    return '\n'.join(table_data) if table_data else ''

In [16]:
def read_and_separate_jsonl(file_path):
    '''
    The function aims to
        1. Read the data
        2. convert the content value to normalized table if is_table = True
        3. Separates the data into 2 parts: 
            data_with_labels
            data_without_labels
    '''
    data_with_label = []
    data_without_label = []

    # Read the JSONL file into a DataFrame
    df = pd.read_json(file_path, lines=True)

    # Iterate through each row in the DataFrame
    for _, row in df.iterrows():
        
        # check if is_table = True
        if row['is_table']:
            # convert the dictionary table to string type
            row['content'] = extract_normalized_table(row['content'])

         # Prepare the data list for each row
        data = [
            row['id'],
            row['document_id'],
            row['document_title'],
            row['content'],
            row['is_table'],
            row['prev_content'],
            row['next_content'],
            row['label']
        ]

        # Separate data based on the presence of a label
        if row['label'] is None:
            data_without_label.append(data)
        else:
            data_with_label.append(data)

    return data_with_label, data_without_label

In [17]:
# Read and separate the data
file_path = './data/excerpts.jsonl'
data_with_label, data_without_label = read_and_separate_jsonl(file_path)

print("Data with label: ", len(data_with_label))
print("Data without label: ", len(data_without_label))

# creating dataframes for labeled and unlabeled data
labeled_df = pd.DataFrame(data_with_label, columns=['id','document_id','document_title','content','is_table','prev_content', 'next_content','label'])
unlabeled_df = pd.DataFrame(data_without_label, columns=['id','document_id','document_title','content','is_table','prev_content', 'next_content','label'])

labeled_df.head()

Data with label:  382
Data without label:  10000


Unnamed: 0,id,document_id,document_title,content,is_table,prev_content,next_content,label
0,75846004,522393,CFG 8-K 10/25/18,6. Conditions to Obligations. The several obli...,False,,(f) The Underwriters shall have received on th...,TEXT
1,79398889,551017,URI 8-K 12/07/22,C: (203) 399-8951,False,,Indicate by check mark whether the registrant ...,NOISE
2,76351219,526475,DIS 8-K 05/08/19 Earnings Release,1,False,,The following table summarizes the second quar...,NOISE
3,76180796,524955,CPB 8-K 09/03/20 Earnings Release,Continuing Operations| Three Months Ended| Thr...,True,today reported results for its fourth-quarter ...,,FIN_TABLE
4,75324462,518706,AME 8-K 05/12/22 Entry into a Material Definit...,"Berwyn, Pa., May 13, 2022 – AMETEK, Inc. (NYSE...",False,,__________________,TEXT


In [18]:
# checking count for each label - FIN_TABLE, TEXT, NOISE
labeled_df['label'].value_counts()

label
NOISE        170
TEXT         140
FIN_TABLE     72
Name: count, dtype: int64

In [19]:
# encoding labels into numerical values- FIN_TABLE, TEXT, NOISE
labeled_df['encoded_label'] = labeled_df['label'].astype('category').cat.codes
labeled_df.head()

Unnamed: 0,id,document_id,document_title,content,is_table,prev_content,next_content,label,encoded_label
0,75846004,522393,CFG 8-K 10/25/18,6. Conditions to Obligations. The several obli...,False,,(f) The Underwriters shall have received on th...,TEXT,2
1,79398889,551017,URI 8-K 12/07/22,C: (203) 399-8951,False,,Indicate by check mark whether the registrant ...,NOISE,1
2,76351219,526475,DIS 8-K 05/08/19 Earnings Release,1,False,,The following table summarizes the second quar...,NOISE,1
3,76180796,524955,CPB 8-K 09/03/20 Earnings Release,Continuing Operations| Three Months Ended| Thr...,True,today reported results for its fourth-quarter ...,,FIN_TABLE,0
4,75324462,518706,AME 8-K 05/12/22 Entry into a Material Definit...,"Berwyn, Pa., May 13, 2022 – AMETEK, Inc. (NYSE...",False,,__________________,TEXT,2


In [20]:
# Preparing data for training and validation
data_texts = labeled_df['content'].to_list()
data_labels = labeled_df['encoded_label'].to_list()

### Train Test Split

In [21]:
train_texts, val_texts, train_labels, val_labels = train_test_split(data_texts, data_labels, test_size = 0.2, random_state = 0 )

## Model Training and Evaluation

In [25]:
def compute_metrics(pred):
    '''
    function to compute metrics like accuracy, precision, recall, and F1-score to assess model performance
    '''
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }

### DistilBERT

In this section, we will train a DistilBERT model on the labeled data and evaluate its performance.

DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than google-bert/bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

**Steps:**
1. **Load Tokenizer and Model**: Initialize the tokenizer and model from the Hugging Face library.
2. **Tokenize Data**: Tokenize the data using the DistilBERT tokenizer.
3. **Create Datasets**: Prepare the training and validation datasets using tokenized data.
4. **Define Training Arguments**: Set the parameters for training, such as learning rate and batch size.
5. **Train the Model**: Train the DistilBERT model using the training dataset.
6. **Evaluate the Model**: Evaluate the trained model on the validation dataset.
7. **Save the Model**: Save the trained model and tokenizer for future use.


In [22]:
# Load tokenizer from the pre-trained DistilBERT model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
# Load the pre-trained DistilBERT model for sequence classification
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use i

In [23]:
# tokenizing train_texts and val_texts
train_encodings = tokenizer(train_texts, truncation = True, padding = True  )
val_encodings = tokenizer(val_texts, truncation = True, padding = True )

In [24]:
# Creating TensorFlow datasets for training and validation
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))

In [26]:
def train_and_evaluate(learning_rate, epochs):
    # Define training arguments
    training_args = TFTrainingArguments(
        output_dir='./results',
        num_train_epochs=epochs,
        learning_rate=learning_rate,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=64,
        warmup_steps=30,
        logging_dir='./logs',
        eval_steps=10
    )

    # Using a distributed training
    with training_args.strategy.scope():
        trainer_model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels = 3 )

    # Initializing the TFTrainer
    trainer = TFTrainer(
        model=trainer_model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics
    )
    
    # Training the model
    trainer.train()
    
    # Evaluating the model
    eval_results = trainer.evaluate()
    
    return eval_results

#### **Hyperparameter Tuning**

In this section, we aim to explore the effect of different hyperparameters on DistilBERT model performance. Specifically, we will experiment with various learning rates and the number of training epochs. The chosen hyperparameters for testing are:

- **Learning Rates**: 5e-5, 3e-5, 2e-5
- **Epochs**: 5, 6, 7, 8, 9

We will train and evaluate the model for each combination of learning rate and epochs. The evaluation metrics we will consider include accuracy, precision, recall, F1 score, and loss. These metrics will help us determine the optimal hyperparameters for our task.

The code below performs the hyperparameter testing and stores the results in a DataFrame for further analysis.

In [37]:
# hyperparameters to test
learning_rates = [5e-5, 3e-5, 2e-5]
epochs_list = [5, 6, 7, 8, 9]

results = []

for lr in learning_rates:
    for epochs in epochs_list:
        print(f"Training with learning rate: {lr} and epochs: {epochs}")
        eval_results = train_and_evaluate(learning_rate=lr, epochs=epochs)
        results.append({
            'learning_rate': lr,
            'epochs': epochs,
            'accuracy': eval_results['eval_accuracy'],
            'precision': eval_results['eval_precision'],
            'loss': eval_results['eval_loss'],
            'recall': eval_results['eval_recall'],
            'f1': eval_results['eval_f1']
        })

# Converting results to DataFrame
results_df = pd.DataFrame(results)

Training with learning rate: 5e-05 and epochs: 5


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'dropout_299', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,7.0
eval_accuracy,0.89844
eval_f1,0.89783
eval_loss,0.53065
eval_precision,0.90017
eval_recall,0.89844


Training with learning rate: 5e-05 and epochs: 6


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_319']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,5.0
eval_accuracy,0.91406
eval_f1,0.91381
eval_loss,0.46219
eval_precision,0.91505
eval_recall,0.91406


Training with learning rate: 5e-05 and epochs: 7


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_339']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,6.0
eval_accuracy,0.95312
eval_f1,0.95315
eval_loss,0.27983
eval_precision,0.95422
eval_recall,0.95312


Training with learning rate: 5e-05 and epochs: 8


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_359', 'classifier', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,7.0
eval_accuracy,0.94531
eval_f1,0.94522
eval_loss,0.20967
eval_precision,0.94656
eval_recall,0.94531


Training with learning rate: 5e-05 and epochs: 9


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_379']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,8.0
eval_accuracy,0.91406
eval_f1,0.91337
eval_loss,0.2503
eval_precision,0.91909
eval_recall,0.91406


Training with learning rate: 3e-05 and epochs: 5


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_399']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,9.0
eval_accuracy,0.92969
eval_f1,0.92975
eval_loss,0.21758
eval_precision,0.93266
eval_recall,0.92969


Training with learning rate: 3e-05 and epochs: 6


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_419', 'classifier', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,5.0
eval_accuracy,0.86719
eval_f1,0.86583
eval_loss,0.6854
eval_precision,0.8734
eval_recall,0.86719


Training with learning rate: 3e-05 and epochs: 7


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'dropout_439', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,6.0
eval_accuracy,0.90625
eval_f1,0.90562
eval_loss,0.50317
eval_precision,0.90915
eval_recall,0.90625


Training with learning rate: 3e-05 and epochs: 8


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_459']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,7.0
eval_accuracy,0.94531
eval_f1,0.94522
eval_loss,0.35123
eval_precision,0.94656
eval_recall,0.94531


Training with learning rate: 3e-05 and epochs: 9


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_479', 'classifier', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,8.0
eval_accuracy,0.94531
eval_f1,0.94532
eval_loss,0.25881
eval_precision,0.94625
eval_recall,0.94531


Training with learning rate: 2e-05 and epochs: 5


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_499']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,9.0
eval_accuracy,0.96094
eval_f1,0.96098
eval_loss,0.18855
eval_precision,0.96133
eval_recall,0.96094


Training with learning rate: 2e-05 and epochs: 6


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'dropout_519', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,5.0
eval_accuracy,0.86719
eval_f1,0.86631
eval_loss,0.84376
eval_precision,0.86846
eval_recall,0.86719


Training with learning rate: 2e-05 and epochs: 7


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'dropout_539', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,6.0
eval_accuracy,0.86719
eval_f1,0.86583
eval_loss,0.68479
eval_precision,0.8734
eval_recall,0.86719


Training with learning rate: 2e-05 and epochs: 8


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_559']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,7.0
eval_accuracy,0.89844
eval_f1,0.89783
eval_loss,0.53065
eval_precision,0.90017
eval_recall,0.89844


Training with learning rate: 2e-05 and epochs: 9


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_579']
You should probably TRAIN this model on a down-stream task to be able to use 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁
eval_accuracy,▁
eval_f1,▁
eval_loss,▁
eval_precision,▁
eval_recall,▁

0,1
epoch,8.0
eval_accuracy,0.92188
eval_f1,0.92171
eval_loss,0.3943
eval_precision,0.92375
eval_recall,0.92188


In [90]:
results_df

Unnamed: 0,learning_rate,epochs,accuracy,precision,loss,recall,f1
0,5e-05,5,0.914062,0.915054,0.462191,0.914062,0.913813
1,5e-05,6,0.953125,0.954221,0.279833,0.953125,0.953155
2,5e-05,7,0.945312,0.946562,0.209668,0.945312,0.945219
3,5e-05,8,0.914062,0.919089,0.250299,0.914062,0.913368
4,5e-05,9,0.929688,0.932658,0.217583,0.929688,0.929748
5,3e-05,5,0.867188,0.873403,0.6854,0.867188,0.865834
6,3e-05,6,0.90625,0.909148,0.503169,0.90625,0.90562
7,3e-05,7,0.945312,0.946562,0.351228,0.945312,0.945219
8,3e-05,8,0.945312,0.946247,0.258813,0.945312,0.945323
9,3e-05,9,0.960938,0.961328,0.188554,0.960938,0.960983


#### Results
We trained the model with various learning rates and epochs. The following graphs show the performance of the model for different hyperparameter combinations.

In [89]:
results_df.to_csv('./data/hyperparameter_tuning_results.csv', index=False)

In [82]:
# hyperparameter tuning results

# Plot accuracy
accuracy_fig = go.Figure()
for lr in learning_rates:
    subset = results_df[results_df['learning_rate'] == lr]
    accuracy_fig.add_trace(go.Scatter(x=subset['epochs'], y=subset['accuracy'], mode='lines', name=f'LR={lr}',
                                      hovertemplate='Learning Rate: %{customdata[0]}<br>Epochs: %{x}<br>Accuracy: %{customdata[1]}<br>Loss: %{customdata[2]}',
                                       customdata=subset[['learning_rate', 'accuracy', 'loss']]))

accuracy_fig.update_layout(
    xaxis_title='Epochs',
    yaxis_title='Accuracy',
    title='Accuracy vs. Epochs for Different Learning Rates',
    legend_title='Learning Rate',
    template='plotly_white',
    width=800
)

accuracy_fig.show()

# Plot F1-score
f1_fig = go.Figure()
for lr in learning_rates:
    subset = results_df[results_df['learning_rate'] == lr]
    f1_fig.add_trace(go.Scatter(x=subset['epochs'], y=subset['f1'], mode='lines', name=f'LR={lr}',
                                hovertemplate='Learning Rate: %{customdata[0]}<br>Epochs: %{x}<br>F1 Score: %{customdata[1]}<br>Loss: %{customdata[2]}',
                                       customdata=subset[['learning_rate', 'f1', 'loss']]))

f1_fig.update_layout(
    xaxis_title='Epochs',
    yaxis_title='F1 Score',
    title='F1 Score vs. Epochs for Different Learning Rates',
    legend_title='Learning Rate',
    template='plotly_white',
    width=800
)

f1_fig.show()

**Inference drawn:**

After conducting hyperparameter tuning by experimenting with different learning rates and epochs, we observe the following results:

- **Learning Rate 0.00005:**

    - Achieves the highest accuracy of 95.31% with 7 epochs.
    - Shows a precision of 94.66%, recall of 94.53%, and F1-score of 94.52%..

- **Learning Rate 0.00003:**

    - Achieves the highest accuracy of 96.09% with 9 epochs.
    - Shows a precision of 96.13%, recall of 96.09%, and F1-score of 96.10%.

- **Learning Rate 0.00002:**

    - Achieves the highest accuracy of 95.31% with 9 epochs.
    - Shows a precision of 95.42%, recall of 95.31%, and F1-score of 95.30%.

From these results, the sweet spot appears to be a learning rate of 0.00003 with 9 epochs, as it provides the highest accuracy and well-balanced precision, recall, and F1 scores. This combination offers the best trade-off between training time and model performance, making it the optimal choice for our DistilBERT model on this dataset.

**Key Observations**
- The model's performance improves as the number of epochs increases, indicating that the model benefits from more training iterations.
- Higher learning rates tend to lead to better performance, but the improvements diminish beyond a certain threshold, suggesting the importance of finding the right balance to prevent overfitting.

#### Training with optimal parameters

Learning rate: 3e-5 and epochs: 9

In [None]:
# Define training arguments
training_args = TFTrainingArguments(
    output_dir='./results',
    num_train_epochs=9,
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=30,
    logging_dir='./logs',
    eval_steps=10
)

# Using a distributed training
with training_args.strategy.scope():
    trainer_model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels = 3 )

# Initializing the TFTrainer
trainer = TFTrainer(
    model=trainer_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

In [41]:
# Training the model
trainer.train()

In [43]:
print("Learning rate: 3e-5 and epochs: 9\n")
# Evaluating the model
trainer.evaluate()

Learning rate: 3e-5 and epochs: 9



{'eval_loss': 0.1885537952184677,
 'eval_accuracy': 0.9609375,
 'eval_precision': 0.961328125,
 'eval_recall': 0.9609375,
 'eval_f1': 0.9609832569391393}

In [44]:
# Save the trained model and tokenizer
distilbert_save_directory = "distilbert_saved_models/"
trainer_model.save_pretrained(distilbert_save_directory)
tokenizer.save_pretrained(distilbert_save_directory)

('distilbert_saved_models/tokenizer_config.json',
 'distilbert_saved_models/special_tokens_map.json',
 'distilbert_saved_models/vocab.txt',
 'distilbert_saved_models/added_tokens.json')

### BERT
Now, we will train and evaluate a BERT model using the same dataset.

BERT is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction.

**Steps:**
1. **Load BERT Tokenizer and Model**: Initialize the tokenizer and model from the Hugging Face library.
2. **Tokenize Data**: Tokenize the data using the BERT tokenizer.
3. **Create Datasets**: Prepare the training and validation datasets using tokenized data.
4. **Define Training Arguments**: Set the parameters for training.
5. **Train the Model**: Train the BERT model using the training dataset.
6. **Evaluate the Model**: Evaluate the performance of the trained BERT model on the validation dataset.
7. **Save the Model**: Save the trained model and tokenizer for future use.

In [60]:
# Load tokenizer from the pre-trained BERT model
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Load the pre-trained BERT model for sequence classification
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [61]:
# tokenizing train_texts and val_texts
bert_train_encodings = bert_tokenizer(train_texts, truncation = True, padding = True  )
bert_val_encodings = bert_tokenizer(val_texts, truncation = True, padding = True )

In [62]:
# Creating TensorFlow datasets for training and validation
bert_train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(bert_train_encodings),
    train_labels
))

bert_val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(bert_val_encodings),
    val_labels
))

#### Training with optimal parameters
Learning rate: 3e-5 and epochs: 9

In [None]:
# Define training arguments
training_args = TFTrainingArguments(
    output_dir='./results',
    num_train_epochs=9,
    learning_rate=3e-5,
    per_device_train_batch_size=8, # reduced batch size to avoid Out Of Memory error
    per_device_eval_batch_size=8, # reduced batch size to avoid Out Of Memory error
    warmup_steps=30,
    logging_dir='./logs',
    eval_steps=10
)

# Using a distributed training
with training_args.strategy.scope():
    bert_trainer_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 3 )

# Initializing the TFTrainer
trainer = TFTrainer(
    model=bert_trainer_model,
    args=training_args,
    train_dataset=bert_train_dataset,
    eval_dataset=bert_val_dataset,
    compute_metrics=compute_metrics
)

In [68]:
# Training the model
trainer.train()

In [70]:
print("Model: BERT")
print("Learning rate: 3e-5 and epochs: 9\n")
# Evaluating the model
trainer.evaluate()

Model: BERT
Learning rate: 3e-5 and epochs: 9



{'eval_loss': 0.33096411228179934,
 'eval_accuracy': 0.9,
 'eval_precision': 0.9034313725490195,
 'eval_recall': 0.9,
 'eval_f1': 0.9003702603702604}

In [71]:
# Save the trained model and tokenizer
bert_save_directory = "bert_saved_models/"
bert_trainer_model.save_pretrained(bert_save_directory)
bert_tokenizer.save_pretrained(bert_save_directory)

('bert_saved_models/tokenizer_config.json',
 'bert_saved_models/special_tokens_map.json',
 'bert_saved_models/vocab.txt',
 'bert_saved_models/added_tokens.json')

### RoBERTa
Finally, we will train and evaluate a RoBERTa model using the same dataset.

RoBERTa builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

**It is same as BERT with better pretraining tricks:**

- dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
- train with larger batches
- use BPE with bytes as a sub-unit and not characters (because of unicode characters)

**Steps:**
1. **Load RoBERTa Tokenizer and Model**: Initialize the tokenizer and model from the Hugging Face library.
2. **Tokenize Data**: Tokenize the data using the RoBERTa tokenizer.
3. **Create Datasets**: Prepare the training and validation datasets using tokenized data.
4. **Define Training Arguments**: Set the parameters for training.
5. **Train the Model**: Train the RoBERTa model using the training dataset.
6. **Evaluate the Model**: Evaluate the performance of the trained RoBERTa model on the validation dataset.
7. **Save the Model**: Save the trained model and tokenizer for future use.

In [84]:
# Load tokenizer from the pre-trained RoBERTa model
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
# Load the pre-trained RoBERTa model for sequence classification
roberta_model = TFRobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=3)

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/627M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [85]:
# tokenizing train_texts and val_texts
roberta_train_encodings = roberta_tokenizer(train_texts, truncation=True, padding=True)
roberta_val_encodings = roberta_tokenizer(val_texts, truncation=True, padding=True)

In [86]:
# Creating TensorFlow datasets for training and validation
roberta_train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(roberta_train_encodings),
    train_labels
))

roberta_val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(roberta_val_encodings),
    val_labels
))

#### Training with optimal parameters
Learning rate: 3e-5 and epochs: 9

In [None]:
# Define training arguments
training_args = TFTrainingArguments(
    output_dir='./results',
    num_train_epochs=9,
    learning_rate=3e-5,
    per_device_train_batch_size=8, # reduced batch size to avoid Out Of Memory error
    per_device_eval_batch_size=8, # reduced batch size to avoid Out Of Memory error
    warmup_steps=30,
    logging_dir='./logs',
    eval_steps=10
)

# Using a distributed training
with training_args.strategy.scope():
    roberta_trainer_model = TFRobertaForSequenceClassification.from_pretrained('roberta-base', num_labels = 3 )

# Initializing the TFTrainer
trainer = TFTrainer(
    model=roberta_trainer_model,
    args=training_args,
    train_dataset=roberta_train_dataset,
    eval_dataset=roberta_val_dataset,
    compute_metrics=compute_metrics
)

In [88]:
# Training the model
trainer.train()

In [91]:
print("Model: RoBERTa")
print("Learning rate: 3e-5 and epochs: 9\n")
# Evaluating the model
trainer.evaluate()

Model: RoBERTa
Learning rate: 3e-5 and epochs: 9



{'eval_loss': 0.3747982978820801,
 'eval_accuracy': 0.9125,
 'eval_precision': 0.9145167895167894,
 'eval_recall': 0.9125,
 'eval_f1': 0.9127909226190475}

In [92]:
# Save the trained model and tokenizer
roberta_save_directory = "roberta_saved_models/"
roberta_trainer_model.save_pretrained(roberta_save_directory)
roberta_tokenizer.save_pretrained(roberta_save_directory)

('roberta_saved_models/tokenizer_config.json',
 'roberta_saved_models/special_tokens_map.json',
 'roberta_saved_models/vocab.json',
 'roberta_saved_models/merges.txt',
 'roberta_saved_models/added_tokens.json')

## Label the unlabeled dataset 

We will label the dataset using DistilBERT because of the highest accuracy acheieved by the model.


In [None]:
# Load the fine-tuned tokenizer and model from the saved directory
tokenizer_fine_tuned = DistilBertTokenizer.from_pretrained(distilbert_save_directory)
model_fine_tuned = TFDistilBertForSequenceClassification.from_pretrained(distilbert_save_directory)

### Testing with random content from unlabeled data

In [37]:
import random

# Converting the 'content' column of the unlabeled DataFrame to a list
unlabeled_content_list = unlabeled_df['content'].tolist()

# Selecting a random piece of content from the unlabeled content list
random_content = random.choice(unlabeled_content_list)
print(random_content)

The holders of shares of Common Stock are entitled to receive such dividends, if any, as may be declared from time to time by the Company’s Board of Directors in its discretion from funds legally available therefor.


In [38]:
# Tokenizing the random content using the fine-tuned tokenizer
predict_input = tokenizer_fine_tuned.encode(
    random_content,
    truncation = True,
    padding = True,
    return_tensors = 'tf'
)

# Predicting the output using the fine-tuned model
output = model_fine_tuned(predict_input)[0]

# Getting the predicted label by finding the index with the maximum value in the output
prediction_value = tf.argmax(output, axis = 1).numpy()[0]

# Printing the mapping of encoded labels to original labels
print(dict(enumerate(labeled_df['label'].astype('category').cat.categories)))
print()

# Printing the predicted value and corresponding label
if prediction_value == 0:
    print("Predicted values: ",prediction_value, " Label: FIN_TABLE")
elif prediction_value == 1:
    print("Predicted values: ",prediction_value, " Label: NOISE")
elif prediction_value == 2:
    print("Predicted values: ",prediction_value, " Label: TEXT")

{0: 'FIN_TABLE', 1: 'NOISE', 2: 'TEXT'}



Predicted values:  2  Label: TEXT


### Labeling the entire unlabeled data

In [46]:
# Tokenize the unlabeled data
unlabeled_texts = unlabeled_df['content'].to_list()
unlabeled_encodings = tokenizer_fine_tuned(unlabeled_texts, truncation=True, padding=True, return_tensors='tf')

# Make predictions on the unlabeled data
predictions = model_fine_tuned.predict(dict(unlabeled_encodings)).logits
predicted_labels = tf.argmax(predictions, axis=1).numpy()

# Add the predicted labels to the dataframe
unlabeled_df['encoded_label'] = predicted_labels

print("Labeling completed.")

Labeling completed.


In [47]:
# updating label column in dataframe using encoded_label assigned
label_mapping = {0: 'FIN_TABLE', 1: 'NOISE', 2: 'TEXT'}
unlabeled_df['label'] = unlabeled_df['encoded_label'].map(label_mapping)
print("Labels updated.")

Labels updated.


In [48]:
unlabeled_df.head()

Unnamed: 0,id,document_id,document_title,content,is_table,prev_content,next_content,label,encoded_label
0,71833445,510403,Deere & Company 10-Q Q3 2023,​(d)Description of Severance Benefits. Subject...,False,​Condensed Notes to Interim Consolidated Finan...,,TEXT,2
1,73134247,512748,McDonald's Corporation 10-Q Q3 2020,,True,,,NOISE,1
2,66528922,505472,"Charter Communications, Inc. 10-K FY 2021","There were 193,730,992 shares of Class A commo...",False,"directors, executive officers and the principa...",Information required by Part III is incorporat...,TEXT,2
3,66095744,505112,A. O. Smith Corporation 10-K FY 2022,For the transition period from to,False,Amount of Award: ______ [RSUs] [Shares] [Targe...,Commission File Number 1-475,NOISE,1
4,65948148,504996,Adobe Inc. 10-K FY 2020,"(as amended and restated as of January 14, 2021)",False,,"Adobe Inc. (the “Company”), pursuant to its 20...",NOISE,1


In [50]:
# merging both the dataframes - labeled_df and unlabeled_df
merged_df = pd.concat([labeled_df, unlabeled_df], ignore_index=True)
merged_df.head()

Unnamed: 0,id,document_id,document_title,content,is_table,prev_content,next_content,label,encoded_label
0,75846004,522393,CFG 8-K 10/25/18,6. Conditions to Obligations. The several obli...,False,,(f) The Underwriters shall have received on th...,TEXT,2
1,79398889,551017,URI 8-K 12/07/22,C: (203) 399-8951,False,,Indicate by check mark whether the registrant ...,NOISE,1
2,76351219,526475,DIS 8-K 05/08/19 Earnings Release,1,False,,The following table summarizes the second quar...,NOISE,1
3,76180796,524955,CPB 8-K 09/03/20 Earnings Release,Continuing Operations| Three Months Ended| Thr...,True,today reported results for its fourth-quarter ...,,FIN_TABLE,0
4,75324462,518706,AME 8-K 05/12/22 Entry into a Material Definit...,"Berwyn, Pa., May 13, 2022 – AMETEK, Inc. (NYSE...",False,,__________________,TEXT,2


In [52]:
# Count of each label in the 'label' column of the merged DataFrame
label_counts = merged_df['label'].value_counts()
label_counts

label
NOISE        5578
TEXT         3801
FIN_TABLE    1003
Name: count, dtype: int64

In [51]:
# storing the entire labeled dataset to jsonl file
output_path = "./data/labeled_excerpts.jsonl"

with open(output_path, "w") as f:
    f.write(merged_df.to_json(orient='records', lines=True))
    
print("Write completed")

Write completed


## Visualize the label distribution

In [77]:
# Define colors for each label
colors = {'FIN_TABLE': '#1f77b4', 'TEXT': '#d62728', 'NOISE': '#ff7f0e'}

# Create a bar plot using Plotly
fig = go.Figure(data=[go.Bar(x=label_counts.index, y=label_counts.values, marker_color=label_counts.index.map(colors))])

# Update layout
fig.update_layout(
    title='Distribution of Labels in the Dataset',
    xaxis_title='Labels',
    yaxis_title='Count',
    xaxis_tickangle=-45,
    yaxis=dict(gridcolor='lightgray'),
    plot_bgcolor='rgba(0, 0, 0, 0)'
)

# Add value labels on the bars
for i, count in enumerate(label_counts.values):
    fig.add_annotation(text=str(count), x=label_counts.index[i], y=count + 100, showarrow=False, font=dict(color='black'))

# Show plot
fig.show()

**Inference drawn**

After labeling entire data of 10,382 values, below is the distribution of labels:
- NOISE: 5578
- TEXT: 3801
- FIN_TABLE: 1003

## Conclusion

Implemented and evaluated 3 pre-trained transformer models—DistilBERT, BERT, and RoBERTa—on a text classification task. The primary goal was to determine the most effective model for classifying the dataset into three categories: FIN_TABLE, NOISE, and TEXT. Through hyperparameter tuning and subsequent evaluations, following conclusions were derived:

#### Hyperparameter Tuning

Hyperparameter tuning was performed using DistilBERT due to its faster and more lightweight nature. The learning rates and epochs were varied, and the optimal combination was identified based on evaluation metrics such as accuracy, precision, recall, and F1 score.

- **Learning Rates:** 5e-5, 3e-5, 2e-5
- **Epochs:** 5, 6, 7, 8, 9

The optimal hyperparameters identified for DistilBERT were a learning rate of 3e-5 and 9 epochs, resulting in the highest performance across all evaluation metrics.

| Learning Rate | Epochs | Accuracy | Precision | Loss     | Recall   | F1       |
|---------------|--------|----------|-----------|----------|----------|----------|
| 3e-5          | 9      | 0.960938 | 0.961328  | 0.188554 | 0.960938 | 0.960983 |

#### Model Evaluation

Using the optimal hyperparameters identified, training was performed on all three models:

1. **DistilBERT Model**
   - **Learning Rate:** 3e-5
   - **Epochs:** 9
   - **Evaluation Metrics:**
     - **Loss:** 0.1885537952184677
     - **Accuracy:** 0.9609375
     - **Precision:** 0.961328125
     - **Recall:** 0.9609375
     - **F1 Score:** 0.9609832569391393

2. **BERT Model**
   - **Learning Rate:** 3e-5
   - **Epochs:** 9
   - **Evaluation Metrics:**
     - **Loss:** 0.33096411228179934
     - **Accuracy:** 0.9
     - **Precision:** 0.9034313725490195
     - **Recall:** 0.9
     - **F1 Score:** 0.9003702603702604

3. **RoBERTa Model**
   - **Learning Rate:** 3e-5
   - **Epochs:** 9
   - **Evaluation Metrics:**
     - **Loss:** 0.3747982978820801
     - **Accuracy:** 0.9125
     - **Precision:** 0.9145167895167894
     - **Recall:** 0.9125
     - **F1 Score:** 0.9127909226190475

Based on the evaluation metrics, DistilBERT outperformed both BERT and RoBERTa in terms of accuracy, precision, recall, and F1 score. Despite being a lighter and faster model, DistilBERT achieved a higher evaluation performance, making it the most effective model for this text classification task.

- **DistilBERT** demonstrated superior performance with an F1 score of 0.96098, making it the best choice for the task.
- **BERT** and **RoBERTa**, while still highly effective, did not perform as well as DistilBERT under the same hyperparameters.
