## 1.0 Introduction

In this notebook we will train and validate 2 'classical' Multi-lingual Transformer models to establish a baseline of the accuracy that can be achieved when training those 2 smaller (especially small compared to current state-of-the-art LLM's) models on the earlier created training and validation CSV files.

These 2 'classical' transformer models consist of millions of parameters compared to billions of parameters for the GPT (and similar LLM's) models.

In [1]:
# Import Modules
import evaluate
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset, DatasetDict
from transformers import (AutoModelForSequenceClassification, 
                          AutoTokenizer,
                          DataCollatorWithPadding, 
                          pipeline,
                          TrainingArguments, 
                          Trainer)



Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin c:\Program Files\Environments\KGLLM\lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll
CUDA SETUP: CUDA runtime path found: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\cudart64_110.dll
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary c:\Program Files\Environments\KGLLM\lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll...


## 2.0 Load Datasets

We will reload the training and validation CSV files that were generated earlier with the notebook 'Prepare_Train_and_Validation_Datasets.ipynb'.

In [2]:
# Load Datasets
train_df = pd.read_csv('./data/train_df.csv')
val_df = pd.read_csv('./data/val_df.csv')

# Summary
print(train_df.shape)
print(val_df.shape)

(3069, 11)
(1559, 11)


In [3]:
# Summary
train_df.head()

Unnamed: 0,id,title,text,mainSection,published_at,publisher,partisan,url,text_wordcount,max_words_text,labels
0,10706318,Ogen als schoteltjes bij de Tachtigjarige Oorlog,Ogen als schoteltjes bij de Tachtigjarige Oorl...,/home,2018-10-07,trouw,True,www.trouw.nl/home/ogen-als-schoteltjes-bij-de-...,539,Ogen als schoteltjes bij de Tachtigjarige Oorl...,1
1,12633805,"Geen beeld, maar een monument voor Mandela in ...","Geen beeld, maar een monument voor Mandela in ...",/amsterdam,2019-05-10,parool,True,www.parool.nl/amsterdam/geen-beeld-maar-een-mo...,662,"Geen beeld, maar een monument voor Mandela in ...",1
2,7140125,Hoe ga je een onveilige arbeidscultuur zoals i...,Hoe ga je een onveilige arbeidscultuur zoals i...,/,2017-04-18,trouw,True,,494,Hoe ga je een onveilige arbeidscultuur zoals i...,1
3,4490774,Wetenschappers ontdekken lichtgevende discokikker,Wetenschappers ontdekken lichtgevende discokik...,/,2017-03-14,trouw,True,,291,Wetenschappers ontdekken lichtgevende discokik...,1
4,10592180,Meer fouten kabinet bij steun aan strijdgroepe...,Meer fouten kabinet bij steun aan strijdgroepe...,/home,2018-09-11,trouw,True,www.trouw.nl/home/meer-fouten-kabinet-bij-steu...,471,Meer fouten kabinet bij steun aan strijdgroepe...,1


In [4]:
# Summary
val_df.head()

Unnamed: 0,id,title,text,mainSection,published_at,publisher,partisan,url,text_wordcount,max_words_text,labels
0,9266995,Verdachte dodelijke steekpartijen Maastricht l...,Verdachte dodelijke steekpartijen Maastricht l...,/nieuws,2017-12-18,ad,False,www.ad.nl/binnenland/verdachte-dodelijke-steek...,188,Verdachte dodelijke steekpartijen Maastricht l...,0
1,4130077,Honderden arrestaties bij acties tegen mensen ...,Honderden arrestaties bij acties tegen mensen ...,/nieuws,2017-02-11,ad,False,www.ad.nl/buitenland/honderden-arrestaties-bij...,122,Honderden arrestaties bij acties tegen mensen ...,0
2,11147268,Waarom de 'oudejaarsbonus' voor de jongeren va...,Waarom de 'oudejaarsbonus' voor de jongeren va...,/home,2019-01-20,trouw,True,www.trouw.nl/home/waarom-de-oudejaarsbonus-voo...,262,Waarom de 'oudejaarsbonus' voor de jongeren va...,1
3,10749100,Klaar voor de verdediging,Klaar voor de verdedigingOver ruim een week be...,/nieuws,2018-10-16,ad,False,www.ad.nl/binnenland/klaar-voor-de-verdediging...,411,Klaar voor de verdedigingOver ruim een week be...,0
4,10700707,Windvlaag grijpt springmatras en doodt 2-jarig...,Windvlaag grijpt springmatras en doodt 2-jarig...,/nieuws,2018-10-05,ad,False,www.ad.nl/buitenland/windvlaag-grijpt-springma...,286,Windvlaag grijpt springmatras en doodt 2-jarig...,0


## 3.0 Process Datasets

In [5]:
# Proces DataFrame to DataSet function.
def process_dataset(tokenizer):
    # Tokenize Helper
    def preprocess_function(examples):
        return tokenizer(examples["text"], truncation = True)

    # Create DataSets
    tdf = pd.DataFrame({"text": train_df.max_words_text.values, "label": train_df.labels.values})
    vdf = pd.DataFrame({"text": val_df.max_words_text.values, "label": val_df.labels.values})
    tds = Dataset.from_pandas(tdf)
    vds = Dataset.from_pandas(vdf)

    ds = DatasetDict()
    ds['train'] = tds
    ds['validation'] = vds

    # Tokenize Text
    ds = ds.map(preprocess_function, batched = True)

    # Summary
    print(ds)

    return ds

## 4.0 Evaluation Setup

In [6]:
metric = evaluate.load("accuracy")

task_evaluator = evaluate.evaluator("text-classification")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis = 1)

    return metric.compute(predictions = predictions, references = labels)

## 5.0 Training and Validation on Subset of the Data

In this section we will train and validate 2 Transformer NLP models on the loaded and processed CSV files.

The 2 models are the following:
* Multi-lingual DistilBert
* Multi-lingual Bert

Both models will be trained and validated based on the same set of hyperparameters to allow for a fair comparison.

It is very likely that optimizing the hyperparameters per model could even achieve higher performance.

At a later moment I will expand the notebook with the 2 models being trained and validated on the complete dataset.

In [7]:
# Set Label Info
id2label = {0: 'NEUTRAL', 1: 'PARTISAN'}
label2id = {'NEUTRAL': 0, 'PARTISAN': 1}

### 5.1 Multi-Lingual DistilBert

In [8]:
# Constants
model_name = 'distilbert-base-multilingual-cased'

In [9]:
# Set Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Data Collator
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

# Tokenize dataset
ds = process_dataset(tokenizer)

Map:   0%|          | 0/3069 [00:00<?, ? examples/s]

Map:   0%|          | 0/1559 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3069
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1559
    })
})


In [10]:
# Create Model
model = AutoModelForSequenceClassification.from_pretrained(model_name, 
                                                           num_labels = 2, 
                                                           id2label = id2label, 
                                                           label2id = label2id)
model.gradient_checkpointing_enable()

# Set TrainingArguments
training_args = TrainingArguments(output_dir = "mdistilbert",
                                  learning_rate = 5.0e-5,
                                  per_device_train_batch_size = 32,
                                  per_device_eval_batch_size = 32,
                                  num_train_epochs = 3,
                                  weight_decay = 0.001,
                                  fp16 = True,
                                  evaluation_strategy = "epoch",
                                  save_strategy = "epoch")

trainer = Trainer(model = model,
                  args = training_args,
                  train_dataset = ds["train"],
                  eval_dataset = ds["validation"],
                  tokenizer = tokenizer,
                  data_collator = data_collator,
                  compute_metrics = compute_metrics)

# Train Model
trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/288 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/49 [00:00<?, ?it/s]

{'eval_loss': 0.32038870453834534, 'eval_accuracy': 0.858242463117383, 'eval_runtime': 4.7096, 'eval_samples_per_second': 331.026, 'eval_steps_per_second': 10.404, 'epoch': 1.0}


  0%|          | 0/49 [00:00<?, ?it/s]

{'eval_loss': 0.3840148448944092, 'eval_accuracy': 0.849903784477229, 'eval_runtime': 4.6123, 'eval_samples_per_second': 338.01, 'eval_steps_per_second': 10.624, 'epoch': 2.0}


  0%|          | 0/49 [00:00<?, ?it/s]

{'eval_loss': 0.37137719988822937, 'eval_accuracy': 0.8659397049390635, 'eval_runtime': 4.4267, 'eval_samples_per_second': 352.185, 'eval_steps_per_second': 11.069, 'epoch': 3.0}
{'train_runtime': 146.0058, 'train_samples_per_second': 63.059, 'train_steps_per_second': 1.973, 'train_loss': 0.27998744116889107, 'epoch': 3.0}


TrainOutput(global_step=288, training_loss=0.27998744116889107, metrics={'train_runtime': 146.0058, 'train_samples_per_second': 63.059, 'train_steps_per_second': 1.973, 'train_loss': 0.27998744116889107, 'epoch': 3.0})

In [11]:
# Evaluation
pipe = pipeline("text-classification", 
                model = model, 
                tokenizer = tokenizer, 
                device = 0)

eval_results = task_evaluator.compute(model_or_pipeline=pipe, 
                                      data = ds['validation'], 
                                      metric = metric,
                                      label_mapping = label2id,
                                      strategy = "bootstrap",
                                      n_resamples = 256)

# Summary 
eval_results

{'accuracy': {'confidence_interval': (0.8493567477092789, 0.8828726453383989),
  'standard_error': 0.00860166650180781,
  'score': 0.8659397049390635},
 'total_time_in_seconds': 15.587806699986686,
 'samples_per_second': 100.01407061336805,
 'latency_in_seconds': 0.009998593136617502}

The Multi-lingual DistilBert achieves an accuracy on the validation dataset of 86.6%.

### 5.2 Multi-Lingual Bert

In [12]:
# Constants
model_name = 'bert-base-multilingual-cased'

In [13]:
# Set Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Data Collator
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

# Tokenize dataset
ds = process_dataset(tokenizer)

Map:   0%|          | 0/3069 [00:00<?, ? examples/s]

Map:   0%|          | 0/1559 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3069
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1559
    })
})


In [14]:
# Create Model
model = AutoModelForSequenceClassification.from_pretrained(model_name, 
                                                           num_labels = 2, 
                                                           id2label = id2label, 
                                                           label2id = label2id)
model.gradient_checkpointing_enable()

# Set TrainingArguments
training_args = TrainingArguments(output_dir = "mbert",
                                  learning_rate = 5.0e-5,
                                  per_device_train_batch_size = 32,
                                  per_device_eval_batch_size = 32,
                                  num_train_epochs = 3,
                                  weight_decay = 0.001,
                                  fp16 = True,
                                  evaluation_strategy = "epoch",
                                  save_strategy = "epoch")

trainer = Trainer(model = model,
                  args = training_args,
                  train_dataset = ds["train"],
                  eval_dataset = ds["validation"],
                  tokenizer = tokenizer,
                  data_collator = data_collator,
                  compute_metrics = compute_metrics)

# Train Model
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/288 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/49 [00:00<?, ?it/s]

{'eval_loss': 0.3354860842227936, 'eval_accuracy': 0.8601667735728031, 'eval_runtime': 8.8464, 'eval_samples_per_second': 176.23, 'eval_steps_per_second': 5.539, 'epoch': 1.0}


  0%|          | 0/49 [00:00<?, ?it/s]

{'eval_loss': 0.3428492844104767, 'eval_accuracy': 0.8614496472097498, 'eval_runtime': 9.4467, 'eval_samples_per_second': 165.031, 'eval_steps_per_second': 5.187, 'epoch': 2.0}


  0%|          | 0/49 [00:00<?, ?it/s]

{'eval_loss': 0.33843645453453064, 'eval_accuracy': 0.8762026940346376, 'eval_runtime': 9.4383, 'eval_samples_per_second': 165.178, 'eval_steps_per_second': 5.192, 'epoch': 3.0}
{'train_runtime': 254.0063, 'train_samples_per_second': 36.247, 'train_steps_per_second': 1.134, 'train_loss': 0.3008674250708686, 'epoch': 3.0}


TrainOutput(global_step=288, training_loss=0.3008674250708686, metrics={'train_runtime': 254.0063, 'train_samples_per_second': 36.247, 'train_steps_per_second': 1.134, 'train_loss': 0.3008674250708686, 'epoch': 3.0})

In [15]:
# Evaluation
pipe = pipeline("text-classification", 
                model = model, 
                tokenizer = tokenizer, 
                device = 0)

eval_results = task_evaluator.compute(model_or_pipeline=pipe, 
                                      data = ds['validation'], 
                                      metric = metric,
                                      label_mapping = label2id,
                                      strategy = "bootstrap",
                                      n_resamples = 256)

# Summary 
eval_results

{'accuracy': {'confidence_interval': (0.8514408355318677, 0.8883163411921758),
  'standard_error': 0.008609690934502106,
  'score': 0.8762026940346376},
 'total_time_in_seconds': 30.169643400004134,
 'samples_per_second': 51.67445896956762,
 'latency_in_seconds': 0.01935192007697507}

The Multi-lingual Bert achieves an accuracy on the validation dataset of 87.6%.

## Summary

With the 2 'classical' Transformer NLP models trained and validated we have an interresting baseline to compare with a finetuned GPT-3.5/GPT-4 model.

With an achieved accuray of 86.6% for the multi-lingual DistilBert model and an achieved accuracy of 87.6% for the multi-lingual Bert model it will be very interresting to see if we can finetune a GPT model as classifier and achieve the 87% accuracy target.