## 1.0 Introduction

In this notebook we will train and validate 3 regular Multi-lingual Transformer models to establish a baseline of the accuracy that can be achieved when training those 3 smaller (especially small compared to current state-of-the-art LLM's) models on the earlier created training and validation CSV files.

The 3 transformer models that will be used are:
* Multi-lingual DistilBert
* Multi-lingual Bert
* Multi-lingual DeBERTa V3

All 3 models will be trained and validated with the small data subsets.

The Multi-lingual DeBERTa V3 model will also be trained on the complete dataset...just to be able to compare...

These 3 transformer models consist of millions of parameters compared to billions of parameters for the GPT (and other similar LLM's) models.

In [1]:
# Import Modules
import gc
import numpy as np
import pandas as pd
import torch
from datasets import load_dataset, Dataset, DatasetDict
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from transformers import (AutoModelForSequenceClassification, 
                          AutoTokenizer,
                          DataCollatorWithPadding, 
                          pipeline,
                          TrainingArguments, 
                          Trainer)

# Set Seed for Randomness
seed = 44
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = False # I prefer some randomness in each training...that gives a good impression of the variations in baseline.

## 2.0 Load Datasets

We will reload the training and validation CSV files that were generated earlier with the notebook 'Prepare_Train_and_Validation_Datasets.ipynb'.

In [2]:
# Load Datasets
train_df = pd.read_csv('./data/train_df.csv')
val_df = pd.read_csv('./data/val_df.csv')

# Summary
print(train_df.shape)
print(val_df.shape)

(3069, 11)
(1559, 11)


In [3]:
# Summary
train_df.head()

Unnamed: 0,id,title,text,mainSection,published_at,publisher,partisan,url,text_wordcount,max_words_text,labels
0,10706318,Ogen als schoteltjes bij de Tachtigjarige Oorlog,Ogen als schoteltjes bij de Tachtigjarige Oorl...,/home,2018-10-07,trouw,True,www.trouw.nl/home/ogen-als-schoteltjes-bij-de-...,539,Ogen als schoteltjes bij de Tachtigjarige Oorl...,1
1,12633805,"Geen beeld, maar een monument voor Mandela in ...","Geen beeld, maar een monument voor Mandela in ...",/amsterdam,2019-05-10,parool,True,www.parool.nl/amsterdam/geen-beeld-maar-een-mo...,662,"Geen beeld, maar een monument voor Mandela in ...",1
2,7140125,Hoe ga je een onveilige arbeidscultuur zoals i...,Hoe ga je een onveilige arbeidscultuur zoals i...,/,2017-04-18,trouw,True,,494,Hoe ga je een onveilige arbeidscultuur zoals i...,1
3,4490774,Wetenschappers ontdekken lichtgevende discokikker,Wetenschappers ontdekken lichtgevende discokik...,/,2017-03-14,trouw,True,,291,Wetenschappers ontdekken lichtgevende discokik...,1
4,10592180,Meer fouten kabinet bij steun aan strijdgroepe...,Meer fouten kabinet bij steun aan strijdgroepe...,/home,2018-09-11,trouw,True,www.trouw.nl/home/meer-fouten-kabinet-bij-steu...,471,Meer fouten kabinet bij steun aan strijdgroepe...,1


In [4]:
# Summary
val_df.head()

Unnamed: 0,id,title,text,mainSection,published_at,publisher,partisan,url,text_wordcount,max_words_text,labels
0,9266995,Verdachte dodelijke steekpartijen Maastricht l...,Verdachte dodelijke steekpartijen Maastricht l...,/nieuws,2017-12-18,ad,False,www.ad.nl/binnenland/verdachte-dodelijke-steek...,188,Verdachte dodelijke steekpartijen Maastricht l...,0
1,4130077,Honderden arrestaties bij acties tegen mensen ...,Honderden arrestaties bij acties tegen mensen ...,/nieuws,2017-02-11,ad,False,www.ad.nl/buitenland/honderden-arrestaties-bij...,122,Honderden arrestaties bij acties tegen mensen ...,0
2,11147268,Waarom de 'oudejaarsbonus' voor de jongeren va...,Waarom de 'oudejaarsbonus' voor de jongeren va...,/home,2019-01-20,trouw,True,www.trouw.nl/home/waarom-de-oudejaarsbonus-voo...,262,Waarom de 'oudejaarsbonus' voor de jongeren va...,1
3,10749100,Klaar voor de verdediging,Klaar voor de verdedigingOver ruim een week be...,/nieuws,2018-10-16,ad,False,www.ad.nl/binnenland/klaar-voor-de-verdediging...,411,Klaar voor de verdedigingOver ruim een week be...,0
4,10700707,Windvlaag grijpt springmatras en doodt 2-jarig...,Windvlaag grijpt springmatras en doodt 2-jarig...,/nieuws,2018-10-05,ad,False,www.ad.nl/buitenland/windvlaag-grijpt-springma...,286,Windvlaag grijpt springmatras en doodt 2-jarig...,0


## 3.0 Process Datasets

In [5]:
# Proces DataFrame to DataSet function.
def process_dataset(tokenizer):
    # Tokenize Helper
    def preprocess_function(examples):
        return tokenizer(examples["text"], truncation = True)

    # Create DataSets
    tdf = pd.DataFrame({"text": train_df.max_words_text.values, "label": train_df.labels.values})
    vdf = pd.DataFrame({"text": val_df.max_words_text.values, "label": val_df.labels.values})
    tds = Dataset.from_pandas(tdf)
    vds = Dataset.from_pandas(vdf)

    ds = DatasetDict()
    ds['train'] = tds
    ds['validation'] = vds

    # Tokenize Text
    ds = ds.map(preprocess_function, batched = True)

    # Summary
    print(ds)

    return ds

## 4.0 Evaluation Setup

In [6]:
def compute_metrics(val_preds):
    preds, labels = val_preds
    preds = np.argmax(preds, axis = 1)
    
    report = classification_report(labels, preds, digits = 3)
    print(report)
    
    accuracy_val = accuracy_score(labels, preds)
    precision_val = precision_score(labels, preds)
    recall_val = recall_score(labels, preds)
    
    return {"accuracy": accuracy_val, "precision": precision_val, "recall": recall_val}

## 5.0 Training and Validation on Subset of the Data

In this section we will train and validate 3 Transformer NLP models on the loaded and processed CSV files.

The 3 models are the following:
* Multi-lingual DistilBert
* Multi-lingual Bert
* Multi-lingual DeBERTa V3

All models will be trained and validated based on the same set of hyperparameters to allow for a fair comparison.

It is very likely that optimizing the hyperparameters specifically for each model could even achieve higher performance.

At a later moment I will expand the notebook with some of the models being trained and validated on the complete dataset.

In [7]:
# Set Label Info
id2label = {0: 'NEUTRAL', 1: 'PARTISAN'}
label2id = {'NEUTRAL': 0, 'PARTISAN': 1}

### 5.1 Multi-Lingual DistilBert

In [8]:
# Constants
model_name = 'distilbert-base-multilingual-cased'

In [9]:
# Set Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Data Collator
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

# Tokenize dataset
ds = process_dataset(tokenizer)

Map:   0%|          | 0/3069 [00:00<?, ? examples/s]

Map:   0%|          | 0/1559 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3069
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1559
    })
})


In [10]:
# Create Model
model = AutoModelForSequenceClassification.from_pretrained(model_name, 
                                                           num_labels = 2, 
                                                           id2label = id2label, 
                                                           label2id = label2id)
model.gradient_checkpointing_enable()

# Set TrainingArguments
training_args = TrainingArguments(output_dir = "mdistilbert",
                                  learning_rate = 3.0e-5,
                                  per_device_train_batch_size = 32,
                                  per_device_eval_batch_size = 32,
                                  gradient_checkpointing = True, 
                                  gradient_checkpointing_kwargs = {"use_reentrant": False},                                 
                                  num_train_epochs = 3,
                                  weight_decay = 0.001,
                                  fp16 = True,
                                  evaluation_strategy = "epoch",
                                  save_strategy = "epoch",
                                  load_best_model_at_end = True,
                                  metric_for_best_model = 'accuracy',
                                  greater_is_better = True)

trainer = Trainer(model = model,
                  args = training_args,
                  train_dataset = ds["train"],
                  eval_dataset = ds["validation"],
                  tokenizer = tokenizer,
                  data_collator = data_collator,
                  compute_metrics = compute_metrics)

# Train Model
trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/288 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/49 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0      0.870     0.788     0.827       765
           1      0.813     0.887     0.848       794

    accuracy                          0.838      1559
   macro avg      0.842     0.837     0.838      1559
weighted avg      0.841     0.838     0.838      1559

{'eval_loss': 0.338260680437088, 'eval_accuracy': 0.8383579217447081, 'eval_precision': 0.812933025404157, 'eval_recall': 0.8866498740554156, 'eval_runtime': 4.0804, 'eval_samples_per_second': 382.068, 'eval_steps_per_second': 12.009, 'epoch': 1.0}


  0%|          | 0/49 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0      0.919     0.759     0.832       765
           1      0.802     0.936     0.863       794

    accuracy                          0.849      1559
   macro avg      0.860     0.848     0.848      1559
weighted avg      0.859     0.849     0.848      1559

{'eval_loss': 0.3530607521533966, 'eval_accuracy': 0.8492623476587556, 'eval_precision': 0.8015102481121898, 'eval_recall': 0.9357682619647355, 'eval_runtime': 4.078, 'eval_samples_per_second': 382.293, 'eval_steps_per_second': 12.016, 'epoch': 2.0}


  0%|          | 0/49 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0      0.855     0.851     0.853       765
           1      0.857     0.861     0.859       794

    accuracy                          0.856      1559
   macro avg      0.856     0.856     0.856      1559
weighted avg      0.856     0.856     0.856      1559

{'eval_loss': 0.33797311782836914, 'eval_accuracy': 0.8563181526619628, 'eval_precision': 0.8571428571428571, 'eval_recall': 0.8614609571788413, 'eval_runtime': 4.0748, 'eval_samples_per_second': 382.6, 'eval_steps_per_second': 12.025, 'epoch': 3.0}
{'train_runtime': 140.3073, 'train_samples_per_second': 65.62, 'train_steps_per_second': 2.053, 'train_loss': 0.32965970039367676, 'epoch': 3.0}


TrainOutput(global_step=288, training_loss=0.32965970039367676, metrics={'train_runtime': 140.3073, 'train_samples_per_second': 65.62, 'train_steps_per_second': 2.053, 'train_loss': 0.32965970039367676, 'epoch': 3.0})

The Multi-lingual DistilBert achieves an accuracy on the validation dataset of 85.6%.

### 5.2 Multi-Lingual Bert

In [11]:
# Constants
model_name = 'bert-base-multilingual-cased'

In [12]:
# Set Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Data Collator
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

# Tokenize dataset
ds = process_dataset(tokenizer)

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 670cc97c-f016-4eed-99a2-39c097eb1ffa)')' thrown while requesting HEAD https://huggingface.co/bert-base-multilingual-cased/resolve/main/tokenizer_config.json


Map:   0%|          | 0/3069 [00:00<?, ? examples/s]

Map:   0%|          | 0/1559 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3069
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1559
    })
})


In [13]:
# Create Model
model = AutoModelForSequenceClassification.from_pretrained(model_name, 
                                                           num_labels = 2, 
                                                           id2label = id2label, 
                                                           label2id = label2id)

# Set TrainingArguments
training_args = TrainingArguments(output_dir = "mbert",
                                  learning_rate = 3.0e-5,
                                  per_device_train_batch_size = 16,
                                  per_device_eval_batch_size = 32,
                                  gradient_accumulation_steps = 2,
                                  gradient_checkpointing = True, 
                                  gradient_checkpointing_kwargs = {"use_reentrant": False},                                 
                                  num_train_epochs = 3,
                                  weight_decay = 0.001,
                                  fp16 = True,
                                  evaluation_strategy = "epoch",
                                  save_strategy = "epoch",
                                  load_best_model_at_end = True,
                                  metric_for_best_model = 'accuracy',
                                  greater_is_better = True)

trainer = Trainer(model = model,
                  args = training_args,
                  train_dataset = ds["train"],
                  eval_dataset = ds["validation"],
                  tokenizer = tokenizer,
                  data_collator = data_collator,
                  compute_metrics = compute_metrics)

# Train Model
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/288 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/49 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0      0.823     0.854     0.838       765
           1      0.854     0.824     0.838       794

    accuracy                          0.838      1559
   macro avg      0.839     0.839     0.838      1559
weighted avg      0.839     0.838     0.838      1559

{'eval_loss': 0.3472679555416107, 'eval_accuracy': 0.8383579217447081, 'eval_precision': 0.8537859007832899, 'eval_recall': 0.8236775818639799, 'eval_runtime': 8.3434, 'eval_samples_per_second': 186.854, 'eval_steps_per_second': 5.873, 'epoch': 1.0}


  0%|          | 0/49 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0      0.935     0.757     0.837       765
           1      0.802     0.950     0.870       794

    accuracy                          0.855      1559
   macro avg      0.869     0.853     0.853      1559
weighted avg      0.868     0.855     0.853      1559

{'eval_loss': 0.34125080704689026, 'eval_accuracy': 0.8550352790250161, 'eval_precision': 0.8021276595744681, 'eval_recall': 0.9496221662468514, 'eval_runtime': 8.3313, 'eval_samples_per_second': 187.126, 'eval_steps_per_second': 5.881, 'epoch': 2.0}


  0%|          | 0/49 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0      0.865     0.854     0.859       765
           1      0.861     0.872     0.866       794

    accuracy                          0.863      1559
   macro avg      0.863     0.863     0.863      1559
weighted avg      0.863     0.863     0.863      1559

{'eval_loss': 0.3605136573314667, 'eval_accuracy': 0.8627325208466966, 'eval_precision': 0.8606965174129353, 'eval_recall': 0.871536523929471, 'eval_runtime': 8.3377, 'eval_samples_per_second': 186.981, 'eval_steps_per_second': 5.877, 'epoch': 3.0}
{'train_runtime': 258.798, 'train_samples_per_second': 35.576, 'train_steps_per_second': 1.113, 'train_loss': 0.29645397928025985, 'epoch': 3.0}


TrainOutput(global_step=288, training_loss=0.29645397928025985, metrics={'train_runtime': 258.798, 'train_samples_per_second': 35.576, 'train_steps_per_second': 1.113, 'train_loss': 0.29645397928025985, 'epoch': 3.0})

The Multi-lingual Bert achieves an accuracy on the validation dataset of 86.3%.

### 5.3 Multi-Lingual DeBERTa V3

In [14]:
# Constants
model_name = 'microsoft/mdeberta-v3-base'

In [15]:
# Set Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Data Collator
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

# Tokenize dataset
ds = process_dataset(tokenizer)



Map:   0%|          | 0/3069 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/1559 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3069
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1559
    })
})


In [16]:
# Create Model
model = AutoModelForSequenceClassification.from_pretrained(model_name, 
                                                           num_labels = 2, 
                                                           id2label = id2label, 
                                                           label2id = label2id)

# Set TrainingArguments
training_args = TrainingArguments(output_dir = "mdebertav3",
                                  learning_rate = 3.0e-5,
                                  per_device_train_batch_size = 16,
                                  per_device_eval_batch_size = 32,
                                  gradient_accumulation_steps = 2,
                                  gradient_checkpointing = True, 
                                  gradient_checkpointing_kwargs = {"use_reentrant": False},
                                  num_train_epochs = 3,
                                  weight_decay = 0.001,
                                  fp16 = True,
                                  evaluation_strategy = "epoch",
                                  save_strategy = "epoch",
                                  load_best_model_at_end = True,
                                  metric_for_best_model = 'accuracy',
                                  greater_is_better = True)

trainer = Trainer(model = model,
                  args = training_args,
                  train_dataset = ds["train"],
                  eval_dataset = ds["validation"],
                  tokenizer = tokenizer,
                  data_collator = data_collator,
                  compute_metrics = compute_metrics)

# Train Model
trainer.train()

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/288 [00:00<?, ?it/s]

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/49 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0      0.855     0.861     0.858       765
           1      0.865     0.859     0.862       794

    accuracy                          0.860      1559
   macro avg      0.860     0.860     0.860      1559
weighted avg      0.860     0.860     0.860      1559

{'eval_loss': 0.3431350290775299, 'eval_accuracy': 0.8601667735728031, 'eval_precision': 0.8654822335025381, 'eval_recall': 0.8589420654911839, 'eval_runtime': 14.9399, 'eval_samples_per_second': 104.351, 'eval_steps_per_second': 3.28, 'epoch': 1.0}


  0%|          | 0/49 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0      0.758     0.954     0.845       765
           1      0.941     0.707     0.807       794

    accuracy                          0.828      1559
   macro avg      0.850     0.830     0.826      1559
weighted avg      0.851     0.828     0.826      1559

{'eval_loss': 0.4065735340118408, 'eval_accuracy': 0.828094932649134, 'eval_precision': 0.9412751677852349, 'eval_recall': 0.7065491183879093, 'eval_runtime': 14.9392, 'eval_samples_per_second': 104.356, 'eval_steps_per_second': 3.28, 'epoch': 2.0}


  0%|          | 0/49 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0      0.805     0.940     0.867       765
           1      0.931     0.781     0.849       794

    accuracy                          0.859      1559
   macro avg      0.868     0.860     0.858      1559
weighted avg      0.869     0.859     0.858      1559

{'eval_loss': 0.3701871931552887, 'eval_accuracy': 0.8588838999358563, 'eval_precision': 0.9309309309309309, 'eval_recall': 0.7808564231738035, 'eval_runtime': 14.9436, 'eval_samples_per_second': 104.326, 'eval_steps_per_second': 3.279, 'epoch': 3.0}
{'train_runtime': 479.9446, 'train_samples_per_second': 19.183, 'train_steps_per_second': 0.6, 'train_loss': 0.33482729064093697, 'epoch': 3.0}


TrainOutput(global_step=288, training_loss=0.33482729064093697, metrics={'train_runtime': 479.9446, 'train_samples_per_second': 19.183, 'train_steps_per_second': 0.6, 'train_loss': 0.33482729064093697, 'epoch': 3.0})

In [17]:
# Memory Cleanup because of occasional OOM
del ds, model, trainer, training_args
torch.cuda.empty_cache()
_ = gc.collect()

The Multi-lingual DeBERTa V3 achieves an accuracy on the validation dataset of 85.8%.

## 6.0 Multi-Lingual DeBERTa V3 - Train and Validate with complete dataset

Since the Multi-lingual DeBERTa v3 is the newest of the 3 transformer models we will use this model again for a test with the complete dataset.
Feel free to run this test yourself for any other model.

The complete dataset will be used with 80% for training and the remaining 20% of the data for validation. All other things will be equal.

First the complete dataset will be processed. For a simple EDA and further explanation of the code please look at the notebook 'Prepare_Train_and_Validation_Datasets.ipynb'.

In [18]:
# Constants
SEED = 42
MAX_WORDS = 192

def get_dpgnews_df(seed):
    # Set 1: Articles
    articles_df = pd.read_json('./dpgMedia2019-articles-bypublisher.jsonl', lines = True)
    articles_df = articles_df.set_index('id')
    
    # Set 2: Labels
    labels_df = pd.read_json('./dpgMedia2019-labels-bypublisher.jsonl', lines = True)
    labels_df = labels_df.set_index('id')
    
    # Finalize Full Data
    dpgnews_df = articles_df.join(labels_df, on = ['id'], how = 'inner')
    
    # Randomize all rows...
    dpgnews_df = dpgnews_df.sample(frac = 1.0, random_state = seed)
    dpgnews_df.reset_index(inplace = True)
    print(f'DPGNews2019 Dataframe Shape: {dpgnews_df.shape}') 

    return dpgnews_df

# Maximize Text to number of words
def maximize_word_count(text):
    text_list = text.split(' ')
    
    if len(text_list) >= MAX_WORDS:
        maximized_text = ' '.join(text_list[:MAX_WORDS])
    else:
        maximized_text = ' '.join(text_list)
    return maximized_text

# Get DpgNews Dataframe
dpgnews_df = get_dpgnews_df(SEED)

# Map
dpgnews_df['max_words_text'] =  dpgnews_df.apply(lambda x: maximize_word_count(x.text), axis = 1)   

# Partisan Modify
labels = []

# Tokenize
for index, row in dpgnews_df.iterrows():
    partisan = row['partisan']
    labels.append(1 if partisan == 'true' else 0)
dpgnews_df["labels"] = labels

# Train Test Split
train_df, val_df = train_test_split(dpgnews_df, 
                                    test_size = 0.20, 
                                    random_state = SEED,
                                    stratify = dpgnews_df.partisan.values)

# Summary
print(f'Training Dataset Shape: {train_df.shape}')
print(f'Validation Dataset Shape: {val_df.shape}')

DPGNews2019 Dataframe Shape: (103870, 8)
Training Dataset Shape: (83096, 10)
Validation Dataset Shape: (20774, 10)


In [19]:
# Constants
model_name = 'microsoft/mdeberta-v3-base'

In [20]:
# Set Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Data Collator
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

# Tokenize dataset
ds = process_dataset(tokenizer)



Map:   0%|          | 0/83096 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/20774 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 83096
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 20774
    })
})


In [21]:
# Create Model
model = AutoModelForSequenceClassification.from_pretrained(model_name, 
                                                           num_labels = 2, 
                                                           id2label = id2label, 
                                                           label2id = label2id)

# Set TrainingArguments
training_args = TrainingArguments(output_dir = "mdebertav3",
                                  learning_rate = 3.0e-5,
                                  per_device_train_batch_size = 16,
                                  per_device_eval_batch_size = 32,
                                  gradient_accumulation_steps = 2,
                                  gradient_checkpointing = True, 
                                  gradient_checkpointing_kwargs = {"use_reentrant": False},
                                  num_train_epochs = 3,
                                  weight_decay = 0.001,
                                  fp16 = True,
                                  evaluation_strategy = "epoch",
                                  save_strategy = "epoch",
                                  load_best_model_at_end = True,
                                  metric_for_best_model = 'accuracy',
                                  greater_is_better = True)

trainer = Trainer(model = model,
                  args = training_args,
                  train_dataset = ds["train"],
                  eval_dataset = ds["validation"],
                  tokenizer = tokenizer,
                  data_collator = data_collator,
                  compute_metrics = compute_metrics)

# Train Model
trainer.train()

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/7791 [00:00<?, ?it/s]

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 0.335, 'learning_rate': 2.8074701578744707e-05, 'epoch': 0.19}
{'loss': 0.2226, 'learning_rate': 2.6157104351174434e-05, 'epoch': 0.39}
{'loss': 0.1971, 'learning_rate': 2.4231805929919137e-05, 'epoch': 0.58}
{'loss': 0.1786, 'learning_rate': 2.2306507508663844e-05, 'epoch': 0.77}
{'loss': 0.1667, 'learning_rate': 2.0381209087408547e-05, 'epoch': 0.96}


  0%|          | 0/650 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0      0.920     0.961     0.940     10190
           1      0.961     0.920     0.940     10584

    accuracy                          0.940     20774
   macro avg      0.941     0.940     0.940     20774
weighted avg      0.941     0.940     0.940     20774

{'eval_loss': 0.164809450507164, 'eval_accuracy': 0.9400693174160007, 'eval_precision': 0.9611851851851851, 'eval_recall': 0.9195011337868481, 'eval_runtime': 201.3636, 'eval_samples_per_second': 103.167, 'eval_steps_per_second': 3.228, 'epoch': 1.0}
{'loss': 0.1422, 'learning_rate': 1.8455910666153254e-05, 'epoch': 1.16}
{'loss': 0.1325, 'learning_rate': 1.653446284174047e-05, 'epoch': 1.35}
{'loss': 0.1232, 'learning_rate': 1.4609164420485175e-05, 'epoch': 1.54}
{'loss': 0.1224, 'learning_rate': 1.268386599922988e-05, 'epoch': 1.73}
{'loss': 0.1195, 'learning_rate': 1.0758567577974585e-05, 'epoch': 1.93}


  0%|          | 0/650 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0      0.935     0.964     0.949     10190
           1      0.964     0.935     0.950     10584

    accuracy                          0.949     20774
   macro avg      0.950     0.950     0.949     20774
weighted avg      0.950     0.949     0.949     20774

{'eval_loss': 0.14764031767845154, 'eval_accuracy': 0.9493597766438818, 'eval_precision': 0.9642509253847652, 'eval_recall': 0.9352796674225246, 'eval_runtime': 201.2591, 'eval_samples_per_second': 103.22, 'eval_steps_per_second': 3.23, 'epoch': 2.0}
{'loss': 0.0993, 'learning_rate': 8.833269156719292e-06, 'epoch': 2.12}
{'loss': 0.0839, 'learning_rate': 6.911821332306508e-06, 'epoch': 2.31}
{'loss': 0.0829, 'learning_rate': 4.9865229110512135e-06, 'epoch': 2.5}
{'loss': 0.081, 'learning_rate': 3.065075086638429e-06, 'epoch': 2.7}
{'loss': 0.0783, 'learning_rate': 1.1397766653831344e-06, 'epoch': 2.89}


  0%|          | 0/650 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0      0.937     0.967     0.952     10190
           1      0.968     0.937     0.952     10584

    accuracy                          0.952     20774
   macro avg      0.952     0.952     0.952     20774
weighted avg      0.952     0.952     0.952     20774

{'eval_loss': 0.17834103107452393, 'eval_accuracy': 0.9518147684605757, 'eval_precision': 0.9675090252707581, 'eval_recall': 0.936885865457294, 'eval_runtime': 200.8521, 'eval_samples_per_second': 103.429, 'eval_steps_per_second': 3.236, 'epoch': 3.0}
{'train_runtime': 11725.4649, 'train_samples_per_second': 21.26, 'train_steps_per_second': 0.664, 'train_loss': 0.1417716857083663, 'epoch': 3.0}


TrainOutput(global_step=7791, training_loss=0.1417716857083663, metrics={'train_runtime': 11725.4649, 'train_samples_per_second': 21.26, 'train_steps_per_second': 0.664, 'train_loss': 0.1417716857083663, 'epoch': 3.0})

The Multi-lingual DeBERTa V3 achieves an accuracy on the validation dataset of 95.2% after training on 80% of the full dataset.

## 7.0 Summary

With the 3 multi-lingual Transformer NLP models trained and validated we have an interresting baseline to compare with a finetuned GPT-3.5/GPT-4 model.

With an achieved accuray of 85.6% for the DistilBert model, an achieved accuracy of 86.3% for the Bert model and an achieved accuracy of 85.8% for the DeBERTa V3 model it will be very interresting to see if we can finetune a GPT model as classifier and achieve/exceed the 86% - 88% accuracy target as achieved by the best regular model.

I did multiple training runs and on various occassions the models scored up to 2% higher or 1% lower compared with the above mentioned values. 

The Multi-lingual DeBERTa V3 achieves an accuracy on the validation dataset of 95.2% after training on 80% of the full dataset.