## 1.0 Introduction

In this notebook we will train and validate 2 Open LLM models to establish a baseline of the accuracy that can be achieved currently with the available Open LLM's.

We will compare this baseline against the GPT-3.5 Turbo accuracy and the accuracies achieved by the 3 smaller Transformer models.

The open LLM models that will be used are:
* PolyLM 1.7B ([HuggingFace URL](https://huggingface.co/DAMO-NLP-MT/polylm-1.7b))
* OpenLLaMA 7B V2 ([HuggingFace URL](https://huggingface.co/openlm-research/open_llama_7b_v2))


The PolyLM model contains 1.7 billion parameters and the OpenLLaMA model contains 7 billion parameters. The Dutch language was part of the datasets that were used for pre-training these models.

In [1]:
# Import Modules
import gc
import numpy as np
import pandas as pd
import torch
from datasets import Dataset, DatasetDict
from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig, TaskType
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score
from transformers import (AutoConfig,
                          AutoModelForSequenceClassification, 
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          DataCollatorWithPadding, 
                          TrainingArguments, 
                          Trainer)

## 2.0 Load Datasets

We will reload the training and validation CSV files that were generated earlier with the notebook 'Prepare_Train_and_Validation_Datasets.ipynb'.

In [2]:
# Load Datasets
train_df = pd.read_csv('./data/train_df.csv')
val_df = pd.read_csv('./data/val_df.csv')

# Summary
print(train_df.shape)
print(val_df.shape)

(3069, 11)
(1559, 11)


In [3]:
# Summary
train_df.head()

Unnamed: 0,id,title,text,mainSection,published_at,publisher,partisan,url,text_wordcount,max_words_text,labels
0,10706318,Ogen als schoteltjes bij de Tachtigjarige Oorlog,Ogen als schoteltjes bij de Tachtigjarige Oorl...,/home,2018-10-07,trouw,True,www.trouw.nl/home/ogen-als-schoteltjes-bij-de-...,539,Ogen als schoteltjes bij de Tachtigjarige Oorl...,1
1,12633805,"Geen beeld, maar een monument voor Mandela in ...","Geen beeld, maar een monument voor Mandela in ...",/amsterdam,2019-05-10,parool,True,www.parool.nl/amsterdam/geen-beeld-maar-een-mo...,662,"Geen beeld, maar een monument voor Mandela in ...",1
2,7140125,Hoe ga je een onveilige arbeidscultuur zoals i...,Hoe ga je een onveilige arbeidscultuur zoals i...,/,2017-04-18,trouw,True,,494,Hoe ga je een onveilige arbeidscultuur zoals i...,1
3,4490774,Wetenschappers ontdekken lichtgevende discokikker,Wetenschappers ontdekken lichtgevende discokik...,/,2017-03-14,trouw,True,,291,Wetenschappers ontdekken lichtgevende discokik...,1
4,10592180,Meer fouten kabinet bij steun aan strijdgroepe...,Meer fouten kabinet bij steun aan strijdgroepe...,/home,2018-09-11,trouw,True,www.trouw.nl/home/meer-fouten-kabinet-bij-steu...,471,Meer fouten kabinet bij steun aan strijdgroepe...,1


In [4]:
# Summary
val_df.head()

Unnamed: 0,id,title,text,mainSection,published_at,publisher,partisan,url,text_wordcount,max_words_text,labels
0,9266995,Verdachte dodelijke steekpartijen Maastricht l...,Verdachte dodelijke steekpartijen Maastricht l...,/nieuws,2017-12-18,ad,False,www.ad.nl/binnenland/verdachte-dodelijke-steek...,188,Verdachte dodelijke steekpartijen Maastricht l...,0
1,4130077,Honderden arrestaties bij acties tegen mensen ...,Honderden arrestaties bij acties tegen mensen ...,/nieuws,2017-02-11,ad,False,www.ad.nl/buitenland/honderden-arrestaties-bij...,122,Honderden arrestaties bij acties tegen mensen ...,0
2,11147268,Waarom de 'oudejaarsbonus' voor de jongeren va...,Waarom de 'oudejaarsbonus' voor de jongeren va...,/home,2019-01-20,trouw,True,www.trouw.nl/home/waarom-de-oudejaarsbonus-voo...,262,Waarom de 'oudejaarsbonus' voor de jongeren va...,1
3,10749100,Klaar voor de verdediging,Klaar voor de verdedigingOver ruim een week be...,/nieuws,2018-10-16,ad,False,www.ad.nl/binnenland/klaar-voor-de-verdediging...,411,Klaar voor de verdedigingOver ruim een week be...,0
4,10700707,Windvlaag grijpt springmatras en doodt 2-jarig...,Windvlaag grijpt springmatras en doodt 2-jarig...,/nieuws,2018-10-05,ad,False,www.ad.nl/buitenland/windvlaag-grijpt-springma...,286,Windvlaag grijpt springmatras en doodt 2-jarig...,0


## 3.0 Process Datasets

In [5]:
# Proces DataFrame to DataSet function.
def process_dataset(tokenizer):
    # Tokenize Helper
    def preprocess_function(examples):
        return tokenizer(examples["text"], 
                         truncation = True)

    # Create DataSets
    tdf = pd.DataFrame({"text": train_df.max_words_text.values, "label": train_df.labels.values})
    vdf = pd.DataFrame({"text": val_df.max_words_text.values, "label": val_df.labels.values})
    tds = Dataset.from_pandas(tdf)
    vds = Dataset.from_pandas(vdf)

    ds = DatasetDict()
    ds['train'] = tds
    ds['validation'] = vds

    # Tokenize Text
    ds = ds.map(preprocess_function, batched = True)

    # Summary
    print(ds)

    return ds

## 4.0 Evaluation Setup

In [6]:
def compute_metrics(val_preds):
    preds, labels = val_preds
    preds = np.argmax(preds, axis = 1)
    
    report = classification_report(labels, preds)
    print(report)
    
    accuracy_val = accuracy_score(labels, preds)
    precision_val = precision_score(labels, preds)
    recall_val = recall_score(labels, preds)
    
    return {"accuracy": accuracy_val, "precision": precision_val, "recall": recall_val}

## 5.0 Training and Validation on Subset of the Data

In this section we will train and validate the Open LLM model on the loaded and processed CSV files.

It is very likely that optimizing the hyperparameters could lead to a further improvement in performance

### 5.1 PolyLM 1.7B

In [7]:
# Constants
model_name = 'DAMO-NLP-MT/polylm-1.7b'

In [8]:
# Create Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          use_fast = False,
                                          legacy = False)
tokenizer.pad_token = tokenizer.eos_token

# Set Data Collator
data_collator = DataCollatorWithPadding(tokenizer = tokenizer, padding = 'longest')

# Tokenize dataset
ds = process_dataset(tokenizer)

# Summary
print(tokenizer)

Map:   0%|          | 0/3069 [00:00<?, ? examples/s]

Map:   0%|          | 0/1559 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3069
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1559
    })
})
LlamaTokenizer(name_or_path='DAMO-NLP-MT/polylm-1.7b', vocab_size=256000, model_max_length=2048, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}


To reduce the memory footprint the LLM model will be quantized to 4-bits and then finetuned with a QLoRA setup.

In [9]:
# Create Config
config = AutoConfig.from_pretrained(model_name,
                                    num_labels = 2,                                                            
                                    use_cache = False)

# Create Quantization Config
quantization_config = BitsAndBytesConfig(load_in_4bit = True,
                                         bnb_4bit_use_double_quant = True,
                                         bnb_4bit_quant_type = 'nf4',
                                         bnb_4bit_compute_dtype = torch.bfloat16)

# Create Model
model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                           config = config,
                                                           device_map = {"":0},
                                                           quantization_config = quantization_config)

# Set Pad Token Id
model.config.pad_token_id = tokenizer.pad_token_id

# Create LoRA config
loraconfig = LoraConfig(r = 16,
                        lora_alpha = 16,
                        lora_dropout = 0.05,
                        bias = 'none',
                        task_type = TaskType.SEQ_CLS,
                        fan_in_fan_out = True)

# Prep for Training
model = prepare_model_for_kbit_training(model)

# Create QLoRA Model
model = get_peft_model(model, loraconfig)
model.print_trainable_parameters()

# Show Model Summary
print(model)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at DAMO-NLP-MT/polylm-1.7b and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 3,149,824 || all params: 1,740,238,848 || trainable%: 0.1809995221989206
PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(256000, 2048)
        (wpe): Embedding(2048, 2048)
        (drop): Dropout(p=0.0, inplace=False)
        (h): ModuleList(
          (0-23): 24 x GPT2Block(
            (ln_1): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): Linear4bit(
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=6144, bias=False)
                )
                (lora_embedding_A): Paramet

In [10]:
# Set TrainingArguments
training_args = TrainingArguments(output_dir = "polylm_1.7b",
                                  learning_rate = 2.0e-4,
                                  warmup_steps = 32,
                                  lr_scheduler_type = 'cosine',                                  
                                  per_device_train_batch_size = 8,
                                  per_device_eval_batch_size = 8,
                                  gradient_accumulation_steps = 4,
                                  gradient_checkpointing = True, 
                                  gradient_checkpointing_kwargs = {"use_reentrant": False},                                 
                                  bf16 = True,
                                  optim = "paged_adamw_8bit",                                 
                                  num_train_epochs = 3,
                                  weight_decay = 0.001,
                                  evaluation_strategy = "epoch",
                                  save_strategy = "epoch",
                                  load_best_model_at_end = False,
                                  metric_for_best_model = 'accuracy',
                                  greater_is_better = True)

# Set Trainer
trainer = Trainer(model = model,
                  args = training_args,
                  train_dataset = ds["train"],
                  eval_dataset = ds["validation"],
                  tokenizer = tokenizer,
                  data_collator = data_collator,
                  compute_metrics = compute_metrics)

# Train Model
trainer.train()

  0%|          | 0/288 [00:00<?, ?it/s]

  0%|          | 0/195 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.88      0.64      0.74       765
           1       0.72      0.92      0.81       794

    accuracy                           0.78      1559
   macro avg       0.80      0.78      0.77      1559
weighted avg       0.80      0.78      0.77      1559

{'eval_loss': 0.5693616271018982, 'eval_accuracy': 0.7793457344451572, 'eval_precision': 0.7232142857142857, 'eval_recall': 0.9181360201511335, 'eval_runtime': 110.416, 'eval_samples_per_second': 14.119, 'eval_steps_per_second': 1.766, 'epoch': 1.0}


  0%|          | 0/195 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.89      0.74      0.81       765
           1       0.79      0.91      0.84       794

    accuracy                           0.83      1559
   macro avg       0.84      0.83      0.83      1559
weighted avg       0.83      0.83      0.83      1559

{'eval_loss': 0.441089004278183, 'eval_accuracy': 0.8268120590121874, 'eval_precision': 0.7854030501089324, 'eval_recall': 0.9080604534005038, 'eval_runtime': 110.3901, 'eval_samples_per_second': 14.123, 'eval_steps_per_second': 1.766, 'epoch': 2.0}


  0%|          | 0/195 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.81      0.82      0.82       765
           1       0.83      0.82      0.82       794

    accuracy                           0.82      1559
   macro avg       0.82      0.82      0.82      1559
weighted avg       0.82      0.82      0.82      1559

{'eval_loss': 0.41999101638793945, 'eval_accuracy': 0.8197562540089801, 'eval_precision': 0.8259212198221093, 'eval_recall': 0.818639798488665, 'eval_runtime': 110.3992, 'eval_samples_per_second': 14.121, 'eval_steps_per_second': 1.766, 'epoch': 3.0}
{'train_runtime': 2324.9851, 'train_samples_per_second': 3.96, 'train_steps_per_second': 0.124, 'train_loss': 0.5053038067287869, 'epoch': 3.0}


TrainOutput(global_step=288, training_loss=0.5053038067287869, metrics={'train_runtime': 2324.9851, 'train_samples_per_second': 3.96, 'train_steps_per_second': 0.124, 'train_loss': 0.5053038067287869, 'epoch': 3.0})

In [11]:
# Cleanup
del trainer, model, training_args
torch.cuda.empty_cache()
gc.collect()

12274

### 5.2 OpenLLaMA 7B V2

In [12]:
# Constants
model_name = 'openlm-research/open_llama_7b_v2'

In [13]:
# Create Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          use_fast = False,
                                          add_eos_token = True)
tokenizer.pad_token_id = 0

# Set Data Collator
data_collator = DataCollatorWithPadding(tokenizer = tokenizer, padding = 'longest')

# Tokenize dataset
ds = process_dataset(tokenizer)

# Summary
print(tokenizer)

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Map:   0%|          | 0/3069 [00:00<?, ? examples/s]

Map:   0%|          | 0/1559 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3069
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1559
    })
})
LlamaTokenizer(name_or_path='openlm-research/open_llama_7b_v2', vocab_size=32000, model_max_length=2048, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}


To reduce the memory footprint the LLM model will be quantized to 4-bits and then finetuned with a QLoRA setup.

In [14]:
# Create Config
config = AutoConfig.from_pretrained(model_name,
                                    num_labels = 2,                                                            
                                    use_cache = False)

# Create Quantization Config
quantization_config = BitsAndBytesConfig(load_in_4bit = True,
                                         bnb_4bit_use_double_quant = True,
                                         bnb_4bit_quant_type = 'nf4',
                                         bnb_4bit_compute_dtype = torch.bfloat16)

# Create Model
model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                           config = config,
                                                           device_map = {"":0},
                                                           quantization_config = quantization_config)

# Create LoRA config
loraconfig = LoraConfig(r = 16,
                        lora_alpha = 16,
                        lora_dropout = 0.05,
                        bias = 'none',
                        task_type = TaskType.SEQ_CLS)

# Prep for Training
model = prepare_model_for_kbit_training(model)

# Create LoRA Model
model = get_peft_model(model, loraconfig)
model.print_trainable_parameters()

# Show Model Summary
print(model)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at openlm-research/open_llama_7b_v2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 8,396,800 || all params: 6,615,748,608 || trainable%: 0.12692138860666938
PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): LlamaForSequenceClassification(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear4bit(
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (base_layer): Linear4bit(in_features=4096,

In [15]:
# Set TrainingArguments
training_args = TrainingArguments(output_dir = "open_llama_7b_v2",
                                  learning_rate = 2.0e-4,
                                  warmup_steps = 32,
                                  lr_scheduler_type = 'cosine',                                  
                                  per_device_train_batch_size = 4,
                                  per_device_eval_batch_size = 4,
                                  gradient_accumulation_steps = 8,
                                  gradient_checkpointing = True, 
                                  gradient_checkpointing_kwargs = {"use_reentrant": False},                                 
                                  fp16 = True,
                                  optim = "paged_adamw_8bit",                                 
                                  num_train_epochs = 3,
                                  weight_decay = 0.001,
                                  evaluation_strategy = "epoch",
                                  save_strategy = "epoch",
                                  load_best_model_at_end = False,
                                  metric_for_best_model = 'accuracy',
                                  greater_is_better = True)

# Set Trainer
trainer = Trainer(model = model,
                  args = training_args,
                  train_dataset = ds["train"],
                  eval_dataset = ds["validation"],
                  tokenizer = tokenizer,
                  data_collator = data_collator,
                  compute_metrics = compute_metrics)

# Train Model
trainer.train()

  0%|          | 0/288 [00:00<?, ?it/s]

  0%|          | 0/390 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.93      0.75      0.83       765
           1       0.79      0.95      0.87       794

    accuracy                           0.85      1559
   macro avg       0.86      0.85      0.85      1559
weighted avg       0.86      0.85      0.85      1559

{'eval_loss': 0.3466767370700836, 'eval_accuracy': 0.8492623476587556, 'eval_precision': 0.7945205479452054, 'eval_recall': 0.9496221662468514, 'eval_runtime': 555.7689, 'eval_samples_per_second': 2.805, 'eval_steps_per_second': 0.702, 'epoch': 1.0}


  0%|          | 0/390 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.97      0.70      0.82       765
           1       0.77      0.98      0.86       794

    accuracy                           0.84      1559
   macro avg       0.87      0.84      0.84      1559
weighted avg       0.87      0.84      0.84      1559

{'eval_loss': 0.3804403841495514, 'eval_accuracy': 0.8441308531109686, 'eval_precision': 0.7746759720837487, 'eval_recall': 0.9785894206549118, 'eval_runtime': 555.7354, 'eval_samples_per_second': 2.805, 'eval_steps_per_second': 0.702, 'epoch': 2.0}


  0%|          | 0/390 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.89      0.89      0.89       765
           1       0.89      0.90      0.89       794

    accuracy                           0.89      1559
   macro avg       0.89      0.89      0.89      1559
weighted avg       0.89      0.89      0.89      1559

{'eval_loss': 0.2637627422809601, 'eval_accuracy': 0.8909557408595253, 'eval_precision': 0.8909774436090225, 'eval_recall': 0.8954659949622166, 'eval_runtime': 555.7108, 'eval_samples_per_second': 2.805, 'eval_steps_per_second': 0.702, 'epoch': 3.0}
{'train_runtime': 11017.6721, 'train_samples_per_second': 0.836, 'train_steps_per_second': 0.026, 'train_loss': 0.3506848282284207, 'epoch': 3.0}


TrainOutput(global_step=288, training_loss=0.3506848282284207, metrics={'train_runtime': 11017.6721, 'train_samples_per_second': 0.836, 'train_steps_per_second': 0.026, 'train_loss': 0.3506848282284207, 'epoch': 3.0})

## Summary

After training the PolyLM 1.7B model achieves an accuracy on the validation set of 82.7% while the OpenLLaMA 7B V2 model even achieves 89.1%.