## 1.0 Introduction

In this notebook we will train and validate an Open LLM model to establish a baseline of the accuracy that can be achieved currently with the available Open LLM's.

We will compare this baseline against the GPT-3.5 Turbo accuracy and the accuracies achieved by the 3 smaller Transformer models.

The open LLM model that will be used is:
* PolyLM 1.7B ([HuggingFace URL](https://huggingface.co/DAMO-NLP-MT/polylm-1.7b))

This open LLM model consist of 1.7 billion parameters. The Dutch language was part of the dataset used for pre-training.

In [1]:
# Import Modules
import evaluate
import numpy as np
import pandas as pd
import torch
from datasets import Dataset, DatasetDict
from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig, TaskType
from transformers import (AutoConfig,
                          AutoModelForSequenceClassification, 
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          DataCollatorWithPadding, 
                          TrainingArguments, 
                          Trainer)

## 2.0 Load Datasets

We will reload the training and validation CSV files that were generated earlier with the notebook 'Prepare_Train_and_Validation_Datasets.ipynb'.

In [2]:
# Load Datasets
train_df = pd.read_csv('./data/train_df.csv')
val_df = pd.read_csv('./data/val_df.csv')

# Summary
print(train_df.shape)
print(val_df.shape)

(3069, 11)
(1559, 11)


In [3]:
# Summary
train_df.head()

Unnamed: 0,id,title,text,mainSection,published_at,publisher,partisan,url,text_wordcount,max_words_text,labels
0,10706318,Ogen als schoteltjes bij de Tachtigjarige Oorlog,Ogen als schoteltjes bij de Tachtigjarige Oorl...,/home,2018-10-07,trouw,True,www.trouw.nl/home/ogen-als-schoteltjes-bij-de-...,539,Ogen als schoteltjes bij de Tachtigjarige Oorl...,1
1,12633805,"Geen beeld, maar een monument voor Mandela in ...","Geen beeld, maar een monument voor Mandela in ...",/amsterdam,2019-05-10,parool,True,www.parool.nl/amsterdam/geen-beeld-maar-een-mo...,662,"Geen beeld, maar een monument voor Mandela in ...",1
2,7140125,Hoe ga je een onveilige arbeidscultuur zoals i...,Hoe ga je een onveilige arbeidscultuur zoals i...,/,2017-04-18,trouw,True,,494,Hoe ga je een onveilige arbeidscultuur zoals i...,1
3,4490774,Wetenschappers ontdekken lichtgevende discokikker,Wetenschappers ontdekken lichtgevende discokik...,/,2017-03-14,trouw,True,,291,Wetenschappers ontdekken lichtgevende discokik...,1
4,10592180,Meer fouten kabinet bij steun aan strijdgroepe...,Meer fouten kabinet bij steun aan strijdgroepe...,/home,2018-09-11,trouw,True,www.trouw.nl/home/meer-fouten-kabinet-bij-steu...,471,Meer fouten kabinet bij steun aan strijdgroepe...,1


In [4]:
# Summary
val_df.head()

Unnamed: 0,id,title,text,mainSection,published_at,publisher,partisan,url,text_wordcount,max_words_text,labels
0,9266995,Verdachte dodelijke steekpartijen Maastricht l...,Verdachte dodelijke steekpartijen Maastricht l...,/nieuws,2017-12-18,ad,False,www.ad.nl/binnenland/verdachte-dodelijke-steek...,188,Verdachte dodelijke steekpartijen Maastricht l...,0
1,4130077,Honderden arrestaties bij acties tegen mensen ...,Honderden arrestaties bij acties tegen mensen ...,/nieuws,2017-02-11,ad,False,www.ad.nl/buitenland/honderden-arrestaties-bij...,122,Honderden arrestaties bij acties tegen mensen ...,0
2,11147268,Waarom de 'oudejaarsbonus' voor de jongeren va...,Waarom de 'oudejaarsbonus' voor de jongeren va...,/home,2019-01-20,trouw,True,www.trouw.nl/home/waarom-de-oudejaarsbonus-voo...,262,Waarom de 'oudejaarsbonus' voor de jongeren va...,1
3,10749100,Klaar voor de verdediging,Klaar voor de verdedigingOver ruim een week be...,/nieuws,2018-10-16,ad,False,www.ad.nl/binnenland/klaar-voor-de-verdediging...,411,Klaar voor de verdedigingOver ruim een week be...,0
4,10700707,Windvlaag grijpt springmatras en doodt 2-jarig...,Windvlaag grijpt springmatras en doodt 2-jarig...,/nieuws,2018-10-05,ad,False,www.ad.nl/buitenland/windvlaag-grijpt-springma...,286,Windvlaag grijpt springmatras en doodt 2-jarig...,0


## 3.0 Process Datasets

In [5]:
# Proces DataFrame to DataSet function.
def process_dataset(tokenizer):
    # Tokenize Helper
    def preprocess_function(examples):
        return tokenizer(examples["text"], truncation = True)

    # Create DataSets
    tdf = pd.DataFrame({"text": train_df.max_words_text.values, "label": train_df.labels.values})
    vdf = pd.DataFrame({"text": val_df.max_words_text.values, "label": val_df.labels.values})
    tds = Dataset.from_pandas(tdf)
    vds = Dataset.from_pandas(vdf)

    ds = DatasetDict()
    ds['train'] = tds
    ds['validation'] = vds

    # Tokenize Text
    ds = ds.map(preprocess_function, batched = True)

    # Summary
    print(ds)

    return ds

## 4.0 Evaluation Setup

In [6]:
metric = evaluate.load("accuracy")

task_evaluator = evaluate.evaluator("text-classification")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis = 1)

    return metric.compute(predictions = predictions, references = labels)

## 5.0 Training and Validation on Subset of the Data

In this section we will train and validate the Open LLM model on the loaded and processed CSV files.

It is very likely that optimizing the hyperparameters could lead to a further improvement in performance

### 5.1 PolyLM 1.7B

In [7]:
# Constants
model_name = 'DAMO-NLP-MT/polylm-1.7b'

In [8]:
# Create Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          use_fast = False,
                                          legacy = False)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Set Data Collator
data_collator = DataCollatorWithPadding(tokenizer = tokenizer, padding = 'longest')

# Tokenize dataset
ds = process_dataset(tokenizer)

# Summary
print(tokenizer)

Map:   0%|          | 0/3069 [00:00<?, ? examples/s]

Map:   0%|          | 0/1559 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3069
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1559
    })
})
LlamaTokenizer(name_or_path='DAMO-NLP-MT/polylm-1.7b', vocab_size=256000, model_max_length=2048, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': '</s>'}, clean_up_tokenization_spaces=False)


To reduce the memory footprint the LLM model will be quantized to 4-bits and then finetuned with a QLoRA setup.

In [9]:
# Create Config
config = AutoConfig.from_pretrained(model_name,
                                    num_labels = 2,                                                            
                                    use_cache = False)

# Create Quantization Config
quantization_config = BitsAndBytesConfig(load_in_4bit = True,
                                         bnb_4bit_use_double_quant = True,
                                         bnb_4bit_quant_type = 'nf4',
                                         bnb_4bit_compute_dtype = torch.bfloat16)

# Create Model
model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                           config = config,
                                                           device_map = {"":0},
                                                           quantization_config = quantization_config)

# Set Pad Token Id
model.config.pad_token_id = tokenizer.pad_token_id

# Enable Gradient Checkpointing
model.gradient_checkpointing_enable()

# Create LoRA config
loraconfig = LoraConfig(r = 64,
                        lora_alpha = 16,
                        lora_dropout = 0.05,
                        bias = 'none',
                        task_type = TaskType.SEQ_CLS,
                        fan_in_fan_out = True)

# Prep for Training
model = prepare_model_for_kbit_training(model)

# Create LoRA Model
model = get_peft_model(model, loraconfig)
model.print_trainable_parameters()

# Show Model Summary
print(model)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at DAMO-NLP-MT/polylm-1.7b and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 12,591,104 || all params: 1,749,676,032 || trainable%: 0.7196248773898732
PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(256000, 2048)
        (wpe): Embedding(2048, 2048)
        (drop): Dropout(p=0.0, inplace=False)
        (h): ModuleList(
          (0-23): 24 x GPT2Block(
            (ln_1): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): Linear4bit(
                in_features=2048, out_features=6144, bias=True
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=6144, bias=Fals

In [10]:
# Set TrainingArguments
training_args = TrainingArguments(output_dir = "polylm_1.7b",
                                  learning_rate = 2.0e-4,
                                  per_device_train_batch_size = 8,
                                  per_device_eval_batch_size = 8,
                                  gradient_accumulation_steps = 4,
                                  warmup_steps = 32,
                                  bf16 = True,
                                  optim = "paged_adamw_8bit",                                 
                                  num_train_epochs = 3,
                                  weight_decay = 0.001,
                                  evaluation_strategy = "epoch",
                                  save_strategy = "epoch",
                                  load_best_model_at_end = True,
                                  metric_for_best_model = 'accuracy',
                                  greater_is_better = True)

# Set Trainer
trainer = Trainer(model = model,
                  args = training_args,
                  train_dataset = ds["train"],
                  eval_dataset = ds["validation"],
                  tokenizer = tokenizer,
                  data_collator = data_collator,
                  compute_metrics = compute_metrics)

# Train Model
trainer.train()

  0%|          | 0/288 [00:00<?, ?it/s]



  0%|          | 0/195 [00:00<?, ?it/s]

{'eval_loss': 0.53938227891922, 'eval_accuracy': 0.7748556767158435, 'eval_runtime': 110.6391, 'eval_samples_per_second': 14.091, 'eval_steps_per_second': 1.762, 'epoch': 1.0}




  0%|          | 0/195 [00:00<?, ?it/s]

{'eval_loss': 0.4602004289627075, 'eval_accuracy': 0.8293778062860808, 'eval_runtime': 110.6227, 'eval_samples_per_second': 14.093, 'eval_steps_per_second': 1.763, 'epoch': 2.0}




  0%|          | 0/195 [00:00<?, ?it/s]

{'eval_loss': 0.4003584384918213, 'eval_accuracy': 0.8338678640153945, 'eval_runtime': 110.663, 'eval_samples_per_second': 14.088, 'eval_steps_per_second': 1.762, 'epoch': 3.0}
{'train_runtime': 2465.2235, 'train_samples_per_second': 3.735, 'train_steps_per_second': 0.117, 'train_loss': 0.5310058063930936, 'epoch': 3.0}


TrainOutput(global_step=288, training_loss=0.5310058063930936, metrics={'train_runtime': 2465.2235, 'train_samples_per_second': 3.735, 'train_steps_per_second': 0.117, 'train_loss': 0.5310058063930936, 'epoch': 3.0})

## Summary

After training the PolyLM 1.7B model achieves an accuracy on the validation set of 83.4%.