<a href="https://colab.research.google.com/github/Tomas-Pompa/Data-science/blob/main/HW6_Bayes_optimization_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification

We will use the [distilled version of the BERT base model](https://huggingface.co/distilbert-base-uncased) on a [dataset with news articles](https://huggingface.co/datasets/ag_news) from HuggingFace.

The dataset consists of 120000 training and 7600 testing samples which can be divided into 4 classes: `World` (0), `Sports` (1), `Business` (2), and `Sci/Tech` (3)

In [2]:
!pip install -qq transformers[torch] datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m55.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m80.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

In [3]:
DATASET = 'ag_news'
NUM_LABELS = 4
MODEL = 'distilbert-base-uncased'

Load the dataset with news articles:

In [4]:
from datasets import load_dataset

dataset = load_dataset(DATASET)
dataset

Downloading builder script:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.65k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/751k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

Check the format of one sample from our dataset:

In [5]:
dataset['train'][0]

{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
 'label': 2}

Check whether our dataset is balanced (get the number of samples from each class):

In [6]:
import numpy as np

def check_class_balance(class_labels):
  values, counts = np.unique(class_labels, return_counts=True)
  return values, counts

check_class_balance(dataset['train']['label'])

(array([0, 1, 2, 3]), array([30000, 30000, 30000, 30000]))

Load the tokenizer and have a look at it's special tokens:

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL)
tokenizer

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

The vocabulary consists of 30522 words ... tokens.

Special tokens:

- PAD ... maximal length of the input, if the input is not of the maximal length, PAD is used.
- UNK ... if unknown token is in the input.
- CLS ... classification.
- SEP ... if the input is longer than the maximal length.
- MASK ... used to mask while training process.

*What do these tokens mean?*

Check what exactly does the tokenizer return (when applied on one sample):

In [8]:
first_sample_text = dataset['train'][0]['text']
first_sample_text

"Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again."

In [9]:
# TODO
# hint: use tokenizer.tokenize(), tokenizer.convert_tokens_to_ids(), tokenizer.decode()

In [10]:
token = tokenizer.tokenize(first_sample_text)
np.transpose(token)
# .tokenize ... splits raw text into tokens

array(['wall', 'st', '.', 'bears', 'claw', 'back', 'into', 'the', 'black',
       '(', 'reuters', ')', 'reuters', '-', 'short', '-', 'sellers', ',',
       'wall', 'street', "'", 's', 'd', '##wind', '##ling', '\\', 'band',
       'of', 'ultra', '-', 'cy', '##nic', '##s', ',', 'are', 'seeing',
       'green', 'again', '.'], dtype='<U7')

In [11]:
ids = tokenizer.convert_tokens_to_ids(token)
np.transpose(ids)

array([ 2813,  2358,  1012,  6468, 15020,  2067,  2046,  1996,  2304,
        1006, 26665,  1007, 26665,  1011,  2460,  1011, 19041,  1010,
        2813,  2395,  1005,  1055,  1040, 11101,  2989,  1032,  2316,
        1997, 11087,  1011, 22330,  8713,  2015,  1010,  2024,  3773,
        2665,  2153,  1012])

In [12]:
print(tokenizer.decode(ids))
print(first_sample_text)

wall st. bears claw back into the black ( reuters ) reuters - short - sellers, wall street's dwindling \ band of ultra - cynics, are seeing green again.
Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.


Compare it to what is returned we when use the `preprocess_function`:



In [13]:
def preprocess_function(examples):
  # https://huggingface.co/docs/transformers/pad_truncation
  # truncation=True and padding='max_length' -> pads sequences with [PAD] token to given max sequence length
  return tokenizer(examples['text'], truncation=True, padding='max_length', return_tensors='pt')

first_sample_tokenized = preprocess_function(dataset['train'][0])
first_sample_tokenized
# it has special tokens ... 101 at the beginning, 102, 0 ... padding

{'input_ids': tensor([[  101,  2813,  2358,  1012,  6468, 15020,  2067,  2046,  1996,  2304,
          1006, 26665,  1007, 26665,  1011,  2460,  1011, 19041,  1010,  2813,
          2395,  1005,  1055,  1040, 11101,  2989,  1032,  2316,  1997, 11087,
          1011, 22330,  8713,  2015,  1010,  2024,  3773,  2665,  2153,  1012,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

Preprocess more samples from our dataset at once:

In [14]:
# training on the whole dataset would take more than 5 hours :(
# train_dataset = dataset['train'].map(preprocess_function, batched=True)
# test_dataset = dataset['test'].map(preprocess_function, batched=True)

train_dataset = dataset['train'].shuffle(seed=42).select(range(2500)).map(preprocess_function, batched=True)
test_dataset = dataset['test'].shuffle(seed=42).select(range(500)).map(preprocess_function, batched=True)

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [15]:
check_class_balance(train_dataset['label'])

(array([0, 1, 2, 3]), array([625, 640, 576, 659]))

In [16]:
check_class_balance(test_dataset['label'])

(array([0, 1, 2, 3]), array([120, 121, 134, 125]))

Load the model:

In [17]:
from transformers import AutoModelForSequenceClassification

id2label = {0: 'World', 1: 'Sports', 2: 'Business', 3: 'Sci/Tech'}
label2id = {'World': 0, 'Sports': 1, 'Business': 2, 'Sci/Tech': 3}

model = AutoModelForSequenceClassification.from_pretrained(MODEL,
                                                           num_labels=NUM_LABELS,
                                                           id2label=id2label,
                                                           label2id=label2id)
model

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

Define evaluation metrics and train our model:

In [18]:
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score
import numpy as np

def compute_metrics(p):
    logits = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(logits, axis=1)
    return {'accuracy': accuracy_score(p.label_ids, preds)}

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    evaluation_strategy='epoch',
    learning_rate=5e-5,
    weight_decay=0.0
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.468629,0.856
2,No log,0.367763,0.886


TrainOutput(global_step=314, training_loss=0.3698929707715466, metrics={'train_runtime': 245.4667, 'train_samples_per_second': 20.369, 'train_steps_per_second': 1.279, 'total_flos': 662360616960000.0, 'train_loss': 0.3698929707715466, 'epoch': 2.0})

Use the trained model to get prediction for some random sentence of your choice using `pipeline`:

https://huggingface.co/docs/transformers/main_classes/pipelines


In [19]:
from transformers import TextClassificationPipeline

# TODO
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=0)
pipe

<transformers.pipelines.text_classification.TextClassificationPipeline at 0x7b1cd0624610>

In [20]:
pipe('Hamas has yet to produce evidence linking Israel to last week’s hospital strike and says it cannot find the munition.')

[{'label': 'World', 'score': 0.9889179468154907}]

In [21]:
pipe('Lacrosse Is Coming to the Olympics. Will Its Inventors Be There?')

[{'label': 'Sports', 'score': 0.5009512305259705}]

In [22]:
pipe('The Glowing Secret That Mammals Have Been Hiding')

[{'label': 'Sci/Tech', 'score': 0.9015260338783264}]

What happens when we try to predict the label of a sentence that actually belongs to a class that wasn't in our data?

Is it correct behaviour?

In [23]:
pipe('What’s on TV This Week: ‘Fellow Travelers’ and ‘Winter House’')

[{'label': 'Sci/Tech', 'score': 0.4003400504589081}]

# Homework 6
Increase the performance of the model

How can we improve the performance of our model?

- bigger sample
- tuning hyperparameters
  - learning rate
  - dropout
  - warmup
  - weight decay
- data quality
- we can also increase the number of epochs

In [24]:
# First, we use smaller sample for searching for optimal hyperparameters
train_dataset = dataset['train'].shuffle(seed=42).select(range(500)).map(preprocess_function, batched=True)
#test_dataset = dataset['test'].shuffle(seed=42).select(range(500)).map(preprocess_function, batched=True)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [30]:
# Now we check the performance of the initial model
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    evaluation_strategy='epoch',
    learning_rate=5e-5,
    weight_decay=0.0
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.737536,0.854
2,No log,0.598601,0.886


TrainOutput(global_step=64, training_loss=0.03392596170306206, metrics={'train_runtime': 62.7963, 'train_samples_per_second': 15.924, 'train_steps_per_second': 1.019, 'total_flos': 132472123392000.0, 'train_loss': 0.03392596170306206, 'epoch': 2.0})

In [31]:
# How to access the validation loss
trainer.state.log_history[1]['eval_loss']

0.5986013412475586

In [32]:
# Function for Bayes optimization
# we want to search for best
# - learning rate
# - weight decay

def training_Bayes(lr, wd):
  training_args = TrainingArguments(
      output_dir='./results',
      num_train_epochs=2,
      per_device_train_batch_size=16,
      evaluation_strategy='epoch',
      learning_rate=lr,
      weight_decay=wd
  )

  trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
      eval_dataset=test_dataset,
      compute_metrics=compute_metrics
  )
  # Training the model
  trainer.train()
  # return the validation loss - we want to minimize the loss, so maximize the negative loss
  return(-trainer.state.log_history[1]['eval_loss'])

In [33]:
pip install bayesian-optimization

Collecting bayesian-optimization
  Downloading bayesian_optimization-1.4.3-py3-none-any.whl (18 kB)
Collecting colorama>=0.4.6 (from bayesian-optimization)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: colorama, bayesian-optimization
Successfully installed bayesian-optimization-1.4.3 colorama-0.4.6


In [34]:
from bayes_opt import BayesianOptimization

In [35]:
# defining intervals where to search for optimal hyperparameters
bds = {"lr": [1e-6, 1e-4],
       "wd": [0.0, 0.5]}

In [38]:
# Create a BayesianOptimization optimizer and optimize the function

optimizer = BayesianOptimization(f = training_Bayes,
                                 pbounds = bds,
                                 random_state = 7,
                                 verbose = 2)

In [39]:
optimizer.maximize(init_points=2,
                   n_iter=3,
                   )

|   iter    |  target   |    lr     |    wd     |
-------------------------------------------------


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.974152,0.89
2,No log,0.975612,0.888


| [0m1        [0m | [0m-0.9756  [0m | [0m8.555e-06[0m | [0m0.39     [0m |


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.204932,0.89
2,No log,1.252051,0.89


| [0m2        [0m | [0m-1.252   [0m | [0m4.44e-05 [0m | [0m0.3617   [0m |


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.279594,0.894
2,No log,1.297765,0.892


| [0m3        [0m | [0m-1.298   [0m | [0m1.393e-05[0m | [0m0.3899   [0m |


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.370428,0.894
2,No log,1.395409,0.892


| [0m4        [0m | [0m-1.395   [0m | [0m2.773e-05[0m | [0m0.39     [0m |


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.525171,0.894
2,No log,1.548767,0.894


| [0m5        [0m | [0m-1.549   [0m | [0m8.481e-05[0m | [0m0.4139   [0m |


In [41]:
# optimal hyperparameters
optimizer.max

{'target': -0.9756120443344116,
 'params': {'lr': 8.55452064802176e-06, 'wd': 0.3899593961200573}}

In [43]:
# First, we use bigger sample for training the model
train_dataset = dataset['train'].shuffle(seed=42).select(range(5000)).map(preprocess_function, batched=True)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [44]:
# we can also increase the number of epochs
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy='epoch',
    learning_rate=optimizer.max['params']['lr'],
    weight_decay=optimizer.max['params']['wd']
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.416768,0.894
2,0.369900,0.358125,0.89
3,0.369900,0.37964,0.908


TrainOutput(global_step=939, training_loss=0.28237616850799135, metrics={'train_runtime': 710.5982, 'train_samples_per_second': 21.109, 'train_steps_per_second': 1.321, 'total_flos': 1987081850880000.0, 'train_loss': 0.28237616850799135, 'epoch': 3.0})

The best model has the accuracy of $90.8\ \%$. It would be also possible to increase the number of epochs again or try different batch size.