# Fine-tuning BERT on classification task 
First, we use the load_dataset function to download and cache the dataset:

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("ag_news")

Found cached dataset ag_news (/Users/mahnaz/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 221.24it/s]


In [3]:
print(type(raw_datasets))
print(raw_datasets.keys())
print(raw_datasets['train'].features)
print(raw_datasets['train'][0])
print(raw_datasets['train'][0]['text'])
print(raw_datasets['train'][0]['label'])

<class 'datasets.dataset_dict.DatasetDict'>
dict_keys(['train', 'test'])
{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'], id=None)}
{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2}
Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
2


To preprocess our data, we will need a tokenizer.   

Some info about the BERT:   

BERT, or Bidirectional Encoder Representations from Transformers, is a transformer-based machine learning technique for natural language processing (NLP) developed by Google. There are two primary versions of the pre-trained BERT model: `bert-base-uncased` and `bert-base-cased`.

The difference between these two versions lies in how they handle case distinctions in the text:   

- `bert-base-uncased` is a version of BERT that converts all text to lowercase before processing it. So, for example, the words "Hello" and "hello" would be treated as the same token.

- `bert-base-cased`, on the other hand, retains the original case of the text when processing it, meaning it treats "Hello" and "hello" as different tokens.

Choosing between the two versions depends on the requirements of the specific task in hand. If the task is case-sensitive (for example, Named Entity Recognition where "US" as a country and "us" as a pronoun need to be differentiated), `bert-base-cased` is more suitable. On the other hand, if case distinctions are not important, using `bert-base-uncased` could help keeping the token vocabulary smaller and more manageable.

In this project it's not necessarily important to differentiate based on case. For example, the words "economy" and "Economy" likely carry the same meaning whether they're capitalized or not. Therefore, we will use the bert-base-uncased model to keep the token vocabulary smaller, which might lead to slightly better performance and faster training.

However, it might also be worth trying the bert-base-cased model and comparing the results. In some cases, preserving the case information might provide additional context that can help the model make more accurate predictions. For example, capitalized words at the beginning of a sentence or proper nouns might carry more significance. We will try bert-base-cased if time allows. 



In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Tryin the tokenizer:

prepare the text inputs for the model:

In [5]:
sent = raw_datasets['train'][0]['text']
print(sent)
tokens = tokenizer.tokenize(sent)
print(tokens)

Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
['wall', 'st', '.', 'bears', 'claw', 'back', 'into', 'the', 'black', '(', 'reuters', ')', 'reuters', '-', 'short', '-', 'sellers', ',', 'wall', 'street', "'", 's', 'd', '##wind', '##ling', '\\', 'band', 'of', 'ultra', '-', 'cy', '##nic', '##s', ',', 'are', 'seeing', 'green', 'again', '.']


In [6]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Loading cached processed dataset at /Users/mahnaz/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548/cache-1000afd0b7cb4a31.arrow
                                                                 

generating a small subset of the training and validation set to try and test everythin faster. we will use the full dataset after we established the training process. 

In [7]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) 
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) 
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]

Loading cached shuffled indices for dataset at /Users/mahnaz/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548/cache-90d942435e5454c5.arrow


First, we define the model:    

We use the AutoModelForSequenceClassification model from the Hugging Face Transformers library.
The num_labels argument specifies the number of output labels for the classification task. Here we have 4 labes: 'World', 'Sports', 'Business', 'Sci/Tech'

In [8]:
print(raw_datasets['train'].features)

{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'], id=None)}


In [9]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=4)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

The warning message is "because we are throwing away the pretraining head of the BERT model to replace it with a classification head which is randomly initialized. We will fine-tune this model on our task, transferring the knowledge of the pretrained model to it (which is why doing this is called transfer learning)." [From huggingFace website]

Defineing our Trainer using the TrainingArguments class in the Hugging Face Transformers library.    
The TrainingArguments class doesn't actually do any training itself. Instead, it defines the parameters that are used by the Trainer class, which handles the training loop. The TrainingArguments object is passed to the Trainer when initializing it.   
TrainingArguments("test_trainer") creates a new TrainingArguments object with the output directory set to "test_trainer". The output directory is where any model checkpoints and training progress files will be saved.  

In [22]:
from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch")

In [23]:
print(training_args.__dict__)

Num processes: 1
Process index: 0
Local process index: 0
Device: cpu
, '_n_gpu': 0, '__cached__setup_devices': device(type='cpu'), 'deepspeed_plugin': None}


Note: First time i ran the code above i wad getting error CUDA reralet and the folliwng error:
ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`   
the error was resolved after updating the packages. at first i attemped to set fp16 parameters to False to avoid using CUDAthe folliwng code, 


In [1]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="test_trainer",
    fp16=False # mixed precision (FP16) training is turned off
)


  from .autonotebook import tqdm as notebook_tqdm


In [18]:
from transformers import Trainer

#trainer = Trainer(
#   model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
#)
#trainer.train()

In [20]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

  metric = load_metric("accuracy")
Downloading builder script: 4.21kB [00:00, 2.08MB/s]                   


In [24]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)


In [25]:
trainer.evaluate()


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
100%|██████████| 125/125 [11:41<00:00,  5.61s/it]


{'eval_loss': 0.43526506423950195,
 'eval_accuracy': 0.864,
 'eval_runtime': 705.3773,
 'eval_samples_per_second': 1.418,
 'eval_steps_per_second': 0.177}

{'eval_loss': 0.4905005991458893, 'eval_accuracy': 0.866, 'eval_runtime': 538.9095, 'eval_samples_per_second': 1.856, 'eval_steps_per_second': 0.232, 'epoch': 2.0}it]

{'train_runtime': 7494.3184, 'train_samples_per_second': 0.4, 'train_steps_per_second': 0.05, 'train_loss': 0.35687178548177084, 'epoch': 3.0}
{'eval_loss': 0.45145148038864136, 'eval_accuracy': 0.885, 'eval_runtime': 685.9512, 'eval_samples_per_second': 1.458, 'eval_steps_per_second': 0.182, 'epoch': 3.0}   {'train_runtime': 7494.3184, 'train_samples_per_second': 0.4, 'train_steps_per_second': 0.05, 'train_loss': 0.35687178548177084, 'epoch': 3.0}


** The model has the accuracy of 0.885 on the small sample of the dataset with 3 epoch. Now we can move forwad and train it with the full dataset. script can be found at src/models/train.py **