# Fine-tuing a Hugging Face pretrained model

__Objective:__ fine-tune a pre-trained HF model by following step by step the [guide on training](https://huggingface.co/docs/transformers/v4.48.0/en/training).

In [1]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
dataset = load_dataset(
    'yelp_review_full',
    cache_dir='/data1/shared_datasets/'
)

dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [8]:
from transformers import AutoTokenizer

In [9]:
tokenizer = AutoTokenizer.from_pretrained(
    'google-bert/bert-base-cased',
    cache_dir='/data1/shared_models/'
)

In [11]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

In [15]:
tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 650000/650000 [02:40<00:00, 4057.80 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50000/50000 [00:12<00:00, 4082.36 examples/s]


In [16]:
small_train_dataset = tokenized_dataset['train'].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_dataset['test'].shuffle(seed=42).select(range(1000))

In [19]:
small_train_dataset

Dataset({
    features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})

In [20]:
from transformers import AutoModelForSequenceClassification

In [23]:
model = AutoModelForSequenceClassification.from_pretrained(
    'google-bert/bert-base-cased',
    cache_dir='/data1/shared_models/',
    num_labels=5,
    torch_dtype='auto'
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
from transformers import TrainingArguments

In [51]:
training_args = TrainingArguments(output_dir='/data1/moscato/personalised-hate-boundaries-data/models/hf_fine_tuning_test', eval_strategy='epoch')

In [30]:
import numpy as np
import evaluate

In [29]:
metric = evaluate.load('accuracy')

In [49]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred

    predictions = np.argmax(logits, axis=-1)

    return metric.compute(predictions=predictions, references=labels)

In [52]:
from transformers import Trainer

In [54]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics
)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [55]:
# trainer.train()