# Pre-training and Finetuning

In this section, we will go through pre-training and finetuning method. Please note that using ChatGPT is like using a pre-trained model. However, the pre-training model may not be trained on your task. Therefore, finetuning method can help the model become more aligned with your dataset.

[Huggingface](https://huggingface.co/) is a community where researchers upload their pretrained models and dataset to. Let's take a look at the model [bert-base-multilingual-uncased-sentiment](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)

Now let's use the same dataset to see the differences between pre-training and finetuning model



## Load Data

Because we will be using a pretrained model, even to finetuning a pretrained model, we need to make sure that our data is in the format of what pretrained model can accept.

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.1-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (

In [2]:
import pandas as pd
from datasets import load_dataset

In [3]:
# select file from computer
from google.colab import files
uploaded = files.upload()

Saving airlines_review.csv to airlines_review.csv


Now let's process the dataset into huggingface format. Find the csv file dataset [here](https://drive.google.com/file/d/1JSsxWlLfbciOLA_26W1EZIf60t0srFQ5/view?usp=drive_link)

In [4]:
dataset = load_dataset("csv", data_files="airlines_review.csv")

Generating train split: 0 examples [00:00, ? examples/s]

Now split the dataset into training and testing dataset with 80/20

In [5]:
dataset = dataset['train'].train_test_split(test_size=0.2)

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'Title', 'Name', 'Review Date', 'Airline', 'Verified', 'Reviews', 'Type of Traveller', 'Month Flown', 'Route', 'Class', 'Seat Comfort', 'Staff Service', 'Food & Beverages', 'Inflight Entertainment', 'Value For Money', 'Overall Rating', 'Recommended', 'Rating Class'],
        num_rows: 1600
    })
    test: Dataset({
        features: ['Unnamed: 0', 'Title', 'Name', 'Review Date', 'Airline', 'Verified', 'Reviews', 'Type of Traveller', 'Month Flown', 'Route', 'Class', 'Seat Comfort', 'Staff Service', 'Food & Beverages', 'Inflight Entertainment', 'Value For Money', 'Overall Rating', 'Recommended', 'Rating Class'],
        num_rows: 400
    })
})

## Pre-training

In [7]:
from transformers import pipeline

In [8]:
classifier = pipeline("text-classification",
            model="nlptown/bert-base-multilingual-uncased-sentiment",
            tokenizer="nlptown/bert-base-multilingual-uncased-sentiment",
            truncation=True)  # CPU

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


In [9]:
def classify_batch(batch):
  results = classifier(batch["Reviews"])
  batch["predicted_label"] = [r["label"] for r in results]
  batch["score"] = [r["score"] for r in results]
  return batch

As we talked about pretraining models are normally big, it will take longer time to process compared with the model we learned in class. Therefore, as an exmaple, we only estimate the performance on the first 100 samples in testing dataset.

In [10]:
sampled_testing = dataset['test'].select(range(100))
classified_dataset = sampled_testing.map(classify_batch,batched=True, batch_size=16)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Now let's examine the result.

In [11]:
classified_dataset['predicted_label'][0:5]

['5 stars', '4 stars', '1 star', '5 stars', '1 star']

We can see that it shows the result from the original model, that is number of stars between 1 and 5. It seems to be a good estimator, however, it doesn't fit with our 10 scale requirements. Therefore, we need finetuning.

## Finetuning

In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [12]:
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import os
#import evaluate



In [13]:
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

In [14]:
def preprocess_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

In [15]:
def select_columns(example):
    return {
        'text': example['Reviews'],        # rename 'Reviews' ➔ 'text'
        'label': int(example['Overall Rating']) - 1 # rename 'Overall Rating' ➔ 'label'
    }

In [16]:
finetune_dataset = dataset.map(select_columns)

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

In [17]:
tokenized_dataset = finetune_dataset.map(preprocess_function, batched=True) # Convert the Tokenize into the model type

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

In [18]:
# Because of memory problem, let's only use 100 samples for training.
# train_dataset = tokenized_dataset['train'].select(range(200))
train_dataset = tokenized_dataset['train']

In [19]:
id2label = {i-1: str(i) for i in range(1, 11)}    # ID 0 → "1", ID 1 → "2", ..., ID 9 → "10"
label2id = {str(i): i-1 for i in range(1, 11)}

In [20]:
model = AutoModelForSequenceClassification.from_pretrained(
    "nlptown/bert-base-multilingual-uncased-sentiment",
    num_labels=10,  # Override the old head (originally 5 labels)
    ignore_mismatched_sizes=True,
    id2label=id2label,
    label2id=label2id
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([5]) in the checkpoint and torch.Size([10]) in the model instantiated
- classifier.weight: found shape torch.Size([5, 768]) in the checkpoint and torch.Size([10, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
# Training settings
training_args = TrainingArguments(
    output_dir="./results",
    run_name="finetuning",
    report_to="none",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)


In [23]:
os.environ["WANDB_DISABLED"] = "true"

In [22]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

In [24]:
# Train
trainer.train()

Step,Training Loss


TrainOutput(global_step=300, training_loss=1.4130463663736978, metrics={'train_runtime': 210.0864, 'train_samples_per_second': 22.848, 'train_steps_per_second': 1.428, 'total_flos': 1263023780659200.0, 'train_loss': 1.4130463663736978, 'epoch': 3.0})

Now let's examine our toy model

In [25]:
test_dataset = tokenized_dataset['test'].select(range(100))

In [26]:
trainer.evaluate(eval_dataset=test_dataset)

{'eval_loss': 1.3571348190307617,
 'eval_runtime': 1.3829,
 'eval_samples_per_second': 72.312,
 'eval_steps_per_second': 9.401,
 'epoch': 3.0}

In [27]:
predictions = trainer.predict(test_dataset)

In [28]:
from sklearn.metrics import accuracy_score

In [29]:
true_labels = predictions.label_ids
predicted_classes = np.argmax(predictions.predictions, axis=1)
accuracy = accuracy_score(true_labels, predicted_classes)

In [30]:
print(f"The accuracy is {accuracy}")

The accuracy is 0.49
