# Fine-tuning a Bert model

- ref: https://huggingface.co/docs/transformers/tasks/sequence_classification

- dataset: [Fake-News-Detection-Challenge-KDD-2020](https://huggingface.co/datasets/LittleFish-Coder/Fake-News-Detection-Challenge-KDD-2020)

- pretrained model: [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)

- BERT architecture go through: [Coursera](https://www.coursera.org/learn/transformer-models-and-bert-model/)

In [15]:
%pip -q install numpy pandas torch transformers datasets evaluate accelerate scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [2]:
# import package
import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import pipeline
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")

device: cuda


## Dataset

use the [`Fake-News-Detection-Challenge-KDD-2020`](https://huggingface.co/datasets/LittleFish-Coder/Fake-News-Detection-Challenge-KDD-2020) dataset from huggingface datasets library

- 1: fake news
- 0: real news

In [4]:
# load data
dataset = load_dataset("LittleFish-Coder/Fake-News-Detection-Challenge-KDD-2020", download_mode="reuse_cache_if_exists", cache_dir="dataset")

Downloading readme: 100%|██████████| 1.14k/1.14k [00:00<00:00, 5.50kB/s]
Downloading data: 100%|██████████| 8.25M/8.25M [00:01<00:00, 7.22MB/s]
Downloading data: 100%|██████████| 2.48M/2.48M [00:00<00:00, 3.38MB/s]
Downloading data: 100%|██████████| 1.15M/1.15M [00:00<00:00, 2.13MB/s]
Generating train split: 100%|██████████| 3490/3490 [00:00<00:00, 16005.05 examples/s]
Generating validation split: 100%|██████████| 997/997 [00:00<00:00, 13708.40 examples/s]
Generating test split: 100%|██████████| 499/499 [00:00<00:00, 19268.44 examples/s]


In [5]:
# data
print(f"Dataset: {dataset}")
train_dataset = dataset["train"]
val_dataset = dataset["validation"]
test_dataset = dataset["test"]

Dataset: DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 3490
    })
    validation: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 997
    })
    test: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 499
    })
})


In [7]:
# quick look at the data
first_train = train_dataset[0]
print(f"First training sample")
print(f"Keys: {first_train.keys()}")
print(f"Text: {first_train['text']}")
print(f"Label: {first_train['label']}")

First training sample
Keys: dict_keys(['text', 'label', '__index_level_0__'])
Text: UPDATE, WRITETHRU with more detail: Shortly before he was due to appear on ITV’s Good Morning Britain today, Ewan McGregor pulled out of the interview, citing comments made about this weekend’s Women’s March by host Piers Morgan. A supporter of President Donald Trump, Morgan yesterday on the program described some of the women who marched as “rabid feminists” and said he didn’t “see the point of the march(es)” which he called “generic” and “vacuous.”  On Twitter this morning, McGregor, who is out promoting Trainspotting sequel T2: Trainspotting, wrote, “Was going on Good Morning Britain, didn’t realise Piers Morgan was host. Won’t go on with him after his comments about #WomensMarch.”  Was going on Good Morning Britain, didn’t realise @piersmorgan was host. Won’t go on with him after his comments about #WomensMarch — Ewan McGregor (@mcgregor_ewan) January 24, 2017  On his Twitter account (whose timeline

# Inference with Pipeline API

In [8]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification", device=device)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [9]:
classifier(first_train['text'], candidate_labels=["real", "fake"])

{'sequence': 'UPDATE, WRITETHRU with more detail: Shortly before he was due to appear on ITV’s Good Morning Britain today, Ewan McGregor pulled out of the interview, citing comments made about this weekend’s Women’s March by host Piers Morgan. A supporter of President Donald Trump, Morgan yesterday on the program described some of the women who marched as “rabid feminists” and said he didn’t “see the point of the march(es)” which he called “generic” and “vacuous.”  On Twitter this morning, McGregor, who is out promoting Trainspotting sequel T2: Trainspotting, wrote, “Was going on Good Morning Britain, didn’t realise Piers Morgan was host. Won’t go on with him after his comments about #WomensMarch.”  Was going on Good Morning Britain, didn’t realise @piersmorgan was host. Won’t go on with him after his comments about #WomensMarch — Ewan McGregor (@mcgregor_ewan) January 24, 2017  On his Twitter account (whose timeline photo is of he and Trump), Morgan responded by saying McGregor is “ju

## Preprocess (Tokenize)
The next step is to load a [`DistilBERT`](https://huggingface.co/distilbert/distilbert-base-uncased) tokenizer to preprocess the `text` field:

In [10]:
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length:

In [11]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use Datasets map function. 

You can speed up map by setting `batched=True` to process multiple elements of the dataset at once:

In [12]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map: 100%|██████████| 3490/3490 [00:16<00:00, 213.71 examples/s]
Map: 100%|██████████| 997/997 [00:04<00:00, 203.21 examples/s]
Map: 100%|██████████| 499/499 [00:02<00:00, 222.43 examples/s]


After preprocessing, the dataset will contain the original text and the following attributes that DistilBERT uses as input:

- `input_ids`: The token indices in the vocabulary
- `attention_mask`: Which parts of the sequence DistilBERT should pay attention to

In [13]:
# tokenized
first_tokenized = tokenized_dataset["train"][0]
print(f"First tokenized sample")
print(f"Keys: {first_tokenized.keys()}")
print(f"Input IDs: {first_tokenized['input_ids']}")
print(f"Attention Mask: {first_tokenized['attention_mask']}")
print(f"Length: {len(first_tokenized['input_ids'])}")

First tokenized sample
Keys: dict_keys(['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'])
Input IDs: [101, 10651, 1010, 4339, 2705, 6820, 2007, 2062, 6987, 1024, 3859, 2077, 2002, 2001, 2349, 2000, 3711, 2006, 11858, 1521, 1055, 2204, 2851, 3725, 2651, 1010, 1041, 7447, 23023, 2766, 2041, 1997, 1996, 4357, 1010, 8951, 7928, 2081, 2055, 2023, 5353, 1521, 1055, 2308, 1521, 1055, 2233, 2011, 3677, 16067, 5253, 1012, 1037, 10129, 1997, 2343, 6221, 8398, 1010, 5253, 7483, 2006, 1996, 2565, 2649, 2070, 1997, 1996, 2308, 2040, 9847, 2004, 1523, 10958, 17062, 10469, 2015, 1524, 1998, 2056, 2002, 2134, 1521, 1056, 1523, 2156, 1996, 2391, 1997, 1996, 2233, 1006, 9686, 1007, 1524, 2029, 2002, 2170, 1523, 12391, 1524, 1998, 1523, 12436, 10841, 3560, 1012, 1524, 2006, 10474, 2023, 2851, 1010, 23023, 1010, 2040, 2003, 2041, 7694, 4499, 11008, 3436, 8297, 1056, 2475, 1024, 4499, 11008, 3436, 1010, 2626, 1010, 1523, 2001, 2183, 2006, 2204, 2851, 3725, 1010, 2134, 1521, 1056, 19148,

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [16]:
import evaluate

accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
# precision = evaluate.load("precision")
# recall = evaluate.load("recall")

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the accuracy:

In [18]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    acc = accuracy.compute(predictions=predictions, references=labels)
    f1_score = f1.compute(predictions=predictions, references=labels, average="weighted")
    # pre = precision.compute(predictions=predictions, references=labels, average="weighted")
    # rec = recall.compute(predictions=predictions, references=labels, average="weighted")

    results = {"accuracy": acc['accuracy'], "f1": f1_score['f1']}

    return results

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train (Finetune the model)

Before you start training your model, create a map of the expected ids to their labels with `id2label` and `label2id`:

In [19]:
id2label = {1: "fake", 0: "real"}
label2id = {"fake": 1, "real": 0}

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! 

Load DistilBERT with [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSequenceClassification) along with the number of expected labels, and the label mappings:

In [20]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [23]:
num_epochs = 2
output_dir = "checkpoints"
batch_size = 64

In [24]:
training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
)

In [25]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [26]:
trainer.train()

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.569913,0.708124,0.668871
2,No log,0.512987,0.747242,0.742078


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


TrainOutput(global_step=56, training_loss=0.5995587621416364, metrics={'train_runtime': 89.1929, 'train_samples_per_second': 78.257, 'train_steps_per_second': 0.628, 'total_flos': 924622442618880.0, 'train_loss': 0.5995587621416364, 'epoch': 2.0})

In [27]:
# save the best model
trainer.save_model(f"{output_dir}/best_model")

## Evaluation Metrics (on testing dataset)
- Accuracy
- F1 Score

In [28]:
# evaluate on validation set
val_result = trainer.evaluate()

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


In [29]:
val_result

{'eval_loss': 0.5129866600036621,
 'eval_accuracy': 0.7472417251755266,
 'eval_f1': 0.7420776730483101,
 'eval_runtime': 4.57,
 'eval_samples_per_second': 218.16,
 'eval_steps_per_second': 1.751,
 'epoch': 2.0}

In [30]:
# evaluate on test set
test_result = trainer.evaluate(eval_dataset=tokenized_dataset["test"])

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


In [31]:
test_result

{'eval_loss': 0.5395698547363281,
 'eval_accuracy': 0.7274549098196392,
 'eval_f1': 0.7199946994304812,
 'eval_runtime': 2.3559,
 'eval_samples_per_second': 211.805,
 'eval_steps_per_second': 1.698,
 'epoch': 2.0}

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Grab some text you'd like to run inference on:

In [32]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

Device: cuda


In [44]:
text = test_dataset[0]["text"]
real_label = test_dataset[0]["label"]
print(f"Text: {text}")
print(f"Real Label: {real_label}")

Text: How great it would be to be able to go out on a date with a favorite superheroe. Probably for some of us our partners are our personal superheroes, but we know that deep down we dream of the God of thunder, beautiful spies stealing classified information, night watchmen, or women with pyrotechnic skills. At some point, everyone has wished to have a date with one of those characters or at least with one of the actors that played them.  But for some people, this dream has come true. Deep in their hearts everyone else envies them because they know it must be fun to be able to say that their husband, wife or even one of their parents has been defeating different villains under the spotlights of the Hollywood cameras. So, let's stop and think about it for a moment... if being a parent is already a challenge then try to picture being parents and superheroes at the same time. Anyone else would lose their minds! But Commissioner James Gordon once said "[One has] to make a difference. A l

### Use pipeline API

In [34]:
from transformers import pipeline

classifier = pipeline("text-classification", model=f"{output_dir}/best_model", truncation=True, device=device)
classifier(text)

[{'label': 'fake', 'score': 0.6022790670394897}]

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

### Use the tokenizer and model directly

In [40]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(f"{output_dir}/best_model")
inputs = tokenizer(text, return_tensors="pt", truncation=True)
print(f"Input keys: {inputs.keys()}")
print(f"Input: {inputs}")

Input keys: dict_keys(['input_ids', 'attention_mask'])
Input: {'input_ids': tensor([[  101,  2129,  2307,  2009,  2052,  2022,  2000,  2022,  2583,  2000,
          2175,  2041,  2006,  1037,  3058,  2007,  1037,  5440, 16251,  2063,
          1012,  2763,  2005,  2070,  1997,  2149,  2256,  5826,  2024,  2256,
          3167, 16251,  2229,  1010,  2021,  2057,  2113,  2008,  2784,  2091,
          2057,  3959,  1997,  1996,  2643,  1997,  8505,  1010,  3376, 16794,
         11065,  6219,  2592,  1010,  2305,  3422,  3549,  1010,  2030,  2308,
          2007,  1052, 12541, 12184,  2818,  8713,  4813,  1012,  2012,  2070,
          2391,  1010,  3071,  2038,  6257,  2000,  2031,  1037,  3058,  2007,
          2028,  1997,  2216,  3494,  2030,  2012,  2560,  2007,  2028,  1997,
          1996,  5889,  2008,  2209,  2068,  1012,  2021,  2005,  2070,  2111,
          1010,  2023,  3959,  2038,  2272,  2995,  1012,  2784,  1999,  2037,
          8072,  3071,  2842,  4372, 25929,  2068,  213

Pass your inputs to the model and return the `logits`:

In [41]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(f"{output_dir}/best_model")
with torch.no_grad():
    logits = model(**inputs).logits
print(f"Logits: {logits}")

Logits: tensor([[-0.3234,  0.0916]])


Get the class with the highest probability, and use the model's `id2label` mapping to convert it to a text label:

In [42]:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

'fake'