# Fine-Tuning BERT for SeqClass

This is a starting template for fine-tuning BERT (distilled) for Hallucination detection.

The result is a baseline for future improvements and not meant to be very good.

## Setup

### Libs

In [1]:
!pip install datasets
!pip install transformers[torch] -U
!pip install -U accelerate -U



In [2]:
import json
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

In [3]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer, DataCollatorWithPadding
from datasets import Dataset

import transformers
import accelerate

### Data & Model Loading

In [4]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
data_dir = "/content/drive/MyDrive/SHROOM/data"

In [7]:
with open(f"{data_dir}/SHROOM_unlabeled-training-data-v2/train.model-aware.v2.json") as f:
  train_data = json.load(f)

with open(f"{data_dir}/SHROOM_dev-v2/val.model-aware.v2.json") as f:
  dev_data = json.load(f)

with open(f"{data_dir}/SHROOM_dev-v2/val.model-agnostic.json") as f:
  dev_data_agnostic = json.load(f)

### Data prep

In [8]:
def prep_df(json_data):
  print(json_data[0])
  print(json_data[0].keys())
  _df = pd.DataFrame(json_data)
  print(_df.task.unique())
  _df = _df.query("task == 'DM'")
  _df = _df.reset_index()
  _df = _df.loc[:, ~_df.columns.str.contains('index')]
  return _df

In [9]:
def preprocess_func(examples):
    return tokenizer(examples["text"], truncation=True)

In [10]:
train_df = prep_df(train_data)
val_df = prep_df(dev_data)

val_df = val_df[["label", "src"]]
val_df.rename(columns={"src": "text"}, inplace=True)

{'hyp': 'Of or pertaining to the language of a particular area , or to a particular', 'tgt': 'Of or pertaining to everyday language , as opposed to standard , literary , liturgical , or scientific idiom .', 'src': 'There are blacktips , silvertips , bronze whalers , black whalers , spinner sharks , and bignose sharks . these of course are vernacular names , but this is one case where the scientific nomenclature does not clarify the species , since it is now being revised . What is the meaning of vernacular ?', 'ref': 'tgt', 'task': 'DM', 'model': 'ltg/flan-t5-definition-en-base'}
dict_keys(['hyp', 'tgt', 'src', 'ref', 'task', 'model'])
['DM' 'PG' 'MT']
{'hyp': 'A sloping top .', 'ref': 'tgt', 'src': 'The sides of the casket were covered with heavy black broadcloth , with velvet caps , presenting a deep contrast to the rich surmountings . What is the meaning of surmounting ?', 'tgt': 'A decorative feature that sits on top of something .', 'model': 'ltg/flan-t5-definition-en-base', 'task

In [11]:
model_agnostic_df = prep_df(dev_data_agnostic)

{'hyp': 'Resembling or characteristic of a weasel.', 'ref': 'tgt', 'src': 'The writer had just entered into his eighteenth year , when he met at the table of a certain Anglo - Germanist an individual , apparently somewhat under thirty , of middle stature , a thin and <define> weaselly </define> figure , a sallow complexion , a certain obliquity of vision , and a large pair of spectacles .', 'tgt': 'Resembling a weasel (in appearance).', 'model': '', 'task': 'DM', 'labels': ['Hallucination', 'Not Hallucination', 'Not Hallucination', 'Not Hallucination', 'Not Hallucination'], 'label': 'Not Hallucination', 'p(Hallucination)': 0.2}
dict_keys(['hyp', 'ref', 'src', 'tgt', 'model', 'task', 'labels', 'label', 'p(Hallucination)'])
['DM' 'PG' 'MT']


In [12]:
val_df["label"] = val_df["label"].replace({"Hallucination": 1, "Not Hallucination": 0})

In [13]:
model_agnostic_df["label"] = model_agnostic_df["label"].replace({"Hallucination": 1, "Not Hallucination": 0})
model_agnostic_df = model_agnostic_df[["label", "src"]]
model_agnostic_df.rename(columns={"src": "text"}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_agnostic_df.rename(columns={"src": "text"}, inplace=True)


In [14]:
val_df.head()

Unnamed: 0,label,text
0,1,The sides of the casket were covered with heav...
1,0,Please try not to overreact if she drives badl...
2,1,"To prevent spoilage , store in a cool , dry pl..."
3,1,The way the opposition has framed the argument...
4,1,To mix with thy concernments i desist . What i...


In [15]:
train_df, eval_df = train_test_split(val_df, test_size=0.001)

In [16]:
train_df

Unnamed: 0,label,text
85,0,Can you demonstrate the new tools for us ? Wha...
57,1,"In that sense , the word acts adjectivally , w..."
37,0,Support for resumable downloads means that you...
172,1,"Thy intercepter , full of despite , bloody as ..."
59,1,The first reliever got the last two outs of th...
...,...,...
25,0,They have no idea what occurs in the network o...
68,1,From his deepe chest laughes out a lowd applau...
133,1,"'the surface of the downs , which form the lan..."
82,1,I will stand the hazard of the die . What is t...


## Preprocessing

In [17]:
train_set = Dataset.from_pandas(train_df)
eval_set = Dataset.from_pandas(eval_df)

In [18]:
model_agnostic_set = Dataset.from_pandas(model_agnostic_df)

In [19]:
eval_set[0]

{'label': 1,
 'text': "Akira toriyama uses a timeskip to rejoin us with gokū and friends in dragon ball chapter 113 , three years after gokū leaves uranai baba 's palace for his global trek . What is the meaning of timeskip ?",
 '__index_level_0__': 71}

In [20]:
train_set = train_set.remove_columns(["__index_level_0__"])
eval_set = eval_set.remove_columns(["__index_level_0__"])

In [21]:
model_agnostic_set

Dataset({
    features: ['label', 'text'],
    num_rows: 187
})

In [23]:
tokenized_train = train_set.map(preprocess_func, batched=True)
tokenized_eval = eval_set.map(preprocess_func, batched=True)

Map:   0%|          | 0/187 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [24]:
tokenized_agnostic = model_agnostic_set.map(preprocess_func, batched=True)

Map:   0%|          | 0/187 [00:00<?, ? examples/s]

In [25]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### Training

In [26]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)

In [27]:
from transformers import EvalPrediction
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(p):
    # Convert logits to predicted labels
    preds = np.argmax(p.predictions, axis=1)
    # Calculate metrics like accuracy, precision, recall, and F1
    accuracy = accuracy_score(p.label_ids, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(p.label_ids, preds, average='binary')
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [28]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_agnostic,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

In [29]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=60, training_loss=0.663671875, metrics={'train_runtime': 544.014, 'train_samples_per_second': 1.719, 'train_steps_per_second': 0.11, 'total_flos': 21295686686280.0, 'train_loss': 0.663671875, 'epoch': 5.0})

## Eval

In [30]:
eval_results = trainer.evaluate()
print(eval_results)

{'eval_loss': 0.6993036866188049, 'eval_accuracy': 0.5080213903743316, 'eval_f1': 0.09803921568627451, 'eval_precision': 0.5, 'eval_recall': 0.05434782608695652, 'eval_runtime': 33.4851, 'eval_samples_per_second': 5.585, 'eval_steps_per_second': 0.358, 'epoch': 5.0}
