# Fine-tuning a HuggingFace model

## Code Preamble

In [8]:
import evaluate
import numpy as np
import pandas as pd
import re

from datasets import Dataset, load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    EarlyStoppingCallback,
    pipeline,
    TrainingArguments, 
    Trainer
)

## Fine-Tuning

- A common use of LLMs is to leverage their **generalized** linguistic capacities by finetunint them for a **particular** task
- For instance: We could take an LLM and train it to... 
    - classify text sequences
    - classify tokens
    - produce dialogue
    - answer questions
    - etc etc etc

## Author Attribution

- The task I want to train the model to perform on is to identify authors of text
    - This is known as "author attribution"
    - E.g. Italian Computer Scientists tried to identify Elena Ferrante by comparing her work with known Italian authors and journalists
- We'll be using one of the few author attribution datasets on Huggingface 
    - Uses text from 13 journalists at the Guardian
-  We can find the [data here](https://huggingface.co/datasets/guardian_authorship)

We load it by calling ```load_dataset```. The function needs the url of the dataset and a specification of which part of the data we want:

In [9]:
dataset = load_dataset('guardian_authorship', 'cross_genre_1')

In [10]:
dataset['train'][50]

{'author': 10,
 'topic': 4,
 'article': 'Chance Witness<br />by Matthew Parris<br />528pp, Viking, 18.99 I\'ve known Matthew Parris pretty well for nearly 10 years, and he has always been kind and helpful and thoughtful, with only the occasional whiff of cattiness carried faintly on the breeze. He is also very kind about me in this alarmingly good book, so you might want to discount what follows. The cover shows Matthew with his hand over his mouth, as if he had just let slip a dangerous indiscretion. Yet I\'ve read few autobiographies that are so carefully considered, so empty of anything glib or cheap. The only damaging material is about people who are already dead, or who are big enough to take it. (Among the many wonderful vignettes of Margaret Thatcher is one illustrating her reliance on the Sun, and in particular the two-bullet-point editorials that used to appear opposite page three. "One day she plonked the paper down in front of the assembled male company, open at this spread,

There are some issues with the data, so I wrote a quick script to fix it. 
- Merge train, test and validate as pandas df
- Create new Dataset
- Do my own train_test_split

Most often this is **not** the case!

In [11]:
def fix_guardian_data(dataset):
    # Add a label column to the data
    dataset['train'] = dataset['train'].add_column("label", dataset['train']['author'])
    dataset['test'] = dataset['test'].add_column("label", dataset['test']['author'])
    dataset['validation'] = dataset['validation'].add_column("label", dataset['validation']['author'])
    
    # We want to do our own test-train split
    # To do this, I first make the data into one big dataframe
    train_df = pd.DataFrame(dataset['train'])
    test_df = pd.DataFrame(dataset['test'])
    val_df = pd.DataFrame(dataset['validation'])
    all_data = pd.concat([train_df, test_df, val_df])
    
    # Now I create a Huggingface dataset from that dataframe
    dataset = Dataset.from_pandas(all_data)
    
    # I decide which column is the 'label' column
    dataset = dataset.class_encode_column("label")
    
    # Then I take the train_test_split. I want 20% of the data to be in the test set
    dataset = dataset.train_test_split(test_size=0.2, stratify_by_column="label")
    return dataset

dataset = fix_guardian_data(dataset)

Stringifying the column:   0%|          | 0/444 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/444 [00:00<?, ? examples/s]

In [12]:
dataset

DatasetDict({
    train: Dataset({
        features: ['author', 'topic', 'article', 'label', '__index_level_0__'],
        num_rows: 355
    })
    test: Dataset({
        features: ['author', 'topic', 'article', 'label', '__index_level_0__'],
        num_rows: 89
    })
})

In [227]:
list(set(dataset['train']['label']))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

Now we tokenize, as always for NLP. 
- Different LLM's use different tokenizers. 
- Like the model, our tokenizer needs to know where in the Huggingface Hub to look for specs to tokenize
- We can use the  ```AutoTokenizer``` class instead of setting a particular tokenizer class

We will be using DistilBERT, a smaller and more nimble version of BERT.

In [13]:
model_type = "distilbert-base-uncased"

In [14]:
tokenizer = AutoTokenizer.from_pretrained(model_type)

def tokenize_function(examples):
    return tokenizer(examples["article"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/355 [00:00<?, ? examples/s]

Map:   0%|          | 0/89 [00:00<?, ? examples/s]

Next we set our hyperparameters:

In [15]:
batch_size = 8
epochs = 10
weight_decay = 0.01
learning_rate = 1e-5

We feed most hyperparameters to the the [```TrainingArguments``` class](https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments)

In [16]:
training_args = TrainingArguments(
    output_dir="test_trainer", 
    evaluation_strategy="epoch",
    save_strategy = "epoch",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=10,
    weight_decay=weight_decay,
    learning_rate=1e-5,
    load_best_model_at_end = True
)

Now we can specify our model. 

In [17]:
model = AutoModelForSequenceClassification.from_pretrained(model_type, num_labels=13)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


When we train, we want to keep track of the model performance. For this we need to give the model a fucntion that takes in the eval and returns some sort of ... . For this we can use the ```evaluate``` library and write a function around it.

In [18]:
metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [19]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

We can also create so called "callbacks". 
- These are objects that customize the training loop
- Some of the deftault ones have [their own classes in HuggingFace](https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/callback).

In our case, we want the model to stop if it didn't improve during 3 sequential epochs.

In [20]:
early_stopping_callback = EarlyStoppingCallback(early_stopping_patience=3)

Finally, we create a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer). 
- This is a class HuggingFace inherits from [```PyTorch Lightning```](https://lightning.ai/docs/pytorch/stable/common/trainer.html)
- Used in many other libraries, like TorchGeo
- Given an instance of a model class, this does the whole job of forward and backward passing

In [21]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42),
    eval_dataset=tokenized_datasets["test"].shuffle(seed=42),
    compute_metrics=compute_metrics,
    callbacks = [early_stopping_callback]
)

Now we just run ```train()```, like with ```PyTorch```!

In [22]:
trainer.train()



  0%|          | 0/450 [00:00<?, ?it/s]

KeyboardInterrupt: 

The model has finished training! 
- Now we can use it in a ```Huggingface``` pipeline

In [238]:
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

The model outputs probabilities, no need to work with logits:

In [239]:
pipe(tokenized_datasets["test"][50]['article'][:512])

[{'label': 'LABEL_3', 'score': 0.1491585373878479}]

Now we'll compare the model predictions on the test set with our predictions on it.
- We'll check if ```pred_lab == real_lab``` and count how many times it´s ```True```.
    - This is our count for how many times we predicted correctly :) 

In [246]:
correct = []
for idx in range(len(tokenized_datasets["test"]['author'])):
    # We pull the predicted label from each prediction
    # The model is only able to predict based on the 512 first tokens
    pred_lab = pipe(tokenized_datasets["test"][idx]['article'][:512])[0]['label']
    
    # The model outputs strings. 
    # We pull the number from it and turn it into an integere
    pred_lab = int(re.findall(r'\d+', pred_lab)[0])
    
    # We get the real label from the test data itself
    real_lab = tokenized_datasets["test"][idx]['label']
    
    # Now we compare and append to a list
    correct.append(pred_lab == real_lab)

In [247]:
Counter(correct)

Counter({False: 66, True: 23})

We could do more analysis. For example:
- Which authors did the model struggle with?
- Which did it predict confidently?

Further work could include:
- Pull out the model weights to see if there are specific words tha are important for predicting specific authors?
- Test if we can deceive the model by performing style transfer