# Recitation 7: Finetune LM using HuggingFace
_Date_: 11/6/2025

## References
- [HuggingFace NLP example notebooks](https://huggingface.co/docs/transformers/en/notebooks)
- [NLP course](https://huggingface.co/learn/llm-course/en/chapter3/2)

## How to use HuggingFace?

Recall the pipeline training a deep neural model using PyTorch,
1. Data representation
2. Build model
3. Train the model on training/validation set
4. Evaluate the model
5. Save/publish the model

HuggingFace provides APIs for each step, so accomplishing the whole pipeline becomes easy if users don't consider customize the model and have enough computing resources. The library `transformers` is the main one to use.

## Task: Sequence classification

- Arch: Masked language models
- Model: BERT
- Dataset: GLUE (General Language Understanding Evaluation benchmark)

In [None]:
from dataclasses import dataclass
from pprint import pprint

In [None]:
@dataclass
class Config:
    model: str
    batch_size: int

In [None]:
conf = Config(model="bert-base-uncased", batch_size=16)

### 1) Data representation

This is the first but the most flexible step among the whole pipeline, as users have different dataset to clean, preprocess and represent. Therefore, this step would normally cost you the longest time.

Normally, you would need APIs from two libraries:
- `datasets` [(link)](https://huggingface.co/docs/datasets/index)
  - Load datasets uploaded by people in HuggingFace community
  - Similar to `Dataset` class, the library provides a class wrapping up your customized dataset for downstream trainer to process.
- `transformers` [(link)](https://huggingface.co/docs/transformers/en/index)
  - The main library for building training/inference pipeline which mainly includes **model** and **tokenizer**.



#### 1a) Load dataset

In [None]:
from datasets import load_dataset

In [None]:
glue_ds = load_dataset("glue", "mrpc")
glue_ds

As you can see, similar to PyTorch, HuggingFace also uses map-style dataset for storing and retrieving raw data. Of course, when the size of data is enormously large, an iterable-style would be utilized.

To better understand this map-style dataset, we can see it as a tabular sheet:
|sentence1      |sentence2     |label |idx |
|---------------|--------------|------|----|
|Pizza is great.| You're right.|1     |10  |

In [None]:
pprint(glue_ds['train'][0])

In [None]:
pprint(glue_ds['train'].features)

#### 1b) Preprocess dataset

Now given raw textual data, the next step is to preprocess them, that is, representing linguistic features and encoding them into numerical values. In `transformers`, the tokenizer will be responisble for this step and each model has its own tokenizer.

In [None]:
from transformers import AutoTokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained(conf.model)

In [None]:
example = tokenizer("This is the first sentence.", "This is the second one.")
pprint(example)

_**NOTE**_: `token_type_ids` is an optional field and indicates tokens are from $(i+1)^{th}$ sentence. Here, `0` stands for token is from sentence $1$ and `1` stands for token is from sentence $2$.

In [None]:
tokenizer.convert_ids_to_tokens(example['input_ids'])

_**[NOTE]**_: the tokenizer merges two sentences into one along with special tokens `[CLS], [SEP]` specifically for classification tasks.

Previous example shows how tokenizer works for a single example, but the training loop processes a batch of examples in practice.

`Dataset.map(func, batched=True)` function is helpful for accomplishing this task. You can understand this function applies the `func` as a collate function to every instance in batch mode.

In [None]:
# Preprocess in batch
def tokenize_fn(instance):
    return tokenizer(instance['sentence1'], instance['sentence2'], truncation=True)
    
tokenized_ds = glue_ds.map(tokenize_fn, batched=True)
tokenized_ds

In [None]:
len(tokenized_ds['train'])

### 2) Training

Once having dataset preprocessed, the next step is to train an existing model by this specific dataset (called fine-tuning).

In `transformers`, you normally need three components:
- `TrainingArguments` ([doc](https://huggingface.co/docs/transformers/en/main_classes/trainer)): an object for specifying arguments for training
- `Trainer`: the driver to achieve training customized by `TrainingArguments`
- `AutoModel[...]` ([doc](https://huggingface.co/docs/transformers/en/model_doc/auto)): the model architecture object to be trained

In [None]:
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification

In [None]:
# 1. Define training arguments
training_args = TrainingArguments("test-trainer")

# 2. Instantiate model
model = AutoModelForSequenceClassification.from_pretrained(conf.model, num_labels=2)

# 3. Instantiate trainer and start training
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    processing_class=tokenizer,
)

In [None]:
trainer.train()