# Setup and Disclaimer

Run the following cell to install all the necessary.

Disclaimer, parts of this tutorial are highly inspired by a [Hugging Face course](https://huggingface.co/learn/nlp-course/chapter1/1)

We will use PyTorch as backbone

In [1]:
! pip install transformers datasets peft evaluate transformers[torch]



# Questions

### 1 Which metrics or patterns might help to detect AI-generated text?

Perplexity (LMs have low perplexity) is for sure worth trying. Otherwise, regularity and repetitivness (LMs do not switch sentence length / complexity too often).

### 2 What is the minimum amount of RAM I would need to run a batch of 4 samples through a GPT-3 instance? (Hint: think about the no. of weights)

To process the batch, you need at least to load 4 times the weights. Every float is 4 bytes => 175B weights * 4 bytes / 1024^3 = ~650GB of RAM at the very very least (in practice a lot more). Multiply by 4 if you want to get the value for the whole batch.

### 3 You want to build a classification architecture on the [AG News dataset](https://huggingface.co/datasets/ag_news). Describe how you would use BERT to build a classification architecture?

It is sufficient to attach a fully-connected layer on top of BERT and feed it with [CLS] output embedding from BERT.

### 4 How many output neurons would your last layer have? What activation function would you use there?

As many as the classes, i.e. 4 (see dataset documentation), therefore the extra layer would have shape (768, 4). Softmax would be good as an output activation as we want to have output probabilities for our four mutually exclusive classes.

### 5 Suppose now we want to use GPT-3 instead and want to do zero-shot learning. Consider the input sentence $x_0=$ _"The Social Computing Group at TUM has just released GPT-5, that is impressive!"_  labelled as $y_0=$_"Sci/Tech"_. Design a reasonable prompt for $(x_0,y_0)$.

A suitable prompt could be "Observe the news article: $x_0$. The piece is about ______".

### 6 Now add demonstrations to your prompt to perform in-context learning. Are you still performing zero-shot learning? If not, what instead? Explicitely state which is the pattern $f$ and which is the verbalizer $v$.

Just imagine other news pieces, for instance I can take two: $x_1$ and $x_2$ and their respective labels $y_1$, $y_2$. The new prompt with demonstation will looks like this:
"Observe the news article: $x_1$. The piece is about $y_1$. Observe the news article: $x_2$. The piece is about $y_2$". "Observe the news article: $x_0$. The piece is about ______".

We are now performing 2-shot learning (we have two demonstrations). The pattern and the verbalizer are as follow:

Pattern $f(x)$ = "Observe the news article: $x$"

Verbalizer $v(y)$ = "The piece is about $y$"

### 7 Suppose we are now given additional data to fine-tune your model, but retraining 175B parameters is absolutely unfeasible. How can parameter-efficient-tuning help us?

Parameter-efficient tuning techniques can achieve results that are on par or even superior to traditional fine-tuning while updating less than $1\%$ (sometimes even closer to $0.1\%$ or $0.01\%$) of the parameters.

### 8 Let's suppose you pick LoRA as efficient method for fine-tuning but you don't want to add ANY additional parameter once deployed. What can you do?

Consider a model layer you are fine-tuning with LoRA. LoRA constructs a lower rank approximation of the update for that layer. If I multiply the matrices that constitute the factorization I will get a matrix with the same shape as the model layer we are considering. Finally, some up the original weights with the reconstructed matrix (which is an approximation of the fine-tuning update).

### 9 Why was RL (initially) the only viable option for intergrating human preferences into LLMs? What does it mean that "the environment is not differentiable"?

In pre-training and in SFT, the ground truth **can always be expressed as a vocabulary distribution** or more in general **as the model output**. For instance, if the next word is "tree", that corresponds to a VOCAB_SIZE-long vector with all 0s and a 1 at the index for the word "tree". In this case, since the model output and the ground truth have the same form, we can compute a loss and calculate gradients.

The same cannot be said about human preferences, where a single score or rank is given to an input/output pair. We cannot compute gradients through supervised learning because the environment (the human preferences) have a different shape, thus the sentence "the environment is not differentiable".

# Pipeline

`pipeline()` is the simplest way to use a Hugging Face model. **Model** is an umbrella term describing an **architecture** (i.e. the skeleton/structure of the model) and a **checkpoint** (i.e. the set of weights that we load on it).

For example, BERT is an architecture while `bert-base-cased`, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the `bert-base-cased model'”

In [2]:
from transformers import pipeline

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
classifier = pipeline("sentiment-analysis", checkpoint)
classifier(["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [3]:
generator = pipeline("text-generation")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use React Router to configure your web site using React.js on its own without having to copy and'},
 {'generated_text': 'In this course, we will teach you how to take advantage of the new tools that have become available for you.\n\nWhat is a Systematic'}]

As you can see, if you provide no model checkpoint, Hugging Face will automatically pick an adequate one.

Check [here](https://huggingface.co/learn/nlp-course/chapter1/3?fw=pt) other usage examples for the pipeline function

Now use pipeline to perform a different task

In [4]:
### YOUR CODE HERE ###

checkpoint = "bert-base-uncased"
unmasker = pipeline("fill-mask", checkpoint)
unmasker("This course will teach you all about [MASK] models.", top_k=2)

######################

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'cls.seq_relationship.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.3366122841835022,
  'token': 2535,
  'token_str': 'role',
  'sequence': 'this course will teach you all about role models.'},
 {'score': 0.08371730148792267,
  'token': 2449,
  'token_str': 'business',
  'sequence': 'this course will teach you all about business models.'}]

# Behind the Pipeline

The Pipeline operator takes care of all steps: tokenization, going through the model, and post-processing. We now go one level lower to gain a better understanding of how everything works.

### Tokenizer

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

sequence = "I've been waiting for a HuggingFace course my whole life."

print(tokenizer(sequence))

tokens = tokenizer.tokenize(sequence) # You can also use the __call__ function directly

print(tokens)

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

decoded_string = tokenizer.decode(ids)

print(decoded_string)

{'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.']
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]
i've been waiting for a huggingface course my whole life.


What are the tokens with indices 101 and 102? Find out!

[CLS] and [SEP].

### Padding
To put inputs together is called batching. In order to batch sequences together, they need to be of the same length, which implies you need **pad** the inputs to ensure this requirement is met.

In [6]:
from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# This will work

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(batched_ids)).logits)

# But this this will fail
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

print(model(torch.tensor(batched_ids)).logits)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[0.2469, 0.3080],
        [0.2570, 0.3084]], grad_fn=<AddmmBackward0>)


ValueError: expected sequence of length 3 at dim 1 (got 2)

### Attention Mask
Remember that the model attends to all tokens, and we don't want our predictions to be depended on the padding tokens. We use **attention masks** to indicate which tokens should be attended and which not.

In [7]:
# This prediction
sentence_1_ids = [[200, 200, 200]]
sentence_2_ids = [[200, 200]]

# will be different from this one
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

# so we add the attention masks
attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

print(f"{model(torch.tensor(sentence_1_ids)).logits}\n"\
      f"{model(torch.tensor(sentence_2_ids)).logits}")
print(model(torch.tensor(batched_ids)).logits)
outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[0.2469, 0.3080]], grad_fn=<AddmmBackward0>)
tensor([[0.1991, 0.2034]], grad_fn=<AddmmBackward0>)
tensor([[0.2469, 0.3080],
        [0.2570, 0.3084]], grad_fn=<AddmmBackward0>)
tensor([[0.2469, 0.3080],
        [0.1991, 0.2034]], grad_fn=<AddmmBackward0>)


### Truncate
Most transformers handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem: either you (1) use a model with a longer supported sequence length or (2)truncate your sequences.

In [8]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

### Through the Model

Why does the following code fail? Can you fix it?

Models by default handle multiple sentences, so we need to add a dimension to our `input_ids`

In [9]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

### MODIFY CODE HERE ###

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)#.unsqueeze(dim=0)# <- SOLUTION

########################

# This line will fail.
model(input_ids).logits

IndexError: too many indices for tensor of dimension 1

Now let's apply everything we have learned and let's run the model on `raw_inputs` without using the `pipeline()` function. We placed some importa as hints. We're using the same inputs and model checkpoint, so you should get the same output as before!

In [10]:
from transformers import AutoModelForSequenceClassification
from torch import argmax
from torch.nn.functional import softmax

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

### YOUR CODE HERE ###

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

outputs = model(**inputs)

print(outputs.logits.shape)

predictions = softmax(outputs.logits, dim=-1)
labels = argmax(predictions, dim=-1).tolist()

for prediction, label in zip(predictions, labels):
    print(model.config.id2label[label], prediction[label])

######################

torch.Size([2, 2])
POSITIVE tensor(0.9598, grad_fn=<SelectBackward0>)
NEGATIVE tensor(0.9995, grad_fn=<SelectBackward0>)


Very likely your code does not work if you use the class `Automodel` instead of `AutoModelForSequenceClassification`. Why not?



# Training

We start by loading a dataset with the dataset library

In [11]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
print(raw_datasets)

raw_train_dataset = raw_datasets['train']
print(raw_train_dataset.features.keys())

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})
dict_keys(['sentence1', 'sentence2', 'label', 'idx'])


Glue mrpc is a classification task: (sentence_1, sentence_2) -> Are they semantically equivalent?

By running the tokenizer we can observe the model expects the inputs to be of the form [CLS] sentence1 [SEP] sentence2 [SEP]. The token_type_ids determines which part is sentence_1 and which part is sentence_"

Do not worry if you don't see token_type_ids in your tokenized inputs: as long as you use the same checkpoint for the tokenizer and the model, everything will be fine as the tokenizer knows what to provide to its model.

In [12]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

inputs = tokenizer("This is the first sentence.", "This is the second one.")
print(inputs)

print(tokenizer.convert_ids_to_tokens(inputs["input_ids"]))

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']


Simply applying the tokenizer to the dataset would require to store the entire dataset in memory. Instead, we would like to only store the current batch in memory by appling the `map()` function with argument `batched=True`.

This also enables to use dynamic padding through a `DataCollatorWithPadding`, i.e. we pad to the maximum length within the batch and not the entire dataset.

Define a `tokenize_function()` which we can then apply to the entire dataset. Don't forget, each sample is made of two sentences!

In [13]:
from transformers import DataCollatorWithPadding


### YOUR CODE HERE ###

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

######################

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True) #batched_true is the whole point of doing this!

samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Print shapes for one batch
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

Now that we can tokenize batches dynamically. We can sequentially feed them into the model to train it. We use the high-level `Trainer` class.

In [14]:
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

training_args = TrainingArguments("test-trainer") #hyperparams for the training

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.6317
1000,0.5855


TrainOutput(global_step=1377, training_loss=0.5773739506562897, metrics={'train_runtime': 182.4363, 'train_samples_per_second': 60.317, 'train_steps_per_second': 7.548, 'total_flos': 405324636337200.0, 'train_loss': 0.5773739506562897, 'epoch': 3.0})

Please notice that `Trainer` doesn't provide us with evaluation metrics by default. We can retrieve them from the dataset through `evaluate` and compute them for our model.


In [15]:
import numpy as np
import evaluate

predictions = trainer.predict(tokenized_datasets["validation"])
preds = np.argmax(predictions.predictions, axis=-1)

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.7671568627450981, 'f1': 0.8300536672629696}

Use `evaluate.load()` and `metric.compute()` to complete the `compute_metric()` function. We can then pass it to our `Trainer` object to track such metrics during training.

In [16]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

### YOUR CODE HERE ###

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

######################


trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.501408,0.754902,0.813433
2,0.561700,0.624419,0.813725,0.878205
3,0.363200,0.585781,0.862745,0.90378


TrainOutput(global_step=1377, training_loss=0.4036487920092082, metrics={'train_runtime': 200.8858, 'train_samples_per_second': 54.777, 'train_steps_per_second': 6.855, 'total_flos': 405540469624800.0, 'train_loss': 0.4036487920092082, 'epoch': 3.0})

# Behind the Trainer

We go one level lower the Trainer class and implement the training procedure in PyTorch.

Starting from the same setup:

In [17]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We setup the Pytorch DataLoaders

In [18]:
from torch.utils.data import DataLoader

tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

We setup an Optimizer:

In [19]:
from transformers import AdamW, get_scheduler

optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

1377




Now you write the PyTorch code inside the training loop.

In [20]:
from tqdm.auto import tqdm
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:

        ### YOUR CODE HERE ###

        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

        ######################

  0%|          | 0/1377 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Does it work? Congrats! You implement a training loop from scratch and you can now run the final evaluation. You should get similar results as with the `Trainer` class.

In [21]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542}

# Parameter-Efficient Fine-Tuning

You're given the same setup:

In [22]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification


raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We provide you with these two functions for counting active parameters and defining evaluation metrics.

In [23]:
def count_parameters(model):
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return {'Total': total_params, 'Trainable': trainable_params}

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Use LoRA by defining a configuration (`LoRAConfig` object) and use it with the trainer class to efficiently fine-tune our model

In [24]:
from peft import LoraConfig, TaskType, get_peft_model
from transformers import TrainingArguments, Trainer
import evaluate
import numpy as np


### YOUR CODE HERE ###

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=16,  # Dimension of the low-rank matrices
    lora_alpha=16,  # Scaling factor for the weight matrices
    lora_dropout=0.1,  # Dropout probability of the LoRA layers
    bias="all",  # Train all bias parameters
)

######################


# Count parameters before LoRA
before_lora_count = count_parameters(model)
print(f"Before LoRA:\n{before_lora_count}")

# Apply LoRA to the model
lora_model = get_peft_model(model, lora_config)

# Count parameters after LoRA
after_loara_count = count_parameters(lora_model)
print(f"After LoRA:\n{after_loara_count}")

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")  # Hyperparams for the training

### YOUR CODE HERE ###

trainer = Trainer(
    lora_model,  # Use the LoRA model here
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

######################

trainer.train()

Before LoRA:
{'Total': 109483778, 'Trainable': 109483778}
After LoRA:
{'Total': 110075140, 'Trainable': 694274}


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.5741,0.696078,0.809816
2,0.604900,0.554088,0.710784,0.814465
3,0.550500,0.533752,0.715686,0.808581


TrainOutput(global_step=1377, training_loss=0.5646672134565632, metrics={'train_runtime': 119.3336, 'train_samples_per_second': 92.212, 'train_steps_per_second': 11.539, 'total_flos': 408340545040320.0, 'train_loss': 0.5646672134565632, 'epoch': 3.0})