## Working with Transformers in the HuggingFace Ecosystem

In this laboratory exercise we will learn how to work with the HuggingFace ecosystem to adapt models to new tasks. As you will see, much of what is required is *investigation* into the inner-workings of the HuggingFace abstractions. With a little work, a little trial-and-error, it is fairly easy to get a working adaptation pipeline up and running.

### Exercise 1: Sentiment Analysis (warm up)

In this first exercise we will start from a pre-trained BERT transformer and build up a model able to perform text sentiment analysis. Transformers are complex beasts, so we will build up our pipeline in several explorative and incremental steps.

#### Exercise 1.1: Dataset Splits and Pre-trained model
There are a many sentiment analysis datasets, but we will use one of the smallest ones available: the [Cornell Rotten Tomatoes movie review dataset](cornell-movie-review-data/rotten_tomatoes), which consists of 5,331 positive and 5,331 negative processed sentences from the Rotten Tomatoes movie reviews.

**Your first task**: Load the dataset and figure out what splits are available and how to get them. Spend some time exploring the dataset to see how it is organized. Note that we will be using the [HuggingFace Datasets](https://huggingface.co/docs/datasets/en/index) library for downloading, accessing, splitting, and batching data for training and evaluation.

In [1]:
from datasets import load_dataset, get_dataset_split_names

# Your code here.
ds_id = "cornell-movie-review-data/rotten_tomatoes"
ds_train = load_dataset(ds_id, split = "train")
ds_test = load_dataset(ds_id, split= "test")
ds_validation = load_dataset(ds_id, split="validation")

In [2]:
#inspect dataset without downloading it
from datasets import load_dataset_builder
ds_builder = load_dataset_builder(ds_id)

In [3]:
print(ds_builder.info)
print(ds_builder.info.features)
print(get_dataset_split_names(ds_id))

print(ds_test[0]['text'])
print(ds_test[0]['label'])

DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}, post_processed=None, supervised_keys=None, builder_name='parquet', dataset_name='rotten_tomatoes', config_name='default', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=1075873, num_examples=8530, shard_lengths=None, dataset_name='rotten_tomatoes'), 'validation': SplitInfo(name='validation', num_bytes=134809, num_examples=1066, shard_lengths=None, dataset_name='rotten_tomatoes'), 'test': SplitInfo(name='test', num_bytes=136102, num_examples=1066, shard_lengths=None, dataset_name='rotten_tomatoes')}, download_checksums={'hf://datasets/cornell-movie-review-data/rotten_tomatoes@aa13bc287fa6fcab6daf52f0dfb9994269ffea28/train.parquet': {'num_bytes': 698845, 'checksum': None}, 'hf://datasets/cornell-movie-review-data/rotten_tomatoes@aa13bc287fa6fcab6daf52f0dfb9994269ffea28/validation.parquet': {'num_bytes':

In [4]:
Classlabels = ['neg','pos']
for i in range(10):
    print(ds_test[i]['text'])
    print(Classlabels[ds_test[i]['label']])

    print(ds_test[-i]['text'])
    print(Classlabels[ds_test[-i]['label']])

lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .
pos
lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .
pos
consistently clever and suspenseful .
pos
enigma is well-made , but it's just too dry and too placid .
neg
it's like a " big chill " reunion of the baader-meinhof gang , only these guys are more harmless pranksters than political activists .
pos
the thing looks like a made-for-home-video quickie .
neg
the story gives ample opportunity for large-scale action and suspense , which director shekhar kapur supplies with tremendous skill .
pos
as it stands , crocodile hunter has the hurried , badly cobbled look of the 1959 godzilla , which combined scenes of a japanese monster flick with canned shots of raymond burr commenting on the monster's path of destruction .
neg
red dragon " never cuts corners .
pos
there are many de

In [5]:
ds_train = load_dataset(ds_id, split='train')
ds_val = load_dataset(ds_id, split='validation')
ds_test = load_dataset(ds_id, split = 'test')


In [6]:
print(ds_val)
print(ds_train)
print(ds_test)

Dataset({
    features: ['text', 'label'],
    num_rows: 1066
})
Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})
Dataset({
    features: ['text', 'label'],
    num_rows: 1066
})


In [7]:
print(ds_train[0]['text'])
print(ds_train[0]['label'])

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
1


#### Exercise 1.2: A Pre-trained BERT and Tokenizer

The model we will use is a *very* small BERT transformer called [Distilbert](https://huggingface.co/distilbert/distilbert-base-uncased) this model was trained (using self-supervised learning) on the same corpus as BERT but using the full BERT base model as a *teacher*.

**Your next task**: Load the Distilbert model and corresponding tokenizer. Use the tokenizer on a few samples from the dataset and pass the tokens through the model to see what outputs are provided. I suggest you use the [`AutoModel`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) class (and the `from_pretrained()` method) to load the model and `AutoTokenizer` to load the tokenizer).

In [8]:
from transformers import AutoTokenizer, AutoModel

device = 'cuda'

model_id = "distilbert/distilbert-base-uncased"
model = AutoModel.from_pretrained(model_id, output_hidden_states = True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

for i in range(3):
    example = ds_train[i]
    inputs = tokenizer(ds_train[i]['text'], return_tensors = 'pt').to(device)
    outputs = model(**inputs)
    
    print('input:',example)
    print('lats hidden state:', outputs[0])
    print('intermidiate hidden state:', outputs[-2])

print(type(outputs)) 


input: {'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
lats hidden state: tensor([[[-0.0332, -0.0168,  0.0194,  ...,  0.0476,  0.5834,  0.3036],
         [-0.0235, -0.0555, -0.3638,  ...,  0.1877,  0.5781, -0.1577],
         [-0.0516, -0.1014, -0.1511,  ...,  0.1503,  0.2649, -0.1575],
         ...,
         [-0.2214,  0.0666, -0.1378,  ...,  0.0319,  0.0833, -0.2145],
         [ 0.6647,  0.2524,  0.0299,  ...,  0.0841, -0.4030, -0.4060],
         [ 0.3342,  0.5060,  0.4131,  ...,  0.1109, -0.2385, -0.2486]]],
       device='cuda:0', grad_fn=<NativeLayerNormBackward0>)
intermidiate hidden state: tensor([[[-0.0332, -0.0168,  0.0194,  ...,  0.0476,  0.5834,  0.3036],
         [-0.0235, -0.0555, -0.3638,  ...,  0.1877,  0.5781, -0.1577],
         [-0.0516, -0.1014, -0.1511,  ...,  0.1503,  0.2649, -0.1575],
         ...,
         [-0.2

In [9]:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
inputs = tokenizer(ds_train[:2]['text'], padding = True, return_tensors= "pt").to('cuda')
 
outputs = model(**inputs)

In [10]:
outputs

BaseModelOutput(last_hidden_state=tensor([[[-0.0332, -0.0168,  0.0194,  ...,  0.0476,  0.5834,  0.3036],
         [-0.0235, -0.0555, -0.3638,  ...,  0.1877,  0.5781, -0.1577],
         [-0.0516, -0.1014, -0.1511,  ...,  0.1503,  0.2649, -0.1575],
         ...,
         [ 0.3688, -0.1147,  0.8428,  ..., -0.0708, -0.0178, -0.2516],
         [ 0.0654, -0.0206,  0.1889,  ...,  0.1159,  0.2323, -0.2404],
         [ 0.0373, -0.0104,  0.1203,  ...,  0.1049,  0.2852, -0.3035]],

        [[-0.2062, -0.0490, -0.4036,  ..., -0.1186,  0.6141,  0.3919],
         [-0.4361, -0.1647, -0.3533,  ...,  0.1086,  0.9478, -0.0272],
         [-0.1164,  0.1690,  0.2698,  ..., -0.1971,  0.4372,  0.2527],
         ...,
         [-0.2341,  0.4810, -0.2634,  ..., -0.3397,  0.2567,  0.1274],
         [ 0.7139,  0.0574, -0.3260,  ...,  0.2041, -0.3800, -0.3343],
         [ 0.5649,  0.2806, -0.0295,  ...,  0.1297, -0.3160, -0.1874]]],
       device='cuda:0', grad_fn=<NativeLayerNormBackward0>), hidden_states=(tensor

In [11]:
hidden_states = outputs['last_hidden_state']


In [12]:
tokenizer.decode(inputs['input_ids'][0])
#cls è il primo nel last hidden state, quello che voglio usare per fare sentiment analysis

'[CLS] the rock is destined to be the 21st century \' s new " conan " and that he \' s going to make a splash even greater than arnold schwarzenegger, jean - claud van damme or steven segal. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD]'

#### Exercise 1.3: A Stable Baseline

In this exercise I want you to:
1. Use Distilbert as a *feature extractor* to extract representations of the text strings from the dataset splits;
2. Train a classifier (your choice, by an SVM from Scikit-learn is an easy choice).
3. Evaluate performance on the validation and test splits.

These results are our *stable baseline* -- the **starting** point on which we will (hopefully) improve in the next exercise.

**Hint**: There are a number of ways to implement the feature extractor, but probably the best is to use a [feature extraction `pipeline`](https://huggingface.co/tasks/feature-extraction). You will need to interpret the output of the pipeline and extract only the `[CLS]` token from the *last* transformer layer. *How can you figure out which output that is?*

In [13]:
# Tokenizza in batch
def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True)

tokenized_dataset = ds_train.map(tokenize_function, batched=True)

In [None]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModel
import numpy as np

model_id = "distilbert/distilbert-base-uncased"



def extract_features(model_id, ds, index_hidden_layer):
    model = AutoModel.from_pretrained(model_id)
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    feature_extractor = pipeline('feature-extraction', model = model, tokenizer = tokenizer)

    hidden_size = model.config.hidden_size
    
    cls_tokens = np.zeros((len(ds), hidden_size))   
    labels = ds['label']

    for index in range(len(ds)):
        text = ds[index]['text']
        
        features = feature_extractor(text,return_tensors = "pt")
        features = features[index_hidden_layer].numpy()

        cls_tokens[index,:] = features[0,:]

    return(cls_tokens,labels)


In [15]:
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report


X_train, Y_train = extract_features(model_id, ds_train, 0)
svc = LinearSVC()
svc.fit(X_train, Y_train)

svc.predict(X_train)
print('training results')
print(classification_report(Y_train, svc.predict(X_train)))

X_test, Y_test = extract_features(model_id, ds_test, 0)
svc.predict(X_test)
print('test results')
print(classification_report(Y_test, svc.predict(X_test)))

X_val, Y_val = extract_features(model_id, ds_validation, 0)
svc.predict(X_val)
print('validation results')
print(classification_report(Y_val, svc.predict(X_val)))

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


training results
              precision    recall  f1-score   support

           0       0.84      0.86      0.85      4265
           1       0.86      0.84      0.85      4265

    accuracy                           0.85      8530
   macro avg       0.85      0.85      0.85      8530
weighted avg       0.85      0.85      0.85      8530



Device set to use cuda:0


test results
              precision    recall  f1-score   support

           0       0.79      0.81      0.80       533
           1       0.81      0.78      0.79       533

    accuracy                           0.80      1066
   macro avg       0.80      0.80      0.80      1066
weighted avg       0.80      0.80      0.80      1066



Device set to use cuda:0


validation results
              precision    recall  f1-score   support

           0       0.81      0.84      0.83       533
           1       0.84      0.80      0.82       533

    accuracy                           0.82      1066
   macro avg       0.82      0.82      0.82      1066
weighted avg       0.82      0.82      0.82      1066



-----
### Exercise 2: Fine-tuning Distilbert

In this exercise we will fine-tune the Distilbert model to (hopefully) improve sentiment analysis performance.

#### Exercise 2.1: Token Preprocessing

The first thing we need to do is *tokenize* our dataset splits. Our current datasets return a dictionary with *strings*, but we want *input token ids* (i.e. the output of the tokenizer). This is easy enough to do my hand, but the HugginFace `Dataset` class provides convenient, efficient, and *lazy* methods. See the documentation for [`Dataset.map`](https://huggingface.co/docs/datasets/v3.5.0/en/package_reference/main_classes#datasets.Dataset.map).

**Tip**: Verify that your new datasets are returning for every element: `text`, `label`, `intput_ids`, and `attention_mask`.

In [16]:

def preprocess_function(examples):
    return tokenizer(examples['text'], padding=True, return_tensors='pt')

tokenized_train = ds_train.map(preprocess_function)
tokenized_val = ds_validation.map(preprocess_function)

In [17]:
print(tokenized_train[0]['text'])
print(tokenized_train[0]['label'])
print(tokenized_train[0]['input_ids'])
print(tokenized_train[0]['attention_mask'] )

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
1
[[101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1000, 16608, 1000, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102]]
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


#### Exercise 2.2: Setting up the Model to be Fine-tuned

In this exercise we need to prepare the base Distilbert model for fine-tuning for a *sequence classification task*. This means, at the very least, appending a new, randomly-initialized classification head connected to the `[CLS]` token of the last transformer layer. Luckily, HuggingFace already provides an `AutoModel` for just this type of instantiation: [`AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification). You will want you instantiate one of these for fine-tuning.

In [18]:
from transformers import AutoModelForSequenceClassification 

model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=2)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Exercise 2.3: Fine-tuning Distilbert

Finally. In this exercise you should use a HuggingFace [`Trainer`](https://huggingface.co/docs/transformers/main/en/trainer) to fine-tune your model on the Rotten Tomatoes training split. Setting up the trainer will involve (at least):


1. Instantiating a [`DataCollatorWithPadding`](https://huggingface.co/docs/transformers/en/main_classes/data_collator) object which is what *actually* does your batch construction (by padding all sequences to the same length).
2. Writing an *evaluation function* that will measure the classification accuracy. This function takes a single argument which is a tuple containing `(logits, labels)` which you should use to compute classification accuracy (and maybe other metrics like F1 score, precision, recall) and return a `dict` with these metrics.  
3. Instantiating a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.51.1/en/main_classes/trainer#transformers.TrainingArguments) object using some reasonable defaults.
4. Instantiating a `Trainer` object using your train and validation splits, you data collator, and function to compute performance metrics.
5. Calling `trainer.train()`, waiting, waiting some more, and then calling `trainer.evaluate()` to see how it did.

**Tip**: When prototyping this laboratory I discovered the HuggingFace [Evaluate library](https://huggingface.co/docs/evaluate/en/index) which provides evaluation metrics. However I found it to have insufferable layers of abstraction and getting actual metrics computed. I suggest just using the Scikit-learn metrics...

In [19]:
# Your code here.
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

import numpy as np
import evaluate #libreria di HF da installare separatamente

def compute_metrics(eval_pred):
    load_accuracy = evaluate.load('accuracy') ######funzione del prof... 

In [20]:
from datasets import load_dataset, get_dataset_split_names, get_dataset_config_names, get_dataset_infos

print(get_dataset_config_names(ds_id))
print(get_dataset_infos(ds_id))

['default']
{'default': DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}, post_processed=None, supervised_keys=None, builder_name='parquet', dataset_name='rotten_tomatoes', config_name='default', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=1075873, num_examples=8530, shard_lengths=None, dataset_name='rotten_tomatoes'), 'validation': SplitInfo(name='validation', num_bytes=134809, num_examples=1066, shard_lengths=None, dataset_name='rotten_tomatoes'), 'test': SplitInfo(name='test', num_bytes=136102, num_examples=1066, shard_lengths=None, dataset_name='rotten_tomatoes')}, download_checksums={'hf://datasets/cornell-movie-review-data/rotten_tomatoes@aa13bc287fa6fcab6daf52f0dfb9994269ffea28/train.parquet': {'num_bytes': 698845, 'checksum': None}, 'hf://datasets/cornell-movie-review-data/rotten_tomatoes@aa13bc287fa6fcab6daf52f0dfb9994269ffea28/validation

In [21]:
from datasets import load_dataset, get_dataset_split_names
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, EarlyStoppingCallback
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Prepare and tokenize dataset
model_id = "distilbert/distilbert-base-uncased"
ds_id = "cornell-movie-review-data/rotten_tomatoes"



ds_train = load_dataset(ds_id, split='train')
ds_val = load_dataset(ds_id, split='validation')
ds_test = load_dataset(ds_id, split = 'test')


tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=2)

def preprocess_function(examples):
    return tokenizer(examples['text'], padding=False)

tokenized_train = ds_train.map(preprocess_function) 
tokenized_val = ds_validation.map(preprocess_function)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)




def compute_metrics(eval_result):

    logits, y_true = eval_result
    y_pred = np.argmax(logits, axis=-1)

    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average="binary")
    recall = recall_score(y_true, y_pred, average="binary")
    f1 = f1_score(y_true, y_pred, average="binary")

    report_dict = {'accuracy':accuracy,'precision':precision,'recall':recall,'f1_score':f1}

    return report_dict

output_dir = '/home/tommaso/Documents/deep_learning/lab3/bert_fine_tuning'
batch_size = 64
logging_steps = len(ds_train)//batch_size

training_args = TrainingArguments(output_dir="test_trainer",
                                  eval_strategy="epoch", 
                                  learning_rate=2e-5, 
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  num_train_epochs= 10,
                                  logging_steps=logging_steps,
                                  use_cpu=False,
                                  metric_for_best_model='f1_score',
                                  load_best_model_at_end=True,
                                  greater_is_better=False,
                                  save_strategy='epoch'                         
                                  )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    compute_metrics=compute_metrics, 
    data_collator=data_collator,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2,)]
)

trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after par

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Score
1,0.445,0.36651,0.84334,0.827957,0.866792,0.846929
2,0.284,0.354573,0.841463,0.868421,0.804878,0.835443
3,0.2027,0.394936,0.845216,0.821678,0.881801,0.850679
4,0.1358,0.443272,0.85272,0.871542,0.827392,0.848893


TrainOutput(global_step=536, training_loss=0.2655878844879456, metrics={'train_runtime': 116.4516, 'train_samples_per_second': 732.493, 'train_steps_per_second': 11.507, 'total_flos': 493462275475416.0, 'train_loss': 0.2655878844879456, 'epoch': 4.0})

In [22]:
trainer.evaluate()

{'eval_loss': 0.35457271337509155,
 'eval_accuracy': 0.8414634146341463,
 'eval_precision': 0.868421052631579,
 'eval_recall': 0.8048780487804879,
 'eval_f1_score': 0.8354430379746836,
 'eval_runtime': 1.1857,
 'eval_samples_per_second': 899.05,
 'eval_steps_per_second': 14.338,
 'epoch': 4.0}

-----
### Exercise 3: Choose at Least One


#### Exercise 3.1: Efficient Fine-tuning for Sentiment Analysis (easy)

In Exercise 2 we fine-tuned the *entire* Distilbert model on Rotten Tomatoes. This is expensive, even for a small model. Find an *efficient* way to fine-tune Distilbert on the Rotten Tomatoes dataset (or some other dataset).

**Hint**: You could check out the [HuggingFace PEFT library](https://huggingface.co/docs/peft/en/index) for some state-of-the-art approaches that should "just work". How else might you go about making fine-tuning more efficient without having to change your training pipeline from above?

In [23]:
# Your code here.

#### Exercise 3.2: Fine-tuning a CLIP Model (harder)

Use a (small) CLIP model like [`openai/clip-vit-base-patch16`](https://huggingface.co/openai/clip-vit-base-patch16) and evaluate its zero-shot performance on a small image classification dataset like ImageNette or TinyImageNet. Fine-tune (using a parameter-efficient method!) the CLIP model to see how much improvement you can squeeze out of it.

**Note**: There are several ways to adapt the CLIP model; you could fine-tune the image encoder, the text encoder, or both. Or, you could experiment with prompt learning.

**Tip**: CLIP probably already works very well on ImageNet and ImageNet-like images. For extra fun, look for an image classification dataset with different image types (e.g. *sketches*).

In [24]:
# Your code here. qui non fare solo forward pass, tropppo costoso: precalcolare matrice di output e loss

#### Exercise 3.3: Choose your Own Adventure

There are a *ton* of interesting and fun models on the HuggingFace hub. Pick one that does something interesting and adapt it in some way to a new task. Or, combine two or more models into something more interesting or fun. The sky's the limit.

**Note**: Reach out to me by email or on the Discord if you are unsure about anything.

In [1]:
# NER task: extract gene names from articules
from datasets import load_dataset, get_dataset_split_names, get_dataset_infos, get_dataset_config_info


ds_id = ["bigbio/genetag", "genetaggold_bigbio_kb"]

infos = get_dataset_config_info(*ds_id)

print(get_dataset_split_names(*ds_id))
print(infos.description)
print(infos.features)
print(infos.features.keys())


['train', 'test', 'validation']
Named entity recognition (NER) is an important first step for text mining the biomedical literature.
Evaluating the performance of biomedical NER systems is impossible without a standardized test corpus.
The annotation of such a corpus for gene/protein name NER is a difficult process due to the complexity
of gene/protein names. We describe the construction and annotation of GENETAG, a corpus of 20K MEDLINE®
sentences for gene/protein NER. 15K GENETAG sentences were used for the BioCreAtIvE Task 1A Competition..

{'id': Value(dtype='string', id=None), 'document_id': Value(dtype='string', id=None), 'passages': [{'id': Value(dtype='string', id=None), 'type': Value(dtype='string', id=None), 'text': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'offsets': Sequence(feature=[Value(dtype='int32', id=None)], length=-1, id=None)}], 'entities': [{'id': Value(dtype='string', id=None), 'type': Value(dtype='string', id=None), 'text': Sequence(f

In [2]:
ds_train = load_dataset(*ds_id, split='train')
ds_val = load_dataset(*ds_id, split='validation')
ds_test = load_dataset(*ds_id, split = 'test')

In [3]:
print(ds_train[0]['passages'])
print(ds_train[0]['entities'])

[{'id': '@@95229799480_text', 'type': 'sentence', 'text': ['Cervicovaginal foetal fibronectin in the prediction of preterm labour in a low-risk population .'], 'offsets': [[0, 96]]}]
[{'id': '@@95229799480_1', 'type': 'NEWGENE', 'text': ['foetal fibronectin'], 'offsets': [[15, 33]], 'normalized': []}]


In [3]:
from datasets import load_dataset, get_dataset_split_names
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, EarlyStoppingCallback
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import DataCollatorWithPadding, DataCollatorForTokenClassification




model_id = 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id,num_labels = 2)





Some weights of BertForTokenClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
def preprocess(instance):
    phrase = instance['passages'][0]['text'][0]  # stringa
    entity_offsets = [ent['offsets'][0] for ent in instance['entities']]

    tokenized = tokenizer(
        phrase,
        return_offsets_mapping=True,
        return_attention_mask=True,
    )

    labels = [0] * len(tokenized['input_ids'])

    for i, (start, end) in enumerate(tokenized['offset_mapping']):
        if start == end:
            labels[i] = -100
            continue
        for ent_start, ent_end in entity_offsets:
            if start >= ent_start and end <= ent_end:
                labels[i] = 1
                break

    return {
        'input_ids': tokenized['input_ids'],
        'attention_mask': tokenized['attention_mask'],
        'labels': labels
    }

        

tokenized_train = ds_train.map(preprocess,remove_columns=ds_train.column_names)
tokenized_vals = ds_val.map(preprocess,remove_columns=ds_train.column_names)    

In [5]:
import numpy as np
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

def compute_metrics(eval_pred):
    logits = eval_pred.predictions   # [B, T, 2]
    labels = eval_pred.label_ids     # [B, T]

    # Predizioni: argmax sulla dimensione delle classi
    pred_ids = np.argmax(logits, axis=-1)  # shape: [B, T]

    # Flatten
    pred_ids = pred_ids.reshape(-1)
    labels = labels.reshape(-1)

    # Maschera per ignorare padding (-100)
    mask = labels != -100
    pred_ids = pred_ids[mask]
    labels = labels[mask]

    # Calcolo metriche (binary, perché hai solo 2 classi)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, pred_ids, average="binary", zero_division=0
    )
    accuracy = accuracy_score(labels, pred_ids)

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }


In [6]:
output_dir = '/home/tommaso/Documents/deep_learning/lab3/NER_fine_tuning'
batch_size = 16
logging_steps = len(ds_train)//batch_size

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, padding = 'longest', return_tensors='pt')


training_args = TrainingArguments(output_dir="test_trainer",
                                  eval_strategy="epoch", 
                                  learning_rate=2e-5, 
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  num_train_epochs= 10,
                                  logging_steps=logging_steps,
                                  use_cpu=False,
                                  metric_for_best_model='f1',
                                  load_best_model_at_end=True,
                                  greater_is_better=False,
                                  save_strategy='epoch',
                                  fp16=True                      
                                  )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_vals,
    compute_metrics=compute_metrics, 
    data_collator=data_collator,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=4,)]
)



trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mtommaso-ducci1[0m ([33mtommaso-ducci1-university-of-florence[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.1093,0.070608,0.974505,0.86989,0.934348,0.900968
2,0.05,0.067243,0.975055,0.868076,0.942222,0.903631
3,0.031,0.08152,0.97688,0.907505,0.906085,0.906795
4,0.0179,0.099117,0.97688,0.908529,0.904834,0.906678
5,0.0113,0.121359,0.976537,0.892331,0.922251,0.907044


TrainOutput(global_step=2345, training_loss=0.0438267993544147, metrics={'train_runtime': 227.7395, 'train_samples_per_second': 329.324, 'train_steps_per_second': 20.594, 'total_flos': 1356344512455600.0, 'train_loss': 0.0438267993544147, 'epoch': 5.0})

In [7]:
trainer.evaluate()

{'eval_loss': 0.07060759514570236,
 'eval_accuracy': 0.9745050193849878,
 'eval_precision': 0.8698902806097679,
 'eval_recall': 0.9343484382333003,
 'eval_f1': 0.9009679446888749,
 'eval_runtime': 5.5247,
 'eval_samples_per_second': 905.028,
 'eval_steps_per_second': 56.655,
 'epoch': 5.0}

In [8]:
text = [ds_test[i]['passages'][0]['text'] for i in range(len(ds_test))]

In [9]:
from transformers import pipeline
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
ner(text[1][0])

Device set to use cuda:0
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'entity_group': 'LABEL_1',
  'score': np.float32(0.99024886),
  'word': 'large t antigen',
  'start': 0,
  'end': 15},
 {'entity_group': 'LABEL_0',
  'score': np.float32(0.92212325),
  'word': 'was coimmunoprecipitated by antibodies to epitope - tagged',
  'start': 16,
  'end': 72},
 {'entity_group': 'LABEL_1',
  'score': np.float32(0.966347),
  'word': 'tbp',
  'start': 73,
  'end': 76},
 {'entity_group': 'LABEL_0',
  'score': np.float32(0.8565054),
  'word': ', endogenous',
  'start': 77,
  'end': 89},
 {'entity_group': 'LABEL_1',
  'score': np.float32(0.7112304),
  'word': 'tbp',
  'start': 90,
  'end': 93},
 {'entity_group': 'LABEL_0',
  'score': np.float32(0.9931629),
  'word': ',',
  'start': 94,
  'end': 95},
 {'entity_group': 'LABEL_1',
  'score': np.float32(0.97746205),
  'word': 'htaf ( ii ) 100',
  'start': 96,
  'end': 111},
 {'entity_group': 'LABEL_0',
  'score': np.float32(0.9912717),
  'word': ',',
  'start': 112,
  'end': 113},
 {'entity_group': 'LABEL_1',
  'score': 

In [None]:
def color_text_spans(texts, color_code="\033[91m"):
    ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
    reset = "\033[0m"
    
    output = []
    for j in range(len(texts)):
        paragraph = texts[j][0]  # perché è lista di liste
        entities = ner(paragraph)
        
        spans = []
        for ent in entities:
            if ent['entity_group'] == 'LABEL_1':  # cambia la label se vuoi
                spans.append((ent['start'], ent['end']))
        
        # Ricostruisci il paragrafo con i colori
        colored = ""
        last_end = 0
        for start, end in sorted(spans):
            colored += paragraph[last_end:start]
            colored += f"{color_code}{paragraph[start:end]}{reset}"
            last_end = end
        colored += paragraph[last_end:]
        
        output.append(colored)
    
    # Rimetto i paragrafi separati con \n
    final_output = "\n".join(output)
    return final_output

In [12]:
print(color_text_spans(text))

Device set to use cuda:0


SETTING : University hospital-based , tertiary care infertility center .
[91mLarge T antigen[0m was coimmunoprecipitated by antibodies to epitope-tagged [91mTBP[0m , endogenous [91mTBP[0m , [91mhTAF ( II ) 100[0m , [91mhTAF ( II ) 130[0m , and [91mhTAF ( II ) 250[0m , under conditions where [91mholo-TFIID[0m would be precipitated .
CONCLUSIONS : This randomized study shows that Vivostat fibrin sealant is effective in preventing air leakage after small lung resections in pigs , even at high inspiratory pressures .
We propose a model in which [91mSro7[0m function is involved in the targeting of the [91mmyosin proteins[0m to their intrinsic pathways .
The response properties of cat horizontal canal afferents ( N = 81 ) were characterized by three parameters : their long time constants ( tau ) , low frequency gain constants ( G1 ) , and middle frequency gain constants ( Gm ) .
Glycogen synthesis and catabolism , gluconeogenesis , glycolysis , motility , cell surface prope