# Hugging Face

Today we will play a little with hugging face.

## Load dataset

We will use the built in `emotion` dataset for this experiment.

In [3]:
import datasets
emotions = datasets.load_dataset('emotion')

In [4]:
emotions

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

First business is to split into train and text. This is very easy in Hugging Face datasets.

In [5]:
train_ds = emotions['train']
print(len(train_ds))
print(train_ds.features)
print(train_ds['text'][:5])

16000
{'text': Value('string'), 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'])}
['i didnt feel humiliated', 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake', 'im grabbing a minute to post i feel greedy wrong', 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property', 'i am feeling grouchy']


Just following the book, I'll give a look at how to transform this into pandas, although I'll skip the pandas operations (this is a Torch/Hugging Face tutotial, not a pandas).

In [6]:
import pandas as pd
emotions.set_format(type = 'pandas')
df = emotions['train'][:] # For some weird reason, we need this [:]
df.head()

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3


Great, let's get back to dataset.

In [7]:
emotions.reset_format()

## Tokenization

We will now explore transformers' AutoTokenizer.

In [8]:
from transformers import AutoTokenizer

model_ckpt = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

Let us just test it on some random text.

In [9]:
text = 'i am as constant as the northern star'
encoded_text = tokenizer(text)
print(encoded_text)

{'input_ids': [101, 1045, 2572, 2004, 5377, 2004, 1996, 2642, 2732, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


We can see the meaning of each index.

In [10]:
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)

['[CLS]', 'i', 'am', 'as', 'constant', 'as', 'the', 'northern', 'star', '[SEP]']


We follow the book in making a tokenization function, which will be later used for mapping.

In [11]:
def tokenize(batch):
    return tokenizer(batch['text'], padding = True, truncation = True)
print(tokenize(emotions['train'][:2]))

{'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


Notice that the size of the vector is the size of the longest document of the batch. If we set `padding` to False, that would not happen. Similarly, documents longet than the maximum context length are truncated. We now tokenize the whole thing (including test and validation, as they are important in setting vector sizes).

In [12]:
# Setting batched to True below makes batches, but as batch_size is None, it applies
# to the whole dataset
emotions_encoded = emotions.map(tokenize, batched = True, batch_size = None) 

Map: 100%|██████████| 16000/16000 [00:00<00:00, 32282.84 examples/s]
Map: 100%|██████████| 2000/2000 [00:00<00:00, 48327.60 examples/s]
Map: 100%|██████████| 2000/2000 [00:00<00:00, 52123.55 examples/s]


We can check it.

In [13]:
print(emotions_encoded['train'].column_names)

['text', 'label', 'input_ids', 'attention_mask']


Let us check.

In [14]:
emotions_encoded['train']['input_ids'][0]

[101,
 1045,
 2134,
 2102,
 2514,
 26608,
 102,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In fact matches what we computed above.

## Feature extraction

In feauture extraction, we do not fine-tune the model weights, but only set an independent classifier on top of it.

In [33]:
from transformers import AutoModel
import torch
device =  torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = AutoModel.from_pretrained(model_ckpt).to(device)


Again, let us try applying it to some text.

In [44]:
inputs =  tokenizer(text, return_tensors = 'pt') # pt stands for torch
print(inputs['input_ids'].size())
inputs = {k:v.to(device) for k,v in inputs.items()}
print(inputs)
with torch.no_grad(): # Ensures that we do not compute the backpropagation
                    # giving a quicker and memory efficient experience
    # As we do dictionary unpacking below, we need to be careful so the keys
    # agree with the function parameter names
    outputs = model(**inputs) 

print(outputs)

torch.Size([1, 10])
{'input_ids': tensor([[ 101, 1045, 2572, 2004, 5377, 2004, 1996, 2642, 2732,  102]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
BaseModelOutput(last_hidden_state=tensor([[[-0.0208,  0.1081,  0.0541,  ...,  0.0513,  0.3912,  0.4102],
         [ 0.2591, -0.0090, -0.0607,  ..., -0.1286,  0.3101,  0.5758],
         [-0.0264,  0.1551,  0.3717,  ..., -0.2704,  0.0127,  0.5160],
         ...,
         [ 0.2663, -0.0777,  0.4344,  ...,  0.0843,  0.3669, -0.0063],
         [ 0.3281, -0.0183,  0.0516,  ...,  0.0531,  0.2275, -0.1374],
         [ 0.9452,  0.1411, -0.1522,  ...,  0.1630, -0.6912, -0.2150]]],
       device='cuda:0'), hidden_states=None, attentions=None)


We can check the last vector

In [45]:
outputs.last_hidden_state

tensor([[[-0.0208,  0.1081,  0.0541,  ...,  0.0513,  0.3912,  0.4102],
         [ 0.2591, -0.0090, -0.0607,  ..., -0.1286,  0.3101,  0.5758],
         [-0.0264,  0.1551,  0.3717,  ..., -0.2704,  0.0127,  0.5160],
         ...,
         [ 0.2663, -0.0777,  0.4344,  ...,  0.0843,  0.3669, -0.0063],
         [ 0.3281, -0.0183,  0.0516,  ...,  0.0531,  0.2275, -0.1374],
         [ 0.9452,  0.1411, -0.1522,  ...,  0.1630, -0.6912, -0.2150]]],
       device='cuda:0')

Now apply to the whole dataset.

In [47]:
def extract_hidden_states(batch):
    inputs = {k:v.to(device) for k,v in batch.items()
              if k in tokenizer.model_input_names}
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
    # To use map, we need tensors in numpy
    return {'hidden_state': last_hidden_state[:,0].cpu().numpy()}

# The model expects tensors as input, so we need to transform them into tensors
emotions_encoded.set_format('torch',
                            columns = ['input_ids', 'attention_mask', 'label'])

# Finally apply map
emotions_hidden = emotions_encoded.map(extract_hidden_states, batched = True)
emotions_hidden

Map: 100%|██████████| 16000/16000 [01:02<00:00, 257.55 examples/s]
Map: 100%|██████████| 2000/2000 [00:06<00:00, 296.33 examples/s]
Map: 100%|██████████| 2000/2000 [00:05<00:00, 351.62 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask', 'hidden_state'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask', 'hidden_state'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask', 'hidden_state'],
        num_rows: 2000
    })
})

Now we will apply a sklearn logistic regressor.

In [49]:
import numpy as np

X_train = np.array(emotions_hidden['train']['hidden_state'])
X_val = np.array(emotions_hidden['validation']['hidden_state'])
y_train = np.array(emotions_hidden['train']['label'])
y_val = np.array(emotions_hidden['validation']['label'])

Now sklearn (go Inria!)

In [52]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression()
log.fit(X_train, y_train)
print(log.score(X_val, y_val))

0.613


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Fine tuning

In fine tuning, we retrain the model's parameters with the new dataset. First, we again follow the book and define a function to compute the metrics.

In [15]:
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average = 'weighted')
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc, 'f1': f1}

Now we set training parameters.

In [16]:
import os, sys

# There were some problems in the import of TrainingArguments, so I ran this
# (thanks GPT)
# Force correct interpreter and disable accelerate subprocesses
os.environ["PYTHON_EXECUTABLE"] = sys.executable
os.environ["ACCELERATE_DISABLE_SUBPROCESS"] = "1"

# Remove already imported transformers and accelerate
modules_to_reload = ["transformers", "accelerate"]
for m in modules_to_reload:
    if m in sys.modules:
        del sys.modules[m]

# Re-import
import transformers

In [17]:

batch_size = 64
# Controls how often the training metrics are updated, that is, once per batch
logging_steps = len(emotions_encoded['train'])//batch_size
model_name = f"{model_ckpt}-finetuned-emotions"
training_args = transformers.TrainingArguments(output_dir = model_name,
                                num_train_epochs = 2,
                                learning_rate = 2e-5,
                                per_device_train_batch_size = batch_size,
                                per_device_eval_batch_size = batch_size,
                                weight_decay = 0.01,
                                disable_tqdm = False, # Allow to show progress bars
                                logging_steps = logging_steps,
                                log_level = 'error') # Verbosity level


Now we define the model

In [25]:
device =  torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Define model
from transformers import AutoModelForSequenceClassification
num_labels = 6
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels = num_labels).to(device)

And train

In [27]:
from transformers import Trainer


trainer = Trainer(model = model, 
                  args = training_args,
                  compute_metrics = compute_metrics,
                  train_dataset = emotions_encoded['train'],
                  eval_dataset = emotions_encoded['validation'],
                  tokenizer = tokenizer)
trainer.train()

  trainer = Trainer(model = model,


Step,Training Loss
250,0.8091
500,0.2349


TrainOutput(global_step=500, training_loss=0.5220201034545898, metrics={'train_runtime': 256.9734, 'train_samples_per_second': 124.526, 'train_steps_per_second': 1.946, 'total_flos': 720342861696000.0, 'train_loss': 0.5220201034545898, 'epoch': 2.0})

Let us evaluate.

In [28]:
pred_output = trainer.predict(emotions_encoded['validation'])
pred_output.metrics

{'test_loss': 0.20293059945106506,
 'test_accuracy': 0.9265,
 'test_f1': 0.9265079391113732,
 'test_runtime': 5.1282,
 'test_samples_per_second': 389.997,
 'test_steps_per_second': 6.24}

Amazing!!