<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2024-Tutorial-Notebooks/blob/main/tutorials_notebooks_in_class_2024/W08_intro_to_hugging_face_transformers_datasets_simpletransformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quick Introduction to Huggingface's Transformers- and Datasets-Library

Adjusted from: https://huggingface.co/transformers/training.html

- Transformers docs: https://huggingface.co/transformers/index.html


In [None]:
!pip install transformers datasets

import os
os.environ["WANDB_MODE"] = "disabled"

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

#Loading the dataset

In [None]:
import pandas as pd
import datasets
dataset = datasets.load_dataset('sms_spam')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/359k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5574 [00:00<?, ? examples/s]

In [None]:
print(dataset.keys())

dict_keys(['train'])


In [None]:
print(len(dataset['train']))

5574


In [None]:
# next time, if we only want a few examples:
dataset = datasets.load_dataset('sms_spam', split='train[800:1000]')  # [:100] [:1%]

In [None]:
from collections import Counter
Counter(dataset['label'])

Counter({0: 165, 1: 35})

In [None]:
dataset

Dataset({
    features: ['sms', 'label'],
    num_rows: 200
})

In [None]:
dataset[0]

{'sms': '"Gimme a few" was  &lt;#&gt;  minutes ago\n', 'label': 0}

In [None]:
dataset[1]

{'sms': 'Last Chance! Claim ur £150 worth of discount vouchers today! Text SHOP to 85023 now! SavaMob, offers mobile! T Cs SavaMob POBOX84, M263UZ. £3.00 Sub. 16\n',
 'label': 1}

#Loading a HuggingFace Transformer

First part:

## **Tokenizer:**

The tokenizer is responsible for converting human-readable text into the numerical format that the model can understand. Each model has been trained with a specific type of tokenizer, so it’s crucial to use the correct tokenizer to ensure the input text is processed in a way that the model expects. The tokenizer:

* Splits text into tokens: Breaks down text into individual words or subwords,  depending on the model (e.g., BERT uses WordPiece, while GPT-2 uses Byte-Pair Encoding).

* Maps tokens to IDs: Converts each token to an integer ID that represents it in the model’s vocabulary. This ensures that each word or subword has a corresponding, unique numerical representation.

* Handles special tokens: Adds tokens that indicate sentence boundaries, padding, or start-of-sequence markers, which can be essential for tasks like translation, summarization, or question answering.

* If you use a different tokenizer from the one the model was trained on, the token IDs will not match what the model expects, resulting in poor or incorrect predictions.

Each model architecture uses a slightly (sometimes significantly) different tokenizer. Depending on the model we use, we need to load the right tokenizer (else nothing works correctly).

Here is how to load it for bert-base.

In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



There is a convenient abstraction to avoid looking to find the right tokenizer, Bert, Roberta, XLM-RoBERTa : AutoTokenizer

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

In [None]:
tokenizer(dataset[0]['sms'])

{'input_ids': [101, 107, 144, 4060, 3263, 170, 1374, 107, 1108, 111, 181, 1204, 132, 108, 111, 176, 1204, 132, 1904, 2403, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer(dataset[0]['sms'], return_tensors="pt", padding='max_length', truncation=True, max_length=128)

{'input_ids': tensor([[ 101,  107,  144, 4060, 3263,  170, 1374,  107, 1108,  111,  181, 1204,
          132,  108,  111,  176, 1204,  132, 1904, 2403,  102,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0,

**input_ids**
Contains token IDs representing the input text.
Starts with [CLS] (101) and ends with [SEP] (102).
Each ID corresponds to a specific token or subword in BERT's vocabulary.

**token_type_ids**
Distinguishes segments within the input.
For single-sequence inputs, all values are 0.
For paired inputs, 0 for the first segment, 1 for the second.

**attention_mask**
Indicates tokens to be attended to with 1 and padding with 0.

In [None]:
encoded_dataset = [tokenizer(item['sms'], return_tensors="pt", padding='max_length', truncation=True, max_length=128) for item in dataset]

In [None]:
import torch
for enc_item, item in zip(encoded_dataset, dataset):
    enc_item['labels'] = torch.LongTensor([item['label']])

In [None]:
print(len(encoded_dataset))
for key, val in encoded_dataset[0].items():
    print(f'key: {key}, dimensions: {val.size()}')

200
key: input_ids, dimensions: torch.Size([1, 128])
key: token_type_ids, dimensions: torch.Size([1, 128])
key: attention_mask, dimensions: torch.Size([1, 128])
key: labels, dimensions: torch.Size([1])


In [None]:
from random import shuffle
shuffle(encoded_dataset)

Second part:

## **Model Weights**

Model Architecture: The structure of the neural network (e.g., transformer layers, attention heads) specific to the model type (like BERT or GPT-2).

Pre-trained Weights: Learned parameters from pre-training on large datasets, enabling the model to perform tasks like classification or summarization without starting from scratch.

Model Configuration: Settings such as hidden layer size, number of layers, and dropout rates that control the model’s behavior and performance.

The randomly initialised (or trained) classification/regression head already attached to the end of the model model. The architecture of it is specified by the end part of the model. - ForSequenceClassification

Hint: For the exercise, it's a token classification task so we use -ForTokenClassification

Here is how to load it for bert-base. Depending on the model size, the size can quickly stack up

In [None]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
model = BertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


There is a convenient abstraction to avoid looking to find the right model too: AutoModel

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Finetuning Methods

In [None]:
train_set = encoded_dataset[:100]
test_set = encoded_dataset[100:]

### Traditional Torch Finetuning

In [None]:
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)
model.train()  # set model train state
outputs = model(**train_set[0])[0]
print(outputs)
loss = outputs
loss.backward()
optimizer.step()



tensor(0.4781, grad_fn=<NllLossBackward0>)


### HuggingFace Trainer

In [None]:
# we don't need the batch dimension when using the trainer
# because the trainer does batching for us
for item in encoded_dataset:
    for key in item:
        item[key] = torch.squeeze(item[key])
train_set = encoded_dataset[:100]
test_set = encoded_dataset[100:]

In [None]:
training_args = TrainingArguments(
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    output_dir='results',
    logging_dir='logs',
    no_cuda=False,  # defaults to false anyway, just to be explicit
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_set,
)

In [None]:
trainer.train()



Step,Training Loss


TrainOutput(global_step=25, training_loss=0.46253467559814454, metrics={'train_runtime': 17.198, 'train_samples_per_second': 5.815, 'train_steps_per_second': 1.454, 'total_flos': 6577776384000.0, 'train_loss': 0.46253467559814454, 'epoch': 1.0})

In [None]:
preds = trainer.predict(test_set)

In [None]:
print(preds.predictions[:2])
print(preds.predictions[:2].argmax(-1))
print(preds.label_ids[:2])
print(preds.metrics)

[[-0.05589291 -0.67698187]
 [ 0.2581243  -1.0777553 ]]
[0 0]
[1 0]
{'test_loss': 0.2828604578971863, 'test_runtime': 0.7407, 'test_samples_per_second': 135.006, 'test_steps_per_second': 33.752}


### Evaluation

In [None]:
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
predictions = preds.predictions.argmax(-1)
f1_score(preds.label_ids, predictions, average='binary')

0.8387096774193549

In [None]:
confusion_matrix(predictions, preds.label_ids)

array([[82,  5],
       [ 0, 13]])

#**SimpleTransformers**

An Abstraction library for quick piloting projects. (We restrict usage of this library for the exercise - limits the learning)

So, let's do the same solution in a few lines

In [None]:
!pip install simpletransformers
from simpletransformers.classification import ClassificationModel, ClassificationArgs

Collecting simpletransformers
  Downloading simpletransformers-0.70.1-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.4/42.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tensorboardx (from simpletransformers)
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting streamlit (from simpletransformers)
  Downloading streamlit-1.39.0-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting pydeck<1,>=0.8.0b4 (from stre

In [None]:
train_df = pd.DataFrame(dataset).iloc[:100, :].sample(frac=1)
test_df = pd.DataFrame(dataset).iloc[100:, :].sample(frac=1)
train_df = train_df.rename(columns={'sms' : 'text'})
test_df = test_df.rename(columns={'sms' : 'text'})
# creating a model on simpletransformers
model_args = ClassificationArgs(num_train_epochs=1, manual_seed=42, train_batch_size=4, max_seq_length=128)
# Create a ClassificationModel
bert_model = ClassificationModel(
    "bert", "bert-base-cased", args=model_args, use_cuda=False
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
test_df

Unnamed: 0,text,label
179,Hey you can pay. With salary de. Only &lt;#&g...,0
137,"Since when, which side, any fever, any vomitin.\n",0
165,Are you this much buzy\n,0
154,Also remember to get dobby's bowl from your car\n,0
115,"Call me da, i am waiting for your call.\n",0
...,...,...
132,Congratulations ore mo owo re wa. Enjoy it and...,0
197,Yetunde i'm in class can you not run water on ...,0
134,What time you think you'll have it? Need to kn...,0
107,"all the lastest from Stereophonics, Marley, Di...",1


In [None]:
bert_model.train_model(train_df, output_dir='test_2')



0it [00:00, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/25 [00:00<?, ?it/s]

(25, 0.46435903310775756)

In [None]:
bert_predictions, _ = bert_model.predict(test_df['text'].tolist())
f1_score(test_df['label'], bert_predictions, average='binary')

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0.782608695652174

In [None]:
confusion_matrix(bert_predictions, test_df['label'])

array([[86,  2],
       [ 3,  9]])