<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2023-Tutorial-Notebooks/blob/main/tutorial_notebooks/10_intro_to_hugging_face_transformers_datasets_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quick Introduction to Huggingface's Transformers- and Datasets-Library

Adjusted from: https://huggingface.co/transformers/training.html

Other relevant links:
- Transformers docs: https://huggingface.co/transformers/index.html
- Datasets docs: https://huggingface.co/docs/datasets/
- BertTokenizer: https://huggingface.co/transformers/model_doc/bert.html?highlight=berttokenizer#transformers.BertTokenizer (Check it out, it can do most of the preprocessing for you.)
- BertModel: https://huggingface.co/transformers/model_doc/bert.html?highlight=bertmodel#transformers.BertModel
- BertForSequenceClassification: https://huggingface.co/transformers/model_doc/bert.html?highlight=bertforsequenceclassification#transformers.BertForSequenceClassification (BertModel-based class for this introduction)
- BertForTokenClassification: https://huggingface.co/transformers/model_doc/bert.html?highlight=bertfortokenclassification#transformers.BertForTokenClassification (BertModel-based class for the exercise)
- On the model outputs from different transformers-versions: https://huggingface.co/transformers/migration.html


In [None]:
!pip install transformers datasets sklearn

Collecting transformers
  Downloading transformers-4.12.4-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 4.3 MB/s 
[?25hCollecting datasets
  Downloading datasets-1.15.1-py3-none-any.whl (290 kB)
[K     |████████████████████████████████| 290 kB 73.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 55.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 60.0 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 33.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |███████

In [None]:
import pandas as pd
import datasets
dataset = datasets.load_dataset('sms_spam')

Downloading:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/901 [00:00<?, ?B/s]

Downloading and preparing dataset sms_spam/plain_text (download: 198.65 KiB, generated: 509.53 KiB, post-processed: Unknown size, total: 708.17 KiB) to /root/.cache/huggingface/datasets/sms_spam/plain_text/1.0.0/53f051d3b5f62d99d61792c91acefe4f1577ad3e4c216fb0ad39e30b9f20019c...


Downloading:   0%|          | 0.00/203k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

Dataset sms_spam downloaded and prepared to /root/.cache/huggingface/datasets/sms_spam/plain_text/1.0.0/53f051d3b5f62d99d61792c91acefe4f1577ad3e4c216fb0ad39e30b9f20019c. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
print(dataset.keys())

dict_keys(['train'])


In [None]:
print(len(dataset['train']))

5574


In [None]:
# next time, if we only want a few examples:
dataset = datasets.load_dataset('sms_spam', split='train[800:1000]')  # [:100] [:1%]

Reusing dataset sms_spam (/root/.cache/huggingface/datasets/sms_spam/plain_text/1.0.0/53f051d3b5f62d99d61792c91acefe4f1577ad3e4c216fb0ad39e30b9f20019c)


In [None]:
from collections import Counter
Counter(dataset['label'])

Counter({0: 165, 1: 35})

In [None]:
dataset

Dataset({
    features: ['sms', 'label'],
    num_rows: 200
})

In [None]:
dataset[0]

{'label': 0, 'sms': '"Gimme a few" was  &lt;#&gt;  minutes ago\n'}

In [None]:
dataset[1]

{'label': 1,
 'sms': 'Last Chance! Claim ur £150 worth of discount vouchers today! Text SHOP to 85023 now! SavaMob, offers mobile! T Cs SavaMob POBOX84, M263UZ. £3.00 Sub. 16\n'}

In [None]:
dataset[100]

{'label': 1,
 'sms': 'Your free ringtone is waiting to be collected. Simply text the password "MIX" to 85069 to verify. Get Usher and Britney. FML, PO Box 5249, MK17 92H. 450Ppw 16\n'}

In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
tokenizer(dataset[0]['sms'])

{'input_ids': [101, 107, 144, 4060, 3263, 170, 1374, 107, 1108, 111, 181, 1204, 132, 108, 111, 176, 1204, 132, 1904, 2403, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer(dataset[0]['sms'], return_tensors="pt", padding='max_length', truncation=True, max_length=128)  # deprecated encode_plus

{'input_ids': tensor([[ 101,  107,  144, 4060, 3263,  170, 1374,  107, 1108,  111,  181, 1204,
          132,  108,  111,  176, 1204,  132, 1904, 2403,  102,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0,

In [None]:
encoded_dataset = [tokenizer(item['sms'], return_tensors="pt", padding='max_length', truncation=True, max_length=128) for item in dataset]

In [None]:
import torch
for enc_item, item in zip(encoded_dataset, dataset):
    enc_item['labels'] = torch.LongTensor([item['label']])

In [None]:
print(len(encoded_dataset))
for key, val in encoded_dataset[0].items():
    print(f'key: {key}, dimensions: {val.size()}')

200
key: input_ids, dimensions: torch.Size([1, 128])
key: token_type_ids, dimensions: torch.Size([1, 128])
key: attention_mask, dimensions: torch.Size([1, 128])
key: labels, dimensions: torch.Size([1])


In [None]:
from random import shuffle
shuffle(encoded_dataset)

In [None]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
model = BertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2)

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [None]:
train_set = encoded_dataset[:100]
test_set = encoded_dataset[100:]

The torch-like way to train:

In [None]:
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)
model.train()  # set model train state
outputs = model(**train_set[0])[0]
print(outputs)
loss = outputs
loss.backward()
optimizer.step()

tensor(0.5259, grad_fn=<NllLossBackward0>)


The easier way to train:

In [None]:
# we don't need the batch dimension when using the trainer
# because the trainer does batching for us
for item in encoded_dataset:
    for key in item:
        item[key] = torch.squeeze(item[key])
train_set = encoded_dataset[:100]
test_set = encoded_dataset[100:]

In [None]:
training_args = TrainingArguments(
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    output_dir='results',
    logging_dir='logs',
    no_cuda=False,  # defaults to false anyway, just to be explicit
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_set,
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 100
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 25


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=25, training_loss=0.338733024597168, metrics={'train_runtime': 146.5033, 'train_samples_per_second': 0.683, 'train_steps_per_second': 0.171, 'total_flos': 6577776384000.0, 'train_loss': 0.338733024597168, 'epoch': 1.0})

In [None]:
preds = trainer.predict(test_set)

***** Running Prediction *****
  Num examples = 100
  Batch size = 4


In [None]:
print(preds.predictions[:2])
print(preds.predictions[:2].argmax(-1))
print(preds.label_ids[:2])
print(preds.metrics)

[[ 0.9839595  -0.77930415]
 [ 1.2374914  -0.9335977 ]]
[0 0]
[0 0]
{'test_loss': 0.22895419597625732, 'test_runtime': 41.3092, 'test_samples_per_second': 2.421, 'test_steps_per_second': 0.605}


In [None]:
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
predictions = preds.predictions.argmax(-1)
f1_score(preds.label_ids, predictions, average='binary')

0.8750000000000001

In [None]:
confusion_matrix(predictions, preds.label_ids)

array([[82,  4],
       [ 0, 14]])

#**SimpleTransformers**

Lets do the same solution in a few lines

In [None]:
!pip install simpletransformers
from simpletransformers.classification import ClassificationModel, ClassificationArgs

In [None]:
train_df = pd.DataFrame(dataset).iloc[:100, :].sample(frac=1)
test_df = pd.DataFrame(dataset).iloc[100:, :].sample(frac=1)
train_df = train_df.rename(columns={'sms' : 'text'})
test_df = test_df.rename(columns={'sms' : 'text'})
# creating a model on simpletransformers
model_args = ClassificationArgs(num_train_epochs=1, manual_seed=42, train_batch_size=4, max_seq_length=128)
# Create a ClassificationModel
bert_model = ClassificationModel(
    "bert", "bert-base-cased", args=model_args, use_cuda=False
)

In [None]:
test_df

Unnamed: 0,text,label
179,Hey you can pay. With salary de. Only &lt;#&g...,0
137,"Since when, which side, any fever, any vomitin.\n",0
165,Are you this much buzy\n,0
154,Also remember to get dobby's bowl from your car\n,0
115,"Call me da, i am waiting for your call.\n",0
...,...,...
132,Congratulations ore mo owo re wa. Enjoy it and...,0
197,Yetunde i'm in class can you not run water on ...,0
134,What time you think you'll have it? Need to kn...,0
107,"all the lastest from Stereophonics, Marley, Di...",1


In [None]:
bert_model.train_model(train_df, output_dir='test_2')

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/100 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/25 [00:00<?, ?it/s]

Configuration saved in test_2/checkpoint-25-epoch-1/config.json
Model weights saved in test_2/checkpoint-25-epoch-1/pytorch_model.bin
tokenizer config file saved in test_2/checkpoint-25-epoch-1/tokenizer_config.json
Special tokens file saved in test_2/checkpoint-25-epoch-1/special_tokens_map.json
Configuration saved in outputs/config.json
Model weights saved in outputs/pytorch_model.bin
tokenizer config file saved in outputs/tokenizer_config.json
Special tokens file saved in outputs/special_tokens_map.json


(25, 0.4350083839893341)

In [None]:
bert_predictions, _ = bert_model.predict(test_df['text'].tolist())
f1_score(test_df['label'], bert_predictions, average='binary')

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/13 [00:00<?, ?it/s]

0.9

In [None]:
confusion_matrix(bert_predictions, test_df['label'])

array([[89,  2],
       [ 0,  9]])