# Hugging Face: *Fine-tuning* to Transformers.

<img width=500px src="https://1.bp.blogspot.com/-qQryqABhdhA/XcC3lJupTKI/AAAAAAAAAzA/MOYu3P_DFRsmNkpjD9j813_SOugPgoBLACLcBGAsYHQ/w1200-h630-p-k-no-nu/h1.png">

We are going to work with the HuggingFace library in combination with Tensorflow, to carry out a *fine-tuning* process to the BERT model on the IMDB movie reviews dataset for sentiment analysis.




In [None]:
!pip install transformers
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.1 tokenizers-0.13.2 transformers-4.26.1
Looking in indexes: https://pypi.org/simple, https://us

In [None]:
from transformers import pipeline

In [None]:
# We create a classifier for sentiment analysis.
classifier = pipeline('sentiment-analysis')

# And test how well it works.
print(classifier('We are very happy to show you the Transformers library.'))
print(classifier('I hope this library is not as bad as programming everything from scratch.'))

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9997994303703308}]
[{'label': 'NEGATIVE', 'score': 0.9520087838172913}]


In [None]:
# Create a classifier for sentiment analysis with a custom model.
classifier = pipeline('sentiment-analysis', model='nlptown/bert-base-multilingual-uncased-sentiment')

# We test the model in Spanish again.
print(classifier('¡Está super bien esta librería!'))
print(classifier('Espero que esto no sea tan pesado como tener que programar todo desde cero.'))
print(classifier('Porque detestaría tener que hacerlo.'))

Downloading (…)lve/main/config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/669M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'label': '5 stars', 'score': 0.7519457340240479}]
[{'label': '3 stars', 'score': 0.3727743923664093}]
[{'label': '1 star', 'score': 0.38728848099708557}]


In [None]:
# Now let's create a text generator.
text_gen = pipeline('text-generation')
text_gen("My favourite color is")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "My favourite color is purple with green and red in there too. I'm not sure why, but it's a lot more intense than others. I also think that white is kind of an overpowering color to make these pretty with.\xa0\nI"}]

In [None]:
# Now we create a pattern to fill in masked words.
model = pipeline("fill-mask")
model(f"In Spain <mask> is the perfect city to live if you like the sea.")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.44055473804473877,
  'token': 4612,
  'token_str': ' Barcelona',
  'sequence': 'In Spain Barcelona is the perfect city to live if you like the sea.'},
 {'score': 0.13196779787540436,
  'token': 21447,
  'token_str': ' Naples',
  'sequence': 'In Spain Naples is the perfect city to live if you like the sea.'},
 {'score': 0.06279805302619934,
  'token': 19512,
  'token_str': ' Venice',
  'sequence': 'In Spain Venice is the perfect city to live if you like the sea.'},
 {'score': 0.04540041834115982,
  'token': 24435,
  'token_str': ' Lisbon',
  'sequence': 'In Spain Lisbon is the perfect city to live if you like the sea.'},
 {'score': 0.030427975580096245,
  'token': 14567,
  'token_str': ' Valencia',
  'sequence': 'In Spain Valencia is the perfect city to live if you like the sea.'}]

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import *
import numpy as np

In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m13.7 MB/s[0m et

In [None]:
from datasets import load_dataset
dataset = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
dataset["train"][np.random.randint(100)]

{'text': '..Oh wait, I can! This movie is not for the typical film snob, unless you want to brush up on your typical cinematic definitions, like "continuity editing" and "geographic match". I couldn\'t tell where I was in this movie. One second they\'re in the present, next minute their supposedly in the 70\'s driving a modern SUV and wearing what looked like to me as 80\'s style clothing. I think. I couldn\'t pay long enough attention to it since the acting was just horrible. I think it only got attention because it has a 3d which I did not watch. If you\'re a b-movie buff, and by b-movie I mean BAD movie, then this film is for you. It\'s home-movie and all non-sense style will keep you laughing for as long as you can stay awake. If your tastes are more for Goddard and Antonioni, though, just skip this one.',
 'label': 0}

In [None]:
# We import the tokenizer.
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer(["Your name is Juan"])

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'input_ids': [[101, 2353, 1271, 1110, 4593, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]}

In [None]:
# We generate a function to process each input.
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# And we map the dataset.
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
# We generate the training and evaluation sets.
small_train_dataset = tokenized_datasets["train"].shuffle().select(range(1000))
small_eval_dataset  = tokenized_datasets["test"].shuffle().select(range(1000))

In [None]:
from transformers import DefaultDataCollator

# We instantiate a batch generator for Tensorflow.
data_collator = DefaultDataCollator(return_tensors="tf")

In [None]:
small_train_dataset

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})

In [None]:
# Create the training dataset.
tf_train_dataset = small_train_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

# Create the test dataset.
tf_validation_dataset = small_eval_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

In [None]:
# Example of a sample dataset.
list(tf_train_dataset.take(1))

[({'input_ids': <tf.Tensor: shape=(8, 512), dtype=int64, numpy=
   array([[ 101,  146, 1486, ..., 1497, 1133,  102],
          [ 101, 1109, 1178, ...,    0,    0,    0],
          [ 101,  146, 1660, ...,    0,    0,    0],
          ...,
          [ 101, 3259, 1104, ..., 4510, 1177,  102],
          [ 101, 1573,  146, ...,    0,    0,    0],
          [ 101, 1457, 4907, ...,    0,    0,    0]])>,
   'token_type_ids': <tf.Tensor: shape=(8, 512), dtype=int64, numpy=
   array([[0, 0, 0, ..., 0, 0, 0],
          [0, 0, 0, ..., 0, 0, 0],
          [0, 0, 0, ..., 0, 0, 0],
          ...,
          [0, 0, 0, ..., 0, 0, 0],
          [0, 0, 0, ..., 0, 0, 0],
          [0, 0, 0, ..., 0, 0, 0]])>,
   'attention_mask': <tf.Tensor: shape=(8, 512), dtype=int64, numpy=
   array([[1, 1, 1, ..., 1, 1, 1],
          [1, 1, 1, ..., 0, 0, 0],
          [1, 1, 1, ..., 0, 0, 0],
          ...,
          [1, 1, 1, ..., 1, 1, 1],
          [1, 1, 1, ..., 0, 0, 0],
          [1, 1, 1, ..., 0, 0, 0]])>},
  <tf

In [None]:
from transformers import TFAutoModelForSequenceClassification

# Load the BERT model from the Hugging Face library for classification in Tensorflow.
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Downloading (…)"tf_model.h5";:   0%|          | 0.00/527M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# We train the model.
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fb24a1b79a0>