<a href="https://colab.research.google.com/github/AlexUmnov/genai_course/blob/main/week3_open_source_llms/seminar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this week's practice session we'll learn the basics of using HuggingFace library.

# HuggingFace intro

In this section we'll learn the basics of HuggingFace framework.

HuggingFace is both a platform and a framework, which is widely used for arranging inference and fine-tuning of pretrained models and sometimes for training from scratch. It is also popular as a hub for sharing custom models.

Let's take a quick look at how it works.

The most important part of the HuggingFace platform is of course the model registry https://huggingface.co/models. There you can find all the submitted models, which are grouped by task, type, languages (in case of NLP) and etc. For demonstration, let's use locally the GPT-2 model: https://huggingface.co/gpt2.

GPT-2 is a transformer-based model, so to work with it we'll need a HuggingFace library called `transformers`.

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m99.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m79.0 MB/s[0m eta [36m0:00:00[0m
Colle

The `transformers` library support various ways of loading a model. We will just we can download it directly from registry without specifing the model type or tokenizer type.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").to('cuda')

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Now we've actually got ourselves both a GPT2 tokenizer and GPT2 model, and we can try using them.

To feed a text to a transformer, we need firstly to *tokenize* it using a `tokenizer`, that is to map it into a sequence of token indices. Those token indices are stored in `'input_ids'`. Another field is `'attention_mask'`, which is used to mask tokens during predictions (more details in the second course).  

In [None]:
input_batch = [
    "Tim had 2 green apples and 3 red apples, in total he had"
]
tokenized_input = tokenizer(input_batch, return_tensors='pt').to('cuda')
tokenized_input

{'input_ids': tensor([[14967,   550,   362,  4077, 22514,   290,   513,  2266, 22514,    11,
           287,  2472,   339,   550]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

As you can see, we output our tokenized inputs in PyTorch format, but it also supports numpy and TensorFlow formats.

Also tokenizers support various ways to control the tokenization process, like setting padding, truncation and etc.

Feel free to take a look at it here https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer

The token indices will be further converted into embeddings which are fed to the transformer:


In [None]:
model_output = model.generate(
    **tokenized_input,
    max_length=128,
)
print(model_output)
tokenizer.batch_decode(model_output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([[14967,   550,   362,  4077, 22514,   290,   513,  2266, 22514,    11,
           287,  2472,   339,   550,   362,  4077, 22514,   290,   513,  2266,
         22514,    13,   198,   198,   464,  1306,  1110,    11,   339,  1816,
           284,   262,  3650,   290,  5839,   257,   649,  5166,   286, 10012,
            13,   679,  5839,   257,  5166,   286, 10012,   326,   547,   925,
           286,   257,  1180,  3124,    13,   679,  5839,   257,  5166,   286,
         10012,   326,   547,   925,   286,   257,  1180,  3124,    13,   679,
          5839,   257,  5166,   286, 10012,   326,   547,   925,   286,   257,
          1180,  3124,    13,   679,  5839,   257,  5166,   286, 10012,   326,
           547,   925,   286,   257,  1180,  3124,    13,   679,  5839,   257,
          5166,   286, 10012,   326,   547,   925,   286,   257,  1180,  3124,
            13,   679,  5839,   257,  5166,   286, 10012,   326,   547,   925,
           286,   257,  1180,  3124,    13,   679,  

['Tim had 2 green apples and 3 red apples, in total he had 2 green apples and 3 red apples.\n\nThe next day, he went to the store and bought a new pair of shoes. He bought a pair of shoes that were made of a different color. He bought a pair of shoes that were made of a different color. He bought a pair of shoes that were made of a different color. He bought a pair of shoes that were made of a different color. He bought a pair of shoes that were made of a different color. He bought a pair of shoes that were made of a different color. He bought a']

Output of the model is also a list of indices and then we have to use a `tokenizer.decode` (`batch_decode`) method to turn it into text.

There's also a bit simpler way to do the same, taking advantage of `pipelines`.

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="gpt2")

In [None]:
pipe(input_batch)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[[{'generated_text': 'Tim had 2 green apples and 3 red apples, in total he had 3 apples. So I have about 3 apples and 2 red apples. That was all the apples and the red apple so far. The amount of food he sent me was about 2'}]]