<div>
<img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" width="150">
<img src="https://scontent-lga3-2.xx.fbcdn.net/v/t39.30808-6/326237524_1368113687277139_7848634163741007761_n.png?_nc_cat=100&ccb=1-7&_nc_sid=09cbfe&_nc_ohc=9x6DAhS-gUwAX-2ZxMv&_nc_ht=scontent-lga3-2.xx&oh=00_AfD3SPcxeME6Ui9O3fIYNFNUUhu-hOCNT8Rttgo6yuFd3Q&oe=64807494" width="150">
</div>

# Hugging Face Workshop
**Workshop Lead: Bassel Al Omari**

Get your hands busy with the most popular python library of the 21st century [*](https://github.com/EvanLi/Github-Ranking#python).

Hugging Face is the landmark one-stop library for machine learning developers, offering access to state-of-the-art models, datasets, and other useful utilities.

In this workshop, we'll be using Hugging Face to finetune GPT-2 to generate Shakespeare-esque text. By the end of this you'll be able use Hugging Face to:
- Load and process a dataset from the Hugging Face Hub.
- Load and finetune a pretrained model

## Setup
To get started we will be downloaded three libraries provided by Hugging Face: `transformers`, `datasets` and `tokenizers`. 

All load their respective component from the Hugging Face Hub

In [None]:
!pip install datasets transformers==4.28.0 tokenizers

We'll also be uploading our model to the Hugging Face Hub, to store it and share it with the community.

You can sign up for a Hugging Face account [here](https://huggingface.co/join), and generate the necessary access token [here](https://huggingface.co/settings/tokens).

In [None]:
!pip install huggingface_hub

In [None]:
from huggingface_hub import notebook_login

notebook_login()

#### GPU access
We'll just install accelerate which allows Hugging Face to run on Multi-GPU setups, and also define the GPU that is to be used.

In [None]:
!pip install accelerate

In [4]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Datasets

You can browse through all the datasets available on the hugging face platform, in [the Datasets page](https://huggingface.co/datasets).

#### Tiny Shakespeare

We'll be using the Tiny Shakespeare, which contains 40,000 lines from various Shakespeare plays.

You can find the dataset on [the Hugging Face Hub](https://huggingface.co/datasets/tiny_shakespeare), with instructions on how to load it with the datasets library.

Just a simple two lines!

In [37]:
from datasets import load_dataset

raw_datasets = load_dataset("tiny_shakespeare")

print(raw_datasets)

  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
})


We can see that `raw_datasets` is a Python dictionary, with each key corresponding to a different split.


Let's look at an example of the text from the training set:

In [38]:
print(raw_datasets["train"]["text"][0][:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



## Tokenize it!

Transformers can't take in strings, but instead require the input to be numbers. Therefore, we must convert the Shakespeare dataset sentences into representative numbers.

This process is known as Tokenization, where we split the strings into smaller subunits (could be either words or characters or other subunits), and assign each unique subunit a representative number.

Take the example sentence `"black cat and black dog".`
- If we were to tokenize the sentence at a word-level, we would get the following split: `"['black', 'cat', 'and', 'black', 'dog']"`, which could be encoded as:

```
0 black
1 cat
2 and
3 dog

black cat and black dog => 0 1 2 0 3
```

Each pretrained model comes with its own tokenizer, so to get started let's download the tokenizer of DistilGPT-2:

In [None]:
from transformers import AutoTokenizer

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Let's feed the tokenizer an example to see how it would encode it. Note that this is how the GPT-2 tokenizer would split the input, other tokenizers may split it differently.

In [8]:
example_str = "Mr. Goose approves of the Data-Science Club"

encoded_str = tokenizer(example_str)
print(encoded_str['input_ids'])

[5246, 13, 46317, 43770, 286, 262, 6060, 12, 26959, 6289]


In [9]:
for token in encoded_str['input_ids']:
  print(token, tokenizer.decode([token]))

5246 Mr
13 .
46317  Goose
43770  approves
286  of
262  the
6060  Data
12 -
26959 Science
6289  Club


We apply this tokenizer now to all the sentences from the Shakepeare dataset:

In [41]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

tokenizer.pad_token = tokenizer.eos_token

tokenized_datasets = raw_datasets.map(
    tokenize_function, 
    batched=True, 
    remove_columns=["text"]
)

## Preparing the dataset

We split the input text into chunks of size 128, which we feed individually into the model.

To train the model, we split a chunk into two parts. We pass the first as an input to the model, and compare the model output to the second part of the input.

For example, take the sentence "Mr. goose approves of this workshop". We take the first part "Mr. goose approves of", and pass it as input to the model, then compare the model's output to the second part "this workshop".

Note that the transformer used in this notebook does all this automatically for us.

In [42]:
# block size == max input length into the model
block_size = 128
    
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size

    # Split by chunks of block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

# split total dataset into smaller sets of length block_size
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True
)
lm_datasets

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2359
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 141
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})

## The Model

We'll be using DistilGPT2, a lighter version of GPT-2 (88.2 million parameters compared to GPT-2's 1.5 billion parameters). DistilGPT2 was pretrained using [knowledge distillation](https://neptune.ai/blog/knowledge-distillation) with the supervision of GPT-2.

You can find the model on the Hugging Face Hub [here](https://huggingface.co/distilgpt2) and how to load it with the transformers library.

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

The `Trainer` is an abstracted API for training your models in PyTorch. 

Begin by defining the training hypeparameters using `TrainingArguments`, and define the `Trainer` with said hyperparameters.

By setting `push_to_hub=True`, a repository (with the name `output_dir`), is created and your finetuned model is uploaded after every training epoch.

In [None]:
from transformers import Trainer, TrainingArguments


args = TrainingArguments(
    output_dir='transformers-dsc-workshop',
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=20,
    optim='adamw_hf',
    weight_decay=0.01,
    learning_rate=2e-5,
    push_to_hub=True, # save model every epoch
    save_strategy='epoch'
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"]
)

Simply call `.train()` to fit your model to the training set:

In [36]:
trainer.train()

***** Running training *****
  Num examples = 2,359
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1,480
  Number of trainable parameters = 81,912,576


Step,Training Loss
500,3.774
1000,3.5732


Saving model checkpoint to transformers-dsc-workshop/checkpoint-74
Configuration saved in transformers-dsc-workshop/checkpoint-74/config.json
Configuration saved in transformers-dsc-workshop/checkpoint-74/generation_config.json
Model weights saved in transformers-dsc-workshop/checkpoint-74/pytorch_model.bin
tokenizer config file saved in transformers-dsc-workshop/checkpoint-74/tokenizer_config.json
Special tokens file saved in transformers-dsc-workshop/checkpoint-74/special_tokens_map.json
tokenizer config file saved in transformers-dsc-workshop/tokenizer_config.json
Special tokens file saved in transformers-dsc-workshop/special_tokens_map.json
Saving model checkpoint to transformers-dsc-workshop/checkpoint-148
Configuration saved in transformers-dsc-workshop/checkpoint-148/config.json
Configuration saved in transformers-dsc-workshop/checkpoint-148/generation_config.json
Model weights saved in transformers-dsc-workshop/checkpoint-148/pytorch_model.bin
tokenizer config file saved in tra

TrainOutput(global_step=1480, training_loss=3.6186934187605573, metrics={'train_runtime': 1875.3404, 'train_samples_per_second': 25.158, 'train_steps_per_second': 0.789, 'total_flos': 1540997586616320.0, 'train_loss': 3.6186934187605573, 'epoch': 20.0})

The `pipeline` is another API for using a model for inference. You can either use the model that you have just trained, or load a model directly from the Hugging Face API (see a couple of cells below).

In [None]:
from transformers import pipeline

text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0, framework="pt")

outputs = text_generator("HAMLET: To be or not to be ", temperature=0.8, max_length=100)[0]
print(outputs["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


HAMLET: To be or not to be ike,
From the head of France, I'd be the very head of Rome,
From the head of Rome.

LUCIO:
What, then?

POLIXENES:
Well, if we did say that Pompey had a daughter,
That Pompey had a son, that Pompey had a son,
That Pompey had a son, that Pompey had a son,
That Pompe


#### Try it Yourself!
Pass the model an input of your choice and watch it go!

In [6]:
INPUT_OF_YOUR_CHOICE = "HAMLET: To be a goose or not to be "

In [None]:
from transformers import pipeline

text_generator = pipeline("text-generation", model="federated/transformers-dsc-workshop", framework="pt")

outputs = text_generator(INPUT_OF_YOUR_CHOICE, temperature=0.8, max_length=100)[0]
print(outputs["generated_text"])

#### Your Turn:
Use the code in this notebook to fine-tune other models from [the model section](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads) on a new dataset of your choice from [the datasets page](https://huggingface.co/datasets):

