<a href="https://colab.research.google.com/github/Tuan-Lee-23/deep-learning-v2-pytorch/blob/master/Fine_tune_a_GPT_2_Model_with_Huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **TODO list:**

- load the dataset
- prepare the dataset and build a ``CustomDataset` or `TextDataset`` (choose only 1 method)
    - TextDataset: the dataset is a whole block without separating sample by lines
    - CustomDataset: model needs more special tokens (author, content, SEP,...) with tabular dataset, each line is a sample
- load the pre-trained GPT-2 model and tokenizer (Vietnamese)
- initialize ``Trainer`` with ``TrainingArguments``
- train and save the model
- test the model

# Instal libs

In [1]:
!pip install -q transformers huggingface_hub wandb

[K     |████████████████████████████████| 3.5 MB 27.2 MB/s 
[K     |████████████████████████████████| 67 kB 5.1 MB/s 
[K     |████████████████████████████████| 1.7 MB 44.6 MB/s 
[K     |████████████████████████████████| 895 kB 51.5 MB/s 
[K     |████████████████████████████████| 6.8 MB 49.9 MB/s 
[K     |████████████████████████████████| 596 kB 33.7 MB/s 
[K     |████████████████████████████████| 144 kB 58.4 MB/s 
[K     |████████████████████████████████| 180 kB 61.0 MB/s 
[K     |████████████████████████████████| 63 kB 2.0 MB/s 
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [2]:
!apt-get install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following packages were automatically installed and are no longer required:
  cuda-command-line-tools-10-0 cuda-command-line-tools-10-1
  cuda-command-line-tools-11-0 cuda-compiler-10-0 cuda-compiler-10-1
  cuda-compiler-11-0 cuda-cuobjdump-10-0 cuda-cuobjdump-10-1
  cuda-cuobjdump-11-0 cuda-cupti-10-0 cuda-cupti-10-1 cuda-cupti-11-0
  cuda-cupti-dev-11-0 cuda-documentation-10-0 cuda-documentation-10-1
  cuda-documentation-11-0 cuda-documentation-11-1 cuda-gdb-10-0 cuda-gdb-10-1
  cuda-gdb-11-0 cuda-gpu-library-advisor-10-0 cuda-gpu-library-advisor-10-1
  cuda-libraries-10-0 cuda-libraries-10-1 cuda-libraries-11-0
  cuda-memcheck-10-0 cuda-memcheck-10-1 cuda-memcheck-11-0 cuda-nsight-10-0
  cuda-nsight-10-1 cuda-nsight-11-0 cuda-nsight-11-1 cuda-nsight-compute-10-0
  cuda-nsight-compute-10-1 cuda-nsight-compute-11-0 cuda-nsight-compute-11-

In [3]:
!nvidia-smi

Wed Feb 16 11:15:49 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Only for uploading model to huggingface library

In [4]:
from huggingface_hub import notebook_login

In [5]:
notebook_login()

VBox(children=(HTML(value='<center>\n<img src=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Load the dataset



Load the dataset file from gdrive

In [6]:
#upload files to your colab environment
# from google.colab import files
# uploaded = files.upload()

After we uploaded the file with use `unzip` to extract the data.json. 

In [7]:
!unzip '132879_316218_bundle_archive.zip'

unzip:  cannot find or open 132879_316218_bundle_archive.zip, 132879_316218_bundle_archive.zip.zip or 132879_316218_bundle_archive.zip.ZIP.


# Prepare the dataset (only for the TextDataset method)

The next step is to extract the instructions from all rows and build a `TextDataset`. The `TextDataset` is a custom implementation of the [Pytroch `Dataset` class](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class) implemented by the transformers library. If you want to know more about Dataset in Pytroch you can check out this [youtube video](https://www.youtube.com/watch?v=PXOzkkB5eH0&ab_channel=PythonEngineer).

First, we are going to split the `data.json` into a `train` and `test` section and extract `Instructions` from the recipes and write them into a `train_dataset.txt` and `test_dataset.txt`

In [8]:
# import re
# import json
# from sklearn.model_selection import train_test_split


# with open('data.json') as f:
#     data = json.load(f)

# def build_text_files(data_json, dest_path):
#     f = open(dest_path, 'w')
#     data = ''
#     for texts in data_json:
#         summary = str(texts['Instructions']).strip()
#         summary = re.sub(r"\s", " ", summary)
#         data += summary + "  "
#     f.write(data)

# train, test = train_test_split(data,test_size=0.15) 


# build_text_files(train,'train_dataset.txt')
# build_text_files(test,'test_dataset.txt')

# print("Train dataset length: "+str(len(train)))
# print("Test dataset length: "+ str(len(test)))


# Model + TOkenizer

the next step is to download the tokenizer, which we use. We use the tokenizer from the `gpt2-viwiki` and model from the `gpt2-viwiki`

In [21]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# tokenizer = AutoTokenizer.from_pretrained("anonymous-german-nlp/german-gpt2")
tokenizer = AutoTokenizer.from_pretrained("danghuy1999/gpt2-viwiki")
model = AutoModelForCausalLM.from_pretrained("danghuy1999/gpt2-viwiki")

train_path = 'poem_train.txt'
test_path = 'poem_test.txt'

Some weights of the model checkpoint at danghuy1999/gpt2-viwiki were not used when initializing GPT2LMHeadModel: ['multiple_choice_head.summary.weight', 'multiple_choice_head.summary.bias']
- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


This tokenizer doesn't have "<|startoftext|>". If you want to separate your input samples, then you should add that token at the next step

In [22]:
print(tokenizer.encode("<|startoftext|>")) # fail to encode this token
print(tokenizer.encode("<|endoftext|>"))
print(tokenizer.encode("<PAD>"))

[28, 92, 1472, 1632, 1247, 19862, 92, 30]
[0]
[28, 6543, 36, 30]


## Add new tokens + special tokens

- New tokens: normal token like special chars (newline, @,...)
- special tokens: SEP, UNK, BOS, author,...

## Normal tokens

In [23]:
print(len(tokenizer))  # 50257
tokenizer.add_tokens(["\n"])
print(len(tokenizer))  # 50258

50257
50258


## Special tokens

In [45]:
# SPECIAL_TOKENS  = ["<|author|>", "<|content|>"]

# tokenizer.add_tokens(SPECIAL_TOKENS, special_tokens=True)
# tokenizer.add_special_tokens({'pad_token': '<PAD>'})
# tokenizer.add_special_tokens({'unk_token': '<UNK>'})

1

In [46]:
# tokenizer

PreTrainedTokenizerFast(name_or_path='danghuy1999/gpt2-viwiki', vocab_size=50257, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': '<UNK>', 'pad_token': '<PAD>'})

## Resize model's word embeddings

In [25]:
model.resize_token_embeddings(len(tokenizer))

# New weight for our new tokens (all zeros)
with torch.no_grad():
    model.transformer.wte.weight[-1, :] = torch.zeros([768])

print(model.transformer.wte.weight.shape)

torch.Size([50261, 768])


# Dataset + dataloader


- Custom dataset: dataset có sự phân tách các dòng sample với nhau
    
    Nếu muốn dùng format input đặc biệt mỗi sample có nhiều token đặc biệt (SEP, EOS, PAD, tác giả, thể loại,...) thì cần:
    - Add các token đặc biệt ở Class CustomDataset


## Custom dataset (only for CustomDataset method)

In [42]:
# import pandas as pd

# temp = {'author': ['A', 'B', 'C'], 'content': ['dog', 'cat', 'rat']}
# df_train = pd.DataFrame(temp)
# df_val = pd.DataFrame(temp)
# df_train

In [27]:
# from torch.utils.data import Dataset
# class CustomDataset(Dataset):  
#     def __init__(self, df, max_length= 768):

#         self.tokenizer = tokenizer
#         self.input_ids = []
#         self.attn_masks= []

#         for row in df[['author','content']]:
#             encoding_dict = self.tokenizer("<|author|>{row['author']}<|content|>{row[:max_length]}<|endoftext|>",
#                                            truncation = True, 
#                                            max_length = max_length, 
#                                            padding = 'max_length')     
#             self.input_ids.append(torch.tensor(encoding_dict['input_ids']))
#             self.attn_masks.append(torch.tensor(encoding_dict['attention_mask']))
        
        
#     def __len__(self):
#         return len(self.input_ids)

#     def __getitem__(self, idx):
#         return self.input_ids[idx], self.attn_masks[idx]



# train_dataset = CustomDataset(df_train, max_length = 100)
# val_dataset = CustomDataset(df_val, max_length = 100)
# data_collator = DataCollatorForLanguageModeling(
#         tokenizer=tokenizer, mlm=False)

In [35]:
# tokenizer.encode("<|author|>con<|content|>hello\n")

[50258, 2919, 50259, 72, 5342, 50257]

In [40]:
# train_dataset[0]

(tensor([50258,    91,  3802,    59,     7,   391, 14465,     7,    61,    93,
         50259,    91,  3802,    59,    26, 34364,    63,   778, 34374,    61,
            93,     0, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260,
         50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260,
         50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260,
         50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260,
         50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260,
         50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260,
         50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260,
         50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260, 50260]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

## TextDataset from transformer (Only for TextDataset method)

- TextDataset: điều kiện data của mình là 1 khối text thống nhất không có phân chia theo từng dòng sample. TextDataset sẽ load toàn bộ khối data và cắt theo block size
- LineByLineTextDataset(Dataset): load theo từng dòng sample của file text, vẫn cắt nếu đạt giới hạn block

- Data collocator: To be able to build batches, data collators may apply some processing (like padding). DataCollatorForLanguageModeling also apply some random data augmentation (like random masking) on the formed batch.

In [11]:
from transformers import TextDataset,DataCollatorForLanguageModeling

def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=100)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=100)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)



# Initialize `Trainer` with `TrainingArguments` and GPT-2 model

The [Trainer](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer) class provides an API for feature-complete training. It is used in most of the [example scripts](https://huggingface.co/transformers/examples.html) from Huggingface. Before we can instantiate our `Trainer` we need to download our GPT-2 model and create a [TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments) to access all the points of customization during training. In the `TrainingArguments`, we can define the Hyperparameters we are going to use in the training process like our `learning_rate`, `num_train_epochs`, or  `per_device_train_batch_size`. A complete list can you find [here](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments).

In [12]:
%env WANDB_PROJECT=GPT2-POEM
%env WANDB_WATCH=all

env: WANDB_PROJECT=GPT2-POEM
env: WANDB_WATCH=all


In [13]:
import wandb
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead,EarlyStoppingCallback


training_args = TrainingArguments(
    output_dir="./GPT2_Poet", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs= 40, # number of training epochs
    per_device_train_batch_size= 16, # batch size for training
    per_device_eval_batch_size= 16,  # batch size for evaluation
    evaluation_strategy = 'steps',
    eval_steps = 50, # Number of update steps between two evaluations.
    # save_steps= 40, # after # steps model is saved 
    save_strategy = 'steps',
    push_to_hub=True,
    hub_model_id = "GPT2_Poet",
    save_total_limit = 10,
    warmup_steps= 1000,# number of warmup steps for learning rate scheduler
    report_to=                      'wandb',
    run_name=                       'Run 6 - w/o label smoothing',
    logging_steps =                 5,                    
    gradient_accumulation_steps=    2,
    learning_rate=                  5e-4,
    weight_decay =                  0.2,
    dataloader_num_workers = 2,
    # label_smoothing_factor = 0.3,
    load_best_model_at_end = True,
    metric_for_best_model = 'eval_loss',

)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    callbacks = [EarlyStoppingCallback(early_stopping_patience= 5)],
)

/content/./GPT2_Poet is already a clone of https://huggingface.co/tuanle/GPT2_Poet. Make sure you pull the latest changes with `repo.git_pull()`.


# Train and save the model

To train the model we can simply run `Trainer.train()`.

In [14]:
trainer.train()

***** Running training *****
  Num examples = 4878
  Num Epochs = 40
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 2
  Total optimization steps = 6080
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mtuanle[0m (use `wandb login --relogin` to force relogin)


Step,Training Loss,Validation Loss
50,7.1977,7.054901
100,6.5009,6.301069
150,6.2205,5.95819
200,5.9985,5.752966
250,5.9432,5.69221
300,5.7153,5.497174
350,5.6177,5.39291
400,5.5091,5.251722
450,5.4206,5.156911
500,5.131,5.053266


***** Running Evaluation *****
  Num examples = 129
  Batch size = 16
***** Running Evaluation *****
  Num examples = 129
  Batch size = 16
***** Running Evaluation *****
  Num examples = 129
  Batch size = 16
***** Running Evaluation *****
  Num examples = 129
  Batch size = 16
***** Running Evaluation *****
  Num examples = 129
  Batch size = 16
***** Running Evaluation *****
  Num examples = 129
  Batch size = 16
***** Running Evaluation *****
  Num examples = 129
  Batch size = 16
***** Running Evaluation *****
  Num examples = 129
  Batch size = 16
***** Running Evaluation *****
  Num examples = 129
  Batch size = 16
***** Running Evaluation *****
  Num examples = 129
  Batch size = 16
Saving model checkpoint to ./GPT2_Poet/checkpoint-500
Configuration saved in ./GPT2_Poet/checkpoint-500/config.json
Model weights saved in ./GPT2_Poet/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 129
  Batch size = 16
***** Running Evaluation *****
  Num examples 

TrainOutput(global_step=1600, training_loss=4.6999406260251995, metrics={'train_runtime': 2925.1567, 'train_samples_per_second': 66.704, 'train_steps_per_second': 2.079, 'total_flos': 2620065024000000.0, 'train_loss': 4.6999406260251995, 'epoch': 10.52})

After training is done you can save the model by calling `save_model()`. This will save the trained model to our `output_dir` from our `TrainingArguments`.

In [27]:
model.push_to_hub('GPT2_Poet')

Configuration saved in GPT2_Poet/config.json
Model weights saved in GPT2_Poet/pytorch_model.bin


In [28]:
tokenizer.push_to_hub('GPT2_Poet')

tokenizer config file saved in GPT2_Poet/tokenizer_config.json
Special tokens file saved in GPT2_Poet/special_tokens_map.json
Several commits (3) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.39k/487M [00:00<?, ?B/s]

To https://huggingface.co/tuanle/GPT2_Poet
   640f858..8661686  main -> main



'https://huggingface.co/tuanle/GPT2_Poet/commit/86616864423950d9520d5187b76970c88f279353'

# Test the model

To test the model we are going to use another [highlight of the transformers library](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) called `pipeline`. [Pipelines](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) are objects that offer a simple API dedicated to several tasks, among others also `text-generation`

In [47]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [48]:
from transformers import AutoTokenizer, AutoModelForCausalLM


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = AutoTokenizer.from_pretrained("tuanle/GPT2_Poet")
# tokenizer = AutoTokenizer.from_pretrained("imthanhlv/gpt2news")
# tokenizer.add_tokens(["\n"])
model = AutoModelForCausalLM.from_pretrained("tuanle/GPT2_Poet").to(device)

Downloading:   0%|          | 0.00/589 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/755k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/421k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/357 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/487M [00:00<?, ?B/s]

In [74]:
text = "hỏi rằng nàng"

input_ids = tokenizer.encode(text, return_tensors='pt').to(device)
min_length = 60
max_length = 100

sample_outputs = model.generate(input_ids,pad_token_id=tokenizer.eos_token_id,
                                   do_sample=True,
                                   max_length=max_length,
                                   min_length=min_length,
                                #    temperature = .8,
                                #    top_k= 100,
                                   top_p = 0.8,
                                   num_beams= 10,
                                #    early_stopping=True,
                                   no_repeat_ngram_size= 2,
                                   num_return_sequences= 3)

for i, sample_output in enumerate(sample_outputs):
    print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist(), skip_special_tokens=True)))
    print('\n---')

  next_indices = next_tokens // vocab_size


>> Generated text 1

hỏi rằng nàng đã nói ra
cớ sao nàng lại hỏi han sự tình
vân tiên nói lại những lời
thưa rằng ở chốn am mây một mình
từ đây mới biết rõ ràng
ở đây cũng gặp một người ở đây
hai người gặp lại gặp nhau
thấy lời nàng mới hỏi tra việc này
nguyệt nga hỏi việc bấy lâu
khen rằng đạo sĩ ở đầu cửa thiền
mậu rằng hai gã đi chơi
rằng trong am tự một lời

---
>> Generated text 2

hỏi rằng nàng ở lại đây
thưa rằng tôi ở bên này
cớ sao nên nỗi lòng này chẳng may
mấy lời nàng mới nói ra
khen rằng hai gã ở đầu gặp nhau
e khi gặp gỡ giữa đàng
phút đâu gặp lại một người ở cùng
nguyệt nga nghe nói rõ ràng
rằng nàng lại nói một vài lời chưa tha
liễu rằng đạo đạo ở tây phương
cho hay đạo sĩ ở ngoài cửa thiền
ngư rằng nhờ đạo

---
>> Generated text 3

hỏi rằng nàng ở bên đàng
khen rằng đạo đạo phật ở nơi chốn nào
chẳng hay người đạo ở đây
đã nghe đạo sĩ ở đầu đạo linh
dương từ nghe nói rõ ràng
thưa rằng vốn đạo thầy linh làm chi
ngư rằng ở chốn am mây
hai người ở đạo thấy bày việc gì
từ r