<a href="https://colab.research.google.com/github/MichaelSomma94/Generative_AI/blob/main/Fine_tune_a_German_GPT_2_Model_with_custom_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Refining a German GPT2 Model with custom data

In this tutorial, I walk you through how to fine-tune a German GPT-2 from the Huggingface model hub. As fine-tune, I was using a private data from Cover Letters I wrote in the past, but you can use the data you have at your disposal.

We are going to do roughly the following steps:
- installation
- library imports
- load the dataset and build a TextDataset
- load the pre-trained GPT-2 model and tokenizer
- initialize Trainer with its arguments
- train and save the model
- test with the Pipeline


I am using Google Colab with a GPU runtime for this tutorial. If you are not sure how to use a GPU Runtime take a look here.

In [2]:
!pip3 install torch torchvision
!pip install transformers
!pip install accelerate -U
!pip install PyPDF2


Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m82.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m74.4 MB/s[0m eta [36m0:00:0

In [3]:
!nvidia-smi

Thu Aug 17 15:29:50 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

After we uploaded the file with use `unzip` to extract the recipes.json.

# Necessary Imports

In [4]:
from transformers import AutoTokenizer
from transformers import TextDataset,DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead
from transformers import AutoConfig
import torch
import glob


# Prepare the dataset and build a ``TextDataset``

We have a scirpt that extracts the text from all .pdf Files in a certain directory.

Then we are going to split the dataset into test and train set and save it to a .txt file

The `TextDataset` is a custom implementation of the [Pytroch `Dataset` class](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class) implemented by the transformers library.


In [None]:

pdf_list = glob.glob("./*.pdf")

full_text = " "
for read in pdf_list:
# creating a pdf reader object
    reader = PdfReader(read)
    # printing number of pages in pdf file
    num_pag = len(reader.pages)
    # getting a specific page from the pdf file
    for i in range(0, num_pag):
        pages = reader.pages[i]
        # extracting text from page
        text = pages.extract_text()
        full_text += text[110:]
#print(full_text)



In [None]:
n = int(0.9*len(full_text))
with open('train.txt', 'w') as f:
    f.write(full_text[0:n])
with open('test.txt', 'w') as f:
    f.write(full_text[n:])

the next step is to download the tokenizer, which we use. We use the tokenizer from the `german-gpt2` model on [huggingface](https://huggingface.co/anonymous-german-nlp/german-gpt2).

In [7]:


tokenizer = AutoTokenizer.from_pretrained("anonymous-german-nlp/german-gpt2")

train_path = '/content/train.txt'
test_path = '/content/test.txt'

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [8]:
def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)

    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)

# Initialize `Trainer` with `TrainingArguments` and GPT-2 model

Next we load our pretrained model, a German GPT2 version. And we set up the Trainer() with the wanted training arguments.

In [9]:
model = AutoModelWithLMHead.from_pretrained("anonymous-german-nlp/german-gpt2")


training_args = TrainingArguments(
    output_dir="./gpt2-ger_CL", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
    )


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)



Downloading pytorch_model.bin:   0%|          | 0.00/675M [00:00<?, ?B/s]

In [10]:
config_file = AutoConfig.from_pretrained("anonymous-german-nlp/german-gpt2")
print(config_file)

GPT2Config {
  "_name_or_path": "anonymous-german-nlp/german-gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 52000
}



# Train and save the model



In [11]:
trainer.train()



Step,Training Loss


TrainOutput(global_step=6, training_loss=5.214125951131185, metrics={'train_runtime': 8.5948, 'train_samples_per_second': 19.198, 'train_steps_per_second': 0.698, 'total_flos': 10778296320000.0, 'train_loss': 5.214125951131185, 'epoch': 3.0})

In [12]:
trainer.save_model('GPT2_Germa_refined')

# Test the model
`

In [13]:
# Input prompt
prompt = "Mich würd es sehr freuen"

# Tokenize the input prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_ids = input_ids.to(device)
# Set pad_token_id and attention_mask
pad_token_id = tokenizer.pad_token_id
attention_mask = torch.ones(input_ids.shape, device=device)
# Generate text
max_length = 100  # Maximum number of tokens in the generated text
output = model.generate(input_ids, max_length=max_length, pad_token_id=pad_token_id,
    attention_mask=attention_mask, num_return_sequences=1)

# Decode the generated output back to text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)








Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Mich würd es sehr freuen, dass wir uns in der Lage sind, die besten Wünsche unserer Kunden zu erfüllen.
Wir freuen uns, dass wir unsere Kunden in der Lage, die besten Produkte zu liefern.
Wir sind sehr stolz auf unsere Arbeit und hoffen, dass wir in der Lage sein, die besten Produkte zu liefern.
Wir sind sehr stolz auf unsere Arbeit und hoffen, dass wir in der Lage sein, die besten Produkte zu liefern.
Wir hoffen, dass wir in der Lage sein,
