# To Fine Tune a GPT 2 with existing dataset

Reference: https://www.kaggle.com/code/changyeop/how-to-fine-tune-gpt-2-for-beginners

**Step 1: Import the transformers and the original GPT2 model**

In [None]:
!pip install accelerate -U
!pip install transformers[torch]

import transformers
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments

Collecting accelerate
  Downloading accelerate-0.29.2-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.w

**Step 2: Define Method to Train the model with dataset provided:**

In [None]:
def load_dataset(file_path, tokenizer, block_size = 128):
  """
  Parameters:
    file_path: the file_path of the data text file
    tokenizer: the GPT2 model specific tokenizer provided by the Huggingface Transformers library
    block_size: the size of chunk that the training dataset will be divided into, also represent
          the length of sequence the model are considered during generation

  Return:
    dataset: A Obhect that contains divided text from the training input; TextDataset function is
         Provided by the transformers library as well
  """
  dataset = TextDataset(
      tokenizer = tokenizer,
      file_path = file_path,
      block_size = block_size,
  )
  return dataset


def load_data_collator(tokenizer, mlm = False):
  """
  Parameters:
    tokenizer: the GPT2 model specific tokenizer provided by the Huggingface Transformers library
    mlm: stands for masked language modeling. should set to false for autoregressive model like GPT2

  Return:
    data_collator: dynamically prepare batches of data for language modeling tasks.
            When you pass a batch of examples (like tokenized texts) to this data collator,
  """
  data_collator = DataCollatorForLanguageModeling(
      tokenizer=tokenizer,
      mlm=mlm,
  )
  return data_collator



def train(
    train_file_path,model_name,
    output_dir,
    overwrite_output_dir,
    per_device_train_batch_size,
    num_train_epochs,
    save_steps):
  tokenizer = GPT2Tokenizer.from_pretrained(model_name)
  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)

  model = GPT2LMHeadModel.from_pretrained(model_name)

  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
      )

  trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
  )

  trainer.train()
  trainer.save_model()

**Step 3: Define the Parameters for the Training Function**

---
Using Dataset of size 5MB as input



In [None]:
train_file_path = "/content/mental_health_data.txt"
model_name = 'gpt2'
output_dir = "/content/TrainingOutput"
overwrite_output_dir = True
per_device_train_batch_size = 8
num_train_epochs = 5.0
save_steps = 500

**Step 4: Start the Training and Save the Model into the trainner for the further inference**

In [None]:
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss


**Step 5: Pos-Trainning Model Load-in for Generation**

In [None]:
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

def load_model(model_path):
    model = GPT2LMHeadModel.from_pretrained(model_path)
    return model


def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer


def generate_text(sequence, max_length):
    model_path = "/content/TrainingOutput"
    model = load_model(model_path)
    tokenizer = load_tokenizer(model_path)
    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    return(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

In [None]:
from google.colab import drive
drive.mount('/content/drive')

MessageError: Error: credential propagation was unsuccessful

**Step 6: Raw Generation based on Input**

In [None]:
sequence = "User: I've been feeling so sad and overwhelmed lately."
max_len = 128
print(generate_text(sequence, max_len))

User: I've been feeling so sad and overwhelmed lately. </s>
Assistant: <s> How long have you been feeling this way? </s>
Assistant: <s> Tell me more. </s> <s> Are there anything else you can tell me? </s>
Assistant: <s> Talking about it might help. </s> <s> What are your thoughts right now? </s>
User: <s> How were you feeling last week? </s>
Assistant: <s> My first thoughts were sadness, but I'm not sure how to put it. </


**Step 7: To make it more conversational: add the "Assistant token" to the Propmt**

In [None]:
sequence = "User: I've been feeling so sad and overwhelmed lately. Assistant: "
max_len = 128
print(generate_text(sequence, max_len))

User: <s> I've been feeling so sad and overwhelmed lately. </s> Assistant: 
Assistant: <s> I'm listening. </s> <s> So, please tell me why? </s>
User: <s> Why don't you tell me why? </s>
Assistant: <s> Why don't you tell me? </s>
Assistant: <s> I know you're feeling this way. </s> <s> That being said, we're all human. </s> <s> Let's discuss further why you're feeling this way. </s


**Step 8: If we use post-processing to control the generative output, and make it a real chatbot like model**

In [None]:
sequence = input()
while sequence:
  max_len= 128
  generative_string = generate_text(sequence, max_len)

  text_cleaned = generative_string.replace("<s>", "\n").replace("</s>", "\n")
  lines = text_cleaned.split("\n")
  text_to_output = ""
  next_line_is_output = False
  for line in lines:
    if line.strip().startswith("Assistant:"):
      next_line_is_output = True
      continue
    if next_line_is_output:
      text_to_output = line
      break
  print(f"Assistant: {text_to_output}")
  if sequence == "Bye":
    break

  sequence = input()

I've been feeling so sad and overwhelmed lately.
Assistant:  I'm here for you. 
Why I feels like nobody loves me
Assistant:  I am depressed. 
how should I do
Assistant:  Oh I see. 
How should I do to relieve my pain
Assistant:  Just take a deep breath and gently open your eyes. 
that works!
Assistant:  I'm sorry to hear that. 
bye!
Assistant:  Yeah. 


KeyboardInterrupt: Interrupted by user