# Finetuning Small/Medium Size LLLMs with custom data
___

Data Sources:
*   Notes from **Finetuning Large Language Models** short course provided by **DeepLearning.AI** [https://learn.deeplearning.ai/]
*   Code from: https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/pytorch/language_modeling.ipynb#scrollTo=OD9MYDZhMVB2
___

Installing required datasets

In [1]:
!pip install datasets



In [2]:
!pip install accelerate -U

Collecting accelerate
  Obtaining dependency information for accelerate from https://files.pythonhosted.org/packages/f7/fc/c55e5a2da345c9a24aa2e1e0f60eb2ca290b6a41be82da03a6d4baec4f99/accelerate-0.25.0-py3-none-any.whl.metadata
  Downloading accelerate-0.25.0-py3-none-any.whl.metadata (18 kB)
Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.24.1
    Uninstalling accelerate-0.24.1:
      Successfully uninstalled accelerate-0.24.1
Successfully installed accelerate-0.25.0


Importing libraries

In [3]:
import pandas as pd
import datasets
from pprint import pprint
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import TrainingArguments
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer
import torch
import warnings
warnings.simplefilter("ignore")



## Data Preparation

It is presumed that you have access to a dataset comprising **Questions** and **Answers**; moreover, the data is of high quality or has undergone extensive curation

In [4]:
model_name = "EleutherAI/pythia-160m"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Using dataset available in the HuggingFace hub

In [5]:
filename = "kotzeje/lamini_docs.jsonl"
finetuning_dataset_loaded = datasets.load_dataset(filename, split="train")
finetuning_dataset_loaded["question"][0:2]

Downloading and preparing dataset parquet/kotzeje--lamini_docs.jsonl to /root/.cache/huggingface/datasets/parquet/kotzeje--lamini_docs.jsonl-a564afd9ef4b1477/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/283k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/kotzeje--lamini_docs.jsonl-a564afd9ef4b1477/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


['How can I evaluate the performance and quality of the generated text from Lamini models?',
 "Can I find information about the code's approach to handling long-running tasks and background jobs?"]

In [6]:
finetuning_dataset_loaded["answer"][0:2]

["There are several metrics that can be used to evaluate the performance and quality of generated text from Lamini models, including perplexity, BLEU score, and human evaluation. Perplexity measures how well the model predicts the next word in a sequence, while BLEU score measures the similarity between the generated text and a reference text. Human evaluation involves having human judges rate the quality of the generated text based on factors such as coherence, fluency, and relevance. It is recommended to use a combination of these metrics for a comprehensive evaluation of the model's performance.",
 'Yes, the code includes methods for submitting jobs, checking job status, and retrieving job results. It also includes a method for canceling jobs. Additionally, there is a method for sampling multiple outputs from a model, which could be useful for long-running tasks.']

In [7]:
def tokenize_function(examples):
    if "question" in examples and "answer" in examples:
      text = examples["question"][0] + examples["answer"][0]
    elif "input" in examples and "output" in examples:
      text = examples["input"][0] + examples["output"][0]
    else:
      text = examples["text"][0]

    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        padding=True,
    )

    max_length = min(
        tokenized_inputs["input_ids"].shape[1],
        2048
    )
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=max_length
    )

    return tokenized_inputs

In [8]:
tokenized_dataset = finetuning_dataset_loaded.map(
    tokenize_function,
    batched=True,
    batch_size=1,
    drop_last_batch=True
)

print(tokenized_dataset)

  0%|          | 0/1400 [00:00<?, ?ba/s]

Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask'],
    num_rows: 1400
})


Tokenized Dataset

In [9]:
split_dataset = tokenized_dataset.train_test_split(test_size=0.2, shuffle=True, seed=123)
print(split_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask'],
        num_rows: 1120
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask'],
        num_rows: 280
    })
})


## Model Training

Using *trainer* from *transformers* library

In [10]:
base_model = AutoModelForCausalLM.from_pretrained(model_name)

Downloading config.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/375M [00:00<?, ?B/s]

In [11]:
device_count = torch.cuda.device_count()
if device_count > 0:
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

In [12]:
base_model.to(device)

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 768)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=768, out_features=2304, bias=True)
          (dense): Linear(in_features=768, out_features=768, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=768, out_features=3072, bias=True)
          (dense_4h_to_h): Linear(in_features=3072, out_features=768, bias=True)
          

In [13]:
epochs = 3
trained_model_name = f"lamini_docs_{epochs}_epochs"
output_dir = trained_model_name

In [14]:
training_args = TrainingArguments(
  # Directory to save model checkpoints
  output_dir=output_dir,
  # Other arguments
  evaluation_strategy="epoch",
  # Learning rate
  learning_rate=1.0e-5,
  # Number of training epochs
  num_train_epochs=epochs
)

In [15]:
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [16]:
from transformers import Trainer
trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"],
    data_collator=data_collator
)

In [17]:
import os
os.environ["WANDB_DISABLED"] = "true"
!wandb off

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.


In [18]:
training_output = trainer.train()

[34m[1mwandb[0m: Tracking run with wandb version 0.16.0
[34m[1mwandb[0m: W&B syncing is set to [1m`offline`[0m in this directory.  
[34m[1mwandb[0m: Run [1m`wandb online`[0m or set [1mWANDB_MODE=online[0m to enable cloud syncing.
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,1.930215
2,No log,1.841945
3,No log,1.851337


## Saving model

In [19]:
save_dir = f'{output_dir}/final'
trainer.save_model(save_dir)
print("Saved model to:", save_dir)

Saved model to: lamini_docs_3_epochs/final


## Testing results from the model

In [20]:
finetuned_slightly_model = AutoModelForCausalLM.from_pretrained(save_dir, local_files_only=True)
finetuned_slightly_model.to(device)

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 768)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=768, out_features=2304, bias=True)
          (dense): Linear(in_features=768, out_features=768, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=768, out_features=3072, bias=True)
          (dense_4h_to_h): Linear(in_features=3072, out_features=768, bias=True)
          

In [21]:
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100, temperature=1.0):
  # Tokenize
  input_ids = tokenizer.encode(
          text,
          return_tensors="pt",
          truncation=True,
          max_length=max_input_tokens
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),
    max_length=max_output_tokens,
    temperature = temperature,
    do_sample=True,
    top_p = 0.95
    
  )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]

  return generated_text_answer

In [22]:
test_question = split_dataset["test"][0]['question']
print("****Question input (test)****:", test_question)

print("****Finetuned slightly model's answer****: ")
print(inference(test_question, finetuned_slightly_model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


****Question input (test)****: Is it possible to fine-tune Lamini on a specific dataset for text generation in legal documents?
****Finetuned slightly model's answer****: 
Yes, it is possible to fine-tune Lamini on a specific dataset for text generation in legal documents.  For example, if you want to update a legal document with a specific legal situation, you can adjust the number of examples per iteration of Lamini and keep the current iteration large enough to handle that task. Additionally, Lamini can be tuned to handle cases


In [23]:
print("****Finetuned slightly model's answer*****: ")
print(inference("How can I evaluate the performance and quality of the generated text??", finetuned_slightly_model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


****Finetuned slightly model's answer*****: 
 To evaluate the performance and quality of the generated text, we use techniques like text summarization and sentiment analysis to separate text from filler text. We define a task as an instance where the generated text should be summarization or sentiment analysis. It is important to use a variety of techniques to keep the generated text consistent and relevant for a given task. This can include fine-tuning techniques to improve formatting, removing irrelevant text, using natural


In [24]:
print("Finetuned slightly model's answer: ")
print(inference("Tell me about the lamini API?", finetuned_slightly_model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Finetuned slightly model's answer: 
The lamini API is a great tool for developing and implementing language models with data. This API is built on the ground that data is recorded in the form of text and is available for usage in your application. Lamini uses a pre-built language model to model and retrieve the data for training and inference. This data is used to train your language model and generate predictions based on the provided text. Additionally, the API allows you to customize the language model


In [25]:
# Reloading the model from a folder, you have to define the 'device' and the 'inference' function
# finetuned_longer_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")
# tokenizer = AutoTokenizer.from_pretrained("lamini/lamini_docs_finetuned")

# finetuned_longer_model.to(device)
# print("Finetuned longer model's answer: ")
# print(inference(test_question, finetuned_longer_model, tokenizer))