Finetuning LLM model from scratch.

In [1]:
import pandas as pd
from pprint import pprint
from datasets import load_dataset
from transformers import AutoTokenizer


# Data Prepare

## 1. Load dataset and tokenizer from hugging face

In [3]:
model_name = "EleutherAI/pythia-70m"
tokenizer = AutoTokenizer.from_pretrained(model_name)

data_file = "lamini/lamini_docs"
dataset = load_dataset(data_file)
dataset

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})

## 2. Convert dataset to Question & Answer format

In [6]:
prompt_template = """### Question:
{question}

### Answer:"""
train_dataset, test_dataset = dataset["train"], dataset["test"]

dataset_len = len(train_dataset["question"])
finetuning_dataset = []
for idx in range(dataset_len):
    question = train_dataset["question"][idx]
    answer = train_dataset["answer"][idx]
    text_with_prompt_template = prompt_template.format(question=question)
    finetuning_dataset.append({"question": text_with_prompt_template, "answer": answer})

pprint(finetuning_dataset[1])

{'answer': 'Yes, the code includes methods for submitting jobs, checking job '
           'status, and retrieving job results. It also includes a method for '
           'canceling jobs. Additionally, there is a method for sampling '
           'multiple outputs from a model, which could be useful for '
           'long-running tasks.',
 'question': '### Question:\n'
             "Can I find information about the code's approach to handling "
             'long-running tasks and background jobs?\n'
             '\n'
             '### Answer:'}


## 3. Tokenize (include padding and truncating) the processed data

In [7]:
def tokenize_function(finetuning_dataset):
    text = []
    for example in finetuning_dataset:
        text.append(example["question"] + example["answer"])
    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        padding=True,
    )
    max_length = min(
        tokenized_inputs["input_ids"].shape[1],
        2048
    )
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        padding=True,
        max_length=max_length
    )
    return tokenized_inputs

In [5]:
#tokenized_inputs = tokenize_function(finetuning_dataset)
#print(tokenized_inputs.input_ids[0])

[ 4118 19782    27   187  2347   476   309  7472   253  3045   285  3290
   273   253  4561  2505   432   418  4988    74  3210    32   187   187
  4118 37741    27  2512   403  2067 17082   326   476   320   908   281
  7472   253  3045   285  3290   273  4561  2505   432   418  4988    74
  3210    13  1690 44229   414    13   378  1843    54  4868    13   285
  1966  7103    15  3545 12813   414  5593   849   973   253  1566 26295
   253  1735  3159   275   247  3425    13  1223   378  1843    54  4868
  5593   253 14259   875   253  4561  2505   285   247  3806  2505    15
  8801  7103  8687  1907  1966 16006  2281   253  3290   273   253  4561
  2505  1754   327  2616   824   347 25253    13  2938  1371    13   285
 17200    15   733   310  8521   281   897   247  5019   273   841 17082
   323   247 11088  7103   273   253  1566   434  3045    15     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0   

## Padding and truncation Examples

In [8]:
text = "Hi, how are you?"
encoded_text = tokenizer(text)["input_ids"]
print("Encoded text: ", encoded_text)
decoded_text = tokenizer.decode(encoded_text)
print("Decoded tokens back into text: ", decoded_text)

list_texts = ["Hi, how are you?", "I'm good", "Yes"]
encoded_texts = tokenizer(list_texts)
print("Encoded several texts: ", encoded_texts["input_ids"])

tokenizer.pad_token = tokenizer.eos_token 
encoded_texts_longest = tokenizer(list_texts, padding=True)
print("Using padding: ", encoded_texts_longest["input_ids"])

encoded_texts_truncation = tokenizer(list_texts, max_length=3, truncation=True)
print("Using truncation: ", encoded_texts_truncation["input_ids"])

tokenizer.truncation_side = "left"
encoded_texts_truncation_left = tokenizer(list_texts, max_length=3, truncation=True)
print("Using left-side truncation: ", encoded_texts_truncation_left["input_ids"])

encoded_texts_both = tokenizer(list_texts, max_length=3, truncation=True, padding=True)
print("Using both padding and truncation: ", encoded_texts_both["input_ids"])

Encoded text:  [12764, 13, 849, 403, 368, 32]
Decoded tokens back into text:  Hi, how are you?
Encoded several texts:  [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175], [4374]]
Using padding:  [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175, 0, 0, 0], [4374, 0, 0, 0, 0, 0]]
Using truncation:  [[12764, 13, 849], [42, 1353, 1175], [4374]]
Using left-side truncation:  [[403, 368, 32], [42, 1353, 1175], [4374]]
Using both padding and truncation:  [[403, 368, 32], [42, 1353, 1175], [4374, 0, 0]]


## Other data set view

In [9]:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder(data_file)
print(ds_builder.info.features)

from datasets import get_dataset_split_names
get_dataset_split_names(data_file)
type(finetuning_dataset)
split_dataset = dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)
print(split_dataset)
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
dataset.format

Using the latest cached version of the dataset since lamini/lamini_docs couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /Users/lianyun/.cache/huggingface/datasets/lamini___lamini_docs/default/0.0.0/05bd680b81d69a7a1d38193873f1487d73e535bf (last modified on Sun May 12 18:17:52 2024).


{'question': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}


AttributeError: 'DatasetDict' object has no attribute 'train_test_split'

# Train Model
## 

In [10]:
from transformers import AutoModelForCausalLM
from transformers import TrainingArguments
from transformers import Trainer
import torch

In [11]:

base_model = AutoModelForCausalLM.from_pretrained(model_name)



In [12]:
device = torch.device("cpu")
# if torch.cuda.device_count() > 0: device = torch.device("cuda")
base_model.to(device)

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 512)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-5): 6 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=512, out_features=1536, bias=True)
          (dense): Linear(in_features=512, out_features=512, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
          (dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
          (a

In [13]:
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenize
  input_ids = tokenizer.encode(
          text,
          return_tensors="pt",
          truncation=True,
          max_length=max_input_tokens
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),
    max_length=max_output_tokens
  )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]

  return generated_text_answer

In [14]:
test_text = test_dataset[0]['question']
print("------")
print("Question input (test):", test_text)
print("------")
print(f"Correct answer from Lamini docs: {test_dataset[0]['answer']}")
print("------")
print("Model's answer: ")
print(inference(test_text, base_model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


------
Question input (test): Can Lamini generate technical documentation or user manuals for software projects?
------
Correct answer from Lamini docs: Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.
------
Model's answer: 


I have a question about the following:

How do I get the correct documentation to work?

A:

I think you need to use the following code:

A:

You can use the following code to get the correct documentation.

A:

You can use the following code to get the correct documentation.

A:

You can use the following


In [15]:
max_steps = 3
trained_model_name = f"lamini_docs_{max_steps}_steps"
output_dir = trained_model_name
training_args = TrainingArguments(
  # Learning rate
  learning_rate=1.0e-5,
  # Number of training epochs
  num_train_epochs=1,
  # Max steps to train for (each step is a batch of data)
  # Overrides num_train_epochs, if not -1
  max_steps=max_steps,
  # Batch size for training
  per_device_train_batch_size=1,
  # Directory to save model checkpoints
  output_dir=output_dir,
  # Other arguments
  overwrite_output_dir=False, # Overwrite the content of the output directory
  disable_tqdm=False, # Disable progress bars
  eval_steps=120, # Number of update steps between two evaluations
  save_steps=120, # After # steps model is saved
  warmup_steps=1, # Number of warmup steps for learning rate scheduler
  per_device_eval_batch_size=1, # Batch size for evaluation
  evaluation_strategy="steps",
  logging_strategy="steps",
  logging_steps=1,
  optim="adafactor",
  gradient_accumulation_steps = 4,
  gradient_checkpointing=False,
  # Parameters for early stopping
  load_best_model_at_end=True,
  save_total_limit=1,
  metric_for_best_model="eval_loss",
  greater_is_better=False
)

In [16]:
model_flops = (
  base_model.floating_point_ops(
    {
       "input_ids": torch.zeros(
           (1, 2048)
      )
    }
  )
  * training_args.gradient_accumulation_steps
)

print(base_model)
print("Memory footprint", base_model.get_memory_footprint() / 1e9, "GB")
print("Flops", model_flops / 1e9, "GFLOPs")

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 512)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-5): 6 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=512, out_features=1536, bias=True)
          (dense): Linear(in_features=512, out_features=512, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
          (dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
          (a

In [17]:
trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)
training_output = trainer.train()

max_steps is given, it will override any value given in num_train_epochs


  0%|          | 0/3 [00:00<?, ?it/s]

{'loss': 4.1562, 'grad_norm': 76.71221923828125, 'learning_rate': 1e-05, 'epoch': 0.0}
{'loss': 3.0686, 'grad_norm': 56.84518814086914, 'learning_rate': 5e-06, 'epoch': 0.01}
{'loss': 3.8929, 'grad_norm': 54.16719055175781, 'learning_rate': 0.0, 'epoch': 0.01}
{'train_runtime': 10.9792, 'train_samples_per_second': 1.093, 'train_steps_per_second': 0.273, 'train_loss': 3.705880800882975, 'epoch': 0.01}


In [18]:
save_dir = f'data/{output_dir}/final'

trainer.save_model(save_dir)
print("Saved model to:", save_dir)

Saved model to: lamini_docs_3_steps/final


In [19]:
finetuned_slightly_model = AutoModelForCausalLM.from_pretrained(save_dir, local_files_only=True)
finetuned_slightly_model.to(device) 

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 512)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-5): 6 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=512, out_features=1536, bias=True)
          (dense): Linear(in_features=512, out_features=512, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
          (dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
          (a

In [20]:
test_question = test_dataset[0]['question']
print("Question input (test):\n", test_question)
print("------")
print("Finetuned slightly model's answer: ")
print(inference(test_question, finetuned_slightly_model, tokenizer))
print("------")
test_answer = test_dataset[0]['answer']
print("Target answer output (test):\n", test_answer)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Question input (test):
 Can Lamini generate technical documentation or user manuals for software projects?
------
Finetuned slightly model's answer: 


I'm not sure if I'm using the same language or the same language, but I'm not sure if I'm using the same language or the same language.

A:

I think you're using the same language, but I'm not sure if I'm using the same language or the same language, but I'm not sure if I'm using the same language or the same language, but
------
Target answer output (test):
 Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.


In [21]:
finetuned_longer_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")
tokenizer = AutoTokenizer.from_pretrained("lamini/lamini_docs_finetuned")

finetuned_longer_model.to(device)
print("------")
print("Finetuned longer model's answer: ")
print(inference(test_question, finetuned_longer_model, tokenizer))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


------
Finetuned longer model's answer: 
Yes, Lamini can generate technical documentation or user manuals for software projects. This can be achieved by providing a prompt for a specific technical question or question to the LLM Engine, or by providing a prompt for a specific technical question or question. Additionally, Lamini can be trained on specific technical questions or questions to help users understand the process and provide feedback to the LLM Engine. Additionally, Lamini
