# Continual Pre-Training vs. Fine-tuning


In this article, we will see the difference between Fine-tuning and Continual Pre-Training. We will use Amazon Bedrock Python SDK to fine-tune and continual pre-train a foundation model with your own data. If you have a train dataset and want to adapt a base model to your domain, you can further adjust it by giving your training data. We will see how to fine-tune or continual pre-train a base model with Amazon Bedrock in this demo.

You can store your data on Amazon S3 and provide the S3 bucket path while you are configuring the model customization job. You can also change the hyper parameters (like learning rate, number of epochs, and batch size) for fine-tuning. Once the fine-tuning job with your data is finished, you can deploy the model into an endpoint and use it for inference. You can use the fine-tuned model and provide your prompt to the model along with a set of model parameters. In the following, we will walk through "continued pre-training with Amazon Bedrock". 

### Fine-tuning

Fine-tuning and continual pre-training are both techniques used in machine learning to adapt models for specific tasks or domains. Fine-tuning involves taking a pre-trained model and adjusting its parameters using labeled data relevant to the target task, thereby tailoring it to the nuances of that particular task. This process enhances the model's effectiveness in that specific task compared to using a general-purpose pre-trained model. 

### Continual Pre-training 

On the other hand, continual pre-training entails taking an already pre-trained model and employing transfer learning to further train it on new data from a different domain. Continual pre-training allows for ongoing adaptation and refinement of the model's knowledge and performance across various tasks or domains, leveraging the previously acquired knowledge while continuously learning from new data.

With this introduction, let's delve into the implementation in Amazon bedrock. 

# Fine-tuning Process
We will start with fine-tuning an LLM with our labeled training data. Our dataset is a public dialogue summarization dataset. We will prepare it for fine-tuning and will customize our model with the training samples.  

In [2]:
%pip install -U transformers==4.36.2
%pip install -U torch
%pip install -U peft==0.7.1
%pip install -U datasets==2.15.0

Collecting transformers==4.36.2
  Using cached transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers==4.36.2)
  Using cached huggingface_hub-0.22.2-py3-none-any.whl.metadata (12 kB)
Collecting tokenizers<0.19,>=0.14 (from transformers==4.36.2)
  Using cached tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting safetensors>=0.3.1 (from transformers==4.36.2)
  Using cached safetensors-0.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Using cached transformers-4.36.2-py3-none-any.whl (8.2 MB)
Using cached huggingface_hub-0.22.2-py3-none-any.whl (388 kB)
Using cached safetensors-0.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
Using cached tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
Installing collected packages: safetensors, huggingface-hub, tokenizers, transformers
Successfully 

In [3]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import pandas as pd
import torch

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [4]:
# Amazon Bedrock control plane including fine-tuning
# bedrock = boto3.client(service_name="bedrock")
# Amazon Bedrock data plane including model inference
# bedrock_runtime = boto3.client(service_name="bedrock-runtime")

First, we should check the available models for fine-tuning. Using <code>list_foundation_models()</code>, we can list the available foundation models in Bedrock. Since we are looking for fine-tunable foundation models, we can filter the list like this: 

In [5]:
# bedrock.list_foundation_models(byProvider="Amazon", byCustomizationType="FINE_TUNING")['modelSummaries']

From the list, we can see there are two options to choose from: <code>Titan Text G1 - Express</code> and <code>Titan Text G1 - Lite</code>. For this demo, we will use the former. So, we set our base model: 

In [6]:
base_model_id = 'google/flan-t5-base'
original_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_id, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

# 3. Our dataset

For this demo, we will use a public dialogue summarization dataset, "dialogsum" as a custom dataset to fine-tune our base model. Dialogsum is a large-scale dialogue summarization dataset has over 13,000 conversations and summaries. You can read more about it here: https://huggingface.co/datasets/knkarthick/dialogsum.

Data Fields:

* dialogue: text of dialogue.
* summary: human written summary of the dialogue.
* topic: human written topic/one liner of the dialogue.
* id: unique file id of an example.

Cosidering the required time for model fine-tuning, we will consider only a 5 % of each split in this demo

In [7]:
from datasets import load_dataset
train_ds, val_ds, test_ds = load_dataset("knkarthick/dialogsum",split=['train[:10%]', 'validation[:10%]','test[:10%]'])

In [8]:
train_ds

Dataset({
    features: ['id', 'dialogue', 'summary', 'topic'],
    num_rows: 1246
})

In [9]:
def tokenize_input(row):
    prompt = ["Summarize the following conversation.\n\n" + dialogue + "\n\nSummary: " for dialogue in row['dialogue']]
    row['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    row['labels'] = tokenizer(row['summary'], padding="max_length", truncation=True, return_tensors="pt").input_ids
    return row    

Considering fine-tuning duration and the cost, let's select only 100 of train data for fine-tuning. <code>load_dataset</code> return an instance of <code>DatasetDict</code> class. We can apply its methods to furthur refine our dataset. https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/main_classes#datasets.DatasetDict.data
First, we select 100 of the 'train' set, 


In [10]:
example = tokenize_input(test_ds[1])
print(example)

        [12198,  1635,  1737,  ...,     0,     0,     0],
        [12198,  1635,  1737,  ...,     0,     0,     0],
        ...,
        [12198,  1635,  1737,  ...,     0,     0,     0],
        [12198,  1635,  1737,  ...,     0,     0,     0],
        [12198,  1635,  1737,  ...,     0,     0,     0]]), 'labels': tensor([[   86,   455,    12,  1709,  1652,    45,     3, 26281,    97,    30,
         18882,     3, 16042,  1356,     6,  1713,   345, 13515,   536,  4663,
          2204,     7,    12, 13813,     8,   169,    13,   273,  1356,    11,
           987,     7,   283,     7,     5, 31676,    12,  1299,    91,     3,
             9, 22986,    12,    66,  1652,    57,     8,  3742,     5,     1,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,   

In [54]:
train_ds_tokenized = train_ds.map(tokenize_input, batched=True)
train_ds_tokenized = train_ds_tokenized.remove_columns(['id','topic','dialogue','summary'])

val_ds_tokenized = val_ds.map(tokenize_input, batched=True)
val_ds_tokenized = val_ds_tokenized.remove_columns(['id','topic','dialogue','summary'])

test_ds_tokenized = test_ds.map(tokenize_input, batched=True)
test_ds_tokenized = test_ds_tokenized.remove_columns(['id','topic','dialogue','summary'])

# train_ds_tokenized.to_json('model_finetuning_data/dialogsum-train-100.jsonl', index=False)

In [55]:
# import pandas as pd
# df = pd.read_json("model_finetuning_data/dialogsum-train-100.jsonl", lines=True)
train_ds_tokenized.shape

(1246, 2)

In [56]:
# data = "model_finetuning_data/dialogsum-train-100.jsonl"

Now we will apply json loads function on each row of the ‘json_element’ column. ‘json.loads’ is a decoder function in python which is used to decode a json object into a dictionary. ‘apply’ is a popular function in pandas that takes any function and applies to each row of the pandas dataframe or series.

# Fine-tuning

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with Summarize the following conversation and to the start of the summary with Summary as follows:

In [57]:
output_dir = './model_finetuning_output'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=train_ds_tokenized,
    eval_dataset=val_ds_tokenized
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
Detected kernel version 4.14.336, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## This cell may take few minutes to run!

In [58]:
trainer.train()
# trainer.save_model(output_dir)

Step,Training Loss
1,47.25


In [59]:
df_log = pd.DataFrame(trainer.state.log_history)
df_log.head()
# df_log.dropna(subset=['eval_loss']).reset_index()['eval_loss'].plot(label='Validation'))
# df_log.dropna(subset=['loss']).reset_index()['loss'].plot(label='Train'))

Unnamed: 0,loss,learning_rate,epoch,step,train_runtime,train_samples_per_second,train_steps_per_second,total_flos,train_loss
0,47.25,0.0,0.01,1,,,,,
1,,,0.01,1,67.9872,0.118,0.015,5478059000000.0,47.25


In [17]:
# my_model = AutoModelForSeq2SeqLM.from_pretrained("./model_finetuning_output", torch_dtype=torch.bfloat16)

Training a fully fine-tuned version of the model would take a few hours on a GPU. However, we can apply performance optimization techniques in order to fully fine-tune the model while utilizing less resources. In the next section, we will talk about a technique called Low Rank Adaption (LoRA) to resuce fine-tuning mermory and compute requirements. 

# Performance Optimization 

## Parameter-Efficient Fine-Tuning 
One way to improve the fine-tuning process is through a method called Parameter-Efficient Fine-Tuning or PEFT for short. In contrast to fine-tuning, PEFT methods allow you to fine-tune your model with less computational resources. PEFT freezes the parameters of the pretrained model and fine-tunes a smaller set of parameters. Because we train a small number of parameters, PEFT reduces fine-tuning compute and memory requirements. PEFT techniques also reduce catastrophic forgetting, because the weights of the original model remain frozen preserving the model's knowledge. One common PEFT method is Low Ran Adaptation (LoRA)

### Low Ran Adaptation (LoRA)
Since LLMs are large, updating all model weights during training can be expensive due to GPU memory limitations. Suppose we have a large weight matrix W for a given layer. During backpropagation, we learn a ΔW matrix, which contains information on how much we want to update the original weights to minimize the loss function during training. 


Because LLMs are big, updating all the model weights during training could be expensive because of GPU memory limits. Let's say we have a large weight matrix W. During backpropagation, we update the weight matrix by ΔW in order to reduce the loss function:  

$ W_{new} = W_{old} + \Delta W $

LoRA technique replaces $ \Delta W $ with an aproximate matrices A and B in a way that:

$ \Delta W \approx A.B $ and $ W_{new} = W_{old} + A.B $

where,

$ W \in \mathbb{R}^{d \times d} $ and $ A \in \mathbb{R}^{d\times r} $ and $ B \in \mathbb{R}^{r \times d} $ 

$ d $ is weight matrix dimention and $ r $ is the rank of a LoRA module.
Assume we have a weight matrix $ W $ with dimensions of 512 x 512, which means it has 262144 trainable parameters. If we perform a full fine-tuning, we will be updating 262144 parameters. However, if we apply LoRA, with rank of 2 for instance, we will have 2 x 512 parameters in matrix A, and 2 x 512 parameters in matrix B, in total we will have 2048 (2 x 2 x 512) parameters. Therefore, we will be able to fine-tune the model by training only 2048 parameters instead of the full 262144 parameters. 

LoRA has two hyperparameters: First is <em>rank</em> that controls the inner dimension of the matrices A and B is a key factor in determining the balance between model adaptability and parameter efficiency.

Second is <em> alpha </em> that serves as a scaling factor applied to the LoRA output. Its role is to govern the degree to which the adapted layer's output can impact the original output. In other ways, it controls the impact of the low-rank adaptation on the layer's output. LoRA can be used to replace existing Linear layers in an LLM, for example, the self-attention module or feed forward modules.

In [9]:
# prompt = "At what time did total solar eclipse happen in Montreal in 2024?"

# body = {
#     "inputText": prompt,
#     "textGenerationConfig": {
#         "maxTokenCount": 512,
#         "stopSequences": [],
#         "temperature": 1,
#         "topP": 0.9
#     }
# }

In [10]:
# response = bedrock_runtime.invoke_model(
#     modelId="amazon.titan-text-express-v1", # Amazon Titan Text model
#     body=json.dumps(body)
# )

In [None]:
# output = response['body'].read().decode('utf8')
# print(json.loads(output)['results'][0]['outputText'])

In [13]:
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.document_loaders import WebBaseLoader, UnstructuredHTMLLoader

In [14]:
# url = "https://blog.cirquedusoleil.com/total-solar-eclipse-montreal"
# doc = WebBaseLoader(url).load()

In [15]:
# text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
#     chunk_size=250, chunk_overlap=0
# )
# doc_splits = text_splitter.split_documents(doc)

In [17]:
# contents = ""
# for split in doc_splits:
#     content = {"input": split.page_content}
#     contents += json.dumps(content) + "\n"
    

In [18]:
# with open("./train-continual-pretraining.jsonl", "w") as file:
#     file.writelines(contents)
#     file.close()

In [None]:
# import pandas as pd
# df = pd.read_json("./train-continual-pretraining.jsonl", lines=True)
# df

In [20]:
# data = "./train-continual-pretraining.jsonl"

In [None]:
# import sagemaker
# sess = sagemaker.Session()
# role = sagemaker.get_execution_role()
# region = boto3.Session().region_name
# sagemaker_session_bucket = sess.default_bucket()

# s3_location = f"s3://{sagemaker_session_bucket}/bedrock/finetuning/train-continual-pretraining.jsonl"
# s3_output = f"s3://{sagemaker_session_bucket}/bedrock/finetuning/output"

In [22]:
# !aws s3 cp train-continual-pretraining.jsonl $s3_location

upload: ./train-continual-pretraining.jsonl to s3://sagemaker-us-east-1-609362070692/bedrock/finetuning/train-continual-pretraining.jsonl


In [30]:
# timestamp = int(time.time())

# job_name = "titan2-{}".format(timestamp)
# job_name

# custom_model_name = "custom-{}".format(job_name)
# custom_model_name

'custom-titan2-1712689857'

In [None]:
# bedrock.create_model_customization_job(
#     customizationType="CONTINUED_PRE_TRAINING", # FINE_TUNING \ CONTINUED_PRE_TRAINING
#     jobName=job_name,
#     customModelName=custom_model_name,
#     roleArn=role,
#     baseModelIdentifier="amazon.titan-text-express-v1",
#     hyperParameters = {
#         "epochCount": "10",
#         "batchSize": "1",
#         "learningRate": "0.000001"
#     },
#     trainingDataConfig={"s3Uri": s3_location},
#     outputDataConfig={"s3Uri": s3_output},
# )

In [None]:
# status = bedrock.get_model_customization_job(jobIdentifier=job_name)["status"]

# while status == "InProgress":
#     print(status)
#     time.sleep(300)
#     status = bedrock.get_model_customization_job(jobIdentifier=job_name)["status"]
    
# print(status)

In [None]:
# custom_model_arn = bedrock.get_custom_model(modelIdentifier=custom_model_name)
