# Continual Pre-Training vs. Fine-tuning


In this article, we will see the difference between Fine-tuning and Continual Pre-Training. We will use Amazon Bedrock Python SDK to fine-tune and continual pre-train a foundation model with your own data. If you have a train dataset and want to adapt a base model to your domain, you can further adjust it by giving your training data. We will see how to fine-tune or continual pre-train a base model with Amazon Bedrock in this demo.

You can store your data on Amazon S3 and provide the S3 bucket path while you are configuring the model customization job. You can also change the hyper parameters (like learning rate, number of epochs, and batch size) for fine-tuning. Once the fine-tuning job with your data is finished, you can deploy the model into an endpoint and use it for inference. You can use the fine-tuned model and provide your prompt to the model along with a set of model parameters. In the following, we will walk through "continued pre-training with Amazon Bedrock". 

### Fine-tuning

Fine-tuning and continual pre-training are both techniques used in machine learning to adapt models for specific tasks or domains. Fine-tuning involves taking a pre-trained model and adjusting its parameters using labeled data relevant to the target task, thereby tailoring it to the nuances of that particular task. This process enhances the model's effectiveness in that specific task compared to using a general-purpose pre-trained model. 

### Continual Pre-training 

On the other hand, continual pre-training entails taking an already pre-trained model and employing transfer learning to further train it on new data from a different domain. Continual pre-training allows for ongoing adaptation and refinement of the model's knowledge and performance across various tasks or domains, leveraging the previously acquired knowledge while continuously learning from new data.

With this introduction, let's delve into the implementation in Amazon bedrock. 

# Fine-tuning Process
We will start with fine-tuning an LLM with our labeled training data. Our dataset is a public dialogue summarization dataset. We will prepare it for fine-tuning and will customize our model with the training samples.  

In [None]:
! pip install langchain tiktoken datasets

In [41]:
import boto3
import json
import time
import sagemaker
from pprint import pprint
from IPython.display import display, HTML
import pandas as pd

In [47]:
# Amazon Bedrock control plane including fine-tuning
bedrock = boto3.client(service_name="bedrock")

# Amazon Bedrock data plane including model inference
bedrock_runtime = boto3.client(service_name="bedrock-runtime")

First, we should check the available models for fine-tuning. Using <code>list_foundation_models()</code>, we can list the available foundation models in Bedrock. Since we are looking for fine-tunable foundation models, we can filter the list like this: 

In [48]:
bedrock.list_foundation_models(byProvider="Amazon", byCustomizationType="FINE_TUNING")['modelSummaries']

[{'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-image-generator-v1:0',
  'modelId': 'amazon.titan-image-generator-v1:0',
  'modelName': 'Titan Image Generator G1',
  'providerName': 'Amazon',
  'inputModalities': ['TEXT', 'IMAGE'],
  'outputModalities': ['IMAGE'],
  'customizationsSupported': ['FINE_TUNING'],
  'inferenceTypesSupported': ['PROVISIONED'],
  'modelLifecycle': {'status': 'ACTIVE'}},
 {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-text-lite-v1:0:4k',
  'modelId': 'amazon.titan-text-lite-v1:0:4k',
  'modelName': 'Titan Text G1 - Lite',
  'providerName': 'Amazon',
  'inputModalities': ['TEXT'],
  'outputModalities': ['TEXT'],
  'responseStreamingSupported': True,
  'customizationsSupported': ['FINE_TUNING', 'CONTINUED_PRE_TRAINING'],
  'inferenceTypesSupported': ['PROVISIONED'],
  'modelLifecycle': {'status': 'ACTIVE'}},
 {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-text-express-v1:0:8k',
  'modelId': 

From the list, we can see there are two options to choose from: <code>Titan Text G1 - Express</code> and <code>Titan Text G1 - Lite</code>. For this demo, we will use the former. So, we set our base model: 

In [52]:
base_model_id = "amazon.titan-text-express-v1"

# 3. Our dataset

For this demo, we will use a public dialogue summarization dataset, "dialogsum" as a custom dataset to fine-tune our base model. Dialogsum is a large-scale dialogue summarization dataset has over 13,000 conversations and summaries. You can read more about it here: https://huggingface.co/datasets/knkarthick/dialogsum.

Data Fields:

* dialogue: text of dialogue.
* summary: human written summary of the dialogue.
* topic: human written topic/one liner of the dialogue.
* id: unique file id of an example.

In [26]:
from datasets import load_dataset
dataset = load_dataset("knkarthick/dialogsum")

In [27]:
def convert_to_instruction(row):
    row['input'] = "Summarize the following conversation.\n\n" + row['dialogue'] + "\n\nSummary: "
    return row

Considering fine-tuning duration and the cost, let's select only 100 of train data for fine-tuning. <code>load_dataset</code> return an instance of <code>DatasetDict</code> class. We can apply its methods to furthur refine our dataset. https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/main_classes#datasets.DatasetDict.data
First, we select 100 of the 'train' set, 


In [28]:
ds = dataset['train'].select(range(100))
ds = ds.remove_columns(['id','topic'])
ds = ds.map(convert_to_instruction)
ds = ds.remove_columns(['dialogue'])
ds = ds.rename_column('summary','output')
ds.to_json('dialogsum-train-100.jsonl', index=False)
ds[0]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

{'output': "Mr. Smith's getting a check-up, and Doctor Hawkins advises him to have one every year. Hawkins'll give some information about their classes and medications to help Mr. Smith quit smoking.",
 'input': "Summarize the following conversation.\n\n#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of tim

In [29]:
import pandas as pd
df = pd.read_json("model_customization_data/dialogsum-train-100.jsonl", lines=True)
df

Unnamed: 0,output,input
0,"Mr. Smith's getting a check-up, and Doctor Haw...",Summarize the following conversation.\n\n#Pers...
1,Mrs Parker takes Ricky for his vaccines. Dr. P...,Summarize the following conversation.\n\n#Pers...
2,#Person1#'s looking for a set of keys and asks...,Summarize the following conversation.\n\n#Pers...
3,#Person1#'s angry because #Person2# didn't tel...,Summarize the following conversation.\n\n#Pers...
4,Malik invites Nikki to dance. Nikki agrees if ...,Summarize the following conversation.\n\n#Pers...
...,...,...
95,#Person1# and #Person2# are planning the class...,Summarize the following conversation.\n\n#Pers...
96,Martin tells #Person1# about his experience in...,Summarize the following conversation.\n\n#Pers...
97,#Person1# surveys #Person2# about #Person2#'s ...,Summarize the following conversation.\n\n#Pers...
98,#Person1# and #Person2# want a place near the ...,Summarize the following conversation.\n\n#Pers...


In [73]:
data = "model_customization_data/dialogsum-train-100.jsonl"

Now we will apply json loads function on each row of the ‘json_element’ column. ‘json.loads’ is a decoder function in python which is used to decode a json object into a dictionary. ‘apply’ is a popular function in pandas that takes any function and applies to each row of the pandas dataframe or series.

### Uploading data to S3

In [90]:
from sagemaker.s3 import S3Downloader, S3Uploader
# Upload input to the target location
bucket = sagemaker.Session().default_bucket()
role = sagemaker.get_execution_role()
prefix = "bedrock"
train_s3_url = f"s3://{bucket}/{prefix}/fine-tuning"
output_s3_url = f"s3://{bucket}/{prefix}/fine-tuning/output"
S3Uploader().upload("model_customization_data/dialogsum-train-100.jsonl", train_s3_url)
training_data_s3 = train_s3_url + "/dialogsum-train-100.jsonl"

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


# Fine-tuning

In [92]:
import time
job_name = f"fine-tuning-titan-{time.localtime().tm_sec}"
custom_model_name = f"fine-tuned-titan-{time.localtime().tm_sec}"
print(f"job name is: {job_name} - custom model name is: {custom_model_name}")

job name is: fine-tuning-titan-0 - custom model name is: fine-tuned-titan-0


In [93]:
bedrock.create_model_customization_job(
    customizationType="FINE_TUNING",  # or CONTINUAL_PRE_TRAINING
    jobName=job_name,
    customModelName=custom_model_name,
    roleArn=role,
    baseModelIdentifier=base_model_id,
    hyperParameters = {
        "epochCount": "10",
        "batchSize": "1",
        "learningRate": "0.000001",
        "learningRateWarmupSteps": "0"
    },
    trainingDataConfig={"s3Uri": training_data_s3},
    outputDataConfig={"s3Uri": output_s3_url},
)

{'ResponseMetadata': {'RequestId': 'ec21b7e4-0ee9-435f-a98b-ff4cb64235e5',
  'HTTPStatusCode': 201,
  'HTTPHeaders': {'date': 'Wed, 10 Apr 2024 20:36:01 GMT',
   'content-type': 'application/json',
   'content-length': '122',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'ec21b7e4-0ee9-435f-a98b-ff4cb64235e5'},
  'RetryAttempts': 0},
 'jobArn': 'arn:aws:bedrock:us-east-1:609362070692:model-customization-job/amazon.titan-text-express-v1:0:8k/13y2lddfvysk'}

# Check model fine-tuning progress

## Attention: dpending on your resource instances the following cell may take up to 45 mins to complete

In [None]:
import time

status = bedrock.get_model_customization_job(jobIdentifier=job_name)["status"]

while status == "InProgress":
    print(status)
    time.sleep(30)
    status = bedrock.get_model_customization_job(jobIdentifier=job_name)["status"]
    
print(status)

# Performance Optimization 

## Parameter-Efficient Fine-Tuning 
One way to improve the fine-tuning process is through a method called Parameter-Efficient Fine-Tuning or PEFT for short. In contrast to fine-tuning, PEFT methods allow you to fine-tune your model with less computational resources. PEFT freezes the parameters of the pretrained model and fine-tunes a smaller set of parameters. Because we train a small number of parameters, PEFT reduces fine-tuning compute and memory requirements. PEFT techniques also reduce catastrophic forgetting, because the weights of the original model remain frozen preserving the model's knowledge. One common PEFT method is Low Ran Adaptation (LoRA)

### Low Ran Adaptation (LoRA)
Since LLMs are large, updating all model weights during training can be expensive due to GPU memory limitations. Suppose we have a large weight matrix W for a given layer. During backpropagation, we learn a ΔW matrix, which contains information on how much we want to update the original weights to minimize the loss function during training. 


Because LLMs are big, updating all the model weights during training could be expensive because of GPU memory limits. Let's say we have a large weight matrix W. During backpropagation, we update the weight matrix by ΔW in order to reduce the loss function:  

$ W_{new} = W_{old} + \Delta W $

LoRA technique replaces $ \Delta W $ with an aproximate matrices A and B in a way that:

$ \Delta W \approx A.B $ and $ W_{new} = W_{old} + A.B $

where,

$ W \in \mathbb{R}^{d \times d} $ and $ A \in \mathbb{R}^{d\times r} $ and $ B \in \mathbb{R}^{r \times d} $ 

$ d $ is weight matrix dimention and $ r $ is the rank of a LoRA module

In [9]:
prompt = "At what time did total solar eclipse happen in Montreal in 2024?"

body = {
    "inputText": prompt,
    "textGenerationConfig": {
        "maxTokenCount": 512,
        "stopSequences": [],
        "temperature": 1,
        "topP": 0.9
    }
}

In [10]:
response = bedrock_runtime.invoke_model(
    modelId="amazon.titan-text-express-v1", # Amazon Titan Text model
    body=json.dumps(body)
)

In [11]:
output = response['body'].read().decode('utf8')
print(json.loads(output)['results'][0]['outputText'])


In Montreal, Canada, a total solar eclipse will take place on June 2, 2024. This eclipse will be visible to a small portion of the country, specifically those located in the northern parts of Quebec and the southern parts of Ontario.


In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import WebBaseLoader, UnstructuredHTMLLoader

In [14]:
url = "https://blog.cirquedusoleil.com/total-solar-eclipse-montreal"
doc = WebBaseLoader(url).load()

In [15]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=250, chunk_overlap=0
)
doc_splits = text_splitter.split_documents(doc)

In [16]:
print(len(doc_splits))

6


In [17]:
contents = ""
for split in doc_splits:
    content = {"input": split.page_content}
    contents += json.dumps(content) + "\n"
    

In [18]:
with open("./train-continual-pretraining.jsonl", "w") as file:
    file.writelines(contents)
    file.close()

In [19]:
import pandas as pd
df = pd.read_json("./train-continual-pretraining.jsonl", lines=True)
df

Unnamed: 0,input
0,Total Solar Eclipse 2024 in Montreal: A Rare C...
1,Life is a Circus\n\n\nCirque du Sound\n\n\n\n\...
2,"On April 8th, 2024, you’ll have the unique opp..."
3,Special Features and Surprises \nIt’s no secre...
4,Participating in the contest is straightforwar...
5,TiktokFacebookInstagramYouTube\n\n\n\n\n\n\n\n...


In [20]:
data = "./train-continual-pretraining.jsonl"

In [21]:
import sagemaker
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session_bucket = sess.default_bucket()

s3_location = f"s3://{sagemaker_session_bucket}/bedrock/finetuning/train-continual-pretraining.jsonl"
s3_output = f"s3://{sagemaker_session_bucket}/bedrock/finetuning/output"

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [22]:
!aws s3 cp train-continual-pretraining.jsonl $s3_location

upload: ./train-continual-pretraining.jsonl to s3://sagemaker-us-east-1-609362070692/bedrock/finetuning/train-continual-pretraining.jsonl


In [30]:
timestamp = int(time.time())

job_name = "titan2-{}".format(timestamp)
job_name

custom_model_name = "custom-{}".format(job_name)
custom_model_name

'custom-titan2-1712689857'

In [None]:
bedrock.create_model_customization_job(
    customizationType="CONTINUED_PRE_TRAINING", # FINE_TUNING \ CONTINUED_PRE_TRAINING
    jobName=job_name,
    customModelName=custom_model_name,
    roleArn=role,
    baseModelIdentifier="amazon.titan-text-express-v1",
    hyperParameters = {
        "epochCount": "10",
        "batchSize": "1",
        "learningRate": "0.000001"
    },
    trainingDataConfig={"s3Uri": s3_location},
    outputDataConfig={"s3Uri": s3_output},
)

Depending on your 

In [None]:
status = bedrock.get_model_customization_job(jobIdentifier=job_name)["status"]

while status == "InProgress":
    print(status)
    time.sleep(300)
    status = bedrock.get_model_customization_job(jobIdentifier=job_name)["status"]
    
print(status)

In [None]:
custom_model_arn = bedrock.get_custom_model(modelIdentifier=custom_model_name)
