[Link](https://www.philschmid.de/sagemaker-llama2-qlora)<br>
[Mistral](https://medium.com/@pierre_guillou/fine-tune-the-llm-mistral-7b-on-amazon-sagemaker-today-4791613c335b)

In [1]:
import sagemaker
import boto3
import os 
import dotenv

from datasets import load_dataset
from random import randrange

sagemaker.config INFO - Not applying SDK defaults from location: C:\ProgramData\sagemaker\sagemaker\config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\arind\AppData\Local\sagemaker\sagemaker\config.yaml


### Setup Sagemaker

In [3]:
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()
 
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='arindam_sagemaker')['Role']['Arn']
    
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
 
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

Couldn't call 'get_role' to get Role ARN from role name arindam.d.dey@gmail.com to get Role path.


sagemaker role arn: arn:aws:iam::570517415597:role/arindam_sagemaker
sagemaker bucket: sagemaker-ap-south-1-570517415597
sagemaker session region: ap-south-1


### Setup HF for Loading Dataset

In [4]:
dotenv.load_dotenv()
os.environ['HUGGINGFACEHUB_API_TOKEN'] =  os.getenv("HF_API_KEY")

In [16]:
# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print('A random samples')
print(dataset[randrange(len(dataset))])

dataset size: 15011
A random samples
{'instruction': 'What is the Taubate Prison known for', 'context': 'Taubaté Prison is a prison in Taubaté in São Paulo, Brazil. It is notorious for containing some of the most violent prisoners, for repeated prison riots, and for being the place where the Primeiro Comando da Capital criminal gang originated.\n\nOn December 19, 2000 The Prison Uprising ended at Taubaté Prison\n\nreleased more than 20 hostages on Monday, ending an uprising at a maximum security facility that left nine prisoners dead, officials said.\n\nThe rebellion at the Taubate House of Custody and Psychiatric Treatment, about 80 miles outside Sao Paulo, began during visiting hours Sunday when an inmate opened fire with a revolver, provoking a fight with prisoners from another pavilion.\n\nTaking advantage of the confusion, prisoners took 23 hostages including four children.\n\nInmates began releasing hostages in small groups Monday after authorities agreed to transfer 10 prisoners

### Prepare the Dataset

Our dataset is currently a list of dictionaries. Each dictionary entry in the dataset has the keys __instruction, context, response and category__<br>

Our objective is to remove the key/value pairs from each dictionary and keep only one key called __text__. The value of the key shall contain the prompt terminated by the selected tokenizer's eos token.<br>

{'text': 
"### Instruction<br>
When did Virgin Australia start operating?<br>
<br>
<br>
\### Context<br>
Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline.<br>
<br>
<br>
\### Response<br>
Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.\</s>"}


In [17]:
from transformers import AutoTokenizer
from random import randint
from itertools import chain
from functools import partial

In [18]:
model_id = "meta-llama/Llama-2-7b-hf" # sharded weights
tokenizer = AutoTokenizer.from_pretrained(model_id,use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token

In [19]:
def format_dolly(sample):

    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if len(sample['context'])>0 else None
    response = f"### Response\n{sample['response']}"

    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])

    return prompt

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
    return sample

In [20]:
print(format_dolly(dataset[randrange(len(dataset))]))

### Instruction
How does Windows Fast Startup Work?

### Response
Fast Startup is a Windows feature that allows you to boot your computer in a few seconds rather than a minute. Rather than going through the cold boot path, Fast Startup uses a minimal hiberfile to resume the system. When the feature is enabled, selecting “Shutdown” in the Windows UI doesn’t actually shutdown the system. Instead, it closes all user applications, logs the current user out, and then creates a hiberfile. Because this hiberfile only includes the kernel, device drivers and a subset of applications, it is small and can be reloaded quickly. 

Alternatively, the cold boot path requires loading the kernel and drivers from disk, initializing the kernel and drivers, and launching various user mode applications. This can be especially slow on computers that use spinning hard drives.


In [21]:
# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
# print random sample
print(dataset[randint(0, len(dataset))]["text"])

### Instruction
Tell me which animals are bigger than the average human: Dog, Mouse, Elephant, Rhino, Hippo, Cat, Squirrel.

### Response
Sure. Here are the selections from above that are larger than the average human: Elephant, Rhino, and Hippo.</s>


In [24]:
# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}
 
def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    #print(sample.keys())
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])
 
    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length
 
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset

lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
)
'''
.map(
    partial(chunk, chunk_length=2048),
    batched=True,
)
'''


'\n.map(\n    partial(chunk, chunk_length=2048),\n    batched=True,\n)\n'

In [31]:
final_dataset = lm_dataset.map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

In [39]:
final_dataset[0]['input_ids']

[1,
 835,
 2799,
 4080,
 13,
 10401,
 1258,
 9167,
 8314,
 1369,
 13598,
 29973,
 13,
 13,
 2277,
 29937,
 15228,
 13,
 29963,
 381,
 5359,
 8314,
 29892,
 278,
 3534,
 292,
 1024,
 310,
 9167,
 8314,
 29718,
 349,
 1017,
 19806,
 29892,
 338,
 385,
 9870,
 29899,
 6707,
 4799,
 1220,
 29889,
 739,
 338,
 278,
 10150,
 4799,
 1220,
 491,
 22338,
 2159,
 304,
 671,
 278,
 9167,
 14982,
 29889,
 739,
 844,
 9223,
 5786,
 373,
 29871,
 29941,
 29896,
 3111,
 29871,
 29906,
 29900,
 29900,
 29900,
 408,
 9167,
 10924,
 29892,
 411,
 1023,
 15780,
 373,
 263,
 2323,
 5782,
 29889,
 739,
 11584,
 1476,
 3528,
 408,
 263,
 4655,
 4799,
 1220,
 297,
 8314,
 29915,
 29879,
 21849,
 9999,
 1156,
 278,
 24382,
 310,
 530,
 9915,
 8314,
 297,
 3839,
 29871,
 29906,
 29900,
 29900,
 29896,
 29889,
 450,
 4799,
 1220,
 756,
 1951,
 21633,
 304,
 4153,
 9080,
 29871,
 29941,
 29906,
 14368,
 297,
 8314,
 29892,
 515,
 19766,
 29879,
 297,
 1771,
 275,
 29890,
 1662,
 29892,
 22103,
 322,
 16198,
 298

In [29]:
print(tokenizer.decode(lm_dataset[0]['input_ids']))

<s> ### Instruction
When did Virgin Australia start operating?

### Context
Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.

### Response
Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.</s>


In [21]:
x = iter(lm_dataset)
y = next(x)

In [22]:
print(tokenizer.decode(y['labels']))

<s> ### Instruction
When did Virgin Australia start operating?

### Context
Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.

### Response
Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.</s><s> ### Instruction
Which is a species of fish? Tope or Rope

### Response
Tope</s><s> ### Instruction
Why can camels survive for long without water?

### Response
Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.</s><s> ### Instruct

In [23]:
print(tokenizer.decode(y['input_ids']))

<s> ### Instruction
When did Virgin Australia start operating?

### Context
Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.

### Response
Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.</s><s> ### Instruction
Which is a species of fish? Tope or Rope

### Response
Tope</s><s> ### Instruction
Why can camels survive for long without water?

### Response
Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.</s><s> ### Instruct

In [249]:
#s3://innovationdatasets/dolly/
# save train_dataset to s3
training_input_path = f's3://innovationdatasets/processed/llama/dolly/train'
lm_dataset.save_to_disk(training_input_path)
 
print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

Saving the dataset (0/1 shards):   0%|          | 0/1581 [00:00<?, ? examples/s]

uploaded data to:
training dataset to: s3://innovationdatasets/processed/llama/dolly/train


In [250]:
import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder
 
# define Training Job Name
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
 
# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'epochs': 2,                                      # number of training epochs
  'per_device_train_batch_size': 3,                 # batch size for training
  'lr': 2e-4,                                       # learning rate used during training
  'hf_token': HfFolder.get_token(),                 # huggingface token to access llama 2
  'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
}
 
# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.2xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

In [251]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}
 
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-qlora-2024-07-14-17-32-23-2024-07-14-12-02-26-208


2024-07-14 12:02:27 Starting - Starting the training job......
2024-07-14 12:03:19 Starting - Preparing the instances for training...
2024-07-14 12:04:00 Downloading - Downloading the training image...........................
2024-07-14 12:08:13 Training - Training image download completed. Training in progress..bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2024-07-14 12:08:32,338 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2024-07-14 12:08:32,355 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2024-07-14 12:08:32,365 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2024-07-14 12:08:32,366 sagemaker_pytorch_container.training INFO     Invoking user training script.
2024-07-14 12:08:33,732 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
/opt/conda/bin/python3.10

In [29]:
def some_func(x,y):
    return x+y

g = partial(some_func, 1)
g(2)

3

In [31]:
x = [1,2,3,4,5]
y = [6,7]

list(chain(x,y))

[1, 2, 3, 4, 5, 6, 7]