# Efficient Large Language Model training with LoRA and Hugging Face

In this sagemaker example, we are going to learn how to apply [Low-Rank Adaptation of Large Language Models (LoRA)](https://arxiv.org/abs/2106.09685) to fine-tune Phi-2 (2.7b parameters model showcased a nearly state-of-the-art performance among models with less than 13 billion parameters) on a single GPU. We are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft). 

You will learn how to:

1. Setup Development Environment
2. Load and prepare the dataset
3. Fine-Tune Phi-2 with LoRA and bnb 4bit on Amazon SageMaker
4. Deploy the model to Amazon SageMaker Endpoint

### Quick intro: PEFT or Parameter Efficient Fine-tunin

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)


## 1. Setup Development Environment

#test run

In [5]:
!pip install transformers datasets sagemaker aiobotocore awscli py7zr --upgrade --quiet

In [1]:
import boto3

%env AWS_DEFAULT_REGION=eu-west-1
%env AWS_PROFILE=unidatalab
boto3.setup_default_session(profile_name="unidatalab")

env: AWS_DEFAULT_REGION=eu-west-1
env: AWS_PROFILE=unidatalab


In [2]:
import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/vmelnyk/Library/Application Support/sagemaker/config.yaml


Couldn't call 'get_role' to get Role ARN from role name vladyslav.melnyk@unidatalab.com to get Role path.


sagemaker role arn: arn:aws:iam::839041442979:role/sagemaker_execution_role
sagemaker bucket: sagemaker-eu-west-1-839041442979
sagemaker session region: eu-west-1


## 2. Load and prepare the dataset

We will use the [Indian Financial News](https://huggingface.co/datasets/kdave/Indian_Financial_News) an open-source dataset comprising 26,000 rows of financial news articles related to the Indian market. It features four columns: URL, Content (scrapped content), Summary (generated using the T5-base model), and Sentiment Analysis (gathered using the GPT add-on for Google Sheets). <br>The dataset is designed for sentiment analysis tasks, providing a comprehensive view of sentiments expressed in financial news.
<br>There is an issue with ```dataset = load_dataset("kdave/Indian_Financial_News")``` for this dataset. Please, download CSV from HuggingFace and put it in the same folder as this notebook.

In [3]:
from datasets import load_dataset

dataset = load_dataset("csv", data_files="training_data_26000.csv")["train"]

print(f"Train dataset size: {len(dataset)}")
print(f"Dataset features: {dataset.features}")

Generating train split: 0 examples [00:00, ? examples/s]

Train dataset size: 26961
Dataset features: {'URL': Value(dtype='string', id=None), 'Content': Value(dtype='string', id=None), 'Summary': Value(dtype='string', id=None), 'Sentiment': Value(dtype='string', id=None)}


In [4]:
from transformers import AutoTokenizer

model_id = "microsoft/phi-2"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.model_max_length = 2024  # overwrite wrong value

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
from random import randint
from itertools import chain
from functools import partial


# Define the create_prompt function
def create_prompt(sample):
    question = f"Instruct: Summarize and classify sentiment: {sample['Content']}"
    answer = f"\nOutput: {sample['Summary']}"
    sentiment = f"\nSentiment: {sample['Sentiment']}"

    full_prompt = ""
    full_prompt += "\n" + question
    full_prompt += "\n" + answer
    full_prompt += "\n" + sentiment
    sample["text"] = full_prompt

    return sample


# apply prompt template per sample
dataset = dataset.map(create_prompt, remove_columns=list(dataset.features))

print(dataset[randint(0, len(dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}


def chunk(sample, chunk_length=1024):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {
        k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()
    }
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {
        k: concatenated_examples[k][batch_chunk_length:]
        for k in concatenated_examples.keys()
    }
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]),
    batched=True,
    remove_columns=list(dataset.features),
).map(
    partial(chunk, chunk_length=2024),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")

Map:   0%|          | 0/26961 [00:00<?, ? examples/s]


Instruct: Summarize and classify sentiment: live bse live

nse live Volume Todays L/H More ×

Tech Mahindra on April 30 announced it is putting on hold incentives and wage hikes till there is more clarity on business as the coronavirus outbreak has disrupted operations worldwide.

However, its junior associates will get their remuneration packages in full.

Speaking to media after the announcement of the results, CP Gurnani, CEO, said the company has taken a conscious decision that junior associates will get full package.

"The cut was taken by middle and senior management," he said.

Employees, say junior associates, they might or might not be able to take any cut given their cost of living and saving.

COVID-19 Vaccine Frequently Asked Questions View more How does a vaccine work? A vaccine works by mimicking a natural infection. A vaccine not only induces immune response to protect people from any future COVID-19 infection, but also helps quickly build herd immunity to put an end to

Map:   0%|          | 0/26961 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2509 > 2024). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/26961 [00:00<?, ? examples/s]

Total number of samples: 12173


In [6]:
# save train_dataset to s3
training_input_path = f"s3://{sess.default_bucket()}/processed/peft-sagemaker/train"
# lm_dataset.save_to_disk(training_input_path)

# select data
# chunk = lm_dataset.select(range(400))
# chunk.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

uploaded data to:
training dataset to: s3://sagemaker-eu-west-1-839041442979/processed/peft-sagemaker/train


In [13]:
import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

mlflow_host = None

# define Training Job Name
job_name = f'huggingface-peft-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# hyperparameters, which are passed into the training job
hyperparameters = {
    "model_id": model_id,  # pre-trained model
    "dataset_path": "/opt/ml/input/data/training",  # path where sagemaker will save training dataset
    "epochs": 2,  # number of training epochs
    "per_device_train_batch_size": 2,  # batch size for training
    "lr": 2e-4,  # learning rate used during training
    "hf_token": HfFolder.get_token(),  # huggingface token to access llama 2
    "merge_weights": True,  # whether to merge LoRA into the model (needs more memory)
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point="run_clm.py",  # train script
    source_dir="scripts",  # directory which includes all the files needed for training
    instance_type="ml.g5.2xlarge",  # instances type used for the training job
    instance_count=1,  # the number of instances used for training
    base_job_name=job_name,  # the name of the training job
    role=role,  # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size=300,  # the size of the EBS volume in GB
    transformers_version="4.28",  # the transformers version used in the training job
    pytorch_version="2.0",  # the pytorch_version version used in the training job
    py_version="py310",  # the python version used in the training job
    hyperparameters=hyperparameters,  # the hyperparameters passed to the training job
    environment={
#        "MLFLOW_TRACKING_URI": f"http://{mlflow_host}",  # will be required when we have MLFlow Server
#        "MLFLOW_EXPERIMENT_NAME": job_name,              # will be required when we have MLFlow Server
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache",
    },
)

In [14]:
# define a data input dictonary with our uploaded s3 uris
data = {"training": training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-peft-2024-02-26-23-45-56-2024-02-26-21-45-57-074


2024-02-26 21:45:58 Starting - Starting the training job...
2024-02-26 21:46:20 Pending - Preparing the instances for training......
2024-02-26 21:47:20 Downloading - Downloading input data...
2024-02-26 21:47:59 Downloading - Downloading the training image..................
2024-02-26 21:51:05 Training - Training image download completed. Training in progress.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-02-26 21:51:33,229 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-02-26 21:51:33,247 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-02-26 21:51:33,256 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-02-26 21:51:33,258 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2024-02-26 21:51:34,

 In our example for `Phi-2 2.7b`, the SageMaker training job took `77 129` seconds, which is about `21.4` hours. The `ml.g5.2xlarge` instance we used costs `$1.212` per hour for on-demand usage. As a result, the total cost for training our fine-tuned Phi-2 model was only `~$26`.

In [132]:
huggingface_estimator.model_data

's3://sagemaker-eu-west-1-839041442979/huggingface-peft-2024-02-26-23-45-56-2024-02-26-21-45-57-074/output/model.tar.gz'

In [117]:
from sagemaker.huggingface import HuggingFaceModel

instance_type = "ml.g5.4xlarge"

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    model_data=huggingface_estimator.model_data,
    # model_data="s3://sagemaker-eu-west-1-139758530029/huggingface-peft-2024-01-28-12-57-23-2024-01-28-10-57-27-857/output/model.tar.gz",
    role=role,
    transformers_version="4.28", 
    pytorch_version="2.0", 
    py_version="py310",
    model_server_workers=1
)

In [118]:
# deploy model to SageMaker Inference

health_check_timeout = 600

predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

INFO:sagemaker:Creating model with name: huggingface-pytorch-inference-2024-02-28-15-39-00-490
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-inference-2024-02-28-15-39-01-235
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-inference-2024-02-28-15-39-01-235


----------!

In [127]:
# select a random test sample
#train_dataset = load_dataset("csv", data_files="training_data_26000.csv")["train"]
#sample = train_dataset[randint(0, len(train_dataset))]

# format sample
prompt_template = (
    f"""Instruct:
    Provide an overview of the most impactful Indian Financial News.
    Output:"""
)

fomatted_sample = {
    "inputs": prompt_template,
    "parameters": {
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.2,
        "max_new_tokens": 2000,
    },
}
fomatted_sample

{'inputs': 'Instruct:\n    Provide an overview of the most impactful Indian Financial News.\n    Output:',
 'parameters': {'do_sample': True,
  'top_p': 0.9,
  'temperature': 0.2,
  'max_new_tokens': 2000}}

In [130]:
# predict
res = predictor.predict(fomatted_sample)

print(res[0]["generated_text"])

Instruct:
    Provide an overview of the most impactful Indian Financial News.
    Output: the whole the whole the entire the entire the entire the whole the entire the entire the entire the entire the whole the entire the entire the entire the entire the entire the entire the entire the entire the whole the entire the entire the entire the entire the entire the entire the entire the entire the entire the entire the entire the entire the entire the entire the entire the entire the whole the whole the entire the entire the entire the entire the top the top the top the top of the entire the entire the top the entire the Rosention of the entire the entire the whole the whole the entire the E 5 to the Cention of the Roention of the Rosention of the Cention of the E2 the E2 the Cention of the Cention of the Cention of the Euro the Rosention of the entire the Proention of the Eul the Cention of the E2 the CA 5 to the Rosention of the Rosention of the Cention of the C2 the Cention of the Rose

In [131]:
predictor.delete_model()
predictor.delete_endpoint()

INFO:sagemaker:Deleting model with name: huggingface-pytorch-inference-2024-02-28-15-39-00-490
INFO:sagemaker:Deleting endpoint configuration with name: huggingface-pytorch-inference-2024-02-28-15-39-01-235
INFO:sagemaker:Deleting endpoint with name: huggingface-pytorch-inference-2024-02-28-15-39-01-235
