## Initial Setup
In this section we'll import the requisite libraries and instantiate a number of objects and variables to configure our training job

In [2]:
import sagemaker                                        # SageMaker Python SDK
from sagemaker.pytorch import PyTorch                   # PyTorch Estimator for running pytorch training jobs
from sagemaker.debugger import TensorBoardOutputConfig  # Debugger TensorBoard config to log training metrics to TensorBoard
import boto3                                            # AWS SDK for Python
import os
import tarfile
import pandas as pd
from pathlib import Path

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [3]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()   # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_key_prefix = "13-bill"  # folder within bucket where code artifact will go

region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

s3_client = boto3.client("s3")  # client to intreract with S3 API
sm_client = boto3.client("sagemaker")  # client to intreract with SageMaker
smr_client = boto3.client("sagemaker-runtime")  # client to intreract with SageMaker Endpoints

In [None]:
!wget https://raw.githubusercontent.com/cylnlp/dialogsum/main/DialogSum_Data/dialogsum.train.jsonl -O data/dialogsum.train.jsonl
!wget https://raw.githubusercontent.com/cylnlp/dialogsum/main/DialogSum_Data/dialogsum.test.jsonl -O data/dialogsum.test.jsonl

Let's take a look at a few examples

The `accelerate launch` command has two key parts, the `config.yml` file and the `train.py` script. The `config.yml` file is used to configure the distributed training job. The `train.py` script is the training script that will be launched by the launcher. In this example, we'll use the [ds_zero3.yml](src/train/ds_zero3.yaml) configuration file. The config file enables [DeepSpeed ZeRo Stage3](#https://www.deepspeed.ai/tutorials/zero/) and a number of other optimizations to enable training of large scale models. This file was generated by running `accelerate config --config_file ds_zero3.yml` and then following the on-screen prompts. 
The [train.py](src/train/train.py) makes use of a number of key libraries to enable training of large models with minimal code changes:
- 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) - Configures the distributed training environment and adapts training objects (data loaders, models, optimizers) to the distributed environment
- 🤗 [Transformers](https://huggingface.co/docs/transformers/index) - Provides a number of pre-trained models and utilities for training and evaluating models
- 🤗 [PEFT](https://github.com/huggingface/peft) - Provides a number of methods for Parameter Efficient Finetuning(PEFT) of large language models. The [LoRA](https://arxiv.org/pdf/2106.09685.pdf) method will be used to finetune the model
- [DeepSpeed](https://github.com/microsoft/DeepSpeed) - Provides a number of optimizations to enable training of large models. In this example, we'll use DeepSpeed ZeRO Stage3 to enable training of models with over 1B parameters

# 8 gpu

In [10]:
# configure the tesnorboard output directly to S3
tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=f"s3://{bucket}/{s3_key_prefix}/tensorboard"
)

image = '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker'
estimator1 = PyTorch(
    source_dir = "src/train",
    entry_point="acc_launcher.py",
    role=role,
    instance_count=1, 
    instance_type="ml.p4d.24xlarge", 
    framework_version="2.0.0",
    py_version="py310",
    disable_profiler=True,
    tensorboard_output_config=tensorboard_output_config,
    hyperparameters = {
    "training_script": "train_qlora.py",
    "config_file": "qlora.yaml",
    "seed": 100,
    "model_name_or_path": "NousResearch/Llama-2-13b-hf",
    "dataset_name": "smangrul/ultrachat-10k-chatml",
    "chat_template_format": "chatml",
    "add_special_tokens": False,
    "append_concat_token": False,
    "splits": "train,test",
    "max_seq_len": 2048,
    "num_train_epochs": 1,
    "logging_strategy": "steps",
    "evaluation_strategy": "epoch",
    "bf16": True,
    "packing": True,
    "learning_rate": 1e-4,
    "lr_scheduler_type": "cosine",
    "weight_decay": 1e-4,
    "warmup_ratio": 0.0,
    "max_grad_norm": 1.0,
    "output_dir": "llama-sft-qlora-dsz3",
    "per_device_train_batch_size": 2,
    "per_device_eval_batch_size": 2,
    "gradient_accumulation_steps": 2,
    "gradient_checkpointing": True,
    "use_reentrant": True,
    "dataset_text_field": "content",
    "use_flash_attn": False,
    "use_peft_lora": True,
    "lora_r": 8,
    "lora_alpha": 16,
    "lora_dropout": 0.1,
    "lora_target_modules": "all-linear",
    "use_4bit_quantization": True,
    "use_nested_quant": True,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_storage_dtype": "bfloat16",
    "report_to":"none",

},

    keep_alive_period_in_seconds=3600
)

In [11]:
estimator1.fit(wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2024-04-14-16-55-37-214


2024-04-14 16:55:37 Starting - Starting the training job
2024-04-14 16:55:37 Pending - Training job waiting for capacity......
2024-04-14 16:56:10 Pending - Preparing the instances for training........................
2024-04-14 17:00:23 Downloading - Downloading input data...
2024-04-14 17:00:48 Downloading - Downloading the training image...............
2024-04-14 17:03:39 Training - Training image download completed. Training in progress..........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-04-14 17:04:56,463 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-04-14 17:04:56,559 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-04-14 17:04:56,566 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-04-14 17:04:56,568 sagemaker_pytorch_c

# 8 gpu

# next