# NeMo ASR Fine-tuning Using AWS SageMaker

In this tutorial we show how you can fine-tune a pre-trained NeMo ASR Model using [Amazon Sagemaker](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html).

Using AWS SageMaker we fine-tune a Conformer CTC model using the AN4 dataset on a remote instance.

The overall steps are:

1. Setup your AWS Credentials to access SageMaker
2. Download the source code we'll be running
3. Configure the fine-tuning job
4. Setup the AN4 dataset, upload data to S3
5. Run fine-tuning job on SageMaker

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect
"""
# If you're using Google Colab and not running locally, run this cell.

## Install dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg
!pip install text-unidecode
!pip install matplotlib>=3.3.2

## Install NeMo
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case
that you want to use the "Run All Cells" (or similar) option.
"""
# exit()

In [None]:
pip install sagemaker awscli

### 1. Setup SageMaker with AWS Credentials

If you haven't setup your AWS credentials, setup using the configuration CLI.
You will need your access and Secret key, with permissions to use SageMaker.

In [None]:
!aws configure

In [None]:
from pathlib import Path

import sagemaker
import wget
from omegaconf import OmegaConf
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch

from nemo.utils.notebook_utils import download_an4

In [None]:
sess = sagemaker.Session()

### 2. Download the NeMo source code

SageMaker allows you to pass in your own source code, with an entrypoint script.

Below we download the AWS NeMo `config.yaml` which contains our configuration, and the `speech_to_text_ctc_finetune.py` script to run fine-tuning.

In [None]:
code_dir = Path('./code/')
config_dir = code_dir / 'conf/'
data_dir = Path('./data/')
code_dir.mkdir(exist_ok=True)
config_dir.mkdir(exist_ok=True)

In [None]:
config_path = str(config_dir / "config.yaml")
wget.download(
    "https://raw.githubusercontent.com/NVIDIA/NeMo/feat/aws-asr/tutorials/asr/cloud/aws/conf/config.yaml", config_path
)
wget.download(
    "https://raw.githubusercontent.com/NVIDIA/NeMo/feat/aws-asr/tutorials/asr/cloud/aws/speech_to_text_ctc_finetune.py",
    str(code_dir),
)

We also create a `requirements.txt` file within our source code to install NeMo.

In [None]:
with open(code_dir / 'requirements.txt', 'w') as f:
    f.write("nemo_toolkit[all]")

### 3. Configure the fine-tuning job

Now we configure the fine-tuning job, by modifying the `config.yaml` file that is stored in our source code directory.
We pass relative directory paths for the data, and the path to the `pretrained_model` we'll be using.

In [None]:
conf = OmegaConf.load(config_path)

conf.pretrained_model_name = "nvidia/stt_en_conformer_ctc_large"

conf.model.train_ds.manifest_filepath = ("/opt/ml/input/data/training/an4/train_manifest.json",)
conf.model.validation_ds.manifest_filepath = "/opt/ml/input/data/testing/an4/test_manifest.json"
conf.trainer.accelerator = "gpu"
conf.trainer.max_epochs = 1
OmegaConf.save(conf, config_dir / 'config.yaml')

### 4. Setup the AN4 Dataset, upload data to S3

We now download our training and validation data, uploading to S3 so that SageMaker can mount our data to the instance at runtime.

In [None]:
# within the SageMaker container, mount_dir will be where our data is stored.
download_an4(
    data_dir=str(data_dir),
    train_mount_dir="/opt/ml/input/data/training/",
    test_mount_dir="/opt/ml/input/data/testing/",
)

# Upload to the default bucket
prefix = "an4"
bucket = sess.default_bucket()
loc = sess.upload_data(path=str(data_dir), bucket=bucket, key_prefix=prefix)

### 4. Run fine-tuning job on SageMaker

Finally we pass the path of the training and validation data on S3 + the output directory on S3.

In [None]:
channels = {"training": loc, "testing": loc}

role = get_execution_role()

output_path = "s3://" + sess.default_bucket() + "/nemo-output/"

local_mode = True

if local_mode:
    instance_type = "local_gpu"
else:
    instance_type = "ml.p2.xlarge"

est = PyTorch(
    entry_point="speech_to_text_ctc_finetune.py",
    source_dir="code",  # directory of your training script
    role=role,
    instance_type=instance_type,
    instance_count=1,
    framework_version="1.12.0",
    py_version="py38",
    volume_size=250,
    output_path=output_path,
    hyperparameters={'config-path': 'conf'},
)

est.fit(inputs=channels)