## Notebook环境
Notebook的运行环境可以选择conda_tensorflow_p36，本实验所用的sagemaker版本为2.42.0，接下来我们会安装对应的版本。

In [None]:
! pip install --upgrade pip
! pip install sagemaker==2.42.0

In [28]:
import boto3
import sagemaker
from sagemaker import get_execution_role

region = boto3.session.Session().region_name
role   = get_execution_role()
sess   = sagemaker.Session()
bucket = sess.default_bucket()

In [17]:
bucket

'sagemaker-us-east-1-022346938362'

Error response from daemon: Get https://763104351884.dkr.ecr.us-east-1.amazonaws.com/v2/pytorch-training/manifests/1.8.1-gpu-py38-cu111-ubuntu18.04: no basic auth credentials


## 准备Docker image

In [55]:
import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.Session().region_name
ecr_repository = 'sagemaker-wenet'

# 登录ECR服务
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account_id}.dkr.ecr.{region}.amazonaws.com


https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


### 创建注册表

In [None]:
!aws ecr create-repository --repository-name $ecr_repository

### 构建训练镜像

In [32]:
training_docker_file_path = '/fsx/wenet'

!cat $training_docker_file_path/Dockerfile-py36-pt1.8.1-cu111-sox-ready

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu110-ubuntu18.04

RUN cd / && \
    pip install ninja && \
    apt update && \
    apt-get install sox libsox-dev libsox-fmt-all pkg-config -y && \
    TORCHAUDIO_VERSION=release/0.6 && \
    git clone -b ${TORCHAUDIO_VERSION} https://github.com/pytorch/audio torchaudio && \
    cd torchaudio && \
    pip install .

COPY ./requirements.txt /tmp/

RUN pip install -r /tmp/requirements.txt && \
    pip install sagemaker-training && \
    apt-get clean
    

In [56]:
# 构建训练镜像并推送到ECR
tag = ':training-pt181'
training_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account_id, region, ecr_repository + tag)
print('training_repository_uri: ', training_repository_uri)

!cd $training_docker_file_path && docker build -t "$ecr_repository$tag" . -f Dockerfile-py36-pt1.8.1-cu111-sox-ready
!docker tag {ecr_repository + tag} $training_repository_uri
!docker push $training_repository_uri


# !docker pull $training_repository_uri

training_repository_uri:  022346938362.dkr.ecr.us-east-1.amazonaws.com/sagemaker-wenet:training-pt181
Sending build context to Docker daemon  22.77MB
Step 1/4 : FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04
 ---> b0662cb45e61
Step 2/4 : RUN cd / &&     pip install ninja &&     apt update &&     apt-get install sox libsox-dev libsox-fmt-all pkg-config -y &&     CUDNN_VERSION=8.0.5.39 && apt-get install cuda-nvrtc-11-1 cuda-nvrtc-dev-11-1 libcudnn8-dev=$CUDNN_VERSION-1+cuda11.1 -y &&     TORCHAUDIO_VERSION=v0.8.1 &&     git clone -b ${TORCHAUDIO_VERSION} https://github.com/pytorch/audio torchaudio &&     cd torchaudio &&     git submodule update --init --recursive &&     pip install .
 ---> Using cache
 ---> 38447c6689ad
Step 3/4 : COPY ./requirements.txt /tmp/
 ---> Using cache
 ---> c5230e1adb12
Step 4/4 : RUN pip install -r /tmp/requirements.txt &&     pip install sagemaker-training &&     apt-get clean
 ---> Using cache
 ---> e33f

### 构建推理镜像

In [33]:
decoding_docker_file_path='/fsx/wenet/runtime/server/x86'

!cat $decoding_docker_file_path/Dockerfile

FROM ubuntu:latest
MAINTAINER <zhendong.peng@mobvoi.com>
ENV DEBIAN_FRONTEND=noninteractive
RUN sed -i s@/archive.ubuntu.com/@/mirrors.aliyun.com/@g /etc/apt/sources.list
RUN apt-get update && apt-get install -y git cmake wget build-essential
RUN git clone https://github.com/mobvoi/wenet.git /home/wenet
ARG model=20210327_unified_transformer_exp_server.tar.gz
RUN wget -P /home http://mobvoi-speech-public.ufile.ucloud.cn/public/wenet/aishell2/$model
RUN tar -xzf /home/$model -C /home
ARG build=/home/wenet/runtime/server/x86/build
RUN mkdir $build && cmake -S $build/.. -B $build



In [None]:
# 构建推理容器并推送到ECR
tag = ':decoding'
decoding_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account_id, region, ecr_repository + tag)
print('decoding_repository_uri: ', decoding_repository_uri)


!cd $decoding_docker_file_path && docker build -t "$ecr_repository$tag" .
!docker tag {ecr_repository + tag} $decoding_repository_uri
!docker push $decoding_repository_uri


## 数据准备

### 数据下载

In [None]:
cd /fsx/wenet/examples/aishell/s0 && \
bash run.sh --stage -1 --stop_stage -1 --data /fsx/asr-data/OpenSLR/33


### 数据预处理

In [7]:
from sagemaker.inputs import FileSystemInput


# 指定文件系统的id.
file_system_id = 'fs-0f8a3b8eef47b6ff8'
# 提供数据集所在的路径，注意格式
file_system_path = '/fsx'
# 指定挂载文件系统的访问模式，支持"ro"（只读）或"rw"（读写）两种，注意内置算法只支持 以 ro 的方式挂载
file_system_access_mode = 'ro'
# 指定文件系统的类型, 支持"EFS" 或 "FSxLustre"两种.
file_system_type = 'FSxLustre'
# 以VPC内的方式启动 Amazon SageMaker 训练任务,指定所在子网和安全组，subnet需要为list或者tuple格式
security_groups_ids = ['sg-04acfc98f6929ee4e']
subnets= ['vpc-3c49de46']

# 定义数据输入
file_system_input_train = FileSystemInput(file_system_id=file_system_id,
                                  file_system_type=file_system_type,
                                  directory_path=file_system_path,
                                  file_system_access_mode=file_system_access_mode)

In [15]:
cd /opt/ml/code/examples/aishell/s0
bash run.sh --stage 0 --stop_stage 3 --trail_dir /opt/ml/input/data/train --train_set train --data /opt/ml/input/data/33

bash run.sh --stage 4 --stop_stage 4 --trail_dir /opt/ml/input/data/train --train_set train --data /opt/ml/input/data/33


bash run.sh --stage 4 --stop_stage 4 --train_set train --trail_dir /opt/ml/input/data/train/sm-train \
    --data /opt/ml/input/data/train/asr-data/OpenSLR/33 --shared_dir /opt/ml/input/data/train/shared


networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-un9wk:
    command: train
    container_name: mfhuhz1akz-algo-1-un9wk
    environment:
    - AWS_REGION=us-east-1
    - TRAINING_JOB_NAME=sagemaker-wenet-2021-06-03-08-49-01-226
    image: sagemaker-wenet:training
    networks:
      sagemaker-local:
        aliases:
        - algo-1-un9wk
    stdin_open: true
    tty: true
    volumes:
    - /tmp/tmplrz93i9x/algo-1-un9wk/output/data:/opt/ml/output/data
    - /tmp/tmplrz93i9x/algo-1-un9wk/input:/opt/ml/input
    - /tmp/tmplrz93i9x/algo-1-un9wk/output:/opt/ml/output
    - /tmp/tmplrz93i9x/model:/opt/ml/model
    - /opt/ml/metadata:/opt/ml/metadata
    - /fsx/wenet:/opt/ml/code
    - /fsx/trail_local_sm:/opt/ml/input/data/train
    - /fsx/asr-data/OpenSLR/33:/opt/ml/input/data/33

SyntaxError: invalid syntax (<ipython-input-15-a305b99eeb98>, line 2)

In [9]:
%cd /fsx/wenet

from sagemaker.pytorch.estimator import PyTorch

# checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, 'common_voice_data', 'checkpoints')
# checkpoint_local_path='/opt/ml/checkpoints'

hp={'train_set':'train', 'trail_dir':'/opt/ml/input/data/train', 'CUDA_VISIBLE_DEVICES': '1,2'}


estimator=PyTorch(
    entry_point='examples/aishell/s0/run1.sh',
#     image_uri='sagemaker-wenet:training',
    image_uri=training_repository_uri,
    instance_type='local',
    instance_count=1,
    source_dir='.',
    role=role,
    hyperparameters=hp
)

estimator.fit({'train':'file:///fsx/trail_local_0/', 'wav':'file:///fsx/asr-data/OpenSLR/33/data_aishell/wav/'})
# estimator.fit(inputs={'train': file_system_input_train})


/fsx/wenet
Creating dhls9xm95e-algo-1-zvm5j ... 
Creating dhls9xm95e-algo-1-zvm5j ... done
Attaching to dhls9xm95e-algo-1-zvm5j
[36mdhls9xm95e-algo-1-zvm5j |[0m 2021-06-09 13:09:52,717 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mdhls9xm95e-algo-1-zvm5j |[0m 2021-06-09 13:09:52,730 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mdhls9xm95e-algo-1-zvm5j |[0m 2021-06-09 13:09:52,742 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mdhls9xm95e-algo-1-zvm5j |[0m 2021-06-09 13:09:52,753 sagemaker-training-toolkit INFO     Invoking user script
[36mdhls9xm95e-algo-1-zvm5j |[0m 
[36mdhls9xm95e-algo-1-zvm5j |[0m Training Env:
[36mdhls9xm95e-algo-1-zvm5j |[0m 
[36mdhls9xm95e-algo-1-zvm5j |[0m {
[36mdhls9xm95e-algo-1-zvm5j |[0m     "additional_framework_parameters": {},
[36mdhls9xm95e-algo-1-zvm5j |[0m     "channel_input_dirs": {
[36mdhls9xm95e-algo-1-zvm5j |

KeyboardInterrupt: 

In [None]:
from sagemaker.tensorflow.estimator import TensorFlow

checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, 'common_voice_data', 'checkpoints')
checkpoint_local_path='/opt/ml/checkpoints'

hp={'mode':'train', 'data_dir':'/opt/ml/input/data/training', 'output_dir':'/opt/ml/output', 'batch_size': 64, 'sm_checkpoint': checkpoint_local_path}


estimator=TensorFlow(
    image_uri=training_repository_uri,
    instance_type='ml.p3.16xlarge',
    instance_count=1,
    entry_point='./run_rnnt.py',
    source_dir='.',
    role=role,
    hyperparameters=hp,
    
    # Parameters required to enable checkpointing
    checkpoint_s3_uri=checkpoint_s3_bucket,
    checkpoint_local_path=checkpoint_local_path
)

estimator.fit('s3://sagemaker-us-east-1-022346938362/common_voice_data/cv-corpus-6.1-2020-12-11/zh-CN/tfrecord/')

In [11]:
from sagemaker.inputs import FileSystemInput
from sagemaker.pytorch.estimator import PyTorch

# bash run.sh --stage 4 --stop_stage 4 --trail_dir /opt/ml/input/data/train --train_set train --data /opt/ml/input/data/33

# 指定文件系统的id.
file_system_id = 'fs-0f8a3b8eef47b6ff8'
# 提供数据集所在的路径，注意格式
file_system_path = '/yobzhbmv'
# 指定挂载文件系统的访问模式，支持"ro"（只读）或"rw"（读写）两种，注意内置算法只支持 以 ro 的方式挂载
file_system_access_mode = 'rw'
# 指定文件系统的类型, 支持"EFS" 或 "FSxLustre"两种.
file_system_type = 'FSxLustre'
# 以VPC内的方式启动 Amazon SageMaker 训练任务,指定所在子网和安全组，subnet需要为list或者tuple格式
security_group_ids = ['sg-04acfc98f6929ee4e']
# subnets= ['vpc-3c49de46']
subnets= ['subnet-07ce0ab63b4cfeb25']

# 定义数据输入
file_system_input_train = FileSystemInput(file_system_id=file_system_id,
                                  file_system_type=file_system_type,
                                  directory_path=file_system_path,
                                  file_system_access_mode=file_system_access_mode)

data_dir   = '/opt/ml/input/data/train/asr-data/OpenSLR/33'
trail_dir  = '/opt/ml/input/data/train/sm-train/trail0'
shared_dir = '/opt/ml/input/data/train/sm-train/shared'
# shared_dir = '/opt/ml/input/data/train/shared'

## 数据预处理 - SageMaker托管实例

In [None]:


bash run.sh --stage 4 --stop_stage 4 --train_set train  \
    --data /opt/ml/input/data/train/asr-data/OpenSLR/33 \
    --trail_dir /opt/ml/input/data/train/sm-train/trail0 \
    --shared_dir /opt/ml/input/data/train/sm-train/shared 

# /opt/ml/input/data/train  <==> /fsx
# /opt/ml/input/data/train/asr-data/OpenSLR/33  <==> /fsx/asr-data/OpenSLR/33
# /opt/ml/input/data/train/sm-train ==> /fsx/sm-train

hp= {
    'stage': 0, 'stop_stage': 3, 'train_set':'train', 
    'data': data_dir, 'trail_dir': trail_dir, 'shared_dir': shared_dir
}

estimator=PyTorch(
    entry_point='examples/aishell/s0/sm-run.sh',
    image_uri=training_repository_uri,
    instance_type='ml.c5.4xlarge',
    instance_count=1,
    source_dir='.',
    role=role,
    hyperparameters=hp,
    
    subnets=subnets,
    security_group_ids=security_group_ids,
    
    debugger_hook_config=False,
    disable_profiler=True
)

# estimator.fit({'train':'file:///fsx/trail_local_0/', 'wav':'file:///fsx/asr-data/OpenSLR/33/data_aishell/wav/'})

estimator.fit(inputs={'train': file_system_input_train})


2021-06-08 09:49:56 Starting - Starting the training job...
2021-06-08 09:49:58 Starting - Launching requested ML instances......
2021-06-08 09:51:11 Starting - Preparing the instances for training......
2021-06-08 09:52:06 Downloading - Downloading input data
2021-06-08 09:52:06 Training - Downloading the training image...........[34m2021-06-08 09:54:06,765 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-06-08 09:54:08,241 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-06-08 09:54:08,251 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-06-08 09:54:08,258 sagemaker-training-toolkit INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": null,
    "hosts": [


## 模型训练 - SageMaker托管实例


In [None]:

# bash run.sh --stage 4 --stop_stage 4 --train_set train  \
#     --data /opt/ml/input/data/train/asr-data/OpenSLR/33 \
#     --trail_dir /opt/ml/input/data/train/sm-train/trail0 \
#     --shared_dir /opt/ml/input/data/train/sm-train/shared 

# instance_type='ml.g4dn.4xlarge'
instance_type='ml.p3.2xlarge'
instance_count = 2
# CUDA_VISIBLE_DEVICES='0,1,2,3'
CUDA_VISIBLE_DEVICES='0'

hp= {
    'stage': 4, 'stop_stage': 4, 'train_set':'train', 
    'data': data_dir, 'trail_dir': trail_dir, 'shared_dir': shared_dir,
    'CUDA_VISIBLE_DEVICES': CUDA_VISIBLE_DEVICES, 
    'ddp_init_path': '/opt/ml',
    'num_nodes': instance_count
}

estimator=PyTorch( 
    entry_point='examples/aishell/s0/sm-run.sh',
#     image_uri=training_repository_uri,
    image_uri='022346938362.dkr.ecr.us-east-1.amazonaws.com/sagemaker-wenet:training',
    instance_type =instance_type,
    instance_count=instance_count,
    source_dir='.',
    role=role,
    hyperparameters=hp,
    
    subnets=subnets,
    security_group_ids=security_group_ids,
    
    debugger_hook_config=False,
    disable_profiler=True
    # Parameters required to enable checkpointing
#     checkpoint_s3_uri=checkpoint_s3_bucket,
#     checkpoint_local_path=checkpoint_local_path
)


estimator.fit(inputs={'train': file_system_input_train})


In [27]:
!pwd

/fsx/wenet


In [72]:

# bash run.sh --stage 4 --stop_stage 4 --train_set train  \
#     --data /opt/ml/input/data/train/asr-data/OpenSLR/33 \
#     --trail_dir /opt/ml/input/data/train/sm-train/trail0 \
#     --shared_dir /opt/ml/input/data/train/sm-train/shared 


data_dir   = '/opt/ml/input/data/train/asr-data/OpenSLR/33'
trail_dir  = '/opt/ml/input/data/train/sm-train/trail0'
shared_dir = '/opt/ml/input/data/train/sm-train/shared'

# instance_type='ml.g4dn.4xlarge'
instance_type='ml.p3.16xlarge'
instance_count = 2
# CUDA_VISIBLE_DEVICES='0,1,2,3'
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7'

hp= {
    'stage': 4, 'stop_stage': 4, 'train_set':'train', 
    'data': data_dir, 'trail_dir': trail_dir, 'shared_dir': shared_dir,
    'CUDA_VISIBLE_DEVICES': CUDA_VISIBLE_DEVICES, 
    'ddp_init_path': '/opt/ml',
    'num_nodes': instance_count
}

estimator=PyTorch( 
    entry_point='examples/aishell/s0/sm-run.sh',
#     image_uri=training_repository_uri,
    image_uri='022346938362.dkr.ecr.us-east-1.amazonaws.com/sagemaker-wenet:training-pt181',
    instance_type =instance_type,
    instance_count=instance_count,
    source_dir='.',
    role=role,
    hyperparameters=hp,
    
    subnets=subnets,
    security_group_ids=security_group_ids,
    
    debugger_hook_config=False,
    disable_profiler=True,
    distribution = {
        'smdistributed':{
            'dataparallel':{
                'enabled': True, 
#                 "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
            }
        }
    }
    # Parameters required to enable checkpointing
#     checkpoint_s3_uri=checkpoint_s3_bucket,
#     checkpoint_local_path=checkpoint_local_path
)


estimator.fit(inputs={'train': file_system_input_train})


2021-06-22 09:22:36 Starting - Starting the training job...
2021-06-22 09:22:51 Starting - Launching requested ML instances.........
2021-06-22 09:24:33 Starting - Preparing the instances for training............
2021-06-22 09:26:15 Downloading - Downloading input data
2021-06-22 09:26:15 Training - Downloading the training image...............................[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-06-22 09:31:43,886 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-06-22 09:31:43,965 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-06-22 09:31:44,817 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m20

UnexpectedStatusException: Error for Training job sagemaker-wenet-2021-06-22-09-22-33-664: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "mpirun --host algo-1:8,algo-2:8 -np 16 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 2 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_HOMOGENEOUS=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.6/site-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -x SMDATAPARALLEL_SERVER_ADDR=algo-1 -x SMDATAPARALLEL_SERVER_PORT=7592 -x SAGEMAKER_INSTANCE_TYPE=ml.p3.16xlarge smddprun /bin/sh -c ./examples/aishell/s0/sm-run.sh --CUDA_VISIBLE_DEVICES 0,1,2,3,4,5,6,7 --data /opt/ml/input/data/train/asr-data/OpenSLR/33 --ddp_init_path /opt/ml --num_nodes 2 --shared_dir /opt/ml/input/data/train/sm-train/shared --stage 4 --stop_stage 4 --trail_dir /opt/ml/input/data/train/sm-t