Skip to content

Latest commit

 

History

History
506 lines (414 loc) · 22.9 KB

cli-instructions.md

File metadata and controls

506 lines (414 loc) · 22.9 KB

本ドキュメントについて

本ドキュメントは、Batch AIの開発チームが作成しているレポジトリの内容を、DLL MeetUp用に適宜修正、追記したものになります。最新の情報は、必ず https://github.com/Azure/BatchAI を確認するようにしてください。

Introduction

Azure CLI 2.0 allows you to create and manage Batch AI resources - create/delete Batch AI file servers and clusters, submit and monitor training jobs.

This recipe shows how to create a GPU cluster, run and monitor training job using Microsoft Cognitive Toolkit.

The training script train_mnist.py is available at Official Chainer GitHub page. This script trains convolutional neural network on MNIST database of handwritten digits.

The Workflow

To train a model, you typically need to perform the following steps:

  • Create a GPU or CPU Batch AI cluster to run the job;
  • Make the training data and training scripts available on the cluster nodes;
  • Submit the training job and obtain its logs and/or generated models;
  • Delete the cluster or resize it to have zero node to not pay for compute resources when you are not using them.

In this recipe, we will:

  • Create a two node GPU cluster (with Standard_NC6 VM size) with name nc6;
  • Create a new storage account, Azure File Share with two folders logs and scripts to store jobs output and training scripts, and Azure Blob Contaier data to store training data;
  • Deploy the training script and the training data to the storage account before job submission;
  • During the job submission we will instruct Batch AI to mount the Azure File Share and Azure Blob Container on the cluster's node and make them available as regular file system at $AZ_BATCHAI_JOB_MOUNT_ROOT/logs, $AZ_BATCHAI_JOB_MOUNT_ROOT/scripts, where AZ_BATCHAI_JOB_MOUNT_ROOT is an environment variable set by Batch AI for the job.
  • We will monitor the job execution by streaming its standard output;
  • After the job completion, we will inspect its output and generated models;
  • At the end, we will cleanup all allocated resources.

Prerequisites

  • Azure subscription - If you don't have an Azure subscription, create a free account before you begin.
  • Access to Azure CLI 2.0. You can either use Azure CLI 2.0 available in Cloud Shell or install and configure it locally using the following instructions.

(追記)リソースプロバイダーの登録

リソースプロバイダーを登録していない方は、Azure Portalから登録が必要です。すべてのサービスからサブスクリプションを選択、左のブレードからリソースプロバイダーを選びます。そこからbatchを検索しMicrosoft.BatchとMicrosoft.BatchAIを登録します。

リソースプロバイダー

Cloud Shell Only

If you are using Cloud Shell, please change the working directory to /usr/$USER/clouddrive because your home directory has no empty space:

cd /usr/$USER/clouddrive

Create a Resource Group

An Azure resource group is a logical container for deploying and managing Azure resources. The following command will create a new resource group batchai.recipes in East US location:

az group create -n batchai.recipes -l eastus

Create a Batch AI Workspace

The following command will create a new workspace recipe_workspace in East US location:

az batchai workspace create -g batchai.recipes -n recipe_workspace -l eastus

Create GPU cluster

The following command will create a two node GPU cluster (VM size is Standard_NC6) using Ubuntu as the operation system image.

az batchai cluster create -n nc6 -g batchai.recipes -w recipe_workspace -s Standard_NC6 -t 2 --generate-ssh-keys

(重要)--vm-priority lowpriorityとすることで、低優先度仮想マシンを使ったクラスターを構成できます。こちらは、クラスター作成後に変更することができませんので、作成時にご指定ください。

(補足)GPUなしのクラスターを作成することも可能です。その場合、-s Standard_D1 など別の仮想マシンのサイズをご指定ください。本ハンズオンでGPUを使わない場合は、のちのジョブ実行の際、train_mnist.py -gの-gを忘れずに消去してください。

--generate-ssh-keys option tells Azure CLI to generate private and public ssh keys if you have not them already, so you can ssh to cluster nodes using the ssh key and you current user name. Note. You need to backup ~/.ssh folder to some permanent storage if you are using Cloud Shell. (すでにキーを作成済みの場合は既存のキーが使われます。)

Example output:

{
  "allocationState": "steady",
  "allocationStateTransitionTime": "2018-06-12T21:25:07.039000+00:00",
  "creationTime": "2018-06-12T21:25:07.039000+00:00",
  "currentNodeCount": 2,
  "errors": null,
  "id": "/subscriptions/1cba1da6-5a83-45e1-a88e-8b397eb84356/resourceGroups/batchai.recipes/providers/Microsoft.BatchAI/workspaces/recipe_workspace/clusters/nc6",
  "name": "nc6",
  "nodeSetup": null,
  "nodeStateCounts": {
    "idleNodeCount": 2,
    "leavingNodeCount": 0,
    "preparingNodeCount": 0,
    "runningNodeCount": 0,
    "unusableNodeCount": 0
  },
  "provisioningState": "succeeded",
  "provisioningStateTransitionTime": "2018-06-12T21:25:23.591000+00:00",
  "resourceGroup": "batchai.recipes",
  "scaleSettings": {
    "autoScale": null,
    "manual": {
      "nodeDeallocationOption": "requeue",
      "targetNodeCount": 2
    }
  },
  "subnet": null,
  "type": "Microsoft.BatchAI/workspaces/clusters",
  "userAccountSettings": {
    "adminUserName": "recipeuser",
    "adminUserPassword": null,
    "adminUserSshPublicKey": "<YOUR SSH PUBLIC KEY HERE>"
  },
  "virtualMachineConfiguration": {
    "imageReference": {
      "offer": "UbuntuServer",
      "publisher": "Canonical",
      "sku": "16.04-LTS",
      "version": "latest",
      "virtualMachineImageId": null
    }
  },
  "vmPriority": "dedicated",
  "vmSize": "STANDARD_NC6"
}

*クラスター情報のPortalでの確認

すべてのサービスからをAzure Batch AIを選択し、Workspaceから特定のワークスペースに移動します。左のブレードからクラスターを選択すると、特定のクラスターの情報を確認できます。

クラスターの確認

*スケールの変更

ノード数はPortalのスケールから変更可能です。こちらで手動スケールから自動スケールへの変更も可能です。自動スケールの場合、ジョブの投入や終了に合わせてクラスターのサイズが変わります。通常10分程度で反応が行われますが、現状SLAなどはありません。

スケール

Create a Storage Account

Create a new storage account with an unique name in the same region where you are going to create Batch AI cluster and run the job. Node, each storage account must have an unique name. (小文字と数字のみで、全世界で唯一の名前にする必要があります。)

az storage account create -n <storage account name> --sku Standard_LRS -g batchai.recipes -l eastus

If selected storage account name is not available, the above command will report corresponding error. In this case, choose other name and retry.

(補足)ストレージアカウントは必ずクラスターと同じリージョンで作成するようにしてください。別リージョンで作成した場合、計算の速度が大幅に遅くなります。

Data Deployment

Download the Training Script

For GNU/Linux or Cloud Shell:

wget https://raw.githubusercontent.com/chainer/chainermn/v1.3.0/examples/mnist/train_mnist.py

(変更) Azure Blob コンテナーの作成と学習スクリプトのデプロイ

以下のコマンドでBlobコンテナーの作成とコンテナーへのファイルのアップロードを行います。ここで、データの送り先のパスで(-n)chainer/train_mnist.py とすることで、Blobコンテナーの中に仮想的なディレクトリを作ることができます。本ハンズオンでは、scriptsに実行ファイルを置き、dataに計算結果を出します。

az storage container create -n scripts --account-name <storage account name>
az storage container create -n data --account-name <storage account name>
az storage blob upload -f train_mnist.py -n chainer/train_mnist.py --account-name <storage account name> --container scripts

ログの置き場所はAzure Filesに作ります。Blobに置くと実行中にログが見れないため、ログファイルのみAzure Filesに置くのがおすすめです。

az storage share create -n logs --account-name <storage account name>

Submit Training Job

Prepare Job Configuration File

Create a training job configuration file job.json with the following content: (job.jsonは次のステップでwgetでダウンロードできます。そちらを適宜修正してください。<>の部分の書き換えが必要です。<Container Name (For Script> と <Container Name (For Data)> の部分は、上記で作成したBlobストレージコンテナーの名前、(書き替えていなければ、scriptsとdata)を記入してください。)

{
    "$schema": "https://raw.githubusercontent.com/Azure/BatchAI/master/schemas/2018-05-01/job.json",
    "properties": {
        "nodeCount": 2,
        "chainerSettings": {   
           "processCount": 2,
            "pythonScriptFilePath": "$AZ_BATCHAI_JOB_MOUNT_ROOT/scripts/chainer/train_mnist.py",
            "commandLineArgs": "-g -o $AZ_BATCHAI_OUTPUT_MODEL"
        },
        "stdOutErrPathPrefix": "$AZ_BATCHAI_JOB_MOUNT_ROOT/logs",
        "mountVolumes": {
            "azureFileShares": [
                {
                    "azureFileUrl": "https://<AZURE_BATCHAI_STORAGE_ACCOUNT>.file.core.windows.net/logs",
                    "relativeMountPath": "logs"
                }
            ],
            "azureBlobFileSystems" :[
                {
                    "accountName": "<Storage Account Name>",
                    "containerName": "<Container Name (For Script)>",
                    "credentials": {
                        "accountKey": "<Storage Account Key>"
                    },
                    "relativeMountPath": "scripts"
                },
                {
                    "accountName": "<Storage Account Name>",
                    "containerName":"<Container Name (For Data)>",
                    "credentials": {
                        "accountKey": "<Storage Account Key>"
                    },
                    "relativeMountPath": "data"
                }
              ]
        },
        "outputDirectories": [{
            "id": "MODEL",
            "pathPrefix": "$AZ_BATCHAI_JOB_MOUNT_ROOT/data"
        }],
        "containerSettings": {
            "imageSourceRegistry": {
                "image": "batchaitraining/chainermn:openMPI"
            }
        }
    }
}

This configuration file specifies:

  • nodeCount - number of nodes required by the job;
  • chainerSettings - tells that the current job needs Chainer and specifies path the training script and command line arguments.
  • stdOutErrPathPrefix - path where Batch AI will create directories containing job's logs;
  • mountVolumes - list of filesystem to be mounted during the job execution. In this case, we are mounting two Azure File Shares logs and scripts, and Azure Blob Container data. The filesystems are mounted under AZ_BATCHAI_JOB_MOUNT_ROOT/<relativeMountPath>;
  • outputDirectories - collection of output directories which will be created by Batch AI. For each directory, Batch AI will create an environment variable with name AZ_BATCHAI_OUTPUT_<id>, where <id> is the directory identifier.
  • <AZURE_BATCHAI_STORAGE_ACCOUNT> tells that the storage account name will be specified during the job submission via --storage-account-name parameter or AZURE_BATCHAI_STORAGE_ACCOUNT environment variable on your computer.
  • Will use chainer docker

Submit the Job in an Experiment

Use the following command to create a new experiment called chainer_experiment in the workspace:

az batchai experiment create -g batchai.recipes -w recipe_workspace -n chainer_experiment

Submit the Job

Use the following command to submit the job on the cluster: (ダウンロードしてjob.jsonを上記に従って編集し、jobを実行してください。)

wget -O job.json https://raw.githubusercontent.com/DLL-BatchAI-Hand-on/Chainer/master/Chainer-GPU-Distributed/job.json
az batchai job create -n distributed_chainer -c nc6 -g batchai.recipes -w recipe_workspace -e chainer_experiment -f job.json --storage-account-name <storage account name>

*Storage Accountのキーは以下のコマンドで確認できます。Primary KeyまたはSecondary Keyのどちらかを入力すればよいです。

az storage account keys list -g batchai.recipes --account-name <storage account name>

Example output(本家からストレージなど若干書き換えているため、少し結果が異なります。):

{
  "caffe2Settings": null,
  "caffeSettings": null,
  "chainerSettings": {
    "commandLineArgs": "-g -o $AZ_BATCHAI_OUTPUT_MODEL",
    "processCount": 2,
    "pythonInterpreterPath": null,
    "pythonScriptFilePath": "$AZ_BATCHAI_JOB_MOUNT_ROOT/scripts/chainer/train_mnist.py"
  },
  "cluster": {
    "id": "/subscriptions/1cba1da6-5a83-45e1-a88e-8b397eb84356/resourceGroups/batchai.recipes/providers/Microsoft.BatchAI/workspaces/recipe_workspace/clusters/nc6",
    "resourceGroup": "batchai.recipes"
  },
  "cntkSettings": null,
  "constraints": {
    "maxWallClockTime": "7 days, 0:00:00"
  },
  "containerSettings": {
    "imageSourceRegistry": {
      "credentials": null,
      "image": "batchaitraining/chainermn:openMPI",
      "serverUrl": null
    },
    "shmSize": null
  },
  "creationTime": "2018-06-13T00:47:28.231000+00:00",
  "customMpiSettings": null,
  "customToolkitSettings": null,
  "environmentVariables": null,
  "executionInfo": {
    "endTime": null,
    "errors": null,
    "exitCode": null,
    "startTime": "2018-06-13T00:47:31.573000+00:00"
  },
  "executionState": "running",
  "executionStateTransitionTime": "2018-06-13T00:47:31.573000+00:00",
  "horovodSettings": null,
  "id": "/subscriptions/1cba1da6-5a83-45e1-a88e-8b397eb84356/resourceGroups/batchai.recipes/providers/Microsoft.BatchAI/workspaces/recipe_workspace/experiments/chainer_experiment/jobs/distributed_chainer",
  "inputDirectories": null,
  "jobOutputDirectoryPathSegment": "1cba1da6-5a83-45e1-a88e-8b397eb84356/batchai.recipes/workspaces/recipe_workspace/experiments/chainer_experiment/jobs/distributed_chainer/b64115e9-1e02-4006-b812-eec14cd08b92",
  "jobPreparation": null,
  "mountVolumes": {
    "azureBlobFileSystems": null,
    "azureFileShares": [
      {
        "accountName": "batchairecipestorage",
        "azureFileUrl": "https://batchairecipestorage.file.core.windows.net/logs",
        "credentials": {
          "accountKey": null,
          "accountKeySecretReference": null
        },
        "directoryMode": "0777",
        "fileMode": "0777",
        "relativeMountPath": "logs"
      },
      {
        "accountName": "batchairecipestorage",
        "azureFileUrl": "https://batchairecipestorage.file.core.windows.net/scripts",
        "credentials": {
          "accountKey": null,
          "accountKeySecretReference": null
        },
        "directoryMode": "0777",
        "fileMode": "0777",
        "relativeMountPath": "scripts"
      }
    ],
    "fileServers": null,
    "unmanagedFileSystems": null
  },
  "name": "distributed_chainer",
  "nodeCount": 2,
  "outputDirectories": [
    {
      "id": "MODEL",
      "pathPrefix": "$AZ_BATCHAI_JOB_MOUNT_ROOT/logs",
      "pathSuffix": null
    }
  ],
  "provisioningState": "succeeded",
  "provisioningStateTransitionTime": "2018-06-13T00:47:28.684000+00:00",
  "pyTorchSettings": null,
  "resourceGroup": "batchai.recipes",
  "schedulingPriority": "normal",
  "secrets": null,
  "stdOutErrPathPrefix": "$AZ_BATCHAI_JOB_MOUNT_ROOT/logs",
  "tensorFlowSettings": null,
  "toolType": "chainer",
  "type": "Microsoft.BatchAI/workspaces/experiments/jobs"
}

Monitor Job Execution

The training script is reporting the training progress in stdout.txt file inside the standard output directory. You can monitor the progress using the following command:

az batchai job file stream -j distributed_chainer -g batchai.recipes -w recipe_workspace -e chainer_experiment  -f stdout.txt

Example output:

==========================================
Num process (COMM_WORLD): 2
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================
Downloading from http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz...
Downloading from http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz...
Downloading from http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz...
Downloading from http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz...
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
     total [..................................................]  1.67%
this epoch [################..................................] 33.33%
       100 iter, 0 epoch / 20 epochs
       inf iters/sec. Estimated time to finish: 0:00:00.
     total [#.................................................]  3.33%
this epoch [#################################.................] 66.67%
       200 iter, 0 epoch / 20 epochs
    19.812 iters/sec. Estimated time to finish: 0:04:52.752795.
           0.22397     0.109173              0.9321         0.9641                    23.6282       
     total [##................................................]  5.00%
...   
     total [##################################################] 100.00%
this epoch [..................................................]  0.00%
      6000 iter, 20 epoch / 20 epochs
    18.829 iters/sec. Estimated time to finish: 0:00:00.

The streaming is stopped when the job is completed.

*こちらはPortalからも確認可能です。各Jobのページをご覧ください。 job

各ログをクリックするとその場で確認可能です。

job2

Inspect Generated Model Files

The job stores the generated model files in the output directory with id = MODEL, you can list this files and get download URLs using the following command:

az batchai job file list -j distributed_chainer -g batchai.recipes -w recipe_workspace -e chainer_experiment  -g batchai.recipes -d MODEL

Example output:

[
  {
    "contentLength": 2575,
    "downloadUrl": "https://batchairecipestorage.file.core.windows.net/logs/1cba1da6-5a83-45e1-a88e-8b397eb84356/batchai.recipes/workspaces/recipe_workspace/experiments/chainer_experiment/jobs/distributed_chainer/b64115e9-1e02-4006-b812-eec14cd08b92/outputs/cg.dot?sv=2016-05-31&sr=f&sig=qe8EEqx626OuYYbYEk1ndZw0oleFjGc01xdIrnbp3vE%3D&se=2018-06-13T02%3A08%3A04Z&sp=rl",
    "fileType": "file",
    "lastModified": "2018-06-13T00:50:33+00:00",
    "name": "cg.dot"
  },
  {
    "contentLength": 5983,
    "downloadUrl": "https://batchairecipestorage.file.core.windows.net/logs/1cba1da6-5a83-45e1-a88e-8b397eb84356/batchai.recipes/workspaces/recipe_workspace/experiments/chainer_experiment/jobs/distributed_chainer/b64115e9-1e02-4006-b812-eec14cd08b92/outputs/log?sv=2016-05-31&sr=f&sig=NM7%2FxZD4PnGQPO68KROsgVPg6XuIh%2F7VK3SLQtFSGRI%3D&se=2018-06-13T02%3A08%3A04Z&sp=rl",
    "fileType": "file",
    "lastModified": "2018-06-13T00:55:52+00:00",
    "name": "log"
  }
]

Alternatively, you can use the Portal or Azure Storage Explorer to inspect the generated files. To distinguish output from the different jobs, Batch AI creates an unique folder structure for each of them. You can find the path to the folder containing the output using jobOutputDirectoryPathSegment attribute of the submitted job:

az batchai job show -n distributed_chainer -g batchai.recipes -w recipe_workspace -e chainer_experiment --query jobOutputDirectoryPathSegment

*注意:-j distributed_chainerというオプションがオリジナルドキュメントには入ってしまっておりますが、こちら誤記になります。

Example output:

"00000000-0000-0000-0000-000000000000/batchai.recipes/workspaces/recipe_workspace/experiments/chainer_experiment/jobs/distributed_chainer/b64115e9-1e02-4006-b812-eec14cd08b92"

おまけ

ノードへのSSHアクセス

Azure PortalからノードへSSHするためのIPアドレスとポート番号を調べることができます。Batch AIでは、裏でロードバランサーが動いており、1つのパブリックなIPアドレスを使って、仮想ネットワークに置かれているプライベートなIPアドレスを持つ計算ノードへSSHで通信ができます。さきほどの計算をjob名を変更して再実行し、SSHで入ってdockerの起動状態やGPUの利用状態を確認してみましょう。

*PortalからIPアドレスとポート番号を確認する

ssh

確認したものを使ってSSHを実行する。

ssh <IPアドレス> -p <ポート番号>

*仮想マシンには、クラスター作成時に指定したSSHの鍵が登録されているため、CloudShellからSSHするようにしてください。

最後に

Delete the resource group and all allocated resources with the following command:

az group delete -n batchai.recipes -y

*注意:オリジナルには、az batchai groupとありますが、batchaiは不要です。