Skip to content

Latest commit

 

History

History
242 lines (145 loc) · 7.31 KB

validate-deployment.md

File metadata and controls

242 lines (145 loc) · 7.31 KB

Validate Deployment

Index:

1 Check Drivers

1.1 Check Drivers service's log

Dashboard:

http://<master>:9090

search driver, view driver status

PAI_search_driver

view driver logs, this log shows driver in health status

PAI_driver_right

1.2 Check Drivers version

# (1) find driver container at server
~$ sudo docker ps | grep driver

daeaa9a81d3f        aiplatform/drivers                                    "/bin/sh -c ./inst..."   8 days ago          Up 8 days                                    k8s_nvidia-drivers_drivers-one-shot-d7fr4_default_9d91059c-9078-11e8-8aea-000d3ab5296b_0
ccf53c260f6f        gcr.io/google_containers/pause-amd64:3.0              "/pause"                 8 days ago          Up 8 days                                    k8s_POD_drivers-one-shot-d7fr4_default_9d91059c-9078-11e8-8aea-000d3ab5296b_0

# (2) login driver container

~$ sudo docker exec -it daeaa9a81d3f /bin/bash

# (3) checker driver version

root@~/drivers# nvidia-smi
Fri Aug  3 01:53:04 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000460D:00:00.0 Off |                    0 |
| N/A   31C    P8    31W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2 Data path check

A configuration in service-configuration.yaml's cluster.commmon.data-path. The default value is /datastorage

#SSH to the master machine

~$ ls /datastorage

hadooptmp  hdfs  launcherlogs  prometheus  yarn  zoodata

3 Admin Account in Webportal

Dashboard:

http://<master>:9286/virtual-clusters.html

try to login:

PAI_login

Note: The username and password are configured in the service-configuraiton.yaml's rest-server field.

4 Troubleshooting OpenPAI services

4.1 Diagnosing the problem

  • Monitor

From kubernetes webportal:

Dashboard:

http://<master>:9090

PAI_deploy_log

From OpenPAI watchdog:

OpenPAI watchdog

  • Log

From kubernetes webportal:

PAI_deploy_pod

From each node container / pods log file:

View containers log under folder:

ls /var/log/containers

View pods log under folder:

ls /var/log/pods
  • Debug

As OpenPAI services are deployed on kubernetes, please refer debug kubernetes pods

4.2 Fix problem

  • Update OpenPAI Configuration

Check and refine 4 yaml files:

    - layout.yaml
    - kubernetes-configuration.yaml
    - k8s-role-definition.yaml
    - serivices-configuration.yaml
  • Customize config for specific service

If user want to customize single service, you could find service config file at src and find image dockerfile at src.

  • Update Code & Image

    • Customize image dockerfile or code

User could find service's image dockerfile at src and customize them.

  • Rebuild image

User could execute the following cmds:

Build docker image

    paictl.py image build -p /path/to/configuration/ [ -n image-x ]

Push docker image

    paictl.py image push -p /path/to/configuration/ [ -n image-x ]

If the -n parameter is specified, only the given image, e.g. rest-server, webportal, watchdog, etc., will be build / push.

4.3 Reboot service

  1. Stop single or all services.
python paictl.py service stop \
  [ -c /path/to/kubeconfig ] \
  [ -n service-list ]

If the -n parameter is specified, only the given services, e.g. rest-server, webportal, watchdog, etc., will be stopped. If not, all PAI services will be stopped.

  1. Boot up single all OpenPAI services.

Please refer to this section for details.

5 Troubleshooting Kubernetes Clusters

Please refer Kubernetes Troubleshoot Clusters

6 Getting help

  • StackOverflow: If you have questions about OpenPAI, please submit question at Stackoverflow under tag: openpai
  • Report an issue: If you have issue/ bug/ new feature, please submit it at Github