# Data Parallel Deep Learning
Huihuo Zheng <huihuo.zheng@anl.gov>

Dated: 8/7/2020

**Please go to https://colab.research.google.com/ to run this notebook**

## Preparing environment

### Check TensorFlow and PyTorch installation

In [2]:
import tensorflow as tf
print("TensorFlow version: ", tf.__version__)
import torch
print("PyTorch version: ", torch.__version__)

ModuleNotFoundError: No module named 'tensorflow'

### Install Horovod

In [None]:
! pip install horovod torchvision --upgrade

Collecting horovod
[?25l  Downloading https://files.pythonhosted.org/packages/25/3a/289d100467ae33bce717daa3b285c72e0c82c761c5de37cc61940982c83c/horovod-0.19.5.tar.gz (2.9MB)
[K     |████████████████████████████████| 2.9MB 9.3MB/s 
Building wheels for collected packages: horovod
  Building wheel for horovod (setup.py) ... [?25l[?25hdone
  Created wheel for horovod: filename=horovod-0.19.5-cp36-cp36m-linux_x86_64.whl size=16660342 sha256=33218189d76ed7d4203d71df21d0252d68a60e5a7f256c039d09771a42bc2512
  Stored in directory: /root/.cache/pip/wheels/c1/de/55/40364395c40c35292366a21572320a9b89029df9fb518b7668
Successfully built horovod
Installing collected packages: horovod
Successfully installed horovod-0.19.5


In [None]:
import horovod.tensorflow as hvd
hvd.init()
print(hvd.rank(), hvd.size())

0 1


In [None]:
import horovod.torch as hvd
hvd.init()
print(hvd.rank(), hvd.size())

0 1


### Running with MPI

In [None]:
! which mpirun

/usr/bin/mpirun


In [None]:
! mpirun --allow-run-as-root -np 2 python -c "import horovod.tensorflow as hvd; hvd.init(); print('I\'m %s of %s' %(hvd.rank(), hvd.size()))"

2020-08-07 08:11:08.612373: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-07 08:11:08.612373: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
I'm 1 of 2
I'm 0 of 2


## Running Data Parallel MNIST examples 

In [None]:
! git clone https://github.com/argonne-lcf/ATPESC_MachineLearning.git

Cloning into 'ATPESC_MachineLearning'...
remote: Enumerating objects: 103, done.[K
remote: Counting objects: 100% (103/103), done.[K
remote: Compressing objects: 100% (80/80), done.[K
remote: Total 192 (delta 46), reused 63 (delta 22), pack-reused 89[K
Receiving objects: 100% (192/192), 82.38 MiB | 1.55 MiB/s, done.
Resolving deltas: 100% (61/61), done.


In [None]:
% cd ATPESC_MachineLearning/DataParallelDeepLearning/

/content/ATPESC_MachineLearning/DataParallelDeepLearning


In [None]:
! ls

keras_imagenet_resnet50.py	sumissions
keras_mnist.py			tensorflow2_keras_mnist.py
pytorch_imagenet_resnet50.py	tensorflow2_mnist.py
pytorch_mnist.py		tensorflow_mnist.py
pytorch_synthetic_benchmark.py	tensorflow_synthetic_benchmark.py
README.md


In [None]:
! mpirun --allow-run-as-root -np 4 python ./pytorch_mnist.py

2020-08-07 08:22:03.209023: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-07 08:22:03.271752: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-07 08:22:03.394057: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-07 08:22:03.438740: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-07 08:22:19.398905: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-08-07 08:22:19.398691: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-08-07 08:22:19.400296: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcud

In [None]:
! mpirun --allow-run-as-root -np 4 python ./tensorflow2_mnist.py

2020-08-07 07:57:02.401245: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-07 07:57:02.626818: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-07 07:57:06.215962: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-08-07 07:57:06.220007: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-08-07 07:57:06.227021: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-08-07 07:57:06.227077: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (d570d8526b8b): /proc/driver/nvidia/version does not exist
2020-08-07 07:57:06.228596: E tensorflow/stream_executor/cuda

In [None]:
! mpirun --allow-run-as-root -np 4 python ./tensorflow2_keras_mnist.py

python3: can't open file './tensorflow2_keras_mnist.py': [Errno 2] No such file or directory
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
python3: can't open file './tensorflow2_keras_mnist.py': [Errno 2] No such file or directory
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[221,1],1]
  Exit code:    2
--------------------------------------------------------------------------
