# Ray with Pytorch Test Notebook

The purpose of this notebook is to confirm that Ray Train with pytorch works in ODH.

This notebook primarily consists of an implementation of the pytorch example from the Ray docs on [Ray Train](https://docs.ray.io/en/latest/train/train.html). 

However, it has been modified to test that the Ray Train features for pytorch work in an Open Data Hub environment. We have also increased the number of samples and epochs run so that the speed up from Ray's distribution can be seen clearly.   

In [1]:
import os
import random
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim

from ray import train
import ray.train.torch
from ray.train import Trainer

from ray.util import connect as ray_connect
from ray.util import disconnect as ray_disconnect
from ray.util.client import ray as rayclient


## Setup
We're going to connect to our Ray Cluster that was spun up for us as part of the [ray notebook image](https://github.com/thoth-station/ray-ml-notebook) we selected through the ODH spawner page. 

This cell should also run locally without a Ray cluster as it checks for the relevant environment variable "RAY_CLUSTER"  

In [2]:
if os.environ.get('RAY_CLUSTER') is not None:
    if rayclient.is_connected():
        ray_disconnect()

    ray_connect('{ray_head}:10001'.format(ray_head=os.environ['RAY_CLUSTER']))

For this example we are interested in the speed of training, not the accuracy. So, let's make our lives easy and just generate some random dataset. We are going to construct a simple feed forward neural network with an feature size of 10 and an output size of 5. We will construct our dummy dataset accordingly. 

We also want to show the benefits of Ray so we will create a somewhat large dataset with 200,000 examples.

In [15]:
num_samples = 200000
input_size = 10
output_size = 5

input = torch.randn(num_samples, input_size)
labels = torch.randn(num_samples, output_size)

Here we define our simple pytorch Neural Network with a simple forward pass function.  

In [4]:
layer_size = 15

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(input_size, layer_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(layer_size, output_size)

    def forward(self, input):
        y_pred = self.layer1(input)
        y_pred = self.relu(y_pred)
        y_pred = self.layer2(y_pred)
        
        return y_pred 



## Simple Train
Now we will define our non-Ray training function for a baseline we will use for timing comparisons later on. 

In [5]:
def train_func():
    num_epochs = 500
    model = NeuralNetwork()
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.1)

    for epoch in range(num_epochs):
        output = model(input)
        loss = loss_fn(output, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if epoch % 100 == 0:
            print(f"epoch: {epoch}, loss: {loss.item()}")

Great! Now let's run our training function and see how long it takes without using any of the Ray Train's distribution functionality. 

In [16]:
%%time
train_func()

epoch: 0, loss: 1.0608739852905273
epoch: 100, loss: 1.008135437965393
epoch: 200, loss: 1.0043003559112549
epoch: 300, loss: 1.0030159950256348
epoch: 400, loss: 1.0023831129074097
CPU times: user 1min 21s, sys: 3.66 s, total: 1min 24s
Wall time: 18.1 s


: 

## Ray Train
Above we can see how long it took our non-distributed training function to iterate through all its epochs. 

Below we define our Ray Train distributed training function, which in this case requires only that we add the line: 
```
model = train.torch.prepare_model(model)
```

_Note:_ The initial training function was written knowing it would later be distributed with Ray. We make no claim that converting existing pytorch code to Ray Train compatible pytorch is always as simple as adding a single line of code.   

In [7]:
def train_func_distributed():
    num_epochs = 500
    model = NeuralNetwork()
    model = train.torch.prepare_model(model)
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.1)

    for epoch in range(num_epochs):
        output = model(input)
        loss = loss_fn(output, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if epoch % 100 == 0:
            print(f"epoch: {epoch}, loss: {loss.item()}")
    
    return model

Here we instantiate our Ray `Trainer` that we use to manage which backend we want (pytorch, tensorflow or horovod) and the number of workers we will want to use. Below we will use 4 workers if we are connected to a Ray cluster, otherwise we will just use 1. Here we can also define whether or not we want to use a gpu for training. 

In [8]:
%%time
if os.environ.get('RAY_CLUSTER') is not None:
    num_workers = 4
else:
    num_workers = 1

trainer = Trainer(backend="torch", num_workers=num_workers, use_gpu=False)

2022-05-26 11:18:51,006	INFO trainer.py:223 -- Trainer logs will be logged in: /home/mcliffor/ray_results/train_2022-05-26_11-18-51


CPU times: user 151 ms, sys: 78.9 ms, total: 230 ms
Wall time: 3.74 s


Alright! Let's run our training function and see how long it takes with Ray Train's distribution functionality. Please note, there is an overhead cost associated with starting the Trainer that seems to take around ~4 seconds. So let's time that separately from our actual training function. 

In [9]:
%%time
trainer.start()

CPU times: user 21.5 ms, sys: 8.64 ms, total: 30.1 ms
Wall time: 3.45 s


[2m[36m(BaseWorkerMixin pid=666185)[0m 2022-05-26 11:18:54,587	INFO torch.py:335 -- Setting up process group for: env:// [rank=0, world_size=1]
[2m[36m(BaseWorkerMixin pid=666185)[0m [W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:58269 (errno: 97 - Address family not supported by protocol).
[2m[36m(BaseWorkerMixin pid=666185)[0m [W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.1.57]:58269 (errno: 97 - Address family not supported by protocol).
[2m[36m(BaseWorkerMixin pid=666185)[0m [W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.1.57]:58269 (errno: 97 - Address family not supported by protocol).


In [10]:
%%time
results = trainer.run(train_func_distributed)

2022-05-26 11:18:54,657	INFO trainer.py:229 -- Run results will be logged in: /home/mcliffor/ray_results/train_2022-05-26_11-18-51/run_001
[2m[36m(BaseWorkerMixin pid=666185)[0m 2022-05-26 11:18:54,739	INFO torch.py:92 -- Moving model to device: cpu


[2m[36m(BaseWorkerMixin pid=666185)[0m epoch: 0, loss: 1.0563119649887085
[2m[36m(BaseWorkerMixin pid=666185)[0m epoch: 100, loss: 1.0061821937561035
[2m[36m(BaseWorkerMixin pid=666185)[0m epoch: 200, loss: 1.0025410652160645
[2m[36m(BaseWorkerMixin pid=666185)[0m epoch: 300, loss: 1.0011948347091675
[2m[36m(BaseWorkerMixin pid=666185)[0m epoch: 400, loss: 1.0004525184631348
CPU times: user 119 ms, sys: 27.8 ms, total: 146 ms
Wall time: 17.9 s


In [11]:
%%time
trainer.shutdown()

CPU times: user 5.41 ms, sys: 5.84 ms, total: 11.3 ms
Wall time: 952 ms


Times will vary depending on where you are running this notebook, the sample size you selected above, the number of epochs and the number of workers, but if everything worked correctly and you are using a distributed ray cluster, the Wall time for the 'train.run()` function above should be significantly less than that for the non-distributed training run. 

## Evaluate

We've trained a model! Now we need to make sure we can use it for inference. Below we'll perform to quick examples of using the trained model. First, we'll generate a brand new data set the same way we did above, and use MSE to evaluate the model's performance. Second we will perform inference on a single new input and print out the result.  


In both cases, since the data is random for this example we don't really care what the MSE or inference values are, we are simply illustrating that we can perform inference with our newly trained model. 

In [12]:
input = torch.randn(num_samples, input_size)
labels = torch.randn(num_samples, output_size)

model = results[0]
model.eval()
y_pred = model(input)

MSE = nn.MSELoss()
mse = MSE(labels,y_pred)
print(f'MSE for predictions on new data: {mse.data.numpy()}')

MSE for predictions on new data: 1.0010573863983154


In [13]:
input_sample = torch.randn(1, input_size)
model(input_sample).data.numpy()

array([[-0.06040192, -0.01437957, -0.07994152,  0.04162318, -0.00674879]],
      dtype=float32)

If you are reading this cell and there are no errors above, then you have successfully run a Ray Train Pytorch notebook in a distributed environment with Open Data Hub!  Yeahh!! :) 