<a href="https://colab.research.google.com/github/ShaswataJash/Ray/blob/master/Ray_progam_on_Google_Cloud_Platform.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**INTRODUCTION: The notebook demonstrates how to create ray-cluster (https://docs.ray.io/en/latest/installation.html) on Google Cloud Platform (GCP) and then run a ray-program remotely on it from google-colab notebook.**

In [None]:
!whoami
!python -V

**Install ray library - current notebook is tested against 0.8.6 version of ray**

In [None]:
!pip install ray==0.8.6

**Run following cell only during development in local node. If you run following cell, it will create a local ray-cluster with single node. The single node acts as head-node of the cluster. Develop your ray-program using this local cluster. Once you have successfully developed the ray-program, then save that ray-program as python (.py) file which can be later submitted to a remote ray-cluster running in Google Cloud Platform (GCP) compute cluster.**

In [None]:
!ray stop
!rm -Rf /tmp/ray
!ray start --head --port=6379 --object-manager-port=8076 --include-webui=True --webui-host=0.0.0.0 &

**Create the following directory in local node which will be used to write final output from ray-program. When you are developing the ray-program locally, write the output to this directory present in local node. We will be mapping the same directoy path even in remote ray-cluster so that there is no need to change the ray-program whether it is running locally or remotely.**

In [None]:
!mkdir /tmp/ray_local_directory_mount

**Note the %%writefile magic code of jupyter notebook used in the following cell. When you are developing the ray-program locally comment out %%writefile as** 

`#%%writefile ray_test.py`

**Once the ray-program is developed successfully in local node, write the content of the following cell as python(.py) file which can be now submitted to remote ray cluster.**


In [None]:
#test program taken from https://towardsdatascience.com/how-to-scale-python-on-every-major-cloud-provider-5e5df3e88274

%%writefile ray_test.py

from collections import Counter
import socket
import time
import ray

ray.shutdown()
ray.init(address='auto', ignore_reinit_error=True)

print('''This cluster consists of {} nodes in total {} CPU resources in total '''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

@ray.remote
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname(socket.gethostname())

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)

with open('/tmp/ray_local_directory_mount/task_stat.txt', 'w') as writer:
    print('Tasks executed')
    for ip_address, num_tasks in Counter(ip_addresses).items():
        rowToWrite = '    {} tasks on {}\n'.format(num_tasks, ip_address)
        print(rowToWrite)
        writer.write(rowToWrite)

ray.shutdown()

**Run following cell only during development in local node. It will show the content written by ray-program and then close the local ray-cluster (a single node cluster with only head-node).**

In [None]:
!cat /tmp/ray_local_directory_mount/task_stat.txt
!ray stop -v

**First install google-api-python-client and cryptography library for ray to setup remote cluster in GCP.** 

In [None]:
!pip install google-api-python-client==1.7.8
!pip install cryptography==2.9.2 #needed internally by ray up

**Following yaml file is the overall configuration file which will define topology of the remote ray-cluster. Note the {{GCP_PROJECT_ID}}, which will be later replaced by your GCP-project-id extracted from your service-account key file (in JSON form). Note that total number of CPUs that you can allocate will be driven by quota-limit in your GCP account. For an example, a free-trial account may not able to allocate more than 8 CPUs. Below I have considered you have enabled billing in your GCP account. In a billable account, GCP allows default 24 CPUs to be allocated in every region. Considering that, the following configuration will create 1 head node of 4 CPUs and 5 additional worker nodes of 4 CPUs.**

In [None]:
#refer: https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp

%%writefile ray_gcp_cluster.yaml

# An unique identifier for the head node and workers of this cluster.
cluster_name: "ray-exp1"

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 5

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 5

# The initial number of worker nodes to launch in addition to the head
# node. When the cluster is first brought up (or when it is refreshed with a
# subsequent `ray up`) this number of nodes will be started.
initial_workers: 5

# Whether or not to autoscale aggressively. If this is enabled, if at any point
#   we would start more workers, we start at least enough to bring us to
#   initial_workers.
autoscaling_mode: aggressive

# The autoscaler will scale up the cluster to this target fraction of resource
# usage. For example, if a cluster of 10 nodes is 100% busy and
# target_utilization is 0.8, it would resize the cluster to 13. This fraction
# can be decreased to increase the aggressiveness of upscaling.
# This value must be less than 1.0 for scaling to happen.
#target_utilization_fraction: 0.05

# If a node is idle for this many minutes, it will be removed.
#idle_timeout_minutes: 1

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-central1
    availability_zone: us-central1-a
    project_id: {{GCP_PROJECT_ID}} # Globally unique project id <USER HAD TO USE THEIR OWN GCP PROJECT-ID>

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ray_user
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
head_node:
    machineType: n1-standard-4
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          # See https://cloud.google.com/compute/docs/images for more images (Check 'selfLink' in 'Equivalent REST response')
          sourceImage: projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20200729

    # Additional options can be found in in the compute docs at
    # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

    # If the network interface is specified as below in both head and worker
    # nodes, the manual network config is used.  Otherwise an existing subnet is
    # used.  To use a shared subnet, ask the subnet owner to grant permission
    # for 'compute.subnetworks.use' to the ray autoscaler account...
    # networkInterfaces:
    #   - kind: compute#networkInterface
    #     subnetwork: path/to/subnet
    #     aliasIpRanges: []

worker_nodes:
    machineType: n1-standard-4
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          # See https://cloud.google.com/compute/docs/images for more images (Check 'selfLink' in 'Equivalent REST response')
          sourceImage: projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20200729
    # Run workers on preemtible instance by default.
    # Comment this out to use non-premptible ones. Preemptible instances are cheaper, but can be switched off by GCP abruptly.
    # Thus never run the head node in preemptible VM instace.
    scheduling:
      - preemptible: true

    # Additional options can be found in in the compute docs at
    # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
   "/tmp/ray_local_directory_mount": "/tmp/ray_local_directory_mount",
   "~/gcp_service_account.json": "/content/gcp_service_account.json"
}

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    #disable ipv6 (https://itsfoss.com/disable-ipv6-ubuntu-linux/)
    - >-
      sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1;
      sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1;
      sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=1
    #firewall is not enabled by default in ubuntu (list out iptable rules to verify that)
    - sudo iptables -L
    #setup netstat command
    - sudo apt -y install net-tools
    #link 'python' command to default pytho3
    - sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.8 1;
    #ensure apt automatically retry installation if internet disconnects (https://askubuntu.com/questions/875213/apt-get-to-retry-downloading)
    - sudo touch /etc/apt/apt.conf.d/80-retries
    - echo "APT::Acquire::Retries \"10\";" >> /tmp/80-retries
    - sudo cp /tmp/80-retries /etc/apt/apt.conf.d/80-retries
    - sudo apt update
    #install pip3
    - sudo apt -y install python3-pip

# List of shell commands to run to set up nodes.
setup_commands:
    # Install ray
    - sudo pip3 install ray==0.8.6

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
    - sudo pip3 install google-api-python-client==1.7.8
    - sudo pip3 install --upgrade --force-reinstall cryptography==2.9.2
    - export GOOGLE_APPLICATION_CREDENTIALS=~/gcp_service_account.json

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start --head --port=6379 --object-manager-port=8076 --include-webui=True --webui-host=0.0.0.0 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076


In [None]:
#refer: https://cloud.google.com/iam/docs/service-accounts (Go to 'User-managed service accounts' section)
### create a new user-managed service account for ray-experiment from 'IAM & Admin' > 'Service Account') 
### Give only 'Editor' role for the GCP project (it is good practise not to give 'owner' access to the project)
#Refer: https://cloud.google.com/iam/docs/creating-managing-service-account-keys for how to generate the service-account json file
#Rename that file as 'gcp_service_account.json' and then upload to google colab using 'upload to session storage' icon
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/content/gcp_service_account.json"

In [None]:
import json

with open(os.environ["GOOGLE_APPLICATION_CREDENTIALS"], 'r') as gcp_service_acc_file:
    data = json.load(gcp_service_acc_file)
    with open('ray_gcp_cluster.yaml', 'r') as ray_cluster_config_file:
        config = ray_cluster_config_file.read().replace('{{GCP_PROJECT_ID}}', data['project_id'])
    with open('ray_gcp_cluster.yaml', 'w') as ray_cluster_config_file:
        ray_cluster_config_file.write(config)

In [None]:
!rm -rf /tmp/*
!mkdir -p /tmp/ray_local_directory_mount
!echo "Content of /tmp"
!ls -la /tmp
!rm -rf /root/.ssh/*
!mkdir -p /root/.ssh
!echo "Content of /root"
!ls -la /root
!echo "Content of /root/.ssh"
!ls -la /root/.ssh

In [None]:
!ray up ray_gcp_cluster.yaml

In [None]:
!ray exec ray_gcp_cluster.yaml 'sudo chmod -R ugo+rwx /tmp/ray/'

In [None]:
!ray exec ray_gcp_cluster.yaml 'sudo netstat -tulpn | grep LISTEN'

In [None]:
!ray submit ray_gcp_cluster.yaml ray_test.py

In [None]:
!ray rsync-down ray_gcp_cluster.yaml 

In [None]:
!cat /tmp/ray_local_directory_mount/task_stat.txt

In [None]:
!ray down ray_gcp_cluster.yaml