<h1>Distributed vs Non-Distributed Benchmark</h1>

This test will be used to benchmark a direct comparison of processing time for a task between a singleuser instance using numpy and/or pytorch and a dask-work cluster.

<h2>Dask Gateway</h2>
Dask Gateway provides a secure, multi-tenant server for managing Dask clusters. It allows users to launch and use Dask clusters in a shared, centrally managed cluster environment, without requiring users to have direct access to the underlying cluster backend (e.g. Kubernetes, Hadoop/YARN, HPC Job queues, etc…).

Dask Gateway is one of many options for deploying Dask clusters, see Deploying Dask in the Dask documentation for an overview of additional options.

<h3>Highlights</h3>

* Centrally Managed: Administrators do the heavy lifting of configuring the Gateway, users simply connect to the Gateway to get a new cluster. Eases deployment, and allows enforcing consistent configuration across all users.

* Secure by Default: Cluster communication is automatically encrypted with TLS. All operations are authenticated with a configurable protocol, allowing you to use what makes sense for your organization.

* Flexible: The gateway is designed to support multiple backends, and runs equally well in the cloud as on-premise. Natively supports Kubernetes, Hadoop/YARN, and HPC Job Queueing systems.

* Robust to Failure: The gateway can be restarted or experience failover without losing existing clusters. Allows for seamless upgrades and restarts without disrupting users.

<h3>Architecture Overview</h3>
Dask Gateway is divided into three separate components:

Multiple active Dask Clusters (potentially more than one per user)

A Proxy for proxying both the connection between the user’s client and their respective scheduler, and the Dask Web UI for each cluster

A central Gateway that manages authentication and cluster startup/shutdown



<h2>Numpy</h2>
NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

Documentation for numby : https://numpy.org/doc/stable/

<h2>Pytorch</h2>
PyTorch is an optimized tensor library for deep learning using GPUs and CPUs. In this instance we will only be using the CPU with pytorch.
Documentation for pytorch: https://pytorch.org/docs/stable/index.html

<h2>Testing</h2>

In [1]:
# import dask, time, and create the gateway to dask
from dask_gateway import Gateway
import time
import torch
gateway = Gateway( "http://<ip-address-for-dask-gateway-loadBalancer", auth="jupyterhub", )
options = gateway.cluster_options()

In [2]:
# create new cluster in dask
cluster = gateway.new_cluster(options)

In [3]:
# make the cluster scalable
cluster.adapt(minimum=2, maximum=10)
# showcase the gateway cluster information, you can open this GUI link to get more information about your cluster.
cluster

<h4>Client for the cluster.  This dashboard will show you the working tasks in real time.</h4>

In [4]:
from dask.distributed import Client
client = Client(cluster)
client 


+-------------+----------------+----------------+----------------+
| Package     | Client         | Scheduler      | Workers        |
+-------------+----------------+----------------+----------------+
| dask        | 2024.4.1       | 2024.1.0       | 2024.1.0       |
| distributed | 2024.4.1       | 2024.1.0       | 2024.1.0       |
| lz4         | 4.3.3          | None           | None           |
| numpy       | 1.26.4         | 1.26.3         | 1.26.3         |
| pandas      | 2.2.2          | 2.1.4          | 2.1.4          |
| python      | 3.11.8.final.0 | 3.11.7.final.0 | 3.11.7.final.0 |
| toolz       | 0.12.1         | 0.12.0         | 0.12.0         |
+-------------+----------------+----------------+----------------+


0,1
Connection method: Cluster object,Cluster type: dask_gateway.GatewayCluster
Dashboard: http://10.106.233.41/clusters/opal.96ae9e7b29ea475ab566118fba522856/status,


2024-04-11 20:43:17,492 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client


<h3>Here is a test that will create a random number array in dask, numby, and pytorch and graphically compare the runtime of each result</h3>

<h4>Dask cluster random array mean and sum</h4>

In [None]:
%%time
import dask.distributed
import dask.array as da

# Create a Dask array
x = da.random.random(15000, 15000)

# Perform an operation on the Dask array
#y = x + x

# Compute the results of the operation
s = client.submit(x.sum)
m = client.submit(x.mean)
p = client.gather(s)
me = client.gather(m)
print(float(p))
print(float(me))

<h4>Pytorch random array mean and sum</h4>

In [None]:
%%time
# CPU array
cpu_a = torch.rand(15000,15000)
result1 = torch.sum(cpu_a)
mean1 = torch.mean(cpu_a)
print(result1)
print(mean1)

<h4>Numpy random array mean and sum</h4>

In [None]:
%%time
import numpy as np

# Create a 2D array
arr = np.random.rand(15000, 15000)

# Calculate the sum of each row
result = np.sum(arr)
mean = np.mean(arr)

# Print the result
print(result)
print(mean)

<h3>Always close down your cluster</h3>

In [10]:
# make sure to shut down the cluster
cluster.shutdown()
print('Cluster is shutdown')

Cluster is shutdown
