# **Cluster and Queue Processing in Python**


Cluster can be simply identified as a set of computers which are prepared to do a common task. By using a cluster we can access very large amount of resources like high number of CPU, RAM and storage space.

One of the major benefits of clustering is that it allows us to scale resources based on our requirement.
Also its ability to add more machines give an additional advantage of reliability.
But on the other hand clusters have high cost to maintain. Also programming for clusters require additional effort as well. 

> BTW read about `Chaos Monkey` package from Netflix, which randomly kill your program processes to check resillency. Very Cool!
> Also read about `Circus` and `supervisord` python packages which helps in reliably start cluster components.

In python there are few tools available for clustering. Below include some of them explained.

## IPython Parrallel

This package have interfaces for local and remote processing engines where data and jobs can be shared among the remote machines. (jupyter notebooks also use this interface.) To properly configure a IPython cluster require configurations changes (for security, MPI etc.) therefore need experience before deploying a cluster into a critical environment.

<center><image src="./img/26.png" width="250"/></center>

This project consists with 4 major components mentioned below.

* Engine: This is an extension of the IPython kernal. (it is synchronized interpreter.) When there are multiple engines available distributed/parrallel compting possible.
* Controller: Controller processes provide an interface for working with a set of engines. It is responsible for work distribution and load balanced work scheduler.
* Hub: This is the central component of the IPython. It keeps track of engine connections, schedulers, clients, as well as all task requests and results.
* Schedulers: Every action performed on the engine goes through a schedular. It hides the synchronous nature and provides asynchronous interface to the users.

Other than that we need the Client, for connecting to a cluster. Also For each execution model, there is a corresponding View. These views allow users/Clients to interact with a set of engines through the interface.

Below is simple usage of IPython parrallel in a local machine.

You can read the documentation [here](https://ipyparallel.readthedocs.io/en/latest/tutorial/index.html) for more details.

In [5]:
import ipyparallel as ipp
cluster = ipp.Cluster(n=4)
await cluster.start_cluster()

Starting 4 engines with <class 'ipyparallel.cluster.launcher.LocalEngineSetLauncher'>


<Cluster(cluster_id='1655642958-max5', profile='default', controller=<running>, engine_sets=['1655642959'])>

In [6]:
rc = cluster.connect_client_sync()
rc.wait_for_engines(n=4)
rc.ids

[0, 1, 2, 3]

In [10]:
rc[:].apply_sync(lambda: ("Hello, World"))

['Hello, World', 'Hello, World', 'Hello, World', 'Hello, World']

In [13]:
cluster.stop_cluster_sync()

Stopping controller
Stopping engine(s): 1655642959
Controller stopped: {'exit_code': 1, 'pid': 19472, 'identifier': 'ipcontroller-1655642958-max5-17084'}
engine set stopped 1655642959: {'engines': {'0': {'exit_code': 1, 'pid': 8268, 'identifier': '0'}, '1': {'exit_code': 1, 'pid': 19620, 'identifier': '1'}, '2': {'exit_code': 1, 'pid': 6384, 'identifier': '2'}, '3': {'exit_code': 1, 'pid': 8732, 'identifier': '3'}}, 'exit_code': 1}


Seemingly Ipython parrallel module is very interesting and easy to use. We can execute the functions we created earlier without much of trouble and get results.

In [14]:
import time
import ipyparallel as ipp
from ipyparallel import require
import random

@require('random')
def calc_point_inside_circle(num_of_estimates):
    
    print(f"Executing calc_point_inside_circle with {num_of_estimates:,}")

    trials_inside_circle = 0

    for step in range(int(num_of_estimates)):
        x = random.uniform(0,1)
        y = random.uniform(0,1)

        is_inside_circle = 1 if (x**2 + y**2) <= 1 else 0
        trials_inside_circle += is_inside_circle

    return trials_inside_circle


if __name__ == "__main__":
    total_trials = 1e8
    num_workers = 4
    trials_per_worker = total_trials/num_workers
    trials_per_processes = [trials_per_worker]*num_workers

    with ipp.Cluster() as rc:

        print(f'Cluster Initialized with {len(rc.ids)} engines')
        view = rc.load_balanced_view()

        # submit the tasks
        start_time = time.time()
        asyncresult = view.map_async(calc_point_inside_circle, trials_per_processes)
        asyncresult.wait_interactive()
        result = asyncresult.get()

        print(f"Estimates of {result} made with {len(result)} engines.")

        pi_estimate = sum(result) * 4 / total_trials
        print("Estimated pi", pi_estimate)
        print(f"Time consumed: {time.time()-start_time}")



Starting 12 engines with <class 'ipyparallel.cluster.launcher.LocalEngineSetLauncher'>
100%|██████████| 12/12 [00:05<00:00,  2.12engine/s]
Cluster Initialized with 12 engines
calc_point_inside_circle: 100%|██████████| 4/4 [00:22<00:00,  5.61s/tasks]
Estimates of [19633519, 19631933, 19634736, 19634490] made with 4 engines.
Estimated pi 3.14138712
Time consumed: 22.4538357257843
Stopping engine(s): 1655644321
engine set stopped 1655644321: {'engines': {'0': {'exit_code': 1, 'pid': 17644, 'identifier': '0'}, '1': {'exit_code': 1, 'pid': 2692, 'identifier': '1'}, '2': {'exit_code': 1, 'pid': 17504, 'identifier': '2'}, '3': {'exit_code': 1, 'pid': 18236, 'identifier': '3'}, '4': {'exit_code': 1, 'pid': 6688, 'identifier': '4'}, '5': {'exit_code': 1, 'pid': 6120, 'identifier': '5'}, '6': {'exit_code': 1, 'pid': 23264, 'identifier': '6'}, '7': {'exit_code': 1, 'pid': 16880, 'identifier': '7'}, '8': {'exit_code': 1, 'pid': 19876, 'identifier': '8'}, '9': {'exit_code': 1, 'pid': 19096, 'identi

Assuming the cluster initialized properly in multiple remote machines, execution is super easy. (though we dont share any data between processes and does not communicate. XD)

### Pandas and Dask

`Dask` can be conidered as the Spark lite version. If you dont need replicated writes, multimachine fail reliability and dont want to support storage environment like hadoop Dask is the way to go for processing larger than RAM datasets.

In dask, just like spark build up a task graph before executing the actual transformation/process (lazy evaluation). I am not going to test that here. But it is worth noting the package.


### Swifter

Swifter is a package which built on top of dask to provide additional parrallalization to your data. It Tries to vectorize your dataframe. If it seems fine it will apply otherwise execute using dask.

> Theres a new library named `Vaex` which provides pandas like interface to handle larger than RAM, datasets. Apparently it is specially good at manipulating string, so will worth a shot!

## NSQ library for queues and pub/sub

NSQ is a highly performant distributed and robust messaging platform. It provides REST API which can be called using any language and therefore language agnostic. The importance about NSQ is that it provides fundaental gurantees about the message delivery. It provides 2 simple well know design patterns for that namely queues and pub/subs.
