# Phys 581 Winter 2019
# Report #1: Cluster computing
## Alexander Hickey, 10169582

Note that the contents of this notebook were created and tested on an Ubuntu 18.04.2 machine, using Python 2.7.16. This machine is connected to the physics junior lab network, which is the network used to construct the computing cluster. The parallel python library used in this notebook is not supported in Python 3+.

In [2]:
import sys
sys.version

'2.7.16 |Anaconda, Inc.| (default, Mar 14 2019, 15:42:17) [MSC v.1500 64 bit (AMD64)]'

### Introduction

As explored in Assignment 5, the Multiprocessing module offers a convenient means of parallelizing the execution of a function across multiple input values. This allows one to side-step the notorious Python global interpreter lock, by spawning subprocesses rather than threads. The simplest, and arguably most useful function of the multiprocessing package is the Pool object, which spawns multiple independent computations in parallel. This is ideal in cases where one is performing repetetive computations that are independent of one another, for example, performing a function on some large list of values, where the results are independent of one another. 

One of the downsides of any simple implementation of the multiprocessing module, is that the efficiency of performing tasks in parallel is limited by the number of physical cores present on the device. Often times however, one has access to many computers that are a part of the same network. It is therefore of interest to be able to run tasks in parallel across an entire network of CPUs, rather than just on a single device. This technique of getting multiple computers across a network to work together is known as cluster computing.

This report will explore the construction and efficiency of a computing cluster, using the Parallel Python package. In particular, I will use this framework to construct a cluster with the maximum number of CPUs available over the network in the physics junior labs.

### Construction of the cluster

The first step to constructing the computing cluster is to generate a list of all of the valid IP addresses in the junior labs. This can be obtained readily by attempting to establish an ssh connection using the paramiko library in Python. Both the parallel python and paramiko libraries can be installed in anaconda using:
    
    conda install -c geneko pp
    conda install -c anaconda paramiko

Additionally, the ability to ssh using the paramiko library will allow us to set up the parallel python server on each machine in the network, which effectively adds the machine as a node to the computing cluster.

In [6]:
#Import useful libraries
import pp, paramiko
import numpy as np
import matplotlib.pyplot as plt
import time
import getpass
%matplotlib inline

In order to establish an ssh connection, my Ucalgary username and password will be required for authentication. There is a handy library built in to Python called "getpass" that prompts the user for their password and saves it to a variable without displaying it on screen. Additionally, the getpass library allows one to retrieve their current username.

In [4]:
#Retrieve username and prompt user for password
user = getpass.getuser()
passw = getpass.getpass('Password for '+user+': ')

Password for alexander.hickey: ········


All of the computers in the junior labs have an IP address of the form 136.159.54.X where X is an integer between 0 and 255. Using the paramiko, we can attempt to interface with each device in the lab by trying to open an ssh connection. If this connection is successful, I store that IP to a list of valid hosts that can later be used in our cluster. If the connection fails, I ignore this IP. This will give us an idea of the number of machines available over the network.

In [6]:
#Set network base IP
IP_base = '136.159.54.'

#List of valid host IPs
hostlist = []

#Search through junior lab IPs and attempt to connect
for j in range(30,40):
    
    #Current IP
    hostname = IP_base+str(j)
    
    #Try to connect, with statement used so thatclient is not stored.
    try:
        with paramiko.SSHClient() as ssh:
            
            #Set host key policy
            ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
            
            #Connect using current IP, timout after 0.1s
            ssh.connect(hostname, username=user, password=passw,timeout=.1)
            
            #Close connection and add to list of valid IPs
            hostlist.append(hostname)
    
    #Move on to next IP if exception error is rased
    except:
        None

print 'Successfully connected to %s hosts:' %len(hostlist) 
print(hostlist)

  m.add_string(self.Q_C.public_numbers().encode_point())
  self.curve, Q_S_bytes
  hm.add_string(self.Q_C.public_numbers().encode_point())


Successfully connected to 9 hosts:
['136.159.54.30', '136.159.54.31', '136.159.54.33', '136.159.54.34', '136.159.54.35', '136.159.54.36', '136.159.54.37', '136.159.54.38', '136.159.54.39']


As we see, we were able to successfully ssh into *** devices over the network. With an average of 4 cores per computer in the lab, this should give us a significant amount of computational power. Next, we set up the parallel python server on each one of the available machines. By running the ppserver.py file included in the parallel python distribution, a machine will become available to interface with the parallel python server and become a node in the computing cluster.

In [8]:
#Terminal command to activate the parallel python server.
#Python27 corresponds to a conda environment on my account
#that is used to run Python 2.7.16.
command = 'conda activate Python27 && ppserver.py'

#Counter
cnt = 0

#Start client on each valid host
for host in hostlist:
    try:
        with paramiko.SSHClient() as ssh:
            
            #Set host key policy
            ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
            
            #Connect to host
            ssh.connect(host, username=user, password=passw,timeout=2)
            
            #Execute command on host
            ssh.exec_command(command)
            
            #Update counter
            cnt += 1

    except:
        None
        
print 'ppserver.py executed on %s hosts'%cnt

ppserver.py executed on 9 hosts


Now that the servers are running, we are ready to create a so-called job server to add each of these machines as a node in our computing cluster.

In [9]:
#Activate job server using available machines
job_server = pp.Server(ppservers = tuple(hostlist))

In [11]:
#Show active nodes in the cluster
job_server.get_active_nodes()

{'136.159.54.30:60000': 4,
 '136.159.54.31:60000': 4,
 '136.159.54.33:60000': 4,
 '136.159.54.34:60000': 4,
 '136.159.54.35:60000': 4,
 '136.159.54.36:60000': 4,
 '136.159.54.37:60000': 4,
 '136.159.54.38:60000': 4,
 '136.159.54.39:60000': 4,
 'local': 4}

In [34]:
d = 120
%timeit np.linalg.eigvals(np.random.rand(d,d))

The slowest run took 8.38 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 8.06 ms per loop


In [35]:
def useless_task(d):
    '''
    Some computationally intensive task. This function
    will generate a random dxd matrix with elements in
    the unit interval, and subsequently compute the
    eigenvalues of it.
    
    Args:
        d: Matrix dimension
        
    Return:
        det: determinant of the random matrix
        
    '''
    
    matrix = np.random.rand(d,d)
    np.linalg.eigvals(matrix)
    
    return 1

d = 100
num_tasks = np.arange(200)
tlist = []

for n in num_tasks:
    
    t0 = time.time()
    jobs = [job_server.submit(useless_task, (d,), (np.random.rand,np.linalg.eigvals), ('np',)) for i in num_tasks]
    tlist.append(time.time()-t0)
    
plt.plot(num_tasks,tlist)

NameError: name 'f' is not defined

In [None]:
import time
def f(x):
    
    time.sleep(1.0)
    
    return x

t0 = time.time()

inputs = (1,2,3,4,5,6,7,8)
jobs = [(i, job_server.submit(f,(i,), modules = ("time",))) for i in inputs]
for i, job in jobs:
    print i

t = time.time()-t0
    
print('Time: ' +str(t))
job_server.print_stats()

Finally, it is probably a good idea to kill the ppserver.py process on each of the machines in the computer lab, otherwise it will essentially be running indefinitely.

In [12]:
command = 'pkill -f ppserver.py'
cnt = 0
for host in hostlist:
    
    try:
        ssh = paramiko.SSHClient()
        ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        ssh.connect(host, username=user, password=passw,timeout=2)
        ssh.exec_command(command)
        ssh.close()
        cnt += 1
        
    except:
        None
        
print 'disconnected from %s hosts'%cnt

disconnected from 9 hosts


In [None]:
#Spawn node
os.system("konsole -e ssh munin 'conda activate Python27 && ppserver.py'")

#Kill process
os.system("konsole -e ssh munin 'pkill -u alexander.hickey'")

### Construction of the cluster

In this report, we examined the construction of a computing cluster using the parallel python and paramiko modules in Python.