# Cloud Computing Assignment 2022-2023
Implementation of an application processing large data sets in parallel on a distributed Cloud environment (ie. AWS)

© Copyright 2022, All rights reserved to Hans Haller, CSTE-CIDA Student at Cranfield Uni. SATM, Cranfield, UK.

https://www.github.com/Hnshlr

### Solution setup - Pre-requisites:
1. Make sure the aws credentials taken from the Learner Lab are updated in the ~/.aws/credentials file (Test connection locally using aws sts get-caller-identity)
2. Specify the "labsuser.pem" perm-key's (taken from the Learner Lab) path, needed by paramiko to connect to the EC2 instances and execute ssh commands.
### Solution setup steps (Using the environment setup function):
1. Make sure the AWS environment of the account whose credentials were imported is empty.
2. If the AWS environment is not empty, execute the kill_all() function following the imports.
3. Verify that the backend path in the second cell is correct.
4. Execute the environment_setup() function.
5. Verify that the AWS environment is fully setup using the view_all() function.
6. Execute the solution_execution() function to run the solution, and view the results.

### IMPORTS:

The following controllers defines functions that use boto3 packaged functions to make AWS API calls. By importing the controllers, a Boto3 resource is automatically created for each AWS service that is needd for the solution (EC2, SQS, SSM, S3, etc) in order for these functions to work.

The Boto3 resources uses the AWS credentials that are located in the .aws local folder of the user who executes this software.

As a result, it is important that they are updated before running the following. Thus please make sure to restart the kernel and re-execute the imports if the credentials expired (ie. the Learner Lab session ended).

In [None]:
# MODULES=
import pandas as pd
import matplotlib.pyplot as plt

# SERVICES=
from backend.app_service import *       # My app only needs to import the app_service service, as it imports all the required controllers.

## AWS - SOLUTION SETUP AND TASKS EXECUTION:

### ENVIRONMENT SETTINGS:

In [None]:
# SETTINGS=             [IMPORTANT: Update the following settings before running my solution]
worker_amount = 8
backend_path = os.path.join(os.getcwd(), 'backend')
print('IMPORTANT: Please verify that the backend path is correct: ', backend_path)

# NAMES=
instances_names = np.concatenate((np.array(['master']), np.array(['worker' + str(i) for i in range(1, worker_amount+1)]))).tolist()
queues_names = ['main-protected-jobs.fifo', 'main-protected-results.fifo']
bucket_names = ['main-protected-bucket', 'main-protected-ssm-outputs']

### ENVIRONMENT SETUP:
Environment setup should be executed in an empty AWS account, in order to avoid any conflicts with existing resources.

To do so, please make sure to delete all the possibly existing resources, by using the kill_all() function. (Use at your own risk)

In [None]:
# CLEANUP PRIOR TO SETTING UP THE ENVIRONMENT:
kill_all()

In [None]:
# START TIMER:
print('Beginning AWS environment setup. Starting timer...')
envsetup_timer = time.time()

# SETUP ENVIRONMENT:
environment_setup(queues_names, bucket_names, instances_names, backend_path)

# STOP TIMER:
print('Environment setup took: ' + str(np.round(time.time() - envsetup_timer, 2)) + ' seconds.')

### ENVIRONMENT VERIFICATION:

In [None]:
# VERIFY ENVIRONMENT STATUS:
view_all()

In [None]:
# UPDATE BACKEND FOLDER ON S3, AND EC2 INSTANCES:
update_backend(get_instance_ids_by_names(instances_names), bucket_names[0], backend_path)

### SOLUTION EXECUTION:

In [None]:
# PREFERENCES=
matrix_shape = 1000
used_workers = 8
online_monitor = True

# EXECUTE SOLUTION (MONITORED=True):
solution_execution(matrix_shape, used_workers, worker_amount, instances_names, queues_names, bucket_names, online_monitor)

### VERIFY:

In [None]:
# SETTINGS=
op_ids = view_all_s3_buckets_filenames(bucket_names[0], 'backend/data/output/mx/')
op_type = 'mx'

# VERIFY MULTIPLES COMPUTATIONS OF A TYPE (RUN THE COMPARISON ONLINE USING AN INSTANCE):
verify_multiple_jobs(bucket_names[0], instances_names[0], op_ids, op_type)

## BULK TESTS - TIME COMPARISON:

In [None]:
# SETTINGS=
matrix_shapes = [100, 250, 500, 1000, 1250, 1500, 1750, 2000]
used_workers = 8
op_type = 'mx'
online_monitor = False

# BULK TESTS:
aws_times = []
for matrix_shape in matrix_shapes:
    print('Testing a computation (mx) for a matrix of shape: ' + str(matrix_shape) + '...')
    aws_times.append(solution_execution(matrix_shape, used_workers, worker_amount, instances_names, queues_names, bucket_names, online_monitor))
    print('\n')

In [None]:
# BULK VERIFY COMPUTATIONS:
op_ids = view_all_s3_buckets_filenames(bucket_names[0], 'backend/data/output/' + op_type + '/')
verify_multiple_jobs(bucket_names[0], instances_names[0], op_ids, op_type)

# COMPARISON WITH NUMPY COMPUTATION TIME:
numpy_times = []
for matrix_shape in matrix_shapes:
    timer_start = time.time()
    np.dot(np.random.randint(0, 10, size=(matrix_shape, matrix_shape)), np.random.randint(0, 10, size=(matrix_shape, matrix_shape)))
    numpy_times.append(np.round(time.time() - timer_start, 2))

# SAVE BOTH TIMES AND THE MATRIX SHAPES IN A CSV FILE:
df = pd.DataFrame({'matrix_shape': matrix_shapes, 'numpy_times': numpy_times, 'times': aws_times})
df.to_csv('data/times.csv', index=False)

# PLOT RESULTS - TIME COMPARISON:
plt.plot(matrix_shapes, aws_times, 'o-', color='orange', label='Distributed computation')
plt.plot(matrix_shapes, numpy_times, 'o-', color='blue', label='Numpy computation')
plt.xlabel('Matrix shape')
plt.ylabel('Time (seconds)')
plt.title('Time comparison between numpy and distributed computation')
plt.legend()
plt.show()

# TABLE OF TIMES - USING THE CSV FILE:
df = pd.read_csv('data/times.csv')
df

### CLEAN UP:

In [None]:
# CLEAN THE AWS ENVIRONMENT OF ALL SERVICES:
kill_all()