# Assignment 4
## Understaning scaling of linear algebra operations on Apache Spark using Apache SystemML

In this assignment we want you to understand how to scale linear algebra operations from a single machine to multiple machines, memory and CPU cores using Apache SystemML. Therefore we want you to understand how to migrate from a numpy program to a SystemML DML program. Don't worry. We will give you a lot of hints. Finally, you won't need this knowledge anyways if you are sticking to Keras only, but once you go beyond that point you'll be happy to see what's going on behind the scenes. Please make sure you run this notebook from an Apache Spark 2.3 notebook.

So the first thing we need to ensure is that we are on the latest version of SystemML, which is 1.2.0 (as of Feb. '19)
Please use the code block below to check if you are already on 1.2.0 or higher.

In [1]:
from systemml import MLContext
ml = MLContext(spark)
ml.version()

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190331161224-0001
KERNEL_ID = db473c1a-d3ec-429f-9c04-19bb23a74378


ImportError: No module named 'systemml'

If you are blow version 1.2.0 please execute the next two code blocks

In [None]:
!pip install systemml

Collecting systemml
[?25l  Downloading https://files.pythonhosted.org/packages/b1/94/62104cb8c526b462cd501c7319926fb81ac9a5668574a0b3407658a506ab/systemml-1.2.0.tar.gz (9.7MB)
[K    100% |################################| 9.7MB 999kB/s eta 0:00:01
[?25hCollecting numpy>=1.8.2 (from systemml)
[?25l  Downloading https://files.pythonhosted.org/packages/e3/18/4f013c3c3051f4e0ffbaa4bf247050d6d5e527fe9cb1907f5975b172f23f/numpy-1.16.2-cp35-cp35m-manylinux1_x86_64.whl (17.2MB)
[K    100% |################################| 17.2MB 860kB/s eta 0:00:01
[?25hCollecting scipy>=0.15.1 (from systemml)
[?25l  Downloading https://files.pythonhosted.org/packages/f0/30/526bee2ce18c066f9ff13ba89603f6c2b96c9fd406b57a21a7ba14bf5679/scipy-1.2.1-cp35-cp35m-manylinux1_x86_64.whl (24.7MB)
[K    100% |################################| 24.7MB 626kB/s eta 0:00:01[K    88% |############################    | 21.9MB 64.4MB/s eta 0:00:01
[?25hCollecting pandas (from systemml)
[?25l  Downloading https://files.

Now we need to create two sym links that the newest version is picket up - this is a workaround and will be removed soon

In [None]:
!ln -s -f ~/user-libs/python3/systemml/systemml-java/systemml-1.2.0-extra.jar ~/user-libs/spark2/systemml-1.2.0-extra.jar
!ln -s -f ~/user-libs/python3/systemml/systemml-java/systemml-1.2.0.jar ~/user-libs/spark2/systemml-1.2.0.jar

Now please restart the kernel and make sure the version is correct

In [1]:
from systemml import MLContext
ml = MLContext(spark)
ml.version()

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190331161315-0002
KERNEL_ID = db473c1a-d3ec-429f-9c04-19bb23a74378


'0.14.0-incubating'

Congratulations, if you see version 1.2.0 or higher, please continue with the notebook...

In [2]:
from systemml import MLContext, dml
import numpy as np
import time

Then we create an MLContext to interface with Apache SystemML. Note that we pass a SparkSession object as parameter so SystemML now knows how to talk to the Apache Spark cluster

In [3]:
ml = MLContext(spark)

Now we create some large random matrices to have numpy and SystemML crunch on it

In [4]:
u = np.random.rand(1000,10000)
s = np.random.rand(10000,1000)
w = np.random.rand(1000,1000)

Now we implement a short one-liner to define a very simple linear algebra operation

In case you are unfamiliar with matrxi-matrix multiplication: https://en.wikipedia.org/wiki/Matrix_multiplication

sum(U' * (W . (U * S)))


| Legend        |            |   
| ------------- |-------------| 
| '      | transpose of a matrix | 
| * | matrix-matrix multiplication      |  
| . | scalar multiplication      |   



In [5]:
start = time.time()
res = np.sum(u.T.dot(w * u.dot(s)))
print (time.time()-start)

1.6442468166351318


As you can see this executes perfectly fine. Note that this is even a very efficient execution because numpy uses a C/C++ backend which is known for it's performance. But what happens if U, S or W get such big that the available main memory cannot cope with it? Let's give it a try:

In [6]:
#u = np.random.rand(10000,100000)
#s = np.random.rand(100000,10000)
#w = np.random.rand(10000,10000)

After a short while you should see a memory error. This is because the operating system process was not able to allocate enough memory for storing the numpy array on the heap. Now it's time to re-implement the very same operations as DML in SystemML, and this is your task. Just replace all ###your_code_goes_here sections with proper code, please consider the following table which contains all DML syntax you need:

| Syntax        |            |   
| ------------- |-------------| 
| t(M)      | transpose of a matrix, where M is the matrix | 
| %*% | matrix-matrix multiplication      |  
| * | scalar multiplication      |   

## Task

In order to show you the advantage of SystemML over numpy we've blown up the sizes of the matrices. Unfortunately, on a 1-2 worker Spark cluster it takes quite some time to complete. Therefore we've stripped down the example to smaller matrices below, but we've kept the code, just in case you are curious to check it out. But you might want to use some more workers which you easily can configure in the environment settings of the project within Watson Studio. Just be aware that you're currently limited to free 50 capacity unit hours per month wich are consumed by the additional workers.

In [7]:
script = """
U = rand(rows=1000,cols=10000)
S = rand(rows=10000,cols=1000)
W = rand(rows=1000,cols=1000)
res = sum( t(U) %*% (W * (U %*% S)))
"""

To get consistent results we switch from a random matrix initialization to something deterministic

In [8]:
#prog = dml(script).input('U', u).input('S', s).input('W', w).output('res')
prog = dml(script).output('res')
res = ml.execute(prog).get('res')
print(res)

TypeError: 'JavaPackage' object is not callable

If everything runs fine you should get *6244089899151.321* as result. Feel free to submit your DML script to the grader now!

### Submission

In [9]:
!rm -f rklib.py
!wget https://raw.githubusercontent.com/romeokienzler/developerWorks/master/coursera/ai/rklib.py

--2019-03-31 16:13:50--  https://raw.githubusercontent.com/romeokienzler/developerWorks/master/coursera/ai/rklib.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2289 (2.2K) [text/plain]
Saving to: 'rklib.py'


2019-03-31 16:13:50 (56.1 MB/s) - 'rklib.py' saved [2289/2289]



In [10]:
from rklib import submit
key = "esRk7vn-Eeej-BLTuYzd0g"


email = "kumarpushkar007@gmail.com"

In [11]:
part = "fUxc8"
secret = "c5yswu0BGFC7aERX"
submit(email, secret, key, part, [part], script)

Submission successful, please check on the coursera grader page for the status
-------------------------
{"elements":[{"itemId":"P1p3F","id":"tE4j0qhMEeecqgpT6QjMdA~P1p3F~-vRKCFPPEem6dhKKDyfzXA","courseId":"tE4j0qhMEeecqgpT6QjMdA"}],"paging":{},"linked":{}}
-------------------------
