<a href="https://colab.research.google.com/github/Rudresh99/System-ML/blob/master/SystemML_Work.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scripted By : Rudresh Joshi { Tuscon } 

# LinkedIn  : https://www.linkedin.com/in/rudresh-joshi-b1397517a/

# GitHub : [Tuscon](https://github.com/Rudresh99)
## Understaning scaling of linear algebra operations on Apache Spark using Apache SystemML

In this notebook we understand how to scale linear algebra operations from a single machine to multiple machines, memory and CPU cores using Apache SystemML. Therefore we want to understand how to migrate from a numpy program to a SystemML DML program.

Finally, We won't need this knowledge anyways if you are sticking to Keras only, but once we go beyond that point we will be happy to see what's going on behind the scenes.

So the first thing we need to ensure is that we are on the latest version of SystemML, which is 1.2.0:

The steps are:
- pip install
- start execution at the cell with the version - check

In [0]:
!pip install pyspark==2.4.5



In [0]:
!pip install systemml



In [0]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .getOrCreate()

In [0]:
!mkdir -p /home/dsxuser/work/systemml

In [0]:
from systemml import MLContext, dml
import numpy as np
import time
ml = MLContext(spark)
ml.setConfigProperty("sysml.localtmpdir", "mkdir /home/dsxuser/work/systemml")
print(ml.version())
    
if not ml.version() == '1.2.0':
    raise ValueError('please upgrade to SystemML 1.2.0, or restart your Kernel (Kernel->Restart & Clear Output)')

1.2.0


Congratulations, Now we see version 1.2.0, so we can continue with the notebook...

We use an MLContext to interface with Apache SystemML. Note that we passed a SparkSession object as parameter so SystemML now knows how to talk to the Apache Spark cluster

Now we create some large random matrices to have numpy and SystemML crunch on it

In [0]:
u = np.random.rand(1000,10000)
s = np.random.rand(10000,1000)
w = np.random.rand(1000,1000)

Now we implement a short one-liner to define a very simple linear algebra operation

In case you are unfamiliar with matrxi-matrix multiplication: https://en.wikipedia.org/wiki/Matrix_multiplication

sum(U' * (W . (U * S)))


| Legend        |            |   
| ------------- |-------------| 
| '      | transpose of a matrix | 
| * | matrix-matrix multiplication      |  
| . | scalar multiplication      |   



In [0]:
start = time.time()
res = np.sum(u.T.dot(w * u.dot(s)))
print (time.time()-start)

1.0806496143341064


As you can see this executes perfectly fine. Note that this is even a very efficient execution because numpy uses a C/C++ backend which is known for it's performance. But what happens if U, S or W get such big that the available main memory cannot cope with it? Let's give it a try:

In [0]:
#u = np.random.rand(10000,100000)
#s = np.random.rand(100000,10000)
#w = np.random.rand(10000,10000)

After a short while you should see a memory error. This is because the operating system process was not able to allocate enough memory for storing the numpy array on the heap. Now it's time to re-implement the very same operations as DML in SystemML.
Consider the following table which contains all DML syntax you need:

| Syntax        |            |   
| ------------- |-------------| 
| t(M)      | transpose of a matrix, where M is the matrix | 
| %*% | matrix-matrix multiplication      |  
| * | scalar multiplication      |   


In order to show the advantage of SystemML over numpy we've blown up the sizes of the matrices. Unfortunately, on a 1-2 worker Spark cluster it takes quite some time to complete. Therefore we've stripped down the example to smaller matrices below. But We might want to use some more workers which We easily can configure in the environment settings of the project within IBM Watson Studio.

In [0]:
script = """
U = rand(rows=1000,cols=10000)
S = rand(rows=10000,cols=1000)
W = rand(rows=1000,cols=1000)
res = sum(t(U) %*% (W * (U%*%S)))
"""

To get consistent results we switch from a random matrix initialization to something deterministic

In [0]:
prog = dml(script).output('res')
res = ml.execute(prog).get('res')
print(res)

ANTLR Tool version 4.5.3 used for code generation does not match the current runtime version 4.7ANTLR Runtime version 4.5.3 used for parser compilation does not match the current runtime version 4.7ANTLR Tool version 4.5.3 used for code generation does not match the current runtime version 4.7ANTLR Runtime version 4.5.3 used for parser compilation does not match the current runtime version 4.7
SystemML Statistics:
Total execution time:		10.522 sec.
Number of executed Spark inst:	0.


6245157736010.988


# Final result or Output : 
If everything runs fine you should get *6252492444241.075* as result (or something in that bullpark). Beacuse normally after calculation we get this value. and our output is preety much closer to the expected output , It also show the time required for the calculation which is very nice.

ThankYou


