# Serialization
by: Joseph Armstrong

# Unit Testing
Sometimes referred to as "PyUnit", unit testing:
* involves breaking your program into pieces, and subjecting each piece into a series of tests.
    
### Requirements

``import unittest``

### Advantages
* provides documentation of a system
* simplifies the debugging process
* allows you to know exactly WHEN of the code fails not just IF it does

### Strategies
***Statement Testing*** - A test strategy in which each statement of a program is executed at least once.

In [None]:
def my_contains(elem, lst):
    """(object, list) -> bool
    
    Return True if and only if elem is in lst.
    """
    return elem in lst

def my_first(lst):
    """ (list) -> object

    Return the first element in lst.
    """
    return lst[0]

In [None]:
def my_contains(elem, lst):
    return elem in lst

def my_first(lst):
    return lst[0]

----------------------------------------------------------------------
python unittest_basic.py
----------------------------------------------------------------------

from myfunctions import my_contains, my_first
import unittest

class TestMyFunctions(unittest.TestCase):
    def test_contains_simple_true(self):
        self.assertTrue(my_contains(3, [1, 2, 3]))

    def test_first_numbers(self):
        self.assertEqual(my_first([1, 2, 3]), 1)

    def test_first_empty(self):
        self.assertRaises(IndexError, my_first, [])

if __name__ == "__main__":
    unittest.main(exit = False)


In [None]:
...
----------------------------------------------------------------------
Ran 3 tests in 0.000s

OK

# Performance Profiling

A profile is a set of statistics that describes how often and how long various parts of the program executed.

### Requirements

python -m cProfile *filename*

### Advantages
Allows the code to be broken apart to see exactly where the code is the slowest part, which gives the code structure.

In [12]:
mpirun -np 2 python example.py

----------------------------------------------------------------------

from mpi4py import MPI

comm = MPI.COMM_WORLD
name=MPI.Get_processor_name()

print("hello world")
print(("my rank is",comm.rank))

hello world
('my rank is', 0)


In [None]:
python -m cProfile example.py

----------------------------------------------------------------------

7 function calls in 0.059 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 __init__.py:25(<module>)
        1    0.000    0.000    0.000    0.000 atexit.py:6(<module>)
        1    0.059    0.059    0.059    0.059 example.py:1(<module>)
        1    0.000    0.000    0.000    0.000 {hasattr}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {mpi4py.MPI.Get_processor_name}


***ncalls*** - for the number of calls

***tottime*** - for the total time spent in the given function (and excluding time made in calls to sub-functions)

***percall*** - is the quotient of tottime divided by ncalls

***cumtime*** -is the cumulative time spent in this and all subfunctions (from invocation till exit). This figure is accurate even for recursive functions.

***percall*** - is the quotient of cumtime divided by primitive calls

***filename: lineno(function)*** - provides the respective data of each function

# Distributed Computing

A distributed system is a collection of independent computers, interconnected via a network, capable of collaborating on a task

### Advantages:
* By using the combined processing and storage capacity of many nodes, performance levels can be reached that are out of the scope of centralised machines
* Resources such as processing and storage capacity of can increase incrementally

### Challenges
* There are differences that apply to: hardware, network, OS, programming languages, and implementations by different developers
* Tackling the complexity of a distributed system which was designed by a different person

# Object Serialization

### Protobuf
* Protobuf doesn't deal with schema evolution which would cause most to assume it would be more efficient, yet is the least efficient out of the three choices.
    
### Pickle
* Pickle serialize the object first before writing it to a file, which seems to very useful for what we're trying to do.
   
### MsgPack 
* Msgpack can distinguish string and binary type, yet it doesn't work well with Python 2 which could be a hindrance. However it does produce the shortest string length and is the quickest out the three.

In [19]:
import cPickle
import msgpack
import timeit
from test_pb2 import BunchOfTestDicts, TestDict, Pair

def writeReadPB():
    bOTD = BunchOfTestDicts()
    for thisDict in realStuff:
        tD = bOTD.dicts.add()
        for k, v in thisDict.items():
            pair = tD.pairs.add()
            pair.key = k
            pair.value = v
    newBOTD = bOTD
    thisDictList = [{thisPair.key: thisPair.value
                             for thisPair in thisBufferedDict.pairs}
                    for thisBufferedDict in newBOTD.dicts]
    return thisDictList


with open('realstuff.pkl', 'r') as f:
    realStuff = cPickle.load(f)

setupStatement="""\
from __main__ import writeReadPB, realStuff
"""

print 'writeRead: %s' % timeit.timeit("writeReadPB()", setup=setupStatement, number=10)

writeRead: 26.319947958


In [19]:
def writeReadPkl():
    serializedPkl = cPickle.dumps(realStuff, protocol=2)
    #print ('pickle string length: %s'%len(serializedPkl))
    rslt = cPickle.loads(serializedPkl)
    return rslt

setupStatement="""\
from __main__ import writeReadPkl, realStuff
"""

print 'writeRead:  %s' % timeit.timeit("writeReadPkl()", setup=setupStatement, number=10)

writeRead:  2.80171298981


In [21]:
def writeReadMSG():
    serializedMSG = msgpack.dumps(realStuff)
    #print ('MsgPack length: %s'%len(serializedMSG))
    rslt = msgpack.loads(serializedMSG)
    return rslt

setupStatement="""\
from __main__ import writeReadMSG
"""

print 'writeRead:  %s' % timeit.timeit("writeReadMSG()", setup=setupStatement, number=10)

writeRead:  2.12104797363


# MPI4PY (MPI for Python)
## MPI stands for Message Passing interface

* MPI lets you rank processors, and send and receive messages/data from various nodes in the cluster
* MPI also allows the program to be parallely executed with messages between nodes

In [None]:
import numpy as np
from mpi4py import MPI
import pickle
import timeit
import msgpack
import sys

with open('realstuff.pkl', 'rb') as f:
    realStuff = pickle.load(f)

def writeReadMSG():
    serializedMSG = msgpack.dumps(realStuff)
    rslt = msgpack.loads(serializedMSG)
    return rslt

comm = MPI.COMM_WORLD
name = MPI.Get_processor_name()
size = comm.Get_size()
rank = comm.Get_rank()

start = MPI.Wtime()

if rank == 0:
    data = writeReadMSG()
    comm.send(data, dest = 1)
    print ("From rank", rank, "we sent:", len(data))

elif rank == 1:
    data = comm.recv(source = 0)
    print ("on node", rank, "we received:", len(data))

end = MPI.Wtime()
print (end - start)

In [None]:
('From rank', 0, 'we sent:', 100000)
0.332388162613
('on node', 1, 'we received:', 100000)
0.25637793541

# Acknowledgements
* Eli Zenkov
* Jay DePasse
* Shawn Brown
* Leila Haidari
* Joel Welling
* Jenn Bakal
* Jim Leonard
* Dave Kapcin