<h1> Threading / MultiProcessing in Python </h1>

Quick Overview:
- Thread: A set instructions that can be managed independently by a scheduler. Exist in the same namespace. 
- Threading: launches one or many thread(s), in the same namespace. 
    - Threading is useful for I/O tasks
- Process: An instance of a program that is being executed. Has it's own namespace. 
- Multiprocessing: launches one or many new process(es), each with it's own namespace. 
    - MultiProcessing is useful for CPU bound tasks
    
Note that Python has a 'Global Interpreter Lock (GIL)' which synchronizes threads and enforces that only one thread can execute at a time. This means that multiple threads don't actually run across cores in parallel. This is why a task for which CPU is limiting factor, threading won't really help. But for I/O tasks it can help, as the CPU can still process other threads if a particular thread is waiting on I/O operations.

Module Docs:
- MultiProcessing: https://docs.python.org/2/library/multiprocessing.html
- Threading: https://docs.python.org/2/library/threading.html

Relevant overviews of parallel computing for Python:
- http://sebastianraschka.com/Articles/2014_multiprocessing_intro.html
- http://chriskiehl.com/article/parallelism-in-one-line/

Notes on differences:
- http://blogs.datalogics.com/2013/09/25/threads-vs-processes-for-program-parallelization/
- http://stackoverflow.com/questions/3044580/multiprocessing-vs-threading-python

Notes: 
- Here I do not do extensive timing tests to figure out if, when, and how much speed is actually provided by these methods. 
- If you are looking for a quick and easy way to get going with parallelization, the pool/map approach in 'multiprocessing' is probably best/easiest

In [1]:
# Python Multi-Threading / Multiprocessing
import thread             # A basic module allowing for writing multi-threads
import threading          # A more powerful module for multi-threading
import multiprocessing    # A module for multiprocessing
import time               # A module for using time

In [2]:
# Q: How many cores do I have?
print 'I have this many cores:', multiprocessing.cpu_count()

I have this many cores: 4


<h3> Thread Module </h3>

'Thread' is a basic module for using threads. 

It's simple to use a thread, but has less capabilities than threading. 

Logic: Use 'start_new_thread' to open a new thread, which takes a function as it's first parameter and a tuple of arguments for that function as it's second argument. This process opens a new thread, and launches the given function with the given arguments. 

In [4]:
# Define a test function to pass to threads.
def test_function(num, thread_name):
    time.sleep(num)
    print 'Finished', thread_name

In [4]:
# Launch a couple threads
thread.start_new_thread(test_function, (2,'Thread_1'))
thread.start_new_thread(test_function, (5, 'Thread_2'))

123145329688576

<h3> Threading Module </h3>

'Threading' is another module for using threads, with more capabilities (but also more overhead). 

Logic: The core of 'Threading' is the thread class, which is used to define thread objects, that you then get to do your bidding. 
- First, define a subclass of the Thread class to set up threads that will do what you want them to do
    - Do this by overriding the run method
- Now, you can initialize and run threads that will do your custom jobs
    - Note that initializing a thread does NOT start it running

Notes:
- Using 'lock' is a way to enforce synchronization between threads
- The Queue module allows for Queue and organizing threads

Example adapted from:
http://www.tutorialspoint.com/python/python_multithreading.htm

In [5]:
# Define a function to use for Threading example
def print_time(name, delay, counter):
    while counter:
        time.sleep(delay)
        print "%s: %s" % (name, time.ctime(time.time()))
        counter -= 1

In [6]:
# Define a subclass of the thread class that implements threads set up as you want them

class myThread (threading.Thread):
    
    # Initialize the thread, with ID, name and counter
    # Note that parameters can be based when initializing the thread to be used in functions called in 'run'.
    #  Basically: parameters are passed when initializing, not when running the thread.
    # ^ In this case, for example, the 'delay' parameter.
    def __init__(self, threadID, name, delay):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.delay = delay
        
    # Overide the run method to describe custom behaviour for your thread
    # In this case, each thread prints start & end, and runs the 'print_time' function. 
    def run(self):
        print "Starting " + self.name + ". "
        
        # Do something (this can be anything)
        print_time(self.name, self.delay, 3)
        
        print "Ending " + self.name + ". "

In [7]:
# Initialize threads - create new threads, providing parameters you requre based on __init__
#   ^ In this case, an ID number, a name, and a delay time. 
# Initializing threads does not automatically run them. 
thread1 = myThread(1, "John", 1  )
thread2 = myThread(2, "Paul", 1.5)

In [8]:
# Start (run) threads. They will automatically terminate when the 'run' method ends. 
thread1.start()
thread2.start()

Starting John. 
Starting Paul. 


In [9]:
# Notice above that the output is intermixed, indicating they were running at the same time. 
# This is different than if you run them sequentially, in which case the second 

print_time('George', 1  , 3)
print_time('Ringo' , 1.5, 3)

# Launches a new thread also doesn't pause the main thread proceeding.
# So if you launch this cell and the previous together, then you'll see what looks like them all running concurrently. 
# ^ Actually, with 3 threads (2 extras, and 1 main), three can run at once, so Ringo still has to wait for George. 

John: Sun May 22 12:56:57 2016
Finished Thread_1
George: Sun May 22 12:56:57 2016
Paul: Sun May 22 12:56:58 2016
John: Sun May 22 12:56:58 2016
George: Sun May 22 12:56:58 2016
Paul: Sun May 22 12:56:59 2016
John: Sun May 22 12:56:59 2016
Ending John. 
George: Sun May 22 12:56:59 2016
Finished Thread_2
Paul: Sun May 22 12:57:01 2016
Ending Paul. 
Ringo: Sun May 22 12:57:01 2016
Ringo: Sun May 22 12:57:02 2016
Ringo: Sun May 22 12:57:04 2016


In [10]:
# As done above, it's a bit haphazard what finishes when. 
# To gain control over scheduling, organizing, you can use 'Lock' from the Threading module. 

# This creates a threadLock object that can be used to synchronize / organize scheduling
threadLock = threading.Lock()

# This sets a block, which can block a thread and force them to wait for the lock to be released. 
threadLock.acquire()

# This releases the block, allowing a thread to proceed. 
threadLock.release()

# Note that Threads can also be used with Queues

In [11]:
# Threading also has a number of HouseKeeping Functions to keep track of threads

# 'Active_count' checks and returns the number of thread objects currently alive
print "Active Count: \n", threading.active_count(), "\n"

# 'currentThread' checks and returns the name of the current thread. 
print "Current Thread: \n", threading.currentThread(), "\n"

# 'Enumerate' prints a list of all current threads
print "Enumerate: \n", threading.enumerate(), "\n"

# Call on a specific Thread to check if it is active
print "Is Alive: \n", thread1.isAlive()

Active Count: 
5 

Current Thread: 
<_MainThread(MainThread, started 140735299985408)> 

Enumerate: 
[<_MainThread(MainThread, started 140735299985408)>, <Thread(Thread-2, started daemon 123145307582464)>, <HistorySavingThread(IPythonHistorySavingThread, started 123145317068800)>, <Heartbeat(Thread-3, started daemon 123145311789056)>, <ParentPollerUnix(Thread-1, started daemon 123145321275392)>] 

Is Alive: 
False


<h3> MultiProcessing Module </h3>

MultiProcessing is similar in usage to Threading, but by spawning processes instead of threads, it sidesteps the GLI, offering real parallel computing across multiple cores. 
Basically: Useful if CPU is the bottleneck, and independent processes can be split across cores. 

Logic: 
- Initiliaze a new process (or Pool of processes), passing in functions to do and arguments to use
- Run the process
- Organize / keep track of processes with Pools, Queues and using map

Example from:
- http://stackoverflow.com/questions/21031372/python-multiprocessing-troubleshooting

In [6]:
# Similar to Threading, you can launch, and then run new processes

# Initialize a new process. This does not start the process running. 
#  For processes, the function & arguments to be run by the process are passed in at initialization. 
#  'args' must be a tuple. (Even if only one argument - for example: args=('lonely_argument',) )
test_process = multiprocessing.Process(target=print_time, args=('Test', 1.5, 3))

# Start the process. 
test_process.start()

# Like a new thread, a new process does not interupt the flow of the main program. 

Test: Mon Jul  4 14:53:57 2016


In [7]:
# Like Threading, there are Housekeeping functions to keep track of what's what. 

# Check the name of a process
print "Name: ", test_process.name

# Check if a process is active
print "Active: ", test_process.is_alive()

Name:  Process-1
Active:  True


In [8]:
# Since each process exists in it's own namespace, if there is output expected, you have to set up a way to collect it
# One way to do so is to set up a Queue, which is basically a list-like object that is used to collect things

# Queue sets up a queue object that processes can send their outputs to
q = multiprocessing.Queue()

In [9]:
# To use a Queue, set up a function such that instead of returning a variable, it sends it to the Queueu object

# Test Function. Does a sum between given numbers and puts the answer on the Queue. 
def sum_queue(a, b):
    su = 0
    for i in range(a, b):
        su+=i
    q.put(su)

In [10]:
# Set up some processes
p1 = multiprocessing.Process(target=sum_queue, args=(1      , 1*10**6))
p2 = multiprocessing.Process(target=sum_queue, args=(1*10**6, 1*10**7))

In [11]:
# Run the processes
p1.start()
p2.start()

In [12]:
# When the processes are done, we can now get the results from the queue
r1 = q.get()
r2 = q.get()
print r1, r2, r1+r2

499999500000 49499995500000 49999995000000


In [13]:
# Note that Queue.get() retrieves and removes an item from the Queue
# Queue, in default mode, is First In, First Out (FIFO)
q.empty()

True

### Using Pool

Another way to use multiprocessing is to use 'Pool'. 

Instead of having to manage many individual processes, Pool lets you set up a group of processes. 

In [14]:
# Test function. Same sum function as above, except it returns instead of sending to Queue. 
def sum_return(vec):
    su = 0
    for i in range(vec[0], vec[1]):
        su += i
    return su

In [18]:
# Start a pool with 2 worker processes
pool = multiprocessing.Pool(processes=2)

# Map is a function to match the pool of processes with a list of jobs
results = pool.map(sum_return, [[1, 1*10**6], [1*10**6, 1*10**7]])

# Check the results
print results

[499999500000, 49499995500000]


In [19]:
# Another example function - for testing.
def f(x):
    return x*x, x*x*x

In [22]:
# You can set a pool with as many processes as you want. 
pool2 = multiprocessing.Pool(processes=4)

# The number of arguments you pass in does not have to match the number of processes you have. 
# The outputs from pool will be a list where each item in the list is the outputs from the function you mapped in. 
results = pool2.map(f, range(20000))

# There are also functions you can call on your pool to for control flow. 
# 'close' prevents any more tasks from being submitted to the pool. Once finished, the worker processes will exit. 
pool2.close()

# 'join' waits for the worker processes to exit (pauses the main program). 
# ^ Must call 'close' or 'terminate' before join. 'terminate' stops workers immediately and quits. 
pool2.join()

# Check results
print results[0:10]

[(0, 0), (1, 1), (4, 8), (9, 27), (16, 64), (25, 125), (36, 216), (49, 343), (64, 512), (81, 729)]


In [21]:
# Example function - can be used for testing. 
def fib(n):
    if((n==1) or (n==2)):
        return 1
    return((fib(n-1) + fib(n-2)))