# Multithreading and Multiprocessing

### Threading
* typically, concurrency is created so that we can do some task while I/O is happening (e.g., a server can start processing a new request while waiting for data from a previous request to arrive)
* we can create objects that appear to be running independently, but simultaneously
* the job of threading is to enable an application to be responsive
* CPython, the default implementation of Python, has a Global Interpreter Lock (GIL), which prevents your application from doing two things at once, but rather, the CPU time is being rationed across your thread

### Simple threading example

In [4]:
from threading import Thread

class InputReader(Thread):
    """Thread example, extends Thread class"""

    def run(self):
        """
        Whatever is in the run method (or called from
        it) is executed in a separate thread
        """
        self.line_of_text = input('Enter some text: ')

input('Are you ready? When you hit return the thread will start.')
thread = InputReader() # create thread object
thread.start() # cf. thread.run() for no concurrency

count, result = 1, 1

while thread.is_alive():
    result = count * count
    count += 1

print('calculated squares up to {0:_} * {0:_} = {1:,}'
      .format(count, result))
print('while you typed "{}"'.format(thread.line_of_text))

Are you ready? When you hit return the thread will start.
Enter some text: hello
calculated squares up to 1 * 1 = 1
while you typed "hello"


In [17]:
# Does not work inside Jupyter (but I think it does now)
from threading import Thread
import json
from urllib.request import urlopen
import time

cities = ['Boulder', 'Atlanta', 'San Francisco',
          'Reno', 'Honolulu', 'Zurich', 'Dubai',
          'Dublin', 'Hyderabad', 'Rome']

class TempGetter(Thread):
    def __init__(self, city):
        """Initialize our thread

In the previous example, our class which extended
Thread did not need an __init__ method, because
there was no per-thread information to store. Which
means that the __init__ method from the superclass
(Thread) was called automatically. Here, because we
need to store per-thread information (the city), we
have to explicitly call the__init__ method of Thread.
        """
        super().__init__()
        self.city = city

    def run(self):
        url_template = (
            'http://api.openweathermap.org/data/2.5/' 
            'weather?q={}&units=imperial'
                        '&&APPID=10d4440bbaa8581bb8da9bd1fbea5617')
        response = urlopen(url_template.format(self.city))
        data = json.loads(response.read().decode())
        self.temperature = data['main']['temp']
        
threads = [TempGetter(c) for c in cities] # creates 10 threads
start = time.time()

# start all 10 threads
for thread in threads:
    thread.start() # not run()

# wait for all 10 threads to complete
for thread in threads:
    thread.join()

for thread in threads:
    print("it is {0.temperature:.0f}°F in {0.city}"
          .format(thread))
print("Got {} temps in {} seconds"
      .format(len(threads), time.time() - start))

IT is 57°F in Boulder
IT is 47°F in Atlanta
IT is 56°F in San Francisco
IT is 39°F in Reno
IT is 76°F in Honolulu
IT is 33°F in Zurich
IT is 74°F in Dubai
IT is 56°F in Dublin
IT is 81°F in Hyderabad
IT is 44°F in Rome
Got 10 temps in 0.4813971519470215 seconds


In [None]:
%%bash
python3 getweather.py

## Threading (cont'd)
* the main problem with threads is also their primary advantage–shared memory
 * all threads have access to all the memory
 * what if two threads access the same data?
* synchronization is the solution, but it's tricky
 * bugs due to incorrect synchronization can be very difficult to find due to ordering issues
* one solution is to force communication between threads to occur using a data structure that has built in locking, such as queue.Queue
* disadvantages could be outweighed by the fact that shared memory is FAST, except for the GIL

## Lab: threads
* create a program which uses threads to simulate a database server
* your "database server" should simply be a thread which sleeps for a random interval (check out __`time.sleep()`__ and __`random.randint()`__ if you're not familiar with them)
* your main thread should get input from the user and respond to it (perhaps reversing the input given by the user) while the database thread is busy

## Lab Solution: threads

# Multiprocessing
* the Python multiprocessing library is designed for cases where CPU-bound jobs needs to happen in parallel and multiple cores are available
* advantages
 * separate memory space for each process
 * code is usually straightforward compared to threads
 * avoids GIL limitation
 * eliminates synchronization (assuming no shared memory)

## A Simple Multiprocessing Example

In [24]:
from multiprocessing import Process, cpu_count
import time
import os

class MuchCPU(Process):
    def run(self):
        print(os.getpid()) # get process ID
        print(__name__)
        for i in range(40_000_000):
            result = i * i
            
#if __name__ == '__main__':
print('Running...')
procs = [MuchCPU() for f in range(cpu_count())]
t = time.time()
for p in procs:
    p.start()
    
for p in procs:
    p.join()
    
print('work took {} seconds'.format(time.time() - t))

Running...
16978
16979
__main__
16980
__main__
16981
__main__
__main__
work took 5.753094911575317 seconds


## Multiprocessing (cont'd)
* no reason for more processes than there are processors
 * only `cpu_count()` procs can run simultaneously
 * each proc consumes resources with a full copy of Python interpreter
 * interproc communication is expensive
 * creating procs takes a nonzero amount of time
* so we create at most `cpu_count()` processes when the program starts and have them execute tasks as needed
* easy to implement a basic series of communicating processes to do this, but it can be tricky to debug, test, and get correct–we don't have to do all this work because the Python developers have already done it for us–multiprocessing pools


## Multiprocessing Pools
* pools abstract away the overhead of figuring out what code is running in main process and what code is running in subprocess
* abstraction restricts the number of places that code in different processes interact with each other, making it easier to keep track of
* pools also hide the passing of data between processes
 * using a pool looks much like a function call–you pass data into a function, it's executed in another process or processes, and when the work is complete, a value is returned
 * under the hood, a lot of work is being done to support this–objects in one process are being pickled (serialized) and passed into a pipe, then another process retrieves data from the pipe and unpickles it. Work is done in the subprocess and a result is produced. The result is pickled and passed into a pipe. Eventually, the original process unpickles it and returns it.

## Multiprocessing Pool Example

In [30]:
import random
import math
import os
from multiprocessing.pool import Pool

def prime_factor(value):
    factors = []
    #print('prime_factor(', value, ')', os.getpid())
    for divisor in range(2, value-1):
        quotient, remainder = divmod(value, divisor)
        if not remainder:
            factors.extend(prime_factor(divisor))
            factors.extend(prime_factor(quotient))
            break
    else:
        factors = [value]
    return factors

if __name__ == '__main__':
    pool = Pool()

    to_factor = [
        random.randint(100_000, 50_000_000) 
                for i in range(40)
    ]
    print(to_factor)
    results = pool.map(prime_factor, to_factor)
    for value, factors in zip(to_factor, results):
        print("The factors of {} are {}".format(value, factors))

[48417975, 43517475, 7961327, 36301960, 14874335, 17444282, 45065960, 42746163, 11305911, 659649, 15357121, 8619180, 12898027, 4694526, 29508322, 1360013, 37393428, 43559324, 49225351, 17697069, 11221067, 9216451, 15089572, 31818672, 44944394, 40783181, 26075596, 26034684, 3754459, 3659960, 148419, 13152877, 43831494, 7289907, 36747565, 10114097, 24020989, 36448377, 23441654, 41362295]
The factors of 48417975 are [3, 3, 5, 5, 215191]
The factors of 43517475 are [3, 3, 5, 5, 269, 719]
The factors of 7961327 are [11, 31, 37, 631]
The factors of 36301960 are [2, 2, 2, 5, 907549]
The factors of 14874335 are [5, 7, 67, 6343]
The factors of 17444282 are [2, 173, 50417]
The factors of 45065960 are [2, 2, 2, 5, 1126649]
The factors of 42746163 are [3, 149, 95629]
The factors of 11305911 are [3, 29, 129953]
The factors of 659649 are [3, 31, 41, 173]
The factors of 15357121 are [13, 31, 53, 719]
The factors of 8619180 are [2, 2, 3, 5, 143653]
The factors of 12898027 are [53, 163, 1493]
The facto

## Lab: Multiprocessing Pool
* write a program to compute 1!…48! using a multiprocessing pool
* won't be much of a parallelism example, but it's easy to code
* use previous example as a template

## Multiprocessing Issues/What Else
* primary drawback: sharing data between processes is expensive since all communication between processes requires serialization (pickling) the data
* what we didn't cover
 * futures: objects that wrap threading or multiprocessing depending on what kind of concurrency we need (I/O vs. CPU)
 * AsyncIO: current state of the art in Python concurrent programming