# Python Concurrency
Part of this lecture is based on the material by [Dr. Gregory Watson](https://nyu-cds.github.io/python-itertools/)

For this class you will need to register in **imgur.com**
- create an account in [imgur.com](imgur.com)
- register your application [here](https://api.imgur.com/oauth2/addclient)
  - Authorization Type: __OAuth 2 authorization with a callback URL__
  - Authorization Callback URL: __https://www.getpostman.com/oauth2/callback__
  - email:
  - Description:

---
The functions below fetchs a list of images and download them from __imgur__ repository. We will split the functionality into three separate functions:
- get_links
- download_link
- setup_download_dir

In [10]:
import json
import os
from pathlib import Path
from urllib.request import urlopen, Request
from time import time

types = {'image/jpeg', 'image/png', 'image/gif'}

def get_links(client_id):
    '''used to obtain a list of available images'''
    headers = {'Authorization': 'Client-ID {}'.format(client_id)}
    req = Request('https://api.imgur.com/3/gallery/random/random/', headers=headers, method='GET')
    with urlopen(req) as resp:
        data = json.loads(resp.read().decode('utf-8'))
    return [item['link'] for item in data['data'] if 'type' in item and item['type'] in types]

def download_link(directory, link):
    '''downloads the image given by the URL _link_ into _directory_'''
    download_path = directory / os.path.basename(link)
    with urlopen(link) as image, download_path.open('wb') as f:
        f.write(image.read())

def setup_download_dir():
    '''creates a download destination directory if it doesn’t already exist'''
    download_dir = Path('images')
    if not download_dir.exists():
        download_dir.mkdir()
    return download_dir

In [12]:
#from download import setup_download_dir, get_links, download_link

# 'replace with your client ID'
CLIENT_ID = '7d7d80b0e389c61'

ts = time()
download_dir = setup_download_dir()
links = [l for l in get_links(CLIENT_ID)]
for link in links:
   download_link(download_dir, link)
print('Took {}s'.format(time() - ts))

Took 27.127142906188965s


To improve the performance of the image downloader we can run multiple copies of the program at the same time. 

However, we would need to know what images are available so that we could ensure that one process didn’t download an image that had already been downloaded by a different process.  Fortunately the multiprocessing module is available for this purpose.

To use multiple processes we need a multiprocessing _Pool_. The Pool class provides a map method that runs a function as a separate process, passing arguments from a supplied iterable. The iterable is divided into a number of chunks, so that each process gets roughly the same number of elements. 

In our example, we will pass the list of URLs to the pool, which, in turn, will start 8 new processes and use each one to download the images in parallel.

In [21]:
from functools import partial
from multiprocessing.pool import Pool

def multi_processes_download():
   ts = time()
   download_dir = setup_download_dir()
   links = [l for l in get_links(CLIENT_ID)]

   # functools.partial makes a new version of a function 
   # with one or more arguments already filled in
   download = partial(download_link, download_dir)
   
   with Pool(8) as p:
       p.map(download, links)
   print('Took {}s'.format(time() - ts))

multi_processes_download()

Took 9.001381874084473s


Although easy to implement, the parallelism bears some drawbacks:
- each process contains a copy of the entire memory
- it does not handle processes that depend on each other

Those issues can be tackled by shared memory and message passing mechanisms, which we will learn from later lessons.

---
## Using Threads

Threading is a well known approach to attaining concurrency: 
- typically threads are lighter weight than processes
- lower memory requirements, as they share the same memory space

We will relay on __Thread class__, which provides a run method that should be overridden with a method that does the actual work of the thread.

### Thread Safety
Variables in the program are shared by all the threads and cannot be accessed the way you would normally access a variable. One thread may change the variable while another thread is reading it, or worse, two threads may try to update the variable at the same time. This is known as a race condition, it is one of the leading sources of errors in threaded programs and needs to be addressed properly.

A way to deal with thread safety is using the __Queue Class__

In [22]:
# Understanding Queue 

from queue import Queue

def do_stuff(q):
    while not q.empty():
        print(q.get())
        q.task_done()  # this is important when combining Queue with Threads

q = Queue()

for x in range(20):
    q.put(x)

do_stuff(q)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19


A simpler example before going back to the image downloader code

In [24]:
# in this example each thread prints an element of the queue

from queue import Queue
from threading import Thread
import logging  

#logging.basicConfig(format='(%(threadName)-9s) %(message)s',)

def do_stuff(q):
  while True:
    m = q.get()
    #logging.debug(str(m)+' ')
    print(str(m)+' ')
    q.task_done()

q = Queue()
num_threads = 10

for i in range(num_threads):
  worker = Thread(target=do_stuff, args=(q,),name='t'+str(i))
  worker.setDaemon(True) # this stop the threads when the program quits  
  worker.start()         # start the threads

for x in range(20):
  q.put(x)

q.join() # wait untill all threads have finished

2 3 6 0 1 4 5 7 8 10 









12 18 13 16 17 9 





11 14 

15 
19 


In [22]:
from queue import Queue
from threading import Thread

class DownloadWorker(Thread):
   def __init__(self, queue):
       Thread.__init__(self)
       self.queue = queue

   def run(self):
       while True:
           # Get the work from the queue and expand the tuple
           directory, link = self.queue.get()
           download_link(directory, link)
           self.queue.task_done()

def threaded_download():
   ts = time()
   download_dir = setup_download_dir()
   links = [l for l in get_links(CLIENT_ID)]
   # Create a queue to communicate with the worker threads
   queue = Queue()
   # Create 8 worker threads
   for x in range(8):
       worker = DownloadWorker(queue)
       # Setting daemon to True will let the main thread exit 
       # even though the workers are blocking
       worker.daemon = True
       worker.start()
   # Put the tasks into the queue as a tuple
   for link in links:
       print('Queueing {}'.format(link))
       queue.put((download_dir, link))
   # Causes the main thread to wait for the queue to finish processing all the tasks
   queue.join()
   print('Took {}'.format(time() - ts))

threaded_download()

Queueing https://i.imgur.com/oc5WY.jpg
Queueing https://i.imgur.com/oc92wCQ.jpg
Queueing https://i.imgur.com/ocMWKr5.jpg
Queueing http://i.imgur.com/ocXFDU8h.gif
Queueing https://i.imgur.com/oct46Ys.jpg
Queueing https://i.imgur.com/ocAx4.jpg
Queueing http://i.imgur.com/oc8Nhm3h.gif
Queueing https://i.imgur.com/ocR4jMt.gif
Queueing https://i.imgur.com/oc3nwC7.png
Queueing https://i.imgur.com/ocGrCaF.jpg
Queueing http://i.imgur.com/ocIF4Ejh.gif
Queueing https://i.imgur.com/ocPW0Uy.jpg
Queueing https://i.imgur.com/ocsmD0k.jpg
Queueing https://i.imgur.com/ocnrU5N.jpg
Queueing https://i.imgur.com/ocIVqwb.gif
Queueing https://i.imgur.com/oczZ2g4.jpg
Queueing https://i.imgur.com/ocu63.jpg
Queueing http://i.imgur.com/ocvAsDah.gif
Queueing https://i.imgur.com/ocCkzFi.jpg
Queueing https://i.imgur.com/ocEz20y.jpg
Queueing https://i.imgur.com/oc7Nlae.jpg
Queueing https://i.imgur.com/ocAzLRy.jpg
Queueing https://i.imgur.com/ocox0pe.jpg
Queueing https://i.imgur.com/ocSNQNn.gif
Queueing https://i.img

#### Not really parallel !!
Python  has a Global Interpreter Lock (GIL), which makes one thread to be executed at a time throughout this process. Therefore, this code is concurrent but not parallel. The reason it is still faster is because the image downloader coade an input/output bound task. The processor is not breaking a sweat while downloading these images, and the majority of the time is spent waiting for the network. This is why threading can provide a large speed increase. The processor can switch between the threads whenever one of them is ready to do some work.

If the program was performing a task that was CPU bound, using the threading module in Python or any other interpreted language with a GIL could actually result in reduced performance. For CPU bound tasks and truly parallel execution in Python, the multiprocessing module is a better option.

---
#### Sum of array elements in parallel

In [8]:
# Sequencial version
from time import time

n = int(1e6)

ts = time()
s = 0
for i in range(n):
    s = s + i
print(s, '-->', time()-ts,'ms')    

499999500000 --> 0.11732816696166992 ms


In [9]:
# Thread version
from queue import Queue
from threading import Thread
from threading import Lock

n = int(1e6)
x = 0

def sum_chunk(q):
    while True:
        global x
        m = q.get()
        for i in m:
#            with Lock():  # force synchronization
                x = x + i
        q.task_done()

chunks = [list(range(i,i+100)) for i in range(0,n,100)]
ts = time()

q = Queue()
num_threads = 10

for i in range(num_threads):
  worker = Thread(target=sum_chunk, args=(q,))
  worker.setDaemon(True) # this stop the threads when the program quits  
  worker.start()         # start the threads

for i in chunks:
  q.put(i)

q.join()
print(x, '-->', time()-ts,'ms')    

484944095455 --> 0.27000904083251953 ms


In [14]:
# multiprocessor version
from time import time
from multiprocessing.pool import Pool

n = int(1e6)

def sum_multi_processes(chunk):
    y = 0
    for i in chunk:
        y = y + i
    return(y)

chunks = [list(range(i,i+100)) for i in range(0,n,100)]

ts = time()
with Pool(10) as p:
     results = p.map(sum_multi_processes, chunks)

print(sum(results), '-->', time()-ts,'ms')   

499999500000 --> 0.14169001579284668 ms
