Pierre Navaro - [Institut de Recherche Mathématique de Rennes](https://irmar.univ-rennes1.fr) - [CNRS](http://www.cnrs.fr/)

https://github.com/pnavaro/big-data/blob/master/03.ParallelComputation.ipynb

[Display on nbviewer](http://nbviewer.jupyter.org/github/pnavaro/big-data/blob/master/03.ParallelComputation.ipynb)



# Parallel Computation

## Parallel computers
- Multiprocessor/multicore: several processors work on data stored in shared memory
- Cluster: several processor/memory units work together by exchanging data over a network
- Co-processor: a general-purpose processor delegates specific tasks to a special-purpose processor (GPU, Xeon Phi,...)


## Parallel Programming
- Decomposition of the complete task into independent subtasks and the data flow between them.
- Distribution of the subtasks over the processors minimizing the total execution time.
- For clusters: distribution of the data over the nodes minimizing the communication time.
- For multiprocessors: optimization of the memory access patterns minimizing waiting times.
- Synchronization of the individual processes.

## MapReduce

In [8]:
from time import sleep
def f(x):
    sleep(1)
    return x*x
L = list(range(8))
L

[0, 1, 2, 3, 4, 5, 6, 7]

In [9]:
%time sum([f(x) for x in L])

CPU times: user 933 µs, sys: 1.12 ms, total: 2.05 ms
Wall time: 8.01 s


140

In [10]:
%time sum(map(f,L))

CPU times: user 937 µs, sys: 987 µs, total: 1.92 ms
Wall time: 8.02 s


140

## Multiprocessing 

<p>
<font color=red> This first part with multiprocessing does not work
    on Windows </font>
    </p>

The multiprocessing allows the programmer to fully leverage multiple processors.
- The Pool object parallelizes the execution of a function across multiple input values.
- The if __name__ == '__main__' part is necessary.
- The multiprocessing Pool class provides a map function. Partition and distribute input to a user-specified function in pool of worker processes is automatic.

In [11]:
from multiprocessing import cpu_count

cpu_count()

8

In [12]:
%%time 
from multiprocessing import Pool

if __name__ == '__main__': # Executed only on main process.
    with Pool() as p:
        print(sum(p.map(f, L))) # Apply f on L sequence and sum


140
CPU times: user 13.2 ms, sys: 22.7 ms, total: 35.9 ms
Wall time: 1.07 s


- Pool() launches one slave process per physical processor on the computer. 
- pool.map(...) divides the input list into chunks and puts the tasks (function + chunk) on a queue.
- Each slave process takes a task (function + a chunk of data), runs map(function, chunk), and puts the result on a result list.
- pool.map on the master process waits until all tasks are handled and returns the concatenation of the result lists.

### Exercise 3.1

- Use `paragraph` function module from `lorem` to create a text
- Create a list of words from it
- Use `map` function from `multiprocessing.Pool` to compute each word length
- Compare time with sequential version.


In [13]:
%%time
from lorem import paragraph

words_list = paragraph().lower().replace('.','').split()
print(*map(len,words_list))

10 6 5 5 6 2 7 7 5 7 10 5 3 7 3 11 5 5 7 6 5 4 11 7 4 3 4 2 5 3 7 10 5 6 7 4 7 7 2 7 11 4 8 10 7
CPU times: user 2.88 ms, sys: 2.12 ms, total: 5.01 ms
Wall time: 3.55 ms


In [14]:
%%time 
from multiprocessing import Pool

if __name__ == '__main__': # Executed only on main process.
    with Pool() as p:
        results = p.map(len, words_list)# Apply f on L sequence and sum

print(*results)

10 6 5 5 6 2 7 7 5 7 10 5 3 7 3 11 5 5 7 6 5 4 11 7 4 3 4 2 5 3 7 10 5 6 7 4 7 7 2 7 11 4 8 10 7
CPU times: user 13.4 ms, sys: 22.2 ms, total: 35.7 ms
Wall time: 134 ms


## Thread and Process: Differences

- A Process is an instance of a running program. 
- Process may contain one or more threads, but a thread cannot contain a process.
- Process has a self-contained execution environment. It has its own memory space. 
- Application running on your computer may be a set of cooperating processes.

- A Thread is made of and exist within a Process; every process has at least one. 
- Multiple threads in a process share resources, which helps in efficient communication between threads.
- Threads can be concurrent on a multi-core system, with every core executing the separate threads simultaneously.




## The Global Interpreter Lock (GIL)

- The Python interpreter is not thread safe.
- A few critical internal data structures may only be accessed by one thread at a time. Access to them is protected by the GIL.
- Attempts at removing the GIL from Python have failed until now. The main difficulty is maintaining the C API for extension modules.
- Multiprocessing avoids the GIL by having separate processes which each have an independent copy of the interpreter data structures.
- The price to pay: serialization of tasks, arguments, and results.

## Futures

The `concurrent.futures` module provides a high-level interface for asynchronously executing callables.



The asynchronous execution can be performed with:
- **threads**, using ThreadPoolExecutor, 
- separate **processes**, using ProcessPoolExecutor. 
Both implement the same interface, which is defined by the abstract Executor class.

`concurrent.futures` does not work on windows. Windows users must install 
[loky](https://github.com/tomMoral/loky).

In [15]:
#!pip install loky  # Windows users will need to install loky

In [16]:
%%time
from concurrent.futures import ProcessPoolExecutor
# from loky import ProcessPoolExecutor  # for Windows users
e = ProcessPoolExecutor()

results = sum(e.map(f, L))
print(results)

140
CPU times: user 12.3 ms, sys: 22.4 ms, total: 34.7 ms
Wall time: 1.03 s


In [17]:
%%time
from concurrent.futures import ThreadPoolExecutor
e = ThreadPoolExecutor()

results = sum(e.map(f, L))
print(results)

140
CPU times: user 3.73 ms, sys: 3.78 ms, total: 7.51 ms
Wall time: 1.01 s


### Exercise 3.2

Use `ProcessPoolExecutor` to compute each word length.

In [18]:
from lorem import text
from concurrent.futures import ProcessPoolExecutor
#from loky import ProcessPoolExecutor  # for Windows

texte = text()

e = ProcessPoolExecutor(4)
word_lengths = e.map(len, texte.split())
print(*word_lengths)

5 10 11 7 6 3 6 3 5 10 7 4 7 7 3 10 10 4 10 8 7 6 5 5 11 6 5 7 6 7 3 4 3 6 4 3 7 7 5 8 5 8 4 7 8 10 8 10 6 10 11 7 4 2 5 8 8 7 4 4 3 5 5 8 8 6 7 5 7 5 4 2 7 4 7 10 7 6 3 5 3 3 12 7 5 3 4 8 8 7 10 6 7 7 5 5 8 7 7 5 6 12 2 7 2 7 8 6 4 6 10 6 7 8 3 4 6 7 4 7 5 5 6 3 8 5 5 3 5 7 6 3 3 10 7 3 11 10 5 3 3 8 10 6 7 7 8 5 4 7 7 10 9 5 5 5 6 11 7 6 7 5 10 10 6 3 3 5 7 11 10 5 7 11 6 7 7 5 7 6 7 5 4 7 12 3 10 4 4 8 3 7 7 5 5 4 4 7 5 7 6 7 5 6 5 3 5 10 5 6 8 3 4 2 3 9 5 8 3 6 8 10 5 3 10 3 4 7 8 8 6 6 4 11 5 6 4 2 5 7 7 5 8


### Exercise 3.3

Same as exercise 3.2 but use `ThreadPoolExecutor`.

In [19]:
from lorem import text
from concurrent.futures import ThreadPoolExecutor

texte = text()

e = ProcessPoolExecutor(4)
word_lengths = e.map(len, texte.split())
print(*word_lengths)

2 5 5 3 3 10 8 5 7 10 4 5 2 8 8 7 3 5 7 6 6 6 7 3 7 5 3 5 4 7 4 7 6 9 7 2 4 2 3 4 7 5 6 5 6 8 6 6 5 8 5 3 6 6 8 4 3 7 3 11 5 6 3 3 3 12 5 5 10 4 7 6 4 7 3 8 7 6 5 3 6 6 4 5 5 5 10 3 6 7 6 7 5 7 6 7 7 7 10 5 7 7 5 7 5 5 3 9 11 2 11 3 4 6 5 5 7 8 3 3 5 3 6 11 8 5 5 4 8 5 3 5 10 6 11 4 7 6 6 8 10 7 6 7 3 3 3 7 2 8 9 7 4 8 7 7 10 4 5 8 10 6 6 11 6 10 5 3 5 3 6 7 7 3 3 6 3 6 4 7 7 7 8 4 5 7 6 7 4 12 4 8 5 6 6 3 6 5 8 5 5 10 6 12 5 7 5 10 4 5 7 3 4 5 3 5 5 5 6 5 2 3 4 3 6 10 4 7 4 7 3 3 11 5 7 3 7 6 7 8 6 7 7 5 11 7 4 5 6 7 5 3 6 5 3 5 3 11 6 3 4 6 7 5 7 3 9 4 6 8 3 6 5 4 3 5 4


# Map

This words version contains some improvements and print out the 
process number where the function is executed.

In [20]:
import string
import multiprocessing as mp  # Windows users should comment this line
def words_mp(file):
    """
    Check if file is utf8
    Read a text file and return a sorted list of (word, 1) values.
    """
    # Windows users should comment this line below
    print(mp.current_process().name, 'reading', file)
    translator = str.maketrans('', '', string.punctuation)
    output = []
    try:
        with open(file) as f:
            for line in f:   
                line = line.strip()
                line = line.translate(translator)
                for word in line.split():
                    if word.isalpha():
                        word = word.lower()
                        output.append((word, 1))
                        
    except UnicodeDecodeError as err:
        print("Some error occurred decoding file %s: %s" % (file, err))
                
    output.sort()
    return output

words_mp('sample.txt')

MainProcess reading sample.txt


[('adipisci', 1),
 ('adipisci', 1),
 ('adipisci', 1),
 ('adipisci', 1),
 ('adipisci', 1),
 ('adipisci', 1),
 ('adipisci', 1),
 ('aliquam', 1),
 ('aliquam', 1),
 ('aliquam', 1),
 ('amet', 1),
 ('amet', 1),
 ('amet', 1),
 ('consectetur', 1),
 ('consectetur', 1),
 ('consectetur', 1),
 ('consectetur', 1),
 ('dolor', 1),
 ('dolor', 1),
 ('dolor', 1),
 ('dolore', 1),
 ('dolore', 1),
 ('dolore', 1),
 ('dolore', 1),
 ('dolore', 1),
 ('dolore', 1),
 ('dolorem', 1),
 ('dolorem', 1),
 ('eius', 1),
 ('eius', 1),
 ('eius', 1),
 ('eius', 1),
 ('eius', 1),
 ('eius', 1),
 ('eius', 1),
 ('eius', 1),
 ('eius', 1),
 ('est', 1),
 ('est', 1),
 ('est', 1),
 ('est', 1),
 ('etincidunt', 1),
 ('etincidunt', 1),
 ('etincidunt', 1),
 ('etincidunt', 1),
 ('ipsum', 1),
 ('ipsum', 1),
 ('ipsum', 1),
 ('labore', 1),
 ('labore', 1),
 ('labore', 1),
 ('labore', 1),
 ('magnam', 1),
 ('magnam', 1),
 ('modi', 1),
 ('modi', 1),
 ('modi', 1),
 ('modi', 1),
 ('neque', 1),
 ('neque', 1),
 ('neque', 1),
 ('neque', 1),
 ('nequ

# Partition
Before parallel reduce operation, data must be aligned in a container. We create a function named `partition_mp` that stores the key/value pairs from `words_mp` into a `defaultdict` from collections module. Ouput is:
[('word1', [1, 1]), ('word2', [1]), ('word3', [1, 1, 1])]

In [21]:
import collections
def partition_mp(mapped_values):
    """
        Organize the mapped values by their key.
        Returns an unsorted sequence of tuples 
        with a key and a sequence of values.
    """
    partitioned_data = collections.defaultdict(list)
    for key, value in mapped_values:
        partitioned_data[key].append(value)
    return partitioned_data.items()

In [22]:
partition_mp(words_mp('sample.txt'))

MainProcess reading sample.txt


dict_items([('adipisci', [1, 1, 1, 1, 1, 1, 1]), ('aliquam', [1, 1, 1]), ('amet', [1, 1, 1]), ('consectetur', [1, 1, 1, 1]), ('dolor', [1, 1, 1]), ('dolore', [1, 1, 1, 1, 1, 1]), ('dolorem', [1, 1]), ('eius', [1, 1, 1, 1, 1, 1, 1, 1, 1]), ('est', [1, 1, 1, 1]), ('etincidunt', [1, 1, 1, 1]), ('ipsum', [1, 1, 1]), ('labore', [1, 1, 1, 1]), ('magnam', [1, 1]), ('modi', [1, 1, 1, 1]), ('neque', [1, 1, 1, 1, 1, 1]), ('non', [1, 1, 1, 1, 1, 1]), ('numquam', [1, 1, 1, 1, 1, 1]), ('porro', [1, 1, 1, 1, 1, 1, 1, 1]), ('quaerat', [1, 1, 1]), ('quiquia', [1, 1, 1, 1, 1]), ('quisquam', [1, 1, 1]), ('sed', [1, 1]), ('sit', [1, 1, 1, 1, 1, 1, 1]), ('tempora', [1, 1, 1, 1, 1]), ('ut', [1, 1, 1]), ('velit', [1]), ('voluptatem', [1, 1])])

# Reduce

In [23]:
def reduce_mp(item):
    """Convert the partitioned data for a word to a
    tuple containing the word and the number of occurances.
    """
    word, occurances = item
    return (word, len(occurances))

In [24]:
for occurences in partition_mp(words_mp('sample.txt')):
    print(reduce_mp(occurences))

MainProcess reading sample.txt
('adipisci', 7)
('aliquam', 3)
('amet', 3)
('consectetur', 4)
('dolor', 3)
('dolore', 6)
('dolorem', 2)
('eius', 9)
('est', 4)
('etincidunt', 4)
('ipsum', 3)
('labore', 4)
('magnam', 2)
('modi', 4)
('neque', 6)
('non', 6)
('numquam', 6)
('porro', 8)
('quaerat', 3)
('quiquia', 5)
('quisquam', 3)
('sed', 2)
('sit', 7)
('tempora', 5)
('ut', 3)
('velit', 1)
('voluptatem', 2)


In [28]:
%%time
import itertools   
import glob

# Sequential version

filenames = glob.glob("sample0*.txt")
mapped_values = map(words_mp, filenames)
partionned_data = partition_mp(itertools.chain(*mapped_values))
results = map(reduce_mp,partionned_data)

MainProcess reading sample00.txt
MainProcess reading sample01.txt
MainProcess reading sample02.txt
MainProcess reading sample03.txt
MainProcess reading sample04.txt
MainProcess reading sample05.txt
MainProcess reading sample06.txt
MainProcess reading sample07.txt
CPU times: user 5.63 ms, sys: 4.34 ms, total: 9.96 ms
Wall time: 7.02 ms


In [33]:
%%time
import itertools   
import glob
from concurrent.futures import ThreadPoolExecutor

# Parallel version
e = ThreadPoolExecutor()
filenames = glob.glob("sample0*.txt")
mapped_values = e.map(words_mp, filenames)
partionned_data = partition_mp(itertools.chain(*mapped_values))
results = e.map(reduce_mp,partionned_data)

MainProcessMainProcess MainProcess MainProcessMainProcessreading reading   reading readingreadingsample00.txt sample01.txt  
sample02.txt
sample04.txtsample05.txtMainProcess


 MainProcessMainProcessreading   readingreadingsample03.txt  
sample06.txtsample07.txt

CPU times: user 11.5 ms, sys: 8.62 ms, total: 20.2 ms
Wall time: 17.1 ms


In [34]:
print(*results)

('adipisci', 48) ('aliquam', 50) ('amet', 48) ('consectetur', 58) ('dolor', 64) ('dolore', 60) ('dolorem', 63) ('eius', 59) ('est', 57) ('etincidunt', 73) ('ipsum', 59) ('labore', 68) ('magnam', 55) ('modi', 58) ('neque', 53) ('non', 57) ('numquam', 63) ('porro', 49) ('quaerat', 55) ('quiquia', 60) ('quisquam', 64) ('sed', 53) ('sit', 73) ('tempora', 53) ('ut', 49) ('velit', 50) ('voluptatem', 54)


### Exercise 3.4

Write a parallel program that uses the three functions above using `multiprocessing module`. It reads all the "sample\*.txt" files. Some hints:
- Map and reduce steps are parallel.
- See how `itertools.chain(*mapped_values)` is used in notebook exercise 01.6.
- Compare time between the notebook 01 version. 

### Exercise 3.5

- Replace `multiprocessing` by `concurrent.futures` functions.
- Try  `ProcessPoolExecutor` and `ThreadPoolExecutor`

You can use for your multi-processing computations both `multiprocessing.Pool` and  `concurrent.futures` object, which behaves more or less identically.

However, today most library designers are coordinating around the  second interface, so it's wise to move over.

`concurrent.futures.ProcessPoolExecutor` is suitable for simple parallelism across many files and you gain some speed boost. 
Describing each task as a function call helps use tools like map for parallelism.

## Increase volume of data

### Getting the data

[The Latin Library](http://www.thelatinlibrary.com/) contains a huge collection of freely accessible Latin texts. We get links on the Latin Library's homepage ignoring some links that are not associated with a particular author.




In [35]:
from bs4 import BeautifulSoup
from urllib.request import urlopen, HTTPError

base_url = "http://www.thelatinlibrary.com/"
home_content = urlopen(base_url)

proxy_host = '192.168.192.17:8080'    # host and port of your proxy
home_content.set_proxy(proxy_host, 'http')

soup = BeautifulSoup(home_content, "lxml")
author_page_links = soup.find_all("a")
author_pages = [ap["href"] for i, ap in enumerate(author_page_links) if i < 49]

Create a list of all links pointing to Latin texts. The Latin Library uses a special format which makes it easy to find the corresponding links: All of these links contain the name of the text author.

In [36]:
ap_content = list()
for ap in author_pages:
    ap_content.append(urlopen(base_url + ap))

In [37]:
book_links = list()
for path, content in zip(author_pages, ap_content):
    author_name = path.split(".")[0]
    ap_soup = BeautifulSoup(content, "lxml")
    book_links += ([link for link in ap_soup.find_all("a", {"href": True}) if author_name in link["href"]])

print(book_links[:5])

[<a href="ammianus/14.shtml">Liber XIV</a>, <a href="ammianus/15.shtml">Liber XV</a>, <a href="ammianus/16.shtml">Liber XVI</a>, <a href="ammianus/17.shtml">Liber XVII</a>, <a href="ammianus/18.shtml">Liber XVIII</a>]


In [38]:
texts = list()
num_pages = 100

for i, bl in enumerate(book_links[:num_pages]):
    print("Getting content " + str(i + 1) + " of " + str(num_pages), end="\r", flush=True)
    try:
        content = urlopen(base_url + bl["href"]).read() 
        texts.append(content)
    except HTTPError as err:
        print("Unable to retrieve " + bl["href"] + ".")
        continue

Getting content 100 of 100

We split the text at periods to convert it into sentences.

In [39]:
%%time
sentences = list()

for text in texts:
    print("Document " + str(i + 1) + " of " + str(len(texts)), end="\r", flush=True)
    textSoup = BeautifulSoup(text, "lxml")
    paragraphs = textSoup.find_all("p", attrs={"class":None})
    prepared = ("".join([p.text.strip().lower() for p in paragraphs[1:-1]]))
    for t in prepared.split("."):
        part = "".join([c for c in t if c.isalpha() or c.isspace()])
        sentences.append(part.strip())

print(sentences[200])

infamabat autem haec suspicio latinum domesticorum comitem et agilonem tribunum stabuli atque scudilonem scutariorum rectorem qui tunc ut dextris suis gestantes rem publicam colebantur
CPU times: user 7.44 s, sys: 87.7 ms, total: 7.53 s
Wall time: 7.51 s


### Exercise 3.6

Parallelize these last process using `concurrent.future`.
Do not try to print out too much text, the notebook will raise an error.

## References

- [Using Conditional Random Fields and Python for Latin word segmentation](https://medium.com/@felixmohr/using-python-and-conditional-random-fields-for-latin-word-segmentation-416ca7a9e513)