
# How to make things faster !!! (using your PC)
* Multiprocessing

Multiprocessing refers to the ability of a system to support more than one processor at the same time. Applications in a multiprocessing system are broken to smaller routines that run independently. The operating system allocates these threads to the processors improving performance of the system.
* without multiprocessing CPUs are randomly assigned to tasks on your computer. For example one CPU runs your notebook, another one is processing Firefox (and even with this two tasks the CPU will be switching. CPU0 with CPU1 may be exchanging tasks )

* very old concept: C,Fortran etc: If you really need speed for physics and simulations you have two options 
  * MPI - Message Passing Interface (MPI) - you can run your code on processors of different machines distributed over net - more complicated by great
  * OpenMP - OpenMP (Open Multi-Processing) is an application programming interface - much simpler to implement but runs on local machine
  
  * all date for this class is here: https://mega.nz/folder/LAcGHJ4I#_uJ79tPCc4i5uWa0ps2itQ

#### Example Fortran OpenMP code:

```
program omp_par_do
  implicit none

  integer, parameter :: n = 100
  real, dimension(n) :: dat, result
  integer :: i

  !$OMP PARALLEL DO
  do i = 1, n
     result(i) = my_function(dat(i))
  end do
  !$OMP END PARALLEL DO

contains

  function my_function(d) result(y)
    real, intent(in) :: d
    real :: y

    ! do something complex with data to calculate y
  end function my_function

end program omp_par_do
```

* multiprocessing in Python

As CPU manufacturers start adding more and more cores to their processors, creating parallel code is a great way to improve performance. Python introduced multiprocessing module to let us write parallel code. To understand the main motivation of this module, we have to know some basics about parallel programming. After reading this article, we hope that, you would be able to gather some knowledge on this topic.

# Python `pool` vs `Process`

Crash course. Python multiprocessing is a separate topic and it may take time to learn it. So this is just a bried introduction with couple of examples:

* The `pool` (like a swimming pool): allows you to do multiple jobs per process, which may make it easier to parallelize your program. If you have a million tasks to execute in parallel, you can create a Pool with a number of processes as many as CPU cores and then pass the list of the million tasks to pool.map. The pool will distribute those tasks to the worker processes(typically the same in number as available cores) and collects the return values in the form of a list and pass it to the parent process. Launching separate million processes would be much less practical (it would probably break your OS). `Pool` by itself distributes the tasks


* The `Process`: On the other hand, if you have a small number of tasks to execute in parallel, and you only need each task done once, it may be perfectly reasonable to use a separate multiprocessing.process for each task, rather than setting up a Pool.  `Process` is manually controlled


* How do they compare: the Pool allocates only executing processes in memory and the `process` allocates all the tasks in memory, so when the task number is small, we can use process class and when the task number is large, we can use the pool. In the case of large tasks, if we use a process then memory problems might occur, causing system disturbance. In the case of Pool, there is overhead in creating it. Hence with small task numbers, the performance is impacted when Pool is used.

* multiprocessing can be used in cience for simulating differential equations, linear algebra (splitting arrays by parts) but also in data science.
* as this is a data science course then lets bring examples related to the topic 

Problem: Imagine you have 4 (400) large files that you want to analyze, they are around 200MB each and you want to find the sum of all the values from each file. You can do it
* sequentially, load each file one by one find the sum then calculate the sum of sums 

# This is slow

In [3]:
%%time
import pandas as pd

def run(task,L):
    df = pd.read_csv(task)
    total_sum=df.sum().sum()
    L.append(total_sum)

L= []
tasks = ['file1.csv','file1.csv','file1.csv','file1.csv']
#tasks = ['file1.csv','file1.csv','file1.csv','file1.csv','file1.csv','file1.csv','file1.csv','file1.csv']
for i in tasks:
    run(i,L)
print(L)   

[50000039990980.88, 50000039990980.88, 50000039990980.88, 50000039990980.88]
CPU times: user 8.87 s, sys: 933 ms, total: 9.8 s
Wall time: 10.9 s


* or you can do it using multiprocessing
* you can split work by each cpu so they do the tasks simultaneously

![](imgs/parallel_code.png)

# `multiprocessing` - checking number of CPUS

In [18]:
import multiprocessing

In [19]:
print("Number of cpu : ", multiprocessing.cpu_count())

Number of cpu :  4


# Python  `map` function

* first without map

In [3]:
org_list = [1, 2, 3, 4, 5]
fin_list = []

def cube(num):
    return num**3

for num in org_list:
    fin_list.append(cube(num))

print(fin_list) # [1, 8, 27, 64, 125]

[1, 8, 27, 64, 125]


* now with `map`

In [4]:
org_list = [1, 2, 3, 4, 5]

def cube(num):
    return num**3
   
fin_list = list(map(cube, org_list))
print(fin_list)

[1, 8, 27, 64, 125]


# This is fast (`pool - map` - pool CPU)

In [26]:
%%time
import multiprocessing as mp
import pandas as pd

def run(task):
    df = pd.read_csv(task)
    total_sum=df.sum().sum()
    print(total_sum)
    
if __name__ == "__main__":
    pool = mp.Pool(processes=4)
    tasks = ['file1.csv','file2.csv','file3.csv','file4.csv']
    pool.map(run,tasks)
    pool.close()
    pool.join()

50000039990980.8850000039990980.8850000039990980.8850000039990980.88



CPU times: user 21 ms, sys: 28 ms, total: 49 ms
Wall time: 5.37 s


* but we would like to save the results somewhere

# This is fast (`pool - apply_async` - pool CPU)
* allows passing arguments so we can save the results in a special list

In [6]:
%%time
import multiprocessing as mp
import pandas as pd

def run(task,L):
    df = pd.read_csv(task)
    total_sum=df.sum().sum()
    L2.append(total_sum)
    
if __name__ == "__main__":
    manager = mp.Manager()
    L2 = manager.list()        
    
    pool = mp.Pool(processes=8)
    tasks = ['file1.csv','file1.csv','file1.csv','file1.csv']
    [pool.apply_async(run, args=[n,L2]) for n in tasks]
    pool.close()
    pool.join()
    print(L2)

[50000039990980.88, 50000039990980.88, 50000039990980.88, 50000039990980.88]
CPU times: user 20 ms, sys: 37 ms, total: 57 ms
Wall time: 5.39 s


In [30]:
sum(list(L2))

200000159963923.53

# This is also fast (`process` - individual CPU)

In [31]:
%%time
import multiprocessing as mp
from multiprocessing import Process
import pandas as pd

def run(task,L):
    df = pd.read_csv(task)
    total_sum=df.sum().sum()
    L.append(total_sum)
    return L

if __name__ == "__main__":
    manager = mp.Manager()
    L2 = manager.list()
    
    tasks = ['file1.csv','file1.csv','file1.csv','file1.csv']
    my_process1 = Process(target=run, args=(tasks[0],L2))
    my_process2 = Process(target=run, args=(tasks[1],L2))
    my_process3 = Process(target=run, args=(tasks[2],L2))
    my_process4 = Process(target=run, args=(tasks[3],L2))

    my_process1.start()
    my_process2.start()
    my_process3.start()
    my_process4.start()

    my_process1.join()
    my_process2.join()
    my_process3.join()
    my_process4.join()
        

    print ("Done")
    print(L2)

Done
[50000039990980.88, 50000039990980.88, 50000039990980.88, 50000039990980.88]
CPU times: user 15 ms, sys: 25 ms, total: 40 ms
Wall time: 5.25 s


# This is not so fast (`map - partial`)

# How to make things actually possible

Situation. I have a file that is 5GB big. I would not dare to open on my computer right now. So how can I analyze it on my personal PC with only 4-8GB RAM?
* pandas has such a functionallity

* You don’t have to read it all

As an alternative to reading everything into memory, Pandas allows you to read data in chunks. In the case of CSV, we can load only some of the lines into memory at any given time.

In particular, if we use the chunksize argument to pandas.read_csv, we get back an iterator over DataFrames, rather than one single DataFrame. Each DataFrame is the next 1000 lines of the CSV:

* would be good to check how many lines has the file
  * linux `$ wc -l filename
            result:  #42448765`
            42 Million records

https://mega.nz/folder/LAcGHJ4I#_uJ79tPCc4i5uWa0ps2itQ

# Fragment of the file - classic way

In [2]:
import pandas as pd
# read the large csv file with specified chunksize 
df_tiny = pd.read_csv('tiny.csv')
df_tiny.head()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,2019-10-01 00:00:00 UTC,view,44600062,2103807459595387724,,shiseido,35.79,541312140,72d76fde-8bb3-4e00-8c23-a032dfed738c
1,2019-10-01 00:00:00 UTC,view,3900821,2053013552326770905,appliances.environment.water_heater,aqua,33.2,554748717,9333dfbd-b87a-4708-9857-6336556b0fcc
2,2019-10-01 00:00:01 UTC,view,17200506,2053013559792632471,furniture.living_room.sofa,,543.1,519107250,566511c2-e2e3-422b-b695-cf8e6e792ca8
3,2019-10-01 00:00:01 UTC,view,1307067,2053013558920217191,computers.notebook,lenovo,251.74,550050854,7c90fc70-0e80-4590-96f3-13c02c18c713
4,2019-10-01 00:00:04 UTC,view,1004237,2053013555631882655,electronics.smartphone,apple,1081.98,535871217,c6bd7419-2748-4c56-95b4-8cec9ff8b80d


In [33]:
df_tiny['price'].sum()

3379.2599999999998

# Fragment of the file - chunksize way

In [35]:
import pandas as pd
# read the large csv file with specified chunksize 
df_chunk = pd.read_csv('tiny.csv', chunksize=1)
# wc -l filename
#42448765

In [36]:
chunk_list = [] 
for chunk in df_chunk:  
    sum_chunk=chunk['price'].sum()
    print(sum_chunk)
    chunk_list.append(sum_chunk)

35.79
33.2
543.1
251.74
1081.98
908.62
380.96
41.16
102.71


In [37]:
sum(chunk_list)

3379.2599999999998

# Entire file

In [39]:
import pandas as pd
# read the large csv file with specified chunksize 
df_chunk = pd.read_csv('huge.csv', chunksize=1000000)
# wc -l filename
#42448765

In [41]:
chunk_list = [] 
for chunk in df_chunk:  
    sum_chunk=chunk['price'].sum()
    print(sum_chunk)
    chunk_list.append(sum_chunk)

295982470.51
297902352.25000006
301375830.57000005
297913145.99999994
303370516.4999999
295255042.6499999
301581415.9699999
301568463.49999976
294448155.87
271216902.56
282985331.38999987
285571944.2100001
290204096.22
279731917.1499999
283156063.01000017
279884552.8299999
275989306.26
293831560.08000016
299158011.86
295262903.45000017
293112490.53999996
293027094.5799999
288816986.2000001
290844328.4999999
288347991.2799999
288987846.3400001
285289654.45
291676301.64
285775966.9099999
286290020.6600001
289038436.53000003
288175044.58
288575785.63
287692333.57999986
288871527.39
289770061.83
292651105.28000015
291729216.77999985
288960815.64000016
288385370.20000005
287038985.34999996
291532345.50000006
132900863.8


In [42]:
sum(chunk_list)

12323880556.029999