### Python Study Group week 5

### Theme: 優化爬蟲速度 by multiprocessing

------------------------------------------------------------

### Let's get it started !

```
首先，考慮以下一個簡單的 function：
它以 Python 內建的 split() 方法進行 sentence 的英文斷詞。
```

In [14]:
def split_it(sent):
    split_sent =  sent.split(' ')
    return split_sent

```
好，接著我們餵幾個簡單的句子給 function。
當然，我們就必須使用迭代依序把 sentence 餵進去。
```

In [15]:
import time

sentences = [
    "I like to eat apple.",
    "I don't like it",
    "Why! It is juicy"
]

start = time.time()

for sent in sentences:
    split_sent = split_it(sent)
    print(split_sent)

print()
print('花費: %f 秒' % (time.time() - start))

['I', 'like', 'to', 'eat', 'apple.']
['I', "don't", 'like', 'it']
['Why!', 'It', 'is', 'juicy']

花費: 0.000888 秒


```
以上方法以迭代方式，依序完成 function 內的任務。
但因為 list 中的個數非常少(len(sentences) = 3)，
且 function 中必須要完成的任務相當簡單，故花費時間非常短。

若我們考量到需要花費較多時間的 function 呢？
```

In [16]:
# 以對ptt-nba版發送請求為例：

base_url = 'https://www.ptt.cc/bbs/NBA/index'


def generate_urls(base_url):
    page = 6506
    all_urls = []
    for i in range(1, 11):
        link = str(page-i) + '.html'
        all_urls.append(base_url + str(link))
        
    return all_urls
        
all_urls = generate_urls(base_url)
all_urls

['https://www.ptt.cc/bbs/NBA/index6505.html',
 'https://www.ptt.cc/bbs/NBA/index6504.html',
 'https://www.ptt.cc/bbs/NBA/index6503.html',
 'https://www.ptt.cc/bbs/NBA/index6502.html',
 'https://www.ptt.cc/bbs/NBA/index6501.html',
 'https://www.ptt.cc/bbs/NBA/index6500.html',
 'https://www.ptt.cc/bbs/NBA/index6499.html',
 'https://www.ptt.cc/bbs/NBA/index6498.html',
 'https://www.ptt.cc/bbs/NBA/index6497.html',
 'https://www.ptt.cc/bbs/NBA/index6496.html']

In [17]:
import requests

def get_web_page(url):
    resp = requests.get(url)
    if resp.status_code != 200:
        print('Invalid url: ', resp.url)
        return None
    else:
        return resp.url

In [18]:
start = time.time()

for url in all_urls:
    url = get_web_page(url)
    print(url)
    
print()
print('花費: %f 秒' % (time.time() - start))

Invalid url:  https://www.ptt.cc/bbs/NBA/index6505.html
None
https://www.ptt.cc/bbs/NBA/index6504.html
https://www.ptt.cc/bbs/NBA/index6503.html
https://www.ptt.cc/bbs/NBA/index6502.html
https://www.ptt.cc/bbs/NBA/index6501.html
https://www.ptt.cc/bbs/NBA/index6500.html
https://www.ptt.cc/bbs/NBA/index6499.html
https://www.ptt.cc/bbs/NBA/index6498.html
https://www.ptt.cc/bbs/NBA/index6497.html
https://www.ptt.cc/bbs/NBA/index6496.html

花費: 0.527810 秒


```
由以上的測試，我們使用迭代依序執行以上 function 的時間大約為 13-14秒左右。
他的執行過程為:
    for i in range(len(sentences)):
        request -> response
        
以上，當我們收到 server 回傳 response後，我們才會接著去發送下個 requests。
當發送的 url 數目非常大，整個執行時間就會拉很長。
```

------------------------------------------------------------

### Intro to multiprocessing

```
multiprocessing:
在同一時間，系統可支援不只一個 processor (處理器)。

在系統中使用 multiprocessing -> 將工作切分成幾個獨立的routinues。
作業系統會將這些 threads 分配給不同 processors，以優化系統的表現。
```

```
multiprocessing system:
 - 電腦將要有不只一個處理器
 - 
```

```
CPU 可同時處理數個不同的 tasks，而每一個 task 都是由各自的 processer 進行處理。

在 Python 中，multiprocessing module 提供非常簡單跟直覺的 API 以讓我們使用。
以下就開始簡介如何使用：
```

In [22]:
import multiprocessing

cores = multiprocessing.cpu_count()
cores

4

In [7]:
import multiprocessing

def print_cube(num):
    print('Cube: {}'.format(num*num*num))
    
def print_square(num):
    print('Square: {}'.format(num*num))
    
if __name__ == '__main__':
    # creating processes
    p1 = multiprocessing.Process(target=print_square, args=(10, ))
    p2 = multiprocessing.Process(target=print_cube, args=(10, ))
    
    # starting process 1
    p1.start()
    # starting process 2
    p2.start()
    
    # wait until process1 is finished
    p1.join()
    # wait until process2 is finished
    p2.join()
    
    # both processes finished
    print('Done!')

Square: 100
Cube: 1000
Done!



```
Note:

Step1.
  實例化 Process class 以建構 process。
  
  """
    p1 = multiprocessing.Process(target=print_square, args=(10, ))
    p2 = multiprocessing.Process(target=print_cube, args=(10, ))
    
  """
    Input arguments:
      1. target: 要被 process 處理之 function
      2. args: 要被丟入 target function 的 arguments
      
Step2.
啟動 process。

"""
  p1.start()
  p2.start()
  
"""

Step3.
終止 process。

一但 process 啟動，執行中的 py/ipynb 檔就會持續執行它。
為了在完成 process 後中斷執行，使用 join()。
亦即程式會等待 p1，然後 p2 執行完畢，再接著執行依序的 statements。

"""
p1.join()
p2.join()

"""
```

------------------------------------------------------------

### Another Example:

In [8]:
import os

def worker1():
    # print process id
    print('ID of process running worker1: {}'.format(os.getpid()))
    
def worker2():
    # print process id
    print('ID of process running worker2: {}'.format(os.getpid()))

In [23]:
if __name__ == '__main__':
    # print main program process id
    print('ID of main process: {}'.format(os.getpid()))
    
    # creating processes
    p1 = multiprocessing.Process(target=worker1)
    p2 = multiprocessing.Process(target=worker2)
    
    # starting processes
    p1.start()
    p2.start()
    print()
    
    # process IDs
    print('ID of process p1: {}'.format(p1.pid)) 
    print('ID of process p2: {}'.format(p2.pid))
    
    
    # wait until processes are finished
    p1.join()
    p2.join()
    
    # both processes finished 
    print('Both processes finished execution!') 
    print()
    # check if processes are alive
    print('Process p1 is alive: {}'.format(p1.is_alive()))
    print('Process p2 is alive: {}'.format(p2.is_alive()))

ID of main process: 1777
ID of process running worker2: 2134
ID of process running worker1: 2133

ID of process p1: 2133
ID of process p2: 2134
Both processes finished execution!

Process p1 is alive: False
Process p2 is alive: False


<br>

```
Note:
1. main process 的 processID 與 multiprocessing module 產生的 processID 不同。
2. 以上，我們使用 os.getpid() 以取得正在 target function 運作的 processID。
   os.getpid() == processer.pid
3. 每一個 process 是相互獨立，且有自己的 memory space。
```

------------------------------------------------------------

### Another Example:

In [10]:
import multiprocessing 

# empty list with global scope 
result = [] 
  
def square_list(mylist): 
    """ 
    function to square a given list 
    """
    global result 
    # append squares of mylist to global list result 
    for num in mylist: 
        result.append(num * num) 
    # print global list result 
    print("Result(in process p1): {}".format(result))

In [11]:
if __name__ == "__main__": 
    # input list 
    mylist = [1,2,3,4] 
  
    # creating new process 
    p1 = multiprocessing.Process(target=square_list, args=(mylist,)) 
    # starting process 
    p1.start() 
    # wait until process is finished 
    p1.join() 
  
    # print global result list 
    print("Result(in main program): {}".format(result))

Result(in process p1): [1, 4, 9, 16]
Result(in main program): []


```
Note:
在 multiprocessing 中，任何被新創立的 process:
    1. 獨立運作
    2. 擁有自己的記憶體空間
```

------------------------------------------------------------

### Multiprocessing a for loop?

```
在前面，我們使用迭代依序對網站發出 get request。

-> 接著，我們以 multiprocessing 執行 for loop。
```

### Review:

In [12]:
import requests

def generate_urls(base_url):
    page = 6506
    for i in range(1, 11):
        link = str(page-i) + '.html'
        all_urls.append(base_url + str(link))

def get_web_page(url):
    resp = requests.get(url)
    if resp.status_code != 200:
        print('Invalid url: ', resp.url)
        return None
    else:
        return resp.url

if __name__ == "__main__": 
    all_urls = []
    base_url = 'https://www.ptt.cc/bbs/NBA/index'
    generate_urls(base_url)
    
    start = time.time()

    for url in all_urls:
        url = get_web_page(url)
        print(url)

    print()
    print('花費: %f 秒' % (time.time() - start))

Invalid url:  https://www.ptt.cc/bbs/NBA/index6505.html
None
https://www.ptt.cc/bbs/NBA/index6504.html
https://www.ptt.cc/bbs/NBA/index6503.html
https://www.ptt.cc/bbs/NBA/index6502.html
https://www.ptt.cc/bbs/NBA/index6501.html
https://www.ptt.cc/bbs/NBA/index6500.html
https://www.ptt.cc/bbs/NBA/index6499.html
https://www.ptt.cc/bbs/NBA/index6498.html
https://www.ptt.cc/bbs/NBA/index6497.html
https://www.ptt.cc/bbs/NBA/index6496.html

花費: 0.653050 秒


<br>

### With multiprocessing

In [13]:
import time
from multiprocessing import Pool
import requests

def generate_urls(base_url):
    page = 6506
    for i in range(1, 11):
        link = str(page-i) + '.html'
        all_urls.append(base_url + str(link))
        
def scrape(url):
    resp = requests.get(url)
    return resp.url, resp.status_code

if __name__ == '__main__':
    all_urls = []
    base_url = 'https://www.ptt.cc/bbs/NBA/index'
    generate_urls(base_url)
    
    start = time.time()
    # Initialize pool
    p = Pool(10)
    # Store the outputs -> list type. 
    results = p.map(scrape, all_urls)
    p.terminate()
    p.join()
    
    for result in results:
        print(result)
    print()
    print('花費: %f 秒' % (time.time() - start))  

('https://www.ptt.cc/bbs/NBA/index6505.html', 500)
('https://www.ptt.cc/bbs/NBA/index6504.html', 200)
('https://www.ptt.cc/bbs/NBA/index6503.html', 200)
('https://www.ptt.cc/bbs/NBA/index6502.html', 200)
('https://www.ptt.cc/bbs/NBA/index6501.html', 200)
('https://www.ptt.cc/bbs/NBA/index6500.html', 200)
('https://www.ptt.cc/bbs/NBA/index6499.html', 200)
('https://www.ptt.cc/bbs/NBA/index6498.html', 200)
('https://www.ptt.cc/bbs/NBA/index6497.html', 200)
('https://www.ptt.cc/bbs/NBA/index6496.html', 200)

花費: 0.282672 秒


------------------------------------------------------------

```
Note:
1. p = Pool(n) 
-> n: n 個 urls 將在同時間進行處理、實作。

2. 此與迭代依序完成的概念相同，只是將工作分配至不同 processor 同時間完成。

3. 若要被處理的 urls 數目大於 n，則會被拆成數個 iterations 迭代完成。
-> Ex: 假設有 100個 urls
       p = Pool(5)
       iterations = 20 (100/5)
       每一次 iteration 處理 5 個 urls。
```