# 爬虫并发

### 简单循环串行

这种方法应该是最慢的，因为一个一个循环，耗时是最长的，是所有时间总和

In [1]:
import time
import requests

url_list = [
    'http://www.baidu.com',
    'http://www.pythonsite.com',
    'http://www.cnblogs.com/'
]

start_time = time.time()
print('开始时间为: {}'.format(start_time))

for i, url in enumerate(url_list):
    start = time.time()
    result = requests.get(url)
    print(result.status_code)
    end = time.time()
    print('第 {} 个请求花费时间: {}'.format(i, end-start))
    
end_time = time.time()

print('总共花费时间为: {}'.format(end_time-start_time))
    

开始时间为: 1533104896.6427982
200
第 0 个请求花费时间: 0.04388260841369629
200
第 1 个请求花费时间: 3.041865110397339
200
第 2 个请求花费时间: 0.23140478134155273
总共花费时间为: 3.317152500152588


### 线程池并发

通过线程池的方式访问，这样整体的耗时是所有连接里耗时最久的那个，相对循环串行来说快了很多

In [4]:
import time
import requests
from concurrent.futures import ThreadPoolExecutor

url_list = [
    'http://www.baidu.com',
    'http://www.bing.com',
    'http://www.cnblogs.com/'
]

def fetch(url):
    result = requests.get(url)
    print(result.status_code)
    

pool = ThreadPoolExecutor(10)

start_time = time.time()
print('开始时间为: {}'.format(start_time))

for i, url in enumerate(url_list):
    start = time.time()
    pool.submit(fetch, url)
    end = time.time()
    print('第 {} 个请求花费时间: {}'.format(i, end-start))

end_time = time.time()
print('总共花费时间为: {}'.format(end_time-start_time))    

开始时间为: 1533105726.9810228
第 0 个请求花费时间: 0.005985260009765625
第 1 个请求花费时间: 0.0019941329956054688
第 2 个请求花费时间: 0.004986286163330078
总共花费时间为: 0.013963699340820312
200
200
200


### 线程池+回调函数

In [10]:
import time
import requests
from concurrent.futures import ThreadPoolExecutor

url_list = [
    'http://www.baidu.com',
    'http://www.bing.com',
    'http://www.cnblogs.com/'
]

def fetch_async(url):
    response = requests.get(url)
    return response

def callback(future):
    print(future.result().status_code)
    

start_time = time.time()
print('开始时间为: {}'.format(start_time))

pool = ThreadPoolExecutor(5)

print('创建线程池花费时间: {}'.format(time.time() - start_time))

for i, url in enumerate(url_list):
    start = time.time()
    v = pool.submit(fetch_async, url)
    v.add_done_callback(callback)
    end = time.time()
    print('第 {} 个请求花费时间: {}'.format(i, end-start))

end_time = time.time()
print('总共花费时间为: {}'.format(end_time-start_time))   

pool.shutdown()

print('关闭线程池花费时间: {}'.format(time.time()-end_time))

开始时间为: 1533106206.701956
创建线程池花费时间: 0.0
第 0 个请求花费时间: 0.0
第 1 个请求花费时间: 0.0029916763305664062
第 2 个请求花费时间: 0.0019936561584472656
总共花费时间为: 0.008974552154541016
200
200
200
关闭线程池花费时间: 0.32712578773498535


### 进程池并发

通过进程池的方式访问，同样的也是取决于耗时最长的，但是相对于线程来说，进程需要耗费更多的资源，同时这里是访问url时IO操作，所以这里线程池比进程池更好.

In [11]:
import time
import requests
from concurrent.futures import ProcessPoolExecutor

url_list = [
    'http://www.baidu.com',
    'http://www.bing.com',
    'http://www.cnblogs.com/'
]

def fetch(url):
    result = requests.get(url)
    print(result.status_code)
    
start_time = time.time()
print('开始时间为: {}'.format(start_time))

pool = ProcessPoolExecutor(5)

print('创建进程池花费时间: {}'.format(time.time()-start_time))

for i, url in enumerate(url_list):
    start = time.time()
    pool.submit(fetch, url)
    end = time.time()
    print('第 {} 个请求花费时间: {}'.format(i, end-start))

end_time = time.time()
print('总共花费时间为: {}'.format(end_time-start_time))    

pool.shutdown(True)
print('关闭进程池花费时间: {}'.format(time.time()-end_time))

开始时间为: 1533106519.7000778
创建进程池花费时间: 0.004985809326171875
第 0 个请求花费时间: 0.016955137252807617
第 1 个请求花费时间: 0.0
第 2 个请求花费时间: 0.0
总共花费时间为: 0.02293848991394043
关闭进程池花费时间: 0.3749985694885254


### 进程池+回调函数

In [16]:
import time
import requests
from concurrent.futures import ProcessPoolExecutor

url_list = [
    'http://www.baidu.com',
    'http://www.bing.com',
    'http://www.cnblogs.com/'
]

def fetch_async(url):
    response = requests.get(url)
    return response

def callback(future):
    print(future.result().status_code)
    

start_time = time.time()
print('开始时间为: {}'.format(start_time))

pool = ProcessPoolExecutor(5)

print('创建进程池花费时间: {}'.format(time.time() - start_time))

for i, url in enumerate(url_list):
    start = time.time()
    v = pool.submit(fetch_async, url)
    v.add_done_callback(callback)
    end = time.time()
    print('第 {} 个请求花费时间: {}'.format(i, end-start))

end_time = time.time()
print('总共花费时间为: {}'.format(end_time-start_time))   


pool.shutdown(True)

print('关闭进程池花费时间: {}'.format(time.time()-end_time))

开始时间为: 1533106840.277305
创建进程池花费时间: 0.001994609832763672
第 0 个请求花费时间: 0.034906625747680664
第 1 个请求花费时间: 0.0
第 2 个请求花费时间: 0.0009970664978027344
总共花费时间为: 0.03789830207824707


exception calling callback for <Future at 0x1677d9cfa90 state=finished raised BrokenProcessPool>
Traceback (most recent call last):
  File "c:\program files\python36\lib\concurrent\futures\_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "<ipython-input-16-6246141e29e4>", line 16, in callback
    print(future.result().status_code)
  File "c:\program files\python36\lib\concurrent\futures\_base.py", line 425, in result
    return self.__get_result()
  File "c:\program files\python36\lib\concurrent\futures\_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
exception calling callback for <Future at 0x1677da04eb8 state=finished raised BrokenProcessPool>
Traceback (most recent call last):
  File "c:\program files\python36\lib\concurrent\futures\_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "<ipython-i

关闭进程池花费时间: 0.2652909755706787


### 单线程协程并发

* asyncio
* gevent
* Twisted
* Tornado

### 使用asyncio

In [None]:
import time
import asyncio
import requests

url_list = [
    'http://www.baidu.com',
    'http://www.bing.com',
    'http://www.cnblogs.com/'
]

async def fetch_async(url):
    start = time.time()
    response = requests.get(url)
    print(response.status_code)
    end = time.time()
    print('请求url: {}, 花费时间: {}'.format(url, end-start))
    
    
tasks = [fetch_async(url) for url in url_list if url]    

start_time = time.time()
print('开始时间为: {}'.format(start_time))

loop = asyncio.get_event_loop()

loop.run_until_complete(asyncio.gather(*tasks))

loop.close()
print('总花费时间: {}'.format(time.time()-end_time))

$ python c:/Users/JS-E-PC-10182/Desktop/test/python/aaa.py
开始时间为: 1533110011.8857687
200
请求url: http://www.cnblogs.com/, 花费时间: 0.13364171981811523
200
请求url: http://www.bing.com, 花费时间: 0.3001983165740967
200
请求url: http://www.baidu.com, 花费时间: 0.025927066802978516
总共花费时间为: 0.4747307300567627