# 爬虫并发

### 简单循环串行

这种方法应该是最慢的，因为一个一个循环，耗时是最长的，是所有时间总和

In [1]:
import time
import requests

url_list = [
    'http://www.baidu.com',
    'http://www.pythonsite.com',
    'http://www.cnblogs.com/'
]

start_time = time.time()
print('开始时间为: {}'.format(start_time))

for i, url in enumerate(url_list):
    start = time.time()
    result = requests.get(url)
    print(result.status_code)
    end = time.time()
    print('第 {} 个请求花费时间: {}'.format(i, end-start))
    
end_time = time.time()

print('总共花费时间为: {}'.format(end_time-start_time))
    

开始时间为: 1533104896.6427982
200
第 0 个请求花费时间: 0.04388260841369629
200
第 1 个请求花费时间: 3.041865110397339
200
第 2 个请求花费时间: 0.23140478134155273
总共花费时间为: 3.317152500152588


### 线程池并发

通过线程池的方式访问，这样整体的耗时是所有连接里耗时最久的那个，相对循环串行来说快了很多

In [4]:
import time
import requests
from concurrent.futures import ThreadPoolExecutor

url_list = [
    'http://www.baidu.com',
    'http://www.bing.com',
    'http://www.cnblogs.com/'
]

def fetch(url):
    result = requests.get(url)
    print(result.status_code)
    

pool = ThreadPoolExecutor(10)

start_time = time.time()
print('开始时间为: {}'.format(start_time))

for i, url in enumerate(url_list):
    start = time.time()
    pool.submit(fetch, url)
    end = time.time()
    print('第 {} 个请求花费时间: {}'.format(i, end-start))

end_time = time.time()
print('总共花费时间为: {}'.format(end_time-start_time))    

开始时间为: 1533105726.9810228
第 0 个请求花费时间: 0.005985260009765625
第 1 个请求花费时间: 0.0019941329956054688
第 2 个请求花费时间: 0.004986286163330078
总共花费时间为: 0.013963699340820312
200
200
200


### 线程池+回调函数

In [10]:
import time
import requests
from concurrent.futures import ThreadPoolExecutor

url_list = [
    'http://www.baidu.com',
    'http://www.bing.com',
    'http://www.cnblogs.com/'
]

def fetch_async(url):
    response = requests.get(url)
    return response

def callback(future):
    print(future.result().status_code)
    

start_time = time.time()
print('开始时间为: {}'.format(start_time))

pool = ThreadPoolExecutor(5)

print('创建线程池花费时间: {}'.format(time.time() - start_time))

for i, url in enumerate(url_list):
    start = time.time()
    v = pool.submit(fetch_async, url)
    v.add_done_callback(callback)
    end = time.time()
    print('第 {} 个请求花费时间: {}'.format(i, end-start))

end_time = time.time()
print('总共花费时间为: {}'.format(end_time-start_time))   

pool.shutdown()

print('关闭线程池花费时间: {}'.format(time.time()-end_time))

开始时间为: 1533106206.701956
创建线程池花费时间: 0.0
第 0 个请求花费时间: 0.0
第 1 个请求花费时间: 0.0029916763305664062
第 2 个请求花费时间: 0.0019936561584472656
总共花费时间为: 0.008974552154541016
200
200
200
关闭线程池花费时间: 0.32712578773498535


### 进程池并发

通过进程池的方式访问，同样的也是取决于耗时最长的，但是相对于线程来说，进程需要耗费更多的资源，同时这里是访问url时IO操作，所以这里线程池比进程池更好.

In [11]:
import time
import requests
from concurrent.futures import ProcessPoolExecutor

url_list = [
    'http://www.baidu.com',
    'http://www.bing.com',
    'http://www.cnblogs.com/'
]

def fetch(url):
    result = requests.get(url)
    print(result.status_code)
    
start_time = time.time()
print('开始时间为: {}'.format(start_time))

pool = ProcessPoolExecutor(5)

print('创建进程池花费时间: {}'.format(time.time()-start_time))

for i, url in enumerate(url_list):
    start = time.time()
    pool.submit(fetch, url)
    end = time.time()
    print('第 {} 个请求花费时间: {}'.format(i, end-start))

end_time = time.time()
print('总共花费时间为: {}'.format(end_time-start_time))    

pool.shutdown(True)
print('关闭进程池花费时间: {}'.format(time.time()-end_time))

开始时间为: 1533106519.7000778
创建进程池花费时间: 0.004985809326171875
第 0 个请求花费时间: 0.016955137252807617
第 1 个请求花费时间: 0.0
第 2 个请求花费时间: 0.0
总共花费时间为: 0.02293848991394043
关闭进程池花费时间: 0.3749985694885254


### 进程池+回调函数

In [16]:
import time
import requests
from concurrent.futures import ProcessPoolExecutor

url_list = [
    'http://www.baidu.com',
    'http://www.bing.com',
    'http://www.cnblogs.com/'
]

def fetch_async(url):
    response = requests.get(url)
    return response

def callback(future):
    print(future.result().status_code)
    

start_time = time.time()
print('开始时间为: {}'.format(start_time))

pool = ProcessPoolExecutor(5)

print('创建进程池花费时间: {}'.format(time.time() - start_time))

for i, url in enumerate(url_list):
    start = time.time()
    v = pool.submit(fetch_async, url)
    v.add_done_callback(callback)
    end = time.time()
    print('第 {} 个请求花费时间: {}'.format(i, end-start))

end_time = time.time()
print('总共花费时间为: {}'.format(end_time-start_time))   


pool.shutdown(True)

print('关闭进程池花费时间: {}'.format(time.time()-end_time))

开始时间为: 1533106840.277305
创建进程池花费时间: 0.001994609832763672
第 0 个请求花费时间: 0.034906625747680664
第 1 个请求花费时间: 0.0
第 2 个请求花费时间: 0.0009970664978027344
总共花费时间为: 0.03789830207824707


exception calling callback for <Future at 0x1677d9cfa90 state=finished raised BrokenProcessPool>
Traceback (most recent call last):
  File "c:\program files\python36\lib\concurrent\futures\_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "<ipython-input-16-6246141e29e4>", line 16, in callback
    print(future.result().status_code)
  File "c:\program files\python36\lib\concurrent\futures\_base.py", line 425, in result
    return self.__get_result()
  File "c:\program files\python36\lib\concurrent\futures\_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
exception calling callback for <Future at 0x1677da04eb8 state=finished raised BrokenProcessPool>
Traceback (most recent call last):
  File "c:\program files\python36\lib\concurrent\futures\_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "<ipython-i

关闭进程池花费时间: 0.2652909755706787


### 单线程协程并发

* asyncio
* gevent
* Twisted
* Tornado

### 使用asyncio

In [None]:
import time
import asyncio
import requests

url_list = [
    'http://www.baidu.com',
    'http://www.bing.com',
    'http://www.cnblogs.com/'
]

async def fetch_async(url):
    start = time.time()
    response = requests.get(url)
    print(response.status_code)
    end = time.time()
    print('请求url: {}, 花费时间: {}'.format(url, end-start))
    
    
tasks = [fetch_async(url) for url in url_list if url]    

start_time = time.time()
print('开始时间为: {}'.format(start_time))

loop = asyncio.get_event_loop()

loop.run_until_complete(asyncio.gather(*tasks))

loop.close()
print('总花费时间: {}'.format(time.time()-end_time))

$ python c:/Users/JS-E-PC-10182/Desktop/test/python/aaa.py

开始时间为: 1533110011.8857687

200

请求url: http://www.cnblogs.com/, 花费时间: 0.13364171981811523

200

请求url: http://www.bing.com, 花费时间: 0.3001983165740967

200

请求url: http://www.baidu.com, 花费时间: 0.025927066802978516

总共花费时间为: 0.4747307300567627

### syncio例子2

这里asyncio并没有提供我们发送http请求的方法，但是我们可以在yield from这里构造http请求的方法

In [3]:
import asyncio

async def fetch_async(host, url='/'):
    print("----",host, url)
    reader, writer = await asyncio.open_connection(host, 80)

    #构造请求头内容
    request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,)
    request_header_content = bytes(request_header_content, encoding='utf-8')
    #发送请求
    writer.write(request_header_content)
    await writer.drain()
    text = await reader.read()
    print(host, url)
    writer.close()

tasks = [
    fetch_async('www.cnblogs.com', '/zhaof/'),
    fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
]

loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

RuntimeError: This event loop is already running

---- dig.chouti.com /pic/show?nid=4073644713430508&lid=10273091
---- www.cnblogs.com /zhaof/
www.cnblogs.com /zhaof/
dig.chouti.com /pic/show?nid=4073644713430508&lid=10273091


### asyncio + aiohttp

In [4]:
import aiohttp
import asyncio

async def fetch_async(url):
    print(url)
    response = await aiohttp.request('GET', url)
    print(url, response)
    response.close()


tasks = [fetch_async('http://baidu.com/'), fetch_async('http://www.chouti.com/')]

event_loop = asyncio.get_event_loop()
results = event_loop.run_until_complete(asyncio.gather(*tasks))
event_loop.close()

RuntimeError: This event loop is already running

Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x00000230A9B7AD68>
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x00000230A9B7AD68>


http://www.chouti.com/
http://baidu.com/


### asyncio+requests代码例子

In [6]:
import asyncio
import requests



async def fetch_async(func, *args):
    loop = asyncio.get_event_loop()
    future = loop.run_in_executor(None, func, *args)
    response = await future
    print(response.url, response.status_code)


tasks = [
    fetch_async(requests.get, 'http://www.cnblogs.com/wupeiqi/'),
    fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091')
]

loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

RuntimeError: This event loop is already running

http://www.cnblogs.com/wupeiqi/ 200
https://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091 403


### gevent+requests代码例子

In [7]:
import gevent

import requests
from gevent import monkey

monkey.patch_all()


def fetch_async(method, url, req_kwargs):
    print(method, url, req_kwargs)
    response = requests.request(method=method, url=url, **req_kwargs)
    print(response.url, response.status_code)

# ##### 发送请求 #####
gevent.joinall([
    gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
    gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
    gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
])

# ##### 发送请求（协程池控制最大协程数量） #####
# from gevent.pool import Pool
# pool = Pool(None)
# gevent.joinall([
#     pool.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
#     pool.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
#     pool.spawn(fetch_async, method='get', url='https://www.github.com/', req_kwargs={}),
# ])

  


get https://www.python.org/ {}
get https://www.yahoo.com/ {}
get https://github.com/ {}
https://github.com/ 200
https://www.yahoo.com/ 200
https://www.python.org/ 200


[<Greenlet "Greenlet-0" at 0x230aa0fb248: _run>,
 <Greenlet "Greenlet-1" at 0x230aa0fb148: _run>,
 <Greenlet "Greenlet-2" at 0x230aa0fb048: _run>]

  with loop.timer(seconds, ref=ref) as t:


### grequests代码例子

这个是将requests+gevent进行了封装

In [10]:
import grequests


request_list = [
    grequests.get('http://httpbin.org/delay/1', timeout=0.001),
    grequests.get('http://fakedomain/'),
    grequests.get('http://httpbin.org/status/500')
]


# ##### 执行并获取响应列表 #####
response_list = grequests.map(request_list)
print(response_list)


# ##### 执行并获取响应列表（处理异常） #####
# def exception_handler(request, exception):
#     print("Request failed", request,exception)

# response_list = grequests.map(request_list, exception_handler=exception_handler)
# print(response_list)

[None, None, <Response [500]>]


  with loop.timer(seconds, ref=ref) as t:


### twisted代码例子

In [12]:
#getPage相当于requets模块，defer特殊的返回值，rector是做事件循环
from twisted.web.client import getPage, defer
from twisted.internet import reactor

def all_done(arg):
    reactor.stop()

def callback(contents):
    print(contents)

deferred_list = []

url_list = ['http://www.bing.com', 'http://www.baidu.com', ]
for url in url_list:
    deferred = getPage(bytes(url, encoding='utf8'))
    deferred.addCallback(callback)
    deferred_list.append(deferred)
#这里就是进就行一种检测，判断所有的请求知否执行完毕
dlist = defer.DeferredList(deferred_list)
dlist.addBoth(all_done)

reactor.run()

  from ipykernel import kernelapp as app


ReactorNotRestartable: 

  with loop.timer(seconds, ref=ref) as t:


### tornado代码例子

In [13]:
from tornado.httpclient import AsyncHTTPClient
from tornado.httpclient import HTTPRequest
from tornado import ioloop


def handle_response(response):
    """
    处理返回值内容（需要维护计数器，来停止IO循环），调用 ioloop.IOLoop.current().stop()
    :param response: 
    :return: 
    """
    if response.error:
        print("Error:", response.error)
    else:
        print(response.title)


def func():
    url_list = [
        'http://www.baidu.com',
        'http://www.bing.com',
    ]
    for url in url_list:
        print(url)
        http_client = AsyncHTTPClient()
        http_client.fetch(HTTPRequest(url), handle_response)


ioloop.IOLoop.current().add_callback(func)
ioloop.IOLoop.current().start()

RuntimeError: This event loop is already running

  with loop.timer(seconds, ref=ref) as t:


http://www.baidu.com


Exception in worker
Traceback (most recent call last):
  File "c:\program files\python36\lib\concurrent\futures\thread.py", line 67, in _worker
    work_item = work_queue.get(block=True)
  File "c:\program files\python36\lib\queue.py", line 164, in get
    self.not_empty.wait()
  File "c:\program files\python36\lib\threading.py", line 295, in wait
    waiter.acquire()
  File "c:\program files\python36\lib\site-packages\gevent\thread.py", line 84, in acquire
    return BoundedSemaphore.acquire(self, blocking, timeout)
  File "src\gevent\_semaphore.py", line 211, in gevent.__semaphore.Semaphore.acquire
  File "src\gevent\_semaphore.py", line 239, in gevent.__semaphore.Semaphore.acquire
  File "src\gevent\_semaphore.py", line 179, in gevent.__semaphore.Semaphore._do_wait
  File "src\gevent\_greenlet_primitives.py", line 59, in gevent.__greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src\gevent\_greenlet_primitives.py", line 59, in gevent.__greenlet_primitives.SwitchOutGreenle

http://www.bing.com


Exception in thread ThreadPoolExecutor-0_5:
Traceback (most recent call last):
  File "c:\program files\python36\lib\threading.py", line 916, in _bootstrap_inner
    self.run()
  File "c:\program files\python36\lib\threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "c:\program files\python36\lib\concurrent\futures\thread.py", line 84, in _worker
    _base.LOGGER.critical('Exception in worker', exc_info=True)
  File "c:\program files\python36\lib\logging\__init__.py", line 1353, in critical
    self._log(CRITICAL, msg, args, **kwargs)
  File "c:\program files\python36\lib\logging\__init__.py", line 1442, in _log
    self.handle(record)
  File "c:\program files\python36\lib\logging\__init__.py", line 1452, in handle
    self.callHandlers(record)
  File "c:\program files\python36\lib\logging\__init__.py", line 1522, in callHandlers
    lastResort.handle(record)
  File "c:\program files\python36\lib\logging\__init__.py", line 861, in handle
    self.acquir