# 加速：非同步爬蟲

* 了解非同步爬蟲加速原理與實作

## 作業目標

* 比較一下非同步爬蟲跟多線程爬蟲的差異是什麼？各自的優缺點為何？

非同步爬蟲

In [13]:
URL = 'https://morvanzhou.github.io/'

In [4]:
pip install aiohttp

Collecting aiohttp
  Downloading aiohttp-3.7.4.post0-cp38-cp38-win_amd64.whl (635 kB)
Collecting multidict<7.0,>=4.5
  Downloading multidict-5.1.0-cp38-cp38-win_amd64.whl (48 kB)
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.6.3-cp38-cp38-win_amd64.whl (125 kB)
Collecting async-timeout<4.0,>=3.0
  Downloading async_timeout-3.0.1-py3-none-any.whl (8.2 kB)
Note: you may need to restart the kernel to use updated packages.
Installing collected packages: multidict, yarl, async-timeout, aiohttp
Successfully installed aiohttp-3.7.4.post0 async-timeout-3.0.1 multidict-5.1.0 yarl-1.6.3


In [6]:
pip install asyncio

Collecting asyncio
  Downloading asyncio-3.4.3-py3-none-any.whl (101 kB)
Installing collected packages: asyncio
Successfully installed asyncio-3.4.3
Note: you may need to restart the kernel to use updated packages.


In [8]:
pip install nest_asyncio

Collecting nest_asyncio
  Downloading nest_asyncio-1.5.1-py3-none-any.whl (5.0 kB)
Installing collected packages: nest-asyncio
Successfully installed nest-asyncio-1.5.1
Note: you may need to restart the kernel to use updated packages.


In [14]:
import aiohttp, asyncio
import nest_asyncio
import time
nest_asyncio.apply()


async def job(session):
    response = await session.get(URL)                               #等待並切換
    return str(response.url)

async def main(loop):
    async with aiohttp.ClientSession() as session:                  #官網推薦建立Session的形式,也可以直接用request
        tasks = [loop.create_task(job(session)) for _ in range(2)]
        finished, unfinished = await asyncio.wait(tasks)            #收集完成的結果,會返回完成的和沒完成的,等全部都完成了才返回
        all_results = [r.result() for r in finished]                #獲取所有結果
        print(all_results)

t1 = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
#loop.close()
print("Async total time:", time.time() - t1)

['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/']
Async total time: 0.36077022552490234


多線程爬蟲

In [10]:
URL = 'https://morvanzhou.github.io/'

In [15]:
import requests
import _thread

startTime = time.time()

for i in range(2):
    _thread.start_new_thread( requests.get, (URL, ) )
    print(URL)

finishTime = time.time()
print(finishTime - startTime)

https://morvanzhou.github.io/
https://morvanzhou.github.io/
0.0006721019744873047


Conclusion

1.非同步是利用程式等待回應的時間，使用CPU處理額外步驟，是單線程，較不會出現額外等待時間。

2.多線程為平行運算，假如其中一線執行完，需要在等待其他線的完成，因此還是有額外等待的時間。

3.非同步使用CPU額外處裡較能利用額外的等待時間，但是多線程的平行處理會比較快，兩個合併使用為最佳。

4.多線程的缺點是會帶給系統上下文的切換造成額外負擔，執行多線程也會使共享變數出現鎖死的情況。