# 4.1 - Procesos asíncronos


![async](images/async.png)




**[Documentación](https://docs.python.org/3/library/asyncio.html)**


**asyncio** es una biblioteca para escribir código [concurrente](https://es.wikipedia.org/wiki/Concurrencia_(inform%C3%A1tica)) utilizando la sintaxis async/await. Se utiliza como base en múltiples frameworks asíncronos de Python y provee un alto rendimiento en redes y servidores web, bibliotecas de conexión de base de datos, colas de tareas distribuidas, etc.

Suele encajar perfectamente para operaciones con límite de E/S y código de red estructurado de alto nivel. Además provee un conjunto de APIs de alto nivel para:

+ ejecutar corutinas de Python de manera concurrente y tener control total sobre su ejecución

+ realizar redes E/S y comunicación entre procesos(IPC)

+ controlar subprocesos

+ distribuir tareas a través de colas

+ sincronizar código concurrente

Adicionalmente, existen APIs de bajo nivel para desarrolladores de bibliotecas y frameworks para:

+ crear y administrar bucles de eventos, los cuales proveen APIs asíncronas para redes, ejecutando subprocesos, gestionando señales del sistema operativo, etc..

+ implementar protocolos eficientes utilizando transportes

+ bibliotecas puente basadas en retrollamadas y código con sintaxis async/wait

$$$$

Nosotros nos enfocaremos en el uso de bucles de eventos para la extracción de datos de la web.


### Hola Mundo

In [1]:
import asyncio


# funcion asincrona
async def saludar():
    
    print('Holaaaa..')
    
    await asyncio.sleep(3)
    
    print('pero que te pashaaa')
    
    
await saludar()            # en jupyter
#asyncio.run(saludar())    # en un .py

Holaaaa..
pero que te pashaaa


### Ejemplo response

Comprobando la respuesta de tres urls.

In [2]:
import requests as req

In [3]:
url='https://s3-eu-west-1.amazonaws.com/'

req.get(url)

<Response [200]>

In [4]:
urls=[
    'https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/documents/breakfast.jpg',
    'https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/documents/forbidden',
    'https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/documents/the-html5-breakfast-site.html'
]

In [5]:
for e in urls:
    print(req.get(e))

<Response [200]>
<Response [403]>
<Response [200]>


In [6]:
req.get(e).status_code

200

In [7]:
# de manera asincrona

async def comprobar():
    
    bucle = asyncio.get_event_loop()   # bucle asincrono
    
    futuros=[]
    
    for e in urls:
        
        promesa=bucle.run_in_executor(None, req.get, e)
        
        futuros.append(promesa)
        
    for res in await asyncio.gather(*futuros): # aqui se los pido, dame los response
        print(res.status_code)
        
        
await comprobar()

200
403
200


### Ejemplo ESPN

Volvamos al ejemplo de scrapeo de la págine de ESPN. Vamos a realizar múltiples requests para obtener los datos de todos los equipos.


https://www.espn.com/soccer/competitions

In [8]:
from selenium import webdriver
from selenium.webdriver.common.by import By

import time

import pandas as pd

from webdriver_manager.chrome import ChromeDriverManager

PATH=ChromeDriverManager().install()

import warnings
warnings.filterwarnings('ignore')

In [9]:
url='https://www.espn.com/soccer/competitions'

In [10]:
driver=webdriver.Chrome(PATH)

driver.get(url)

# cookies
aceptar=driver.find_element(By.XPATH, '//*[@id="onetrust-accept-btn-handler"]')
aceptar.click()

time.sleep(4)

In [11]:
# selecciona equipos laliga
equipos=driver.find_element(By.XPATH, '//*[@id="fittPageContainer"]/div[3]/div/div[1]/div/div[2]/div[2]/div/div[4]/div/section/div/div/span[2]/a')
equipos.click()

time.sleep(2)

In [12]:
stats = driver.find_elements(By.CSS_SELECTOR, 'a.AnchorLink')

stats[12].get_attribute('href')

'http://www.espn.com/watch/'

In [13]:
stats[12].get_attribute('tabindex')

'0'

In [14]:
team_stats = []

for e in stats:
    
    try:
        link = e.get_attribute('href')

        if 'soccer/team/stats' in link:
            team_stats.append(link)
        else:
            continue
    except:
        continue
        
driver.quit()

team_stats

['https://www.espn.com/soccer/team/stats/_/id/349/afc-bournemouth',
 'https://www.espn.com/soccer/team/stats/_/id/13884/afc-fylde',
 'https://www.espn.com/soccer/team/stats/_/id/3802/afc-wimbledon',
 'https://www.espn.com/soccer/team/stats/_/id/2731/accrington-stanley',
 'https://www.espn.com/soccer/team/stats/_/id/21711/alvechurch',
 'https://www.espn.com/soccer/team/stats/_/id/359/arsenal',
 'https://www.espn.com/soccer/team/stats/_/id/362/aston-villa',
 'https://www.espn.com/soccer/team/stats/_/id/280/barnet',
 'https://www.espn.com/soccer/team/stats/_/id/397/barnsley',
 'https://www.espn.com/soccer/team/stats/_/id/642/barrow',
 'https://www.espn.com/soccer/team/stats/_/id/392/birmingham-city',
 'https://www.espn.com/soccer/team/stats/_/id/365/blackburn-rovers',
 'https://www.espn.com/soccer/team/stats/_/id/346/blackpool',
 'https://www.espn.com/soccer/team/stats/_/id/358/bolton-wanderers',
 'https://www.espn.com/soccer/team/stats/_/id/2595/boreham-wood',
 'https://www.espn.com/socc

In [15]:
len(team_stats)

124

**Extracción asincrónica**

In [16]:
help(asyncio.get_event_loop().run_in_executor)

Help on method run_in_executor in module asyncio.base_events:

run_in_executor(executor, func, *args) method of asyncio.unix_events._UnixSelectorEventLoop instance



In [17]:
def asincrono(funcion):
    
    def eventos(*args, **kwargs):
        return asyncio.get_event_loop().run_in_executor(None, funcion, *args, **kwargs)
    
    return eventos

In [18]:
DATOS=[]

CABECERAS=[]

In [19]:
@asincrono
def extraer(url):
    
    global DATOS, CABECERAS
    
    # inicia el driver
    driver=webdriver.Chrome(PATH)
    driver.get(url)

    time.sleep(2)

    # acepta cookies
    try:
        aceptar=driver.find_element(By.XPATH, '//*[@id="onetrust-accept-btn-handler"]')
        aceptar.click()

        time.sleep(2)
    except:
        time.sleep(1)



    # disciplina
    dis=driver.find_element(By.XPATH,'//*[@id="fittPageContainer"]/div[2]/div[5]/div/div[1]/section/div/div[2]/nav/ul/li[2]/a')
    dis.click()

    time.sleep(2)

    tabla=driver.find_element(By.TAG_NAME,'tbody')

    filas=tabla.find_elements(By.TAG_NAME, 'tr')

    data=[]

    for f in filas:

        elementos=f.find_elements(By.TAG_NAME, 'td') 

        tmp=[]

        for e in elementos:

            tmp.append(e.text)

        tmp.append(url.split('/')[-1])  # añade el nombre del equipo
        data.append(tmp)


    cabeceras=driver.find_element(By.TAG_NAME, 'thead')

    cabeceras=[c.text for c in cabeceras.find_elements(By.TAG_NAME, 'th')]+['TEAM']


    DATOS+=data

    CABECERAS=cabeceras


In [25]:
%%time

for url in team_stats[:10]:
    
    res=extraer(url)
    display(res)

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/asyncio/futures.py:384]>

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/asyncio/futures.py:384]>

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/asyncio/futures.py:384]>

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/asyncio/futures.py:384]>

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/asyncio/futures.py:384]>

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/asyncio/futures.py:384]>

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/asyncio/futures.py:384]>

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/asyncio/futures.py:384]>

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/asyncio/futures.py:384]>

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/asyncio/futures.py:384]>

CPU times: user 22.8 ms, sys: 61.3 ms, total: 84.2 ms
Wall time: 83.4 ms


Future exception was never retrieved
future: <Future finished exception=NoSuchElementException()>
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/var/folders/95/ms6dwls51ls1jq0t456d3r200000gn/T/ipykernel_34554/889091157.py", line 29, in extraer
    tabla=driver.find_element(By.TAG_NAME,'tbody')
  File "/opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 830, in find_element
    return self.execute(Command.FIND_ELEMENT, {"using": by, "value": value})["value"]
  File "/opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 440, in execute
    self.error_handler.check_response(response)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/clase/lib/python3.9/site-packages/selenium/webdriver/remot

In [26]:
df=pd.DataFrame(DATOS, columns=CABECERAS)

df.shape

(288, 7)

In [27]:
df.head()

Unnamed: 0,RK,NAME,P,YC,RC,PTS,TEAM
0,1.0,Danny Waldron,2,1,0,1,alvechurch
1,,Jediael Abbey,2,1,0,1,alvechurch
2,,Jamie Willets,2,1,0,1,alvechurch
3,4.0,Tyrell Hamilton,2,0,0,0,alvechurch
4,,Leo Brown,1,0,0,0,alvechurch


In [28]:
df.TEAM.unique()

array(['alvechurch', 'afc-wimbledon', 'afc-bournemouth',
       'accrington-stanley', 'arsenal', 'aston-villa', 'barrow',
       'barnsley', 'barnet'], dtype=object)