# 4.1 - Procesos asíncronos


![async](images/async.png)




**[Documentación](https://docs.python.org/3/library/asyncio.html)**


**asyncio** es una biblioteca para escribir código [concurrente](https://es.wikipedia.org/wiki/Concurrencia_(inform%C3%A1tica)) utilizando la sintaxis async/await. Se utiliza como base en múltiples frameworks asíncronos de Python y provee un alto rendimiento en redes y servidores web, bibliotecas de conexión de base de datos, colas de tareas distribuidas, etc.

Suele encajar perfectamente para operaciones con límite de E/S y código de red estructurado de alto nivel. Además provee un conjunto de APIs de alto nivel para:

+ ejecutar corutinas de Python de manera concurrente y tener control total sobre su ejecución

+ realizar redes E/S y comunicación entre procesos(IPC)

+ controlar subprocesos

+ distribuir tareas a través de colas

+ sincronizar código concurrente

Adicionalmente, existen APIs de bajo nivel para desarrolladores de bibliotecas y frameworks para:

+ crear y administrar bucles de eventos, los cuales proveen APIs asíncronas para redes, ejecutando subprocesos, gestionando señales del sistema operativo, etc..

+ implementar protocolos eficientes utilizando transportes

+ bibliotecas puente basadas en retrollamadas y código con sintaxis async/wait

$$$$

Nosotros nos enfocaremos en el uso de bucles de eventos para la extracción de datos de la web.


### Hola Mundo

In [1]:
import asyncio

# funcion asincrona


async def saludar():
    
    print('Hola...')
    
    await asyncio.sleep(3)
    
    print('Pero que pasa')
    

    
await saludar()           # en jupyter
#asyncio.run(saludar())   # en un .py

Hola...
Pero que pasa


### Ejemplo response

Comprobando la respuesta de tres urls.

In [2]:
import requests as req

In [3]:
url='https://s3-eu-west-1.amazonaws.com/'

req.get(url)

<Response [200]>

In [4]:
urls=[
    'https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/documents/breakfast.jpg',
    'https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/documents/forbidden',
    'https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/documents/the-html5-breakfast-site.html'
]

In [8]:
for url in urls:        # de una en una
    print(req.get(url).status_code)

200
403
200


In [10]:
# asincrono

async def comprobar():
    
    bucle=asyncio.get_event_loop()   # bucle asincrono
    
    #futuros=[bucle.run_in_executor(None, req.get, url) for url in urls]
    futuros=[]
    
    for url in urls:
        
        promesa=bucle.run_in_executor(None, req.get, url)
        
        futuros.append(promesa)
        
    for res in await asyncio.gather(*futuros):  # aqui se los pido, los response
        print(res.status_code)
        
        
await comprobar()
    

200
403
200


### Ejemplo ESPN

Volvamos al ejemplo de scrapeo de la págine de ESPN. Vamos a realizar múltiples requests para obtener los datos de todos los equipos.


https://www.espn.com/soccer/competitions

In [11]:
from selenium import webdriver

import time

import pandas as pd

from webdriver_manager.chrome import ChromeDriverManager

PATH=ChromeDriverManager().install()

import warnings
warnings.filterwarnings('ignore')



Current google-chrome version is 105.0.5195
Get LATEST chromedriver version for 105.0.5195 google-chrome
Trying to download new driver from https://chromedriver.storage.googleapis.com/105.0.5195.52/chromedriver_mac64_m1.zip
Driver has been saved in cache [/Users/iudh/.wdm/drivers/chromedriver/mac64_m1/105.0.5195.52]


In [12]:
url='https://www.espn.com/soccer/competitions'

In [19]:
driver=webdriver.Chrome(PATH)

driver.get(url)

# cookies
aceptar=driver.find_element_by_xpath('//*[@id="onetrust-accept-btn-handler"]')
aceptar.click()

time.sleep(4)

# selecciona equipos laliga
equipos=driver.find_element_by_xpath('//*[@id="fittPageContainer"]/div[3]/div/div[1]/div/div[2]/div[2]/div/div[4]/div/section/div/div/span[2]/a')
equipos.click()

time.sleep(2)

In [23]:
stats=driver.find_elements_by_css_selector('a.AnchorLink')

teams_stats=[]

for e in stats:
    
    link=e.get_attribute('href')
    
    if '/team/stats/' in link:
        teams_stats.append(link)
    else:
        continue


driver.quit()

teams_stats

['https://www.espn.com/soccer/team/stats/_/id/6832/almeria',
 'https://www.espn.com/soccer/team/stats/_/id/93/athletic-club',
 'https://www.espn.com/soccer/team/stats/_/id/1068/atletico-madrid',
 'https://www.espn.com/soccer/team/stats/_/id/83/barcelona',
 'https://www.espn.com/soccer/team/stats/_/id/85/celta-vigo',
 'https://www.espn.com/soccer/team/stats/_/id/3842/cadiz',
 'https://www.espn.com/soccer/team/stats/_/id/3751/elche',
 'https://www.espn.com/soccer/team/stats/_/id/88/espanyol',
 'https://www.espn.com/soccer/team/stats/_/id/2922/getafe',
 'https://www.espn.com/soccer/team/stats/_/id/9812/girona',
 'https://www.espn.com/soccer/team/stats/_/id/84/mallorca',
 'https://www.espn.com/soccer/team/stats/_/id/97/osasuna',
 'https://www.espn.com/soccer/team/stats/_/id/101/rayo-vallecano',
 'https://www.espn.com/soccer/team/stats/_/id/244/real-betis',
 'https://www.espn.com/soccer/team/stats/_/id/86/real-madrid',
 'https://www.espn.com/soccer/team/stats/_/id/89/real-sociedad',
 'https

**Extracción asincrónica**

In [24]:
def asincrono(funcion):
    
    def eventos(*args, **kwargs):
        return asyncio.get_event_loop().run_in_executor(None, funcion, *args, **kwargs)
    
    return eventos

In [53]:
DATOS=[]

CABECERAS=[]

In [54]:
@asincrono
def extraer(url):
    
    global DATOS, CABECERAS
    
    # inicia el driver
    driver=webdriver.Chrome(PATH)
    driver.get(url)

    time.sleep(2)

    # acepta cookies
    aceptar=driver.find_element_by_xpath('//*[@id="onetrust-accept-btn-handler"]')
    aceptar.click()

    time.sleep(2)
    
    # disciplina
    dis=driver.find_element_by_xpath('//*[@id="fittPageContainer"]/div[2]/div[5]/div/div[1]/section/div/div[2]/nav/ul/li[2]/a')
    dis.click()

    time.sleep(2)
    
    tabla=driver.find_element_by_tag_name('tbody')

    filas=tabla.find_elements_by_tag_name('tr')


    data=[]

    for f in filas:

        elementos=f.find_elements_by_tag_name('td') 

        tmp=[]

        for e in elementos:

            tmp.append(e.text)
            
        tmp.append(url.split('/')[-1])
        data.append(tmp)
        

    cabeceras=driver.find_element_by_tag_name('thead')

    cabeceras=[c.text for c in cabeceras.find_elements_by_tag_name('th')]+['TEAM']
    
    
    DATOS+=data

    CABECERAS=cabeceras

In [55]:
for url in teams_stats[:5]:
    
    res=extraer(url)
    display(res)

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /Users/iudh/miniforge3/envs/clase/lib/python3.9/asyncio/futures.py:384]>

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /Users/iudh/miniforge3/envs/clase/lib/python3.9/asyncio/futures.py:384]>

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /Users/iudh/miniforge3/envs/clase/lib/python3.9/asyncio/futures.py:384]>

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /Users/iudh/miniforge3/envs/clase/lib/python3.9/asyncio/futures.py:384]>

<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /Users/iudh/miniforge3/envs/clase/lib/python3.9/asyncio/futures.py:384]>

2022-09-01 09:49:23,506 [5743] ERROR    asyncio:1738: [JupyterRequire] Future exception was never retrieved
future: <Future finished exception=StaleElementReferenceException('stale element reference: element is not attached to the page document\n  (Session info: chrome=105.0.5195.52)', None, ['0   chromedriver                        0x0000000100fc5a90 chromedriver + 3889808', '1   chromedriver                        0x0000000100f54b54 chromedriver + 3427156', '2   chromedriver                        0x0000000100c46238 chromedriver + 221752', '3   chromedriver                        0x0000000100c48df8 chromedriver + 232952', '4   chromedriver                        0x0000000100c48c60 chromedriver + 232544', '5   chromedriver                        0x0000000100c48e94 chromedriver + 233108', '6   chromedriver                        0x0000000100c726b4 chromedriver + 403124', '7   chromedriver                        0x0000000100c6da10 chromedriver + 383504', '8   chromedriver               

In [58]:
df=pd.DataFrame(DATOS, columns=CABECERAS)

df.head()

Unnamed: 0,RK,NAME,P,YC,RC,PTS,TEAM
0,1.0,Iker Muniain,3,1,0,1,athletic-club
1,,Dani Vivian,3,1,0,1,athletic-club
2,,Iñigo Lekue,1,1,0,1,athletic-club
3,,Yeray,3,1,0,1,athletic-club
4,,Yuri Berchiche,2,1,0,1,athletic-club


In [59]:
df.TEAM.unique()

array(['athletic-club', 'almeria', 'celta-vigo', 'barcelona'],
      dtype=object)

In [61]:
df.shape

(85, 7)