<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Selenium" data-toc-modified-id="Selenium-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Selenium</a></span><ul class="toc-item"><li><span><a href="#Los-drivers" data-toc-modified-id="Los-drivers-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Los drivers</a></span></li><li><span><a href="#Capturar-elementos-con-Selenium" data-toc-modified-id="Capturar-elementos-con-Selenium-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Capturar elementos con Selenium</a></span></li></ul></li><li><span><a href="#Manos-a-la-obra" data-toc-modified-id="Manos-a-la-obra-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Manos a la obra</a></span></li></ul></div>

# Selenium

Hacer web scraping se ha convertido en una herramienta muy poderosa para la extracción de datos y con ellos alimentar nuestras aplicaciones para realizar tareas increíbles. En esta ocasión usaremos ese poder para obtener las recompensas del mundial de League of Legends.
Haremos uso de selenium pues nos permite de manera muy dinámica explorar y obtener elementos de árbol de HTML e interactuar con los resultados de las acciones.

In [None]:
#!pip install selenium
#!pip install webdriver-manager

## Los drivers

Selenium emplea “drivers” para poder acceder al contenido del sitio web, un driver es solo una versión reducida de un navegador que el código de selenium  utiliza, existen por ejemplo los drivers. Para descargarse los drivers: 


- [Driver Chrome](https://chromedriver.storage.googleapis.com/index.html?path=2.43/)


- [Driver Firefox](https://github.com/mozilla/geckodriver/releases)


- [Driver Opera](https://github.com/operasoftware/operachromiumdriver/releases)


> Cuando nos hayamos descargado el driver, tendremos un archivo .exe que tendremos que guardar en la misma carpeta de donde estemos trabajando. 

Podemos cargar los `drivers` de dos formas: 

```python
driver = webdriver.Chrome("/RutaAlDriver/chromedriver.exe")
```


También lo podemos hacer sin necesidad de descargarnos el driver: 

```python
driver = webdriver.Chrome(ChromeDriverManager().install())
```

## Capturar elementos con Selenium 

- Para encontrar un elemento 

```python
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
```
- Para encontrar múltiples elementos (este método nos devuelve una lista):

```python
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
```

- Otros mátodos que pueden ser interesantes: 

```python
.click() # simula un click en el navegador
.send_keys # va a rellenar un campo
.implicitly_wait # # indica al Selenium WebDriver que espere un determinado tiempo antes de lanzar una excepción. Una vez que se establece este tiempo, WebDriver esperará el elemento antes de que se produzca la excepción.
.text # obtener el texto del elemento selenio
```

In [1]:
import requests
import pandas as pd
from time import sleep
import numpy as np


from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options




import warnings
warnings.filterwarnings('ignore')


# Manos a la obra 

Lo primero que vamos a hacer es definir una serie de opciones para trabajar con Selenium

In [3]:
opciones= Options()
opciones.add_experimental_option('excludeSwitches', ['enable-automation'])
#para ocultarme como robot
opciones.add_experimental_option('useAutomationExtension', False)
opciones.add_argument('--start-maximized') #empezar maximizado
opciones.add_argument('user.data-dir=selenium') #guarda las cookies
opciones.add_argument('--incognito')#incognito window

In [15]:
# iniciamos el driver
driver = webdriver.Chrome(ChromeDriverManager().install())
## accedemos a la pagina web

driver.get("https://www.wunderground.com/history")
driver.implicitly_wait(10)
# esperamos
driver.find_element_by_css_selector("#truste-consent-button").click()
# aceptamos las cookies

sleep(5)
# seleccionamos la casilla para poner la ciudad por la que queremos buscar
driver.find_element_by_css_selector("#historySearch").send_keys("Madrid, Madrid, Spain", Keys.TAB)
sleep(5)
# por lo que sea le tenemos que dar dos veces al botón de enviar
driver.find_element_by_xpath('//*[@id="dateSubmit"]').click()
sleep(2)
driver.find_element_by_xpath('//*[@id="dateSubmit"]').click()
sleep(5)

driver.find_element_by_xpath('//*[@id="inner-content"]/div[2]/div[1]/div[1]/div[1]/div/lib-link-selector/div/div/div/a[3]').click()
# seleccionamos que nos los valores medios del mes
resultado = driver.find_element_by_css_selector("#inner-content > div.region-content-main > div.row > div:nth-child(5) > div:nth-child(1) > div > lib-city-history-observation > div > div.observation-table.ng-star-inserted").text



Current google-chrome version is 98.0.4758
Get LATEST chromedriver version for 98.0.4758 google-chrome
Driver [/Users/anagarciagarcia/.wdm/drivers/chromedriver/mac64/98.0.4758.102/chromedriver] found in cache


In [16]:
resultado

'Time Temperature (° F) Dew Point (° F) Humidity (%) Wind Speed (mph) Pressure (Hg) Precipitation (in)\nMar\n1\n2\nMax Avg Min\n64 50.0 34\n63 54.0 45\nMax Avg Min\n43 36.6 32\n43 39.0 32\nMax Avg Min\n93 64.7 32\n87 59.2 36\nMax Avg Min\n12 4.4 1\n20 8.5 1\nMax Avg Min\n28.2 28.1 28.1\n28.1 28.1 28.0\nTotal\n0.00\n0.00'

Hemos accedido a la información de un dia, perfecto! Pero que pasa si queremos más dias? 


In [18]:
# sacamos todas las listas que url que necesitamos

url_list = []

for year in range(2010, 2021):
    for month in range(1, 13):
        url_list.append(f"https://www.wunderground.com/history/monthly/LEMD/date/{year}-{month}")

In [20]:
len(url_list)

132

In [21]:
url_list[:3]

['https://www.wunderground.com/history/monthly/LEMD/date/2010-1',
 'https://www.wunderground.com/history/monthly/LEMD/date/2010-2',
 'https://www.wunderground.com/history/monthly/LEMD/date/2010-3']

In [31]:
driver = webdriver.Chrome(ChromeDriverManager().install())
result_list = []

for i in url_list[:3]: 
    driver.get(i)
    try:
        sleep(5)
        driver.find_element_by_css_selector('#truste-consent-button').click()
        sleep(2)
        resultado = driver.find_element_by_css_selector("#inner-content > div.region-content-main > div.row > div:nth-child(5) > div:nth-child(1) > div > lib-city-history-observation > div > div.observation-table.ng-star-inserted").text
        result_list.append(resultado)
    except: 
        sleep(5)
        resultado = driver.find_element_by_css_selector("#inner-content > div.region-content-main > div.row > div:nth-child(5) > div:nth-child(1) > div > lib-city-history-observation > div > div.observation-table.ng-star-inserted").text
        result_list.append(resultado)
driver.quit()



Current google-chrome version is 98.0.4758
Get LATEST chromedriver version for 98.0.4758 google-chrome
Driver [/Users/anagarciagarcia/.wdm/drivers/chromedriver/mac64/98.0.4758.102/chromedriver] found in cache


In [30]:
result_list

['Time Temperature (° F) Dew Point (° F) Humidity (%) Wind Speed (mph) Pressure (Hg) Precipitation (in)\nJan\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\nMax Avg Min\n48 42.9 39\n46 41.7 37\n46 42.3 36\n50 46.6 45\n52 46.8 43\n46 42.4 39\n39 36.3 34\n37 34.3 32\n36 32.8 30\n32 26.9 21\n34 27.5 23\n45 16.9 0\n46 40.8 36\n52 49.5 45\n48 41.7 34\n50 44.1 39\n50 46.5 45\n59 51.5 46\n55 52.8 50\n57 47.7 37\n52 41.4 32\n50 42.0 34\n54 47.5 43\n55 47.4 41\n48 42.9 37\n50 38.1 0\n43 37.4 34\n54 37.5 25\n55 42.0 28\n52 44.3 32\n52 45.7 36\nMax Avg Min\n41 36.2 34\n41 35.8 0\n43 39.3 36\n48 44.9 43\n46 44.1 39\n39 23.0 0\n37 22.8 0\n23 19.0 16\n21 18.1 14\n27 21.0 18\n30 25.3 21\n36 16.2 0\n45 38.4 0\n46 32.1 0\n39 26.7 0\n43 40.3 37\n50 45.8 43\n50 46.9 46\n52 49.3 41\n46 38.9 30\n41 31.6 0\n43 38.5 34\n45 41.1 39\n41 39.5 36\n36 20.2 0\n34 14.4 0\n28 23.6 19\n30 24.7 19\n37 26.6 0\n34 19.7 0\n36 13.5 0\nMax Avg Min\n81 77.0

In [None]:
# ya tenemos la lista de las url que queremos. 



In [None]:
# sacamos la tabla para cada una de las url de arriba


In [None]:
# cada uno de los elementos de nuestra lista será una tabla



Convertimos toda la lista en dataframe

In [33]:
df = pd.DataFrame(result_list)
df

Unnamed: 0,0
0,Time Temperature (° F) Dew Point (° F) Humidit...
1,Time Temperature (° F) Dew Point (° F) Humidit...
2,Time Temperature (° F) Dew Point (° F) Humidit...


In [36]:
df.iloc[0][0]

'Time Temperature (° F) Dew Point (° F) Humidity (%) Wind Speed (mph) Pressure (Hg) Precipitation (in)\nJan\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\nMax Avg Min\n48 42.9 39\n46 41.7 37\n46 42.3 36\n50 46.6 45\n52 46.8 43\n46 42.4 39\n39 36.3 34\n37 34.3 32\n36 32.8 30\n32 26.9 21\n34 27.5 23\n45 16.9 0\n46 40.8 36\n52 49.5 45\n48 41.7 34\n50 44.1 39\n50 46.5 45\n59 51.5 46\n55 52.8 50\n57 47.7 37\n52 41.4 32\n50 42.0 34\n54 47.5 43\n55 47.4 41\n48 42.9 37\n50 38.1 0\n43 37.4 34\n54 37.5 25\n55 42.0 28\n52 44.3 32\n52 45.7 36\nMax Avg Min\n41 36.2 34\n41 35.8 0\n43 39.3 36\n48 44.9 43\n46 44.1 39\n39 23.0 0\n37 22.8 0\n23 19.0 16\n21 18.1 14\n27 21.0 18\n30 25.3 21\n36 16.2 0\n45 38.4 0\n46 32.1 0\n39 26.7 0\n43 40.3 37\n50 45.8 43\n50 46.9 46\n52 49.3 41\n46 38.9 30\n41 31.6 0\n43 38.5 34\n45 41.1 39\n41 39.5 36\n36 20.2 0\n34 14.4 0\n28 23.6 19\n30 24.7 19\n37 26.6 0\n34 19.7 0\n36 13.5 0\nMax Avg Min\n81 77.0 

In [37]:
# sacamos la primera fila para ver como tenemos la info en cada fila. 

x = df.iloc[0][0].split("\n")

In [38]:
# qué es x?

x

['Time Temperature (° F) Dew Point (° F) Humidity (%) Wind Speed (mph) Pressure (Hg) Precipitation (in)',
 'Jan',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '31',
 'Max Avg Min',
 '48 42.9 39',
 '46 41.7 37',
 '46 42.3 36',
 '50 46.6 45',
 '52 46.8 43',
 '46 42.4 39',
 '39 36.3 34',
 '37 34.3 32',
 '36 32.8 30',
 '32 26.9 21',
 '34 27.5 23',
 '45 16.9 0',
 '46 40.8 36',
 '52 49.5 45',
 '48 41.7 34',
 '50 44.1 39',
 '50 46.5 45',
 '59 51.5 46',
 '55 52.8 50',
 '57 47.7 37',
 '52 41.4 32',
 '50 42.0 34',
 '54 47.5 43',
 '55 47.4 41',
 '48 42.9 37',
 '50 38.1 0',
 '43 37.4 34',
 '54 37.5 25',
 '55 42.0 28',
 '52 44.3 32',
 '52 45.7 36',
 'Max Avg Min',
 '41 36.2 34',
 '41 35.8 0',
 '43 39.3 36',
 '48 44.9 43',
 '46 44.1 39',
 '39 23.0 0',
 '37 22.8 0',
 '23 19.0 16',
 '21 18.1 14',
 '27 21.0 18',
 '30 25.3 21',
 '36 16.2 0',
 '45 38

In [42]:
# lo podemos convertir a dataframe?

pd.DataFrame(np.array(x[1:]).reshape(7, 32)).T


Unnamed: 0,0,1,2,3,4,5,6
0,Jan,Max Avg Min,Max Avg Min,Max Avg Min,Max Avg Min,Max Avg Min,Total
1,1,48 42.9 39,41 36.2 34,81 77.0 66,29 16.6 7,28.0 27.8 27.7,0.00
2,2,46 41.7 37,41 35.8 0,100 84.0 66,10 3.9 0,28.1 28.1 28.0,0.00
3,3,46 42.3 36,43 39.3 36,100 89.7 76,7 1.5 0,28.1 28.0 28.0,0.00
4,4,50 46.6 45,48 44.9 43,100 94.0 87,8 4.3 0,28.0 27.7 27.6,0.00
5,5,52 46.8 43,46 44.1 39,100 88.9 76,5 1.7 0,27.6 27.6 27.5,0.00
6,6,46 42.4 39,39 23.0 0,93 70.9 53,15 7.3 0,27.7 27.7 27.6,0.00
7,7,39 36.3 34,37 22.8 0,93 87.0 65,23 11.7 0,27.6 27.5 27.5,0.00
8,8,37 34.3 32,23 19.0 16,65 54.4 44,28 18.5 7,27.9 27.7 27.6,0.00
9,9,36 32.8 30,21 18.1 14,64 55.4 47,20 7.2 0,28.0 27.9 27.9,0.00


Para hacer el `reshape` necesitamos tener en cuenta el número de dias que hay en cada mes. Lo que podemos hacer es sacar el índice del valor `"Max Avg Min"` que será el siguiente valor después de terminar la información de los dias. 

**¿Nos acordamos del método `index`?** 👇🏽

In [44]:
# nos devuelve la posición en nuestro dataframe. 
x[1:].index("Max Avg Min")


32

In [45]:
len(df)

3

In [51]:
x[1]

'Mar'

In [54]:
# hacemos un dataframe gordote
df_clima = pd.DataFrame()

for i in range(len(df)):
    x = list(df.iloc[i][0].split("\n"))
    
    mi_ind = x[1:].index("Max Avg Min")
    print(mi_ind)
    
    df_solo =  pd.DataFrame(np.array(x[1:]).reshape(7, mi_ind)).T
    
    df_solo.columns = ["Time", "Temperature (° F)", "Dew Point (° F)", "Humidity (%)", "Wind Speed (mph)", "Pressure (Hg)", "Precipitation (in)"]
    df_solo["month"] = x[1]
    df_clima = pd.concat([df_clima, df_solo], axis = 0)
    

32
29
32


In [56]:
df_clima.head(34)

Unnamed: 0,Time,Temperature (° F),Dew Point (° F),Humidity (%),Wind Speed (mph),Pressure (Hg),Precipitation (in),month
0,Jan,Max Avg Min,Max Avg Min,Max Avg Min,Max Avg Min,Max Avg Min,Total,Jan
1,1,48 42.9 39,41 36.2 34,81 77.0 66,29 16.6 7,28.0 27.8 27.7,0.00,Jan
2,2,46 41.7 37,41 35.8 0,100 84.0 66,10 3.9 0,28.1 28.1 28.0,0.00,Jan
3,3,46 42.3 36,43 39.3 36,100 89.7 76,7 1.5 0,28.1 28.0 28.0,0.00,Jan
4,4,50 46.6 45,48 44.9 43,100 94.0 87,8 4.3 0,28.0 27.7 27.6,0.00,Jan
5,5,52 46.8 43,46 44.1 39,100 88.9 76,5 1.7 0,27.6 27.6 27.5,0.00,Jan
6,6,46 42.4 39,39 23.0 0,93 70.9 53,15 7.3 0,27.7 27.7 27.6,0.00,Jan
7,7,39 36.3 34,37 22.8 0,93 87.0 65,23 11.7 0,27.6 27.5 27.5,0.00,Jan
8,8,37 34.3 32,23 19.0 16,65 54.4 44,28 18.5 7,27.9 27.7 27.6,0.00,Jan
9,9,36 32.8 30,21 18.1 14,64 55.4 47,20 7.2 0,28.0 27.9 27.9,0.00,Jan


In [57]:
# separamos las columnas que tienen más de un dato 

df_clima[["Tmax", "Avg", "Tmin"]] = df_clima["Temperature (° F)"].str.split(" ", expand = True)


In [74]:
def separar_columnas(lista, columna):
    df_clima[lista] = df_clima[columna].str.split(" ", expand = True)
    return df_clima[columna].str.split(" ", expand = True)


In [75]:
columnas = df_clima.columns[1:6]
columnas

Index(['Temperature (° F)', 'Dew Point (° F)', 'Humidity (%)',
       'Wind Speed (mph)', 'Pressure (Hg)'],
      dtype='object')

In [76]:
listas = [["Tmax", "Avg", "Tmin"], ["Dmax", "DAvg", "Dmin"], ["Hmax", "HAvg", "Hmin"], ["Wmax", "WAvg", "Wmin"], 
         ["Pmax", "PAvg", "Pmin"]]

for c, l in zip(listas, columnas):
    print(separar_columnas(c,l))

     0     1   2
1   48  42.9  39
2   46  41.7  37
3   46  42.3  36
4   50  46.6  45
5   52  46.8  43
..  ..   ...  ..
27  63  53.3  45
28  68  53.6  41
29  59  53.0  45
30  57  50.8  45
31  59  51.1  45

[90 rows x 3 columns]
     0     1   2
1   41  36.2  34
2   41  35.8   0
3   43  39.3  36
4   48  44.9  43
5   46  44.1  39
..  ..   ...  ..
27  45  33.9   0
28  43  38.2  34
29  50  43.8  41
30  50  40.2  34
31  41  21.5   0

[90 rows x 3 columns]
      0     1   2
1    81  77.0  66
2   100  84.0  66
3   100  89.7  76
4   100  94.0  87
5   100  88.9  76
..  ...   ...  ..
27   87  57.0  34
28   87  60.4  28
29   88  71.4  55
30   88  68.3  47
31   67  53.4  41

[90 rows x 3 columns]
     0     1  2
1   29  16.6  7
2   10   3.9  0
3    7   1.5  0
4    8   4.3  0
5    5   1.7  0
..  ..   ... ..
27  20   8.5  2
28   8   3.1  0
29  23  10.8  0
30  28  14.6  6
31  21  13.6  6

[90 rows x 3 columns]
       0     1     2
1   28.0  27.8  27.7
2   28.1  28.1  28.0
3   28.1  28.0  28.0
4   28.0

In [58]:
df_clima.head(2)

Unnamed: 0,Time,Temperature (° F),Dew Point (° F),Humidity (%),Wind Speed (mph),Pressure (Hg),Precipitation (in),month,Tmax,Avg,Tmin
0,Jan,Max Avg Min,Max Avg Min,Max Avg Min,Max Avg Min,Max Avg Min,Total,Jan,Max,Avg,Min
1,1,48 42.9 39,41 36.2 34,81 77.0 66,29 16.6 7,28.0 27.8 27.7,0.00,Jan,48,42.9,39


In [59]:
df_clima.drop([0], axis = 0, inplace = True)

In [61]:
df_clima.head(34)

Unnamed: 0,Time,Temperature (° F),Dew Point (° F),Humidity (%),Wind Speed (mph),Pressure (Hg),Precipitation (in),month,Tmax,Avg,Tmin
1,1,48 42.9 39,41 36.2 34,81 77.0 66,29 16.6 7,28.0 27.8 27.7,0.0,Jan,48,42.9,39
2,2,46 41.7 37,41 35.8 0,100 84.0 66,10 3.9 0,28.1 28.1 28.0,0.0,Jan,46,41.7,37
3,3,46 42.3 36,43 39.3 36,100 89.7 76,7 1.5 0,28.1 28.0 28.0,0.0,Jan,46,42.3,36
4,4,50 46.6 45,48 44.9 43,100 94.0 87,8 4.3 0,28.0 27.7 27.6,0.0,Jan,50,46.6,45
5,5,52 46.8 43,46 44.1 39,100 88.9 76,5 1.7 0,27.6 27.6 27.5,0.0,Jan,52,46.8,43
6,6,46 42.4 39,39 23.0 0,93 70.9 53,15 7.3 0,27.7 27.7 27.6,0.0,Jan,46,42.4,39
7,7,39 36.3 34,37 22.8 0,93 87.0 65,23 11.7 0,27.6 27.5 27.5,0.0,Jan,39,36.3,34
8,8,37 34.3 32,23 19.0 16,65 54.4 44,28 18.5 7,27.9 27.7 27.6,0.0,Jan,37,34.3,32
9,9,36 32.8 30,21 18.1 14,64 55.4 47,20 7.2 0,28.0 27.9 27.9,0.0,Jan,36,32.8,30
10,10,32 26.9 21,27 21.0 18,100 79.2 55,6 1.9 0,28.0 27.9 27.9,0.0,Jan,32,26.9,21


In [None]:
# eliminamos cualquier fila que tenga un índice de 0



In [None]:
# volvemos a chequear los resultados

