# Web Data Extraction - Selenium

Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.

- [Official Documentation](https://www.selenium.dev/documentation/)

- [Unofficial Documentation](https://selenium-python.readthedocs.io/index.html)

In [1]:
import random
import time
import pandas as pd

import requests
import bs4

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

### [WebDriver](https://www.selenium.dev/documentation/webdriver/)

Responsible for controlling the actual browser. Most drivers are created by the browser vendors themselves. Drivers are generally executable modules that run on the system with the browser itself, not on the system executing the test suite.


[Install browser drivers](https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/)

In [2]:
path = './requirements/chromedriver.exe'
service = Service(executable_path=path)
driver = webdriver.Chrome(service=service)
type(driver)

selenium.webdriver.chrome.webdriver.WebDriver

### [Browser info & actions](https://www.selenium.dev/documentation/webdriver/browser/)

You can get brower info and interact with it (i.e.: navigation, alerts, cookies, frames, windows)

In [3]:
# Navigate to...

driver.get('https://toogoodtogo.es/es/blog')
driver.maximize_window()

In [4]:
# Browser info

print(driver.title)
print(driver.current_url)

Blog | Too Good To Go
https://toogoodtogo.es/es/blog


In [5]:
# Cookies

print(driver.get_cookies())

[{'domain': 'toogoodtogo.es', 'httpOnly': False, 'name': 'time_stamp', 'path': '/', 'secure': False, 'value': '1657130501695'}, {'domain': 'toogoodtogo.es', 'httpOnly': False, 'name': 'accepted_localizations', 'path': '/es', 'secure': False, 'value': '%5B%22es%22%5D'}, {'domain': 'toogoodtogo.es', 'httpOnly': False, 'name': 'localization_code', 'path': '/es', 'secure': False, 'value': 'es'}, {'domain': 'toogoodtogo.es', 'httpOnly': False, 'name': 'country_code', 'path': '/es', 'secure': False, 'value': 'es'}]


### [Web elements](https://www.selenium.dev/documentation/webdriver/elements/)

You can find, interact and extract info from web elements

In [6]:
# Find cookies button

cookie_button = driver.find_element(by=By.CLASS_NAME, value='coi-banner__decline')
cookie_button

<selenium.webdriver.remote.webelement.WebElement (session="03d0848e7def821ccbd2c19a4a4c7608", element="aa8ba60b-c525-41af-a3c4-13def2b56f6d")>

In [7]:
# Accept cookies

cookie_button.click()

In [8]:
# Find blog posts

blog_posts = driver.find_elements(by=By.TAG_NAME, value='a')

In [9]:
len(blog_posts)

225

In [10]:
# Get info from post (e.g.: url)

urls = [blog_posts[i].get_attribute('href') for i in range(len(blog_posts)) if 'blog/' in blog_posts[i].get_attribute('href')]
urls

['https://toogoodtogo.es/es/blog/tgtg-westfield',
 'https://toogoodtogo.es/es/blog/ww-junio',
 'https://toogoodtogo.es/es/blog/fruta-temporada',
 'https://toogoodtogo.es/es/blog/tgtg-westfield',
 'https://toogoodtogo.es/es/blog/ww-junio',
 'https://toogoodtogo.es/es/blog/fruta-temporada',
 'https://toogoodtogo.es/es/blog/sorteo-restaurante-coque',
 'https://toogoodtogo.es/es/blog/chefs-contra-el-desperdicio',
 'https://toogoodtogo.es/es/blog/pepa-munoz']

#### Now the selenium magic!!!

In [11]:
# Click the "Ver más" button

driver.find_elements(by=By.TAG_NAME, value="button")[-1].click()

In [12]:
blog_posts = driver.find_elements(by=By.TAG_NAME, value='a')
urls = [blog_posts[i].get_attribute('href') for i in range(len(blog_posts)) if 'blog/' in blog_posts[i].get_attribute('href')]
len(urls)

15

---

### Get al the posts that you need

You may want to establish a [waiting strategy](https://www.selenium.dev/documentation/webdriver/waits/)

In [13]:
driver.implicitly_wait(10)

In [14]:
%%time
# Click the button as many times as you wish, but be careful because you can be banned!!!

for click in range(100):
    try:
        driver.find_elements(by=By.TAG_NAME, value="button")[-1].click()
        secs = random.randint(1,4)
        time.sleep(secs)
    except:
        print(f'You have reached the total amount of clicks: {click}')
        break

You have reached the total amount of clicks: 49
Wall time: 2min 22s


In [15]:
%%time
# Get posts info

blog_posts = driver.find_elements(by=By.TAG_NAME, value='a')

posts = [blog_posts[i].text.split('\n')[1] for i in range(len(blog_posts))\
         if '|' in blog_posts[i].text]

urls = [blog_posts[i].get_attribute('href') for i in range(len(blog_posts))\
        if 'blog/' in blog_posts[i].get_attribute('href')]

date = [blog_posts[i].text.split('\n')[2].split(' | ')[0] for i in range(len(blog_posts))\
        if '|' in blog_posts[i].text]

author = [blog_posts[i].text.split('\n')[2].split(' | ')[1] for i in range(len(blog_posts))\
          if '|' in blog_posts[i].text]

Wall time: 1min 17s


In [16]:
# Pandas!!!

df = pd.DataFrame({'Blog Posts':posts,
                   'Links':urls,
                   'Fecha':date,
                   'Autor':author}).drop_duplicates()

df

Unnamed: 0,Blog Posts,Links,Fecha,Autor
0,TALLERES ANTI DESPERDICIO DE LA MANO DE URW,https://toogoodtogo.es/es/blog/tgtg-westfield,hace 2 días,Rocío Abella Ruiz
1,#WASTEWARRIORDELMES DE JUNIO 🍕,https://toogoodtogo.es/es/blog/ww-junio,hace 9 días,Elena Melchor
2,¿POR QUÉ ES IMPORTANTE CONSUMIR FRUTA Y VERDUR...,https://toogoodtogo.es/es/blog/fruta-temporada,hace 19 días,Gisela Casanovas
6,CONSIGUE UNA EXPERIENCIA 2 ESTRELLAS MICHELÍN ...,https://toogoodtogo.es/es/blog/sorteo-restaura...,hace 22 días,Lorena Diaz-Santos
7,DESCUBRE LA SORPRESA DE PEPA MUÑOZ EN #CHEFSCO...,https://toogoodtogo.es/es/blog/chefs-contra-el...,hace un mes,Elena Melchor
...,...,...,...,...
303,LOS 5 REGALOS NAVIDEÑOS SOSTENIBLES,https://toogoodtogo.es/es/blog/regalos-naviden...,hace 4 años,Samuel Asenjo
304,GREEN FINDE: 2X1 EN TUS PACKS SALVADOS,https://toogoodtogo.es/es/blog/green-finde-2x1...,hace 4 años,Jonathan Zarzalejo
305,¡YA SOMOS MÁS DE 25.000 WASTE WARRIORS!,https://toogoodtogo.es/es/blog/ya-somos-mas-de...,hace 4 años,Jonathan Zarzalejo
306,¡SALVAR COMIDA TIENE PREMIO! #DÍAMUNDIALDELAAL...,https://toogoodtogo.es/es/blog/en-el-dia-mundi...,hace 4 años,Jonathan Zarzalejo


In [17]:
df['Autor'].unique()

array(['Rocío Abella Ruiz', 'Elena Melchor', 'Gisela Casanovas',
       'Lorena Diaz-Santos', 'Marianne Costa', 'Helena Calvo',
       'Carmen Huidobro', 'Jonathan Zarzalejo', 'Samuel Asenjo',
       'Franziska Lienert', 'Tanja Andersen', 'Nora Di Cesare',
       'Nicole van Brummelen', 'Laetitia Ramé', 'Carlos García'],
      dtype=object)

### End the session

This ends the driver process, which by default closes the browser as well. No more commands can be sent to this driver instance.

In [18]:
driver.quit()

### Lets build some robots!!!

![robot](https://media.giphy.com/media/5YEgnkjeryvwA/giphy.gif)