# Web Data Extraction - Selenium

Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.

- [Official Documentation](https://www.selenium.dev/documentation/)

- [Unofficial Documentation](https://selenium-python.readthedocs.io/index.html)

In [None]:
import random
import time
import pandas as pd

import requests
import bs4

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

### [WebDriver](https://www.selenium.dev/documentation/webdriver/)

Responsible for controlling the actual browser. Most drivers are created by the browser vendors themselves. Drivers are generally executable modules that run on the system with the browser itself, not on the system executing the test suite.


[Install browser drivers](https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/)

In [None]:
path = './requirements/chromedriver.exe'
service = Service(executable_path=path)
driver = webdriver.Chrome(service=service)
type(driver)

### [Browser info & actions](https://www.selenium.dev/documentation/webdriver/browser/)

You can get brower info and interact with it (i.e.: navigation, alerts, cookies, frames, windows)

In [None]:
# Navigate to...

driver.get('https://toogoodtogo.es/es/blog')
driver.maximize_window()

In [None]:
# Browser info

print(driver.title)
print(driver.current_url)

In [None]:
# Cookies

print(driver.get_cookies())

### [Web elements](https://www.selenium.dev/documentation/webdriver/elements/)

You can find, interact and extract info from web elements

In [None]:
# Find cookies button

cookie_button = driver.find_element(by=By.CLASS_NAME, value='coi-banner__decline')
cookie_button

In [None]:
# Accept cookies

cookie_button.click()

In [None]:
# Find blog posts

blog_posts = driver.find_elements(by=By.TAG_NAME, value='a')

In [None]:
# Get info from post (e.g.: url)

urls = [blog_posts[i].get_attribute('href') for i in range(len(blog_posts)) if 'blog/' in blog_posts[i].get_attribute('href')]
urls

#### Now the selenium magic!!!

In [None]:
# Click the "Ver más" button

driver.find_elements(by=By.TAG_NAME, value="button")[-1].click()

In [None]:
blog_posts = driver.find_elements(by=By.TAG_NAME, value='a')
urls = [blog_posts[i].get_attribute('href') for i in range(len(blog_posts)) if 'blog/' in blog_posts[i].get_attribute('href')]
len(urls)

---

### Get al the posts that you need

You may want to establish a [waiting strategy](https://www.selenium.dev/documentation/webdriver/waits/)

In [None]:
driver.implicitly_wait(10)

In [None]:
%%time
# Click the button as many times as you wish, but be careful because you can be banned!!!

for click in range(100):
    try:
        driver.find_elements(by=By.TAG_NAME, value="button")[-1].click()
        #secs = random.randint(1,4)
        #time.sleep(secs)
    except:
        print(f'You have reached the total amount of clicks: {click}')
        break

In [None]:
%%time
# Get posts info

blog_posts = driver.find_elements(by=By.TAG_NAME, value='a')

posts = [blog_posts[i].text.split('\n')[1] for i in range(len(blog_posts))\
         if '|' in blog_posts[i].text]

urls = [blog_posts[i].get_attribute('href') for i in range(len(blog_posts))\
        if 'blog/' in blog_posts[i].get_attribute('href')]

date = [blog_posts[i].text.split('\n')[2].split(' | ')[0] for i in range(len(blog_posts))\
        if '|' in blog_posts[i].text]

author = [blog_posts[i].text.split('\n')[2].split(' | ')[1] for i in range(len(blog_posts))\
          if '|' in blog_posts[i].text]

In [None]:
# Pandas!!!

df = pd.DataFrame({'Blog Posts':posts,
                   'Links':urls,
                   'Fecha':date,
                   'Autor':author}).drop_duplicates()

df

### End the session

This ends the driver process, which by default closes the browser as well. No more commands can be sent to this driver instance.

In [None]:
driver.quit()

### Lets build some robots!!!

![robot](https://media.giphy.com/media/5YEgnkjeryvwA/giphy.gif)