**Disclaimer**: This educational content, including any code examples, is provided for instructional purposes only. The author does not endorse or encourage the unauthorised or illegal scraping of websites.

While Python with releveant libraries can be used for web scraping, it's crucial to conduct scraping activities in compliance with applicable laws, the website's terms of service, and ethical considerations. Always review and respect the rules set by the website you are scraping to ensure legal and responsible data collection practices.

# Scraping reviews using Selenium

Here is another example of how Selenium can be used to interact with websites making use of Ajax (Asynchronous JavaScript):

## Selenium is a chrome automation framework

It will enable us to tell chrome:
* go to page bbc.co.uk/weather
* "click the work 'next'"
* scroll down

Selenium will basically open a simplified version of Chrome, for a few seconds, use it and close it afterwards. You might even see it flash on your screen quickly. Then we will use beautiful soup to understand the code.

## BeautifulSoup is an HTML parsing framework

It will enable us to:
* copy the html of the tags eg. div, table
* extract text from these tags

## Getting selenium (don't skip this!)-- You need to download the chromedrive by yourself.

1. find out which version of chrome you have, in chrome open page: chrome://settings/help
2. Find your Chrome version (eg. 131.0.6778.265). However, the latest version of ChromeDriver is 114.0.5735.90.
**If you are using Chrome version 115 or newer, please consult the [Chrome for Testing availability dashboard](https://googlechromelabs.github.io/chrome-for-testing/). This page provides convenient JSON endpoints for specific ChromeDriver version downloading.**
If your verison is older than 114.0.5735.90, please find you version of ChromeDriver on https://chromedriver.storage.googleapis.com/index.html
3. Go into the folder for your version and download the zip file with the version for your operating system (most likely `chromedriver_mac64.zip` or `chromedriver_win32.zip` ).
4. unzip that file on yoru machine and put it (`chromedriver.exe` in windows, or `chromedrive` in MacOS) in the folder where this notebook is.

In [1]:
## If you have not installed selenium yet, please uncomment the following line
!pip install selenium

Defaulting to user installation because normal site-packages is not writeable
Collecting selenium
  Downloading selenium-4.27.1-py3-none-any.whl.metadata (7.1 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.28.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting sortedcontainers (from trio~=0.17->selenium)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Collecting PySocks!=1.5.7,<2.0,>=1.5.6 (from urllib3[socks]<3,>=1.26->selenium)
  Downloading PySocks-1.7.1-py3-none-any.whl.metadata (13 kB)
Downloading selenium-4.27.1-py3-none-any.whl (9.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

In [7]:
# define method that will create a browser, suitable to your operating system
import sys
def get_a_browser():
    if sys.platform.startswith('win32') or sys.platform.startswith('cygwin'):
        return webdriver.Chrome() # windows
    else:
        print('here')
        return webdriver.Chrome('./chromedriver') # mac

In [None]:
webdriver.Chrome()

AttributeError: 'str' object has no attribute 'capabilities'

**Important Note**: allowing your system to run `chromedriver`. This needs to be done just once.

If you are on a mac, you will need to allow your system to use chromium. Run below cell, and you will likely see a warning the first time, click 'cancel' (don't click 'Delete').

After you see the warning, go into `Settings > Security&Privacy > General` and `"Allow Anyway"`.

On a windows pc the process will be simpler. When asked you'll need to allow computer to use the `chromedriver.exe`  in the folder.

## Task: let's try to scrape an interactive website

What will be the weather in Edinburgh of next Sunday?

You need a web browser, pen and paper!

In this task you will be asked to do something by yourself (using your web browser, mouse and keyboard), and then you will see how you cen program `Selenium` to do it for you.

**Use www.bbc.co.uk/weather to find out what time will be the sunrise in EDINBURGH next Sunday.**

Do it at least 3 times and observe all the steps you are taking. Make a very detailed list of all the steps, as if you had to describe them to someone over the phone without seeing their screen. See example below.

it will look a bit like this:
* ok, go to www.bbc.co.uk/weather and wait for it to load
* scroll down, do you see a link with words 'Edinburgh' on it? Click it.
* Wait a minute for it to load.
* ok, now scroll down and ...

When you are done with this exercise, we will try to instruct Selenium (Chrome automation tool) to do it for us. Do you think you can try to use Chrome Dev tools to make yoru steps more specific? eg. Instead of saying "copy text in that bold link next to the word Sunrise" try to say "copy text from the html span item with a class `wr-c-astro-data__time`".

**SERIOUSLY: Take a few minutes to do this. It will make you learn more from the below code!**

Ok. And now let's get the python to do it for us.

In [None]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC



browser = webdriver.Chrome()

# Open the BBC Weather page
browser.get('https://www.bbc.co.uk/weather')

# the url we want to open
url = u'https://www.bbc.co.uk/weather'

# the browser will start and load the webpage
browser.get(url)


# Handle cookies popup
try:
    WebDriverWait(browser, 10).until(
        EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Accept')]"))
    ).click()
    print("Cookies popup dismissed.")
except Exception as e:
    print("No cookies popup found:", e)

# Interact with Edinburgh element
try:
    element = WebDriverWait(browser, 10).until(
        EC.visibility_of_element_located((By.XPATH, "//span[@role='text' and contains(@class, 'ssrcss-1xjdod2-StyledLabel') and text()='Edinburgh']"))
    )
    # Scroll into view and click
    browser.execute_script("arguments[0].scrollIntoView(true);", element)
    browser.execute_script("arguments[0].click();", element)
    print("Clicked on Edinburgh successfully.")

    # sleep again, let everything load
    time.sleep(1)

    # we load the HTML body (the main page content without headers, footers, etc.)
    body = browser.find_element(By.TAG_NAME,'body')

    # we use seleniums' send_keys() function to physically scroll down where we want to click
    body.send_keys(Keys.PAGE_DOWN)

    # search for the next button to access the next day's weather
    try:
        # link will look like "Sun 12Dec" so we use find_element_by_partial_link_text()
        next_button = browser.find_element(By.PARTIAL_LINK_TEXT,'Sun ') 
        next_button.click()
    except NoSuchElementException:  #if such element does not exist, just stop looping
        print("something went wrong. There was no Sunday link.")
        
    # load current view of the page into a soup
    soup = BeautifulSoup(browser.page_source, 'html.parser')
except Exception as e:
    print("Error clicking Edinburgh:", e)





No cookies popup found: Message: 
Stacktrace:
0   chromedriver                        0x00000001034ee138 cxxbridge1$str$ptr + 3653888
1   chromedriver                        0x00000001034e6988 cxxbridge1$str$ptr + 3623248
2   chromedriver                        0x0000000102f4c968 cxxbridge1$string$len + 89228
3   chromedriver                        0x0000000102f90d4c cxxbridge1$string$len + 368752
4   chromedriver                        0x0000000102fca4f0 cxxbridge1$string$len + 604180
5   chromedriver                        0x0000000102f85564 cxxbridge1$string$len + 321672
6   chromedriver                        0x0000000102f861b4 cxxbridge1$string$len + 324824
7   chromedriver                        0x00000001034b8fc0 cxxbridge1$str$ptr + 3436424
8   chromedriver                        0x00000001034bc2dc cxxbridge1$str$ptr + 3449508
9   chromedriver                        0x000000010349fe60 cxxbridge1$str$ptr + 3333672
10  chromedriver                        0x00000001034bcb9c cxxbri

In [25]:
"""
1. Find all the elements of class pros and print them 
2. These values include today's sunrise and sunset time, and the following 13 days.
3. `browser.page_source` always get the whole page, so we can only find all
4. A not smart, but workable solution is to count how many days between today and next sunday 
   and then choose the right element of all sunrise_tag list.
"""

# The whole list
sunrise_tag = soup.find_all("span", {"class" : 'wr-c-astro-data__time'})
# How many days between today and the next sunday
# PLEASE KEEP THE PAGE OPEN, otherwise the next button will not be found
diff = int(next_button.get_attribute('id')[-1])

print("Sunrise next Sunday: ", sunrise_tag[2*diff].text)

Sunrise next Sunday:  08:29


In [26]:
for i in range(16):
    print(i, sunrise_tag[i].text)

0 08:34
1 16:11
2 08:33
3 16:13
4 08:32
5 16:15
6 08:31
7 16:17
8 08:29
9 16:18
10 08:28
11 16:20
12 08:27
13 16:22
14 08:25
15 16:24


In [27]:
soup

<html class="b-reith-sans-font b-pw-1280 no-touch wr-enhanced wr-svg id-svg wr-js-active b-reith-sans-loaded id-svg" data-location-id="2650225" data-location-name="Edinburgh" data-wr-unit--temperature="c" data-wr-unit--windspeed="mph" id="weather-forecast" lang="en"><head>
<meta content="width=device-width, initial-scale=1, user-scalable=1" name="viewport"/>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="on" http-equiv="cleartype"/>
<link href="https://ssl.bbc.co.uk/" rel="dns-prefetch"/>
<link href="http://sa.bbc.co.uk/" rel="dns-prefetch"/>
<link href="http://ichef-1.bbci.co.uk/" rel="dns-prefetch"/>
<link href="http://ichef.bbci.co.uk/" rel="dns-prefetch"/>
<style>
        [data-wr-unit--temperature="c"] .wr-c-map__temperature-f,
        [data-wr-unit--temperature="f"] .wr-c-map__temperature-c,
        [data-wr-unit--windspeed="mph"] .wr-c-map__wind-kph,
        [data-wr-unit--windspeed="kph"] .wr-c-map__wind-mph {
            

**If you have interests, you can try to find a better way to do this!**

**You can now automatically extract ALL cities' other Weather info!**