## Scraping using Selenium

*Prepared by:*
**Jude Michael Teves**  
Faculty, Software Technology Department  
College of Computer Studies - De La Salle University

This notebook shows how to scrape dynamic webpages using Selenium. We will be scraping the https://quotes.toscrape.com/scroll as it is dedicated for practicing scraping, similar to https://quotes.toscrape.com/.

**Reminder**

> *"With great power, comes great responsibility"*
    
Remember to perform web scraping with extra caution and to not abuse it. The boundaries are not so clear when it comes to what you can and cannot legally do with scraping. Use your own judgment to determine if what you are about to do is unethical or illegal.
<hr>

<sup>```Last run: 2021-07-12 11:28PM (GMT +8)```</sup>

### Import libraries

We will be using the `requests` and `BeautifulSoup` libraries for the succeeding cells. These two will give us the functionalities we need to scrape a webpage. If this is not already installed in your environment, you may use the either of the following commands in your command line:

```conda install -c anaconda beautifulsoup4``` or
```pip install beautifulsoup4```

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
page = requests.get("https://quotes.toscrape.com/scroll")

#feed it into beautiful soup for parsing
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quotes">
     </div>
    </div>
   </div>
   <div id="loading" style="background-color: #eeeecc">
    <h5>
     Loading...
    </h5>
   </div>
   <script src="/static/jquery.js">
   </script>
   <script>
    $(function(){
        var page = 1, tag = null, hasNextPage = true;
        function appendQuotes(quotes) {
            var $quotes = $('.quotes');
            var html = $.ma

### Inspect HTML code

Inspect the code we retrieved and compare it against the webpage. This is what we should be seeing.

<img src="../images/quotes-to-scrape-console.png">

Why is it different? Why did we not get the contents in the actual webpage? This is because the contents are dynamically generated. BeautifulSoup cannot handle such pages. And there are lots of webpages that are like this.

### Selenium to the rescue!

Selenium is an automation library that can be used to deal with dynamic webpages. To install it, you may use the following commands:

```conda install -c conda-forge selenium``` or
```pip install selenium```

You will also be needing a driver for your browser. See this section of the Selenium documentation for more details: https://selenium-python.readthedocs.io/installation.html#drivers

### Setup browser automation

You should see a new browser open after executing the cell below. This is the browser that is under the influence of our code--we are fully controlling it.

In [14]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver_path = "driver/chromedriver-v95.exe" # replace this with your own driver path
url = "https://quotes.toscrape.com/scroll"
driver = webdriver.Chrome(driver_path)
driver.get(url)
print(driver.page_source)

<html lang="en"><head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    
<div class="row">
    <div class="col-md-8">
        <div class="quotes"></div>
    </div>
</div>
<div id="loading" style="background-color: rgb(238, 238, 204);"><h5>Loading...</h5></div>
<script src="/static/jquery.js"></script>
<script>
    $(function(){
        var page = 1, tag = null, hasNextPage = true;
        function appendQuotes(quotes) {


### XPath: Getting the elements that we want

> XPath is the language used for locating nodes in an XML document. As HTML can be an implementation of XML (XHTML), Selenium users can leverage this powerful language to target elements in their web applications. XPath supports the simple methods of locating by id or name attributes and extends them by opening up all sorts of new possibilities such as locating the third checkbox on the page.

For the XPath syntax, you may refer to the following link: https://www.w3schools.com/xml/xpath_syntax.asp

In [15]:
quotes = driver.find_elements_by_xpath("//div[@class='quote']")
quotes

[<selenium.webdriver.remote.webelement.WebElement (session="3dfe2ceb712a50287568637d0e7b3179", element="4e189e04-3ccf-4b18-8363-ff97492ba389")>,
 <selenium.webdriver.remote.webelement.WebElement (session="3dfe2ceb712a50287568637d0e7b3179", element="30dd14bd-e7d2-4120-9ed4-fb435726d94c")>,
 <selenium.webdriver.remote.webelement.WebElement (session="3dfe2ceb712a50287568637d0e7b3179", element="2a66077e-9d0d-44f2-8701-6877d8a71226")>,
 <selenium.webdriver.remote.webelement.WebElement (session="3dfe2ceb712a50287568637d0e7b3179", element="f6c3a450-8389-40ce-a8cd-ae13f6775168")>,
 <selenium.webdriver.remote.webelement.WebElement (session="3dfe2ceb712a50287568637d0e7b3179", element="ec255e9e-a34b-4daa-8669-0fcf002815dd")>,
 <selenium.webdriver.remote.webelement.WebElement (session="3dfe2ceb712a50287568637d0e7b3179", element="16261a1a-5f5a-48e8-ab55-f1f147d809e6")>,
 <selenium.webdriver.remote.webelement.WebElement (session="3dfe2ceb712a50287568637d0e7b3179", element="36a42e5a-6507-44cb-9357-dd

In [16]:
len(quotes)

10

In [17]:
quotes[0].text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”\nby Albert Einstein\nTags: change deep-thoughts thinking world'

In [18]:
quotes_text = [quote.text for quote in quotes]
quotes_text

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”\nby Albert Einstein\nTags: change deep-thoughts thinking world',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”\nby J.K. Rowling\nTags: abilities choices',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”\nby Albert Einstein\nTags: inspirational life live miracle miracles',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”\nby Jane Austen\nTags: aliteracy books classic humor',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”\nby Marilyn Monroe\nTags: be-yourself inspirational",
 '“Try not to become a man of success. Rather become a man of value.”\nby Albert Einstein\nTags: adulthood success value',
 '“It is better to be hated 

This returns all the texts inside the element. How can we choose specific parts of the element then?

In [9]:
quotes[0].find_element_by_xpath("span[@class='text']").text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

This one did not return all the other text after the quote. How about the others?

In [19]:
quotes[0].find_element_by_xpath(".//small[@class='author']").text

'Albert Einstein'

In [20]:
tags = quotes[0].find_elements_by_xpath(".//a[@class='tag']")
tags = [tag.text for tag in tags]
tags

['change', 'deep-thoughts', 'thinking', 'world']

### Handle scrolling

You will notice that we are only getting the first 10 quotes on the page. This is because we have to scroll first so that the other quotes get generated by the page. The following line of code automates that scrolling. Code for handling infinite scrolling is taken from <a href="https://stackoverflow.com/questions/28928068/scroll-down-to-bottom-of-infinite-page-with-phantomjs-in-python/28928684#28928684">the answer to this Stackoverflow question</a>.

In [24]:
import time

pause = 0.5
lastHeight = driver.execute_script("return document.body.scrollHeight")

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(pause)
    newHeight = driver.execute_script("return document.body.scrollHeight")
    if newHeight == lastHeight:
        break
    lastHeight = newHeight

Now let's check the quotes on the page

In [25]:
quotes = driver.find_elements_by_xpath("//div[@class='quote']")
len(quotes)

100

We now got 100 quotes instead of 10!

### Exercise

Scrape the page and save the results into a Pandas Dataframe with the following format:

| author | tags | quote |
| --- | --- | --- |
| Albert Einstein | ['change', 'deep-thoughts', 'thinking', 'world'] | “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.“ |

In [56]:
df = pd.DataFrame(columns=['author', 'tags', 'quote'])

for quote in quotes:
    author = quote.find_element_by_xpath(".//small[@class='author']").text
    tags = [tag.text for tag in quote.find_elements_by_xpath(".//a[@class='tag']")]
    quote = quote.find_element_by_xpath("span[@class='text']").text
    
#     df = pd.concat([df, pd.DataFrame([[author, tags, quote]], columns=['author', 'tags', 'quote'])], axis=0)
    df.loc[df.shape[0]] = [author, tags, quote]

In [57]:
df

Unnamed: 0,author,tags,quote
0,Albert Einstein,"[change, deep-thoughts, thinking, world]",“The world as we have created it is a process ...
1,J.K. Rowling,"[abilities, choices]","“It is our choices, Harry, that show what we t..."
2,Albert Einstein,"[inspirational, life, live, miracle, miracles]",“There are only two ways to live your life. On...
3,Jane Austen,"[aliteracy, books, classic, humor]","“The person, be it gentleman or lady, who has ..."
4,Marilyn Monroe,"[be-yourself, inspirational]","“Imperfection is beauty, madness is genius and..."
...,...,...,...
95,Harper Lee,[better-life-empathy],“You never really understand a person until yo...
96,Madeleine L'Engle,"[books, children, difficult, grown-ups, write,...",“You have to write the book that wants to be w...
97,Mark Twain,[truth],“Never tell the truth to people who are not wo...
98,Dr. Seuss,[inspirational],"“A person's a person, no matter how small.”"


## References

1. https://selenium-python.readthedocs.io/

## End
<sup>made by **Jude Michael Teves**</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> <href>judemichaelteves@gmail.com</href> or <href>jude.teves@dlsu.edu.ph</href></sup><br>