# Data 200: Data Systems for Data Analytics (Spring 2024)

# Homework 13: Web Scraping with Selenium

<font color='red'>**Due Date and Time:** 1:30pm on Friday, 4/26/2024 </font>
---
Enter your name in the markdown cell below.

# Name: Minh Trinh

In [1]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING AND TO LOAD NumPy
import requests
import numpy as np
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

# Tasks

- Review pages 337-368 in the Course Notes.
- Complete the **CSS Locators, Chaining, and Responses** and **Spiders** chapters of the **Web Scraping in Python** course on DataCamp.
- E-mail me your completed Jupyter notebook.

# Exercises

In this homework, we'll sharpen our web scraping skills by exploring the website https://books.toscrape.com. This fictional bookstore was designed specifically for practicing web scraping.

We first need to import the required libraries.

**Study and run the code cell below**

In [2]:
import pandas as pd
import time
import random

from selenium import webdriver
from selenium.webdriver.common.by import By

Now we need to initialize the WebDriver service and launch the Chrome browser. 

**Study and run the code cell below**, but be sure that `chromedriver.exe` is in the same directory as this notebook or update it to the full path to the driver.

In [3]:
service = webdriver.ChromeService()
driver = webdriver.Chrome(service=service)

<div class="exercise"><b>Exercise 1:</b></div> 

The goal of this homework is to create a data frame that contains information about all the Fiction books, specifically, the titles, prices, and descriptions.

Toward that end, use the `driver.get()` method to open the browser to the page related to Ficton. To accomplish this, read over page 357 in the notes, and then
- Open a browser to https://books.toscrape.com
- **Select Fiction** from the menu on the left
- Copy the URL
- Paste it into the `driver.get()` method

You will know if you did this correctly if the Chrome window that is being controlled by driver opens to the Fiction page.

In [4]:
driver.get("https://books.toscrape.com/catalogue/category/books/fiction_10/index.html")

<div class="exercise"><b>Exercise 2:</b></div> 

Let's see if we can scrape the title of the first book listed, which is *Soumission*. To accomplish this, you will need to

- **Review pages 359-361 in the Course Notes**
- Right click on the title of the book in your browser and select Inspect
- In the developer tab, right click on the `title="soumission"` attribute and select `Copy -> Copy XPath`
- Use the `driver.find_element()` function to obtain the Element containing the title (via the XPath you just copied) and assign it to `title_element`
- Print out the title with the command `print(title_element.text)`--note that I already included this code

The expected output should be

<code>
Soumission
</code>

In [5]:
# Your code here
title_element = driver.find_element('xpath', '//*[@id="default"]/div/div/div/div/section/div[2]/ol/li[1]/article/h3/a')
print(title_element.text)

Soumission


<div class="exercise"><b>Exercise 3:</b></div> 

Now let's see if we can scrape the price of the first book listed, which is *£50.10*. To accomplish this, you will need to

- Right click on the price of the book in your browser and select Inspect
- In the developer tab, right click on the price (£50.10) and select `Copy -> Copy XPath`
- Use the `driver.find_element()` function to obtain the Element containing the price (via the XPath you just copied) and assign it to `price_element`
- Print out the price with the command `print(print_element.text)`--note that I already included this code

The expected output should be

<code>
£50.10
</code>

In [6]:
# Your code here
price_element = driver.find_element('xpath', '//*[@id="default"]/div/div/div/div/section/div[2]/ol/li[1]/article/div[2]/p[1]')

print(price_element.text)

£50.10


<div class="exercise"><b>Exercise 4:</b></div> 

We are now interested in creating a Python list, `book_titles`, of <u>all</u> the Fiction titles (on the first page). By exploring the HTML code on this webpage we can see that the information related to each book is stored within an <code>&lt;article class="product_pod"&gt;</code> element. That is, there is a <code>&lt;article class="product_pod"&gt;</code> element for each book.

For example, all the information stored about the first book, *Soumission*, is stored in the following element: 

<code>
&lt;article class="product_pod"&gt;
  &lt;div class="image_container"&gt;
     &lt;a href="../../../soumission_998/index.html"&gt;
       &lt;img src="..2830.jpg" alt="Soumission" class="thumbnail"&gt;
     &lt;/a&gt;
  &lt;/div&gt;
  &lt;p class="star-rating One"&gt;...&lt;/p&gt;
  &lt;h3&gt;
     <b>&lt;a href="../../../soumission_998/index.html" title="Soumission"&gt;Soumission&lt;/a&gt;</B>
  &lt;/h3&gt;
  &lt;div class="product_price"&gt;
     &lt;p class="price_color"&gt;£50.10&lt;/p&gt;
     &lt;p class="instock availability"&gt;
        &lt;i class="icon-ok"&gt;&lt;/i&gt; In stock
     &lt;/p&gt;
     &lt;form&gt;...&lt;/form&gt;
  &lt;/div&gt;
&lt;/article&gt;
</code>
    
Furthermore, all the information stored about the second book, *Private Paris (Private #10)*, is stored in this element: 
    
<code>
&lt;article class="product_pod"&gt;
  &lt;div class="image_container"&gt;
      &lt;a href="../../../private-paris-private-10_958/index.html"&gt;
        &lt;img src="...c26.jpg" alt="Private Paris (Private #10)" class="thumbnail"&gt;
      &lt;/a&gt;
  &lt;/div&gt;
  &lt;p class="star-rating Five"&gt;...&lt;/p&gt;
  &lt;h3&gt;
      <b>&lt;a href="../../../private-paris-private-10_958/index.html" title="Private Paris (Private #10)"&gt;Private Paris (Private #10)&lt;/a&gt;</b>
  &lt;/h3&gt;
  &lt;div class="product_price"&gt;
      &lt;p class="price_color"&gt;£47.61&lt;/p&gt;
      &lt;p class="instock availability"&gt;
         &lt;i class="icon-ok"&gt;&lt;/i&gt; In stock
      &lt;/p&gt;
      &lt;form&gt;...&lt;/form&gt;
  &lt;/div&gt;
&lt;/article&gt;
</code> 
    
We can create a Python list of all the elements that match <code>&lt;article class="product_pod"&gt;</code> using the `find_element(By.CLASS_NAME, "class name")` function, replacing `"class name"` with `"product_pod"`. Specifically,
    
**`book_elements = driver.find_elements(By.CLASS_NAME, "product_pod")`**
    
Thus, `book_elements` is a Python list of all the elements that have a class name `"product_pod"`. For example, `book_elements[0]` would be the element corresponding to the first book, *Soumission*, and it would have the structure above. Note that the relative XPath to the title is `h3/a` (make sure you understand where this path came from by studying the elements above before moving on). So we can find the title of the first book using the code below.

**Study and run the code cell below.**

In [7]:
book_elements = driver.find_elements(By.CLASS_NAME, "product_pod")

title_element = book_elements[0].find_element('xpath',"h3/a")
book_title = title_element.get_attribute('title')

print(book_title)

Soumission


**It is now your turn--complete the Python code below to create a list, `book_titles`, that contains the titles of all the books.** Please review the code cell you just ran along with page 362 in the Course Notes first.  The expected output is

<code>
['Soumission', 'Private Paris (Private #10)', 'We Love You, Charlie Freeman', 'Thirst', 'The Murder That Never Was (Forensic Instincts #5)', 'Tuesday Nights in 1980', 'The Vacationers', 'The Regional Office Is Under Attack!', 'Finders Keepers (Bill Hodges Trilogy #2)', 'The Time Keeper', 'The Testament of Mary', 'The First Hostage (J.B. Collins #2)', 'Take Me with You', 'Still Life with Bread Crumbs', 'Shtum', 'My Name Is Lucy Barton', 'My Mrs. Brown', 'Mr. Mercedes (Bill Hodges Trilogy #1)', 'I Am Pilgrim (Pilgrim #1)', 'Eligible (The Austen Project #4)']
</code>

In [8]:
book_elements = driver.find_elements(By.CLASS_NAME, "product_pod")

book_titles=[]

for element in book_elements:
    title_element = element.find_element('xpath',"h3/a")
    book_titles.append(title_element.get_attribute('title'))
    
print(book_titles)

['Soumission', 'Private Paris (Private #10)', 'We Love You, Charlie Freeman', 'Thirst', 'The Murder That Never Was (Forensic Instincts #5)', 'Tuesday Nights in 1980', 'The Vacationers', 'The Regional Office Is Under Attack!', 'Finders Keepers (Bill Hodges Trilogy #2)', 'The Time Keeper', 'The Testament of Mary', 'The First Hostage (J.B. Collins #2)', 'Take Me with You', 'Still Life with Bread Crumbs', 'Shtum', 'My Name Is Lucy Barton', 'My Mrs. Brown', 'Mr. Mercedes (Bill Hodges Trilogy #1)', 'I Am Pilgrim (Pilgrim #1)', 'Eligible (The Austen Project #4)']


<div class="exercise"><b>Exercise 5:</b></div> 

We are now interested in creating a Python list, `book_urls`, of all the weblinks (urls) to the book descriptions. Note that we can get this information from our list `book_elements`. Recall that the element for the first book, *Soumission*, is as follows.

<code>
&lt;article class="product_pod"&gt;
  &lt;div class="image_container"&gt;
     &lt;a href="../../../soumission_998/index.html"&gt;
       &lt;img src="..2830.jpg" alt="Soumission" class="thumbnail"&gt;
     &lt;/a&gt;
  &lt;/div&gt;
  &lt;p class="star-rating One"&gt;...&lt;/p&gt;
  &lt;h3&gt;
    <b>&lt;a href="../../../soumission_998/index.html" title="Soumission"&gt;Soumission&lt;/a&gt;</b>
  &lt;/h3&gt;
  &lt;div class="product_price"&gt;
     &lt;p class="price_color"&gt;£50.10&lt;/p&gt;
     &lt;p class="instock availability"&gt;
        &lt;i class="icon-ok"&gt;&lt;/i&gt; In stock
     &lt;/p&gt;
     &lt;form&gt;...&lt;/form&gt;
  &lt;/div&gt;
&lt;/article&gt;
</code>
    
Note that the relative XPath to the url (which has the attribute `href`) is the same as the title: `h3/a`. So we can find the url of the first book using the code below.

**Study and run the code cell below.**

In [9]:
url_element = book_elements[0].find_element('xpath',"h3/a")
book_url = url_element.get_attribute('href')

print(book_url)

https://books.toscrape.com/catalogue/soumission_998/index.html


**It is now your turn--complete the Python code below to create a list, `book_urls`, that contains the urls of all the books.** This is almost identical to what you did in Exercise 4.  The expected output is

<code>
['https://books.toscrape.com/catalogue/soumission_998/index.html', 'https://books.toscrape.com/catalogue/private-paris-private-10_958/index.html', 'https://books.toscrape.com/catalogue/we-love-you-charlie-freeman_954/index.html', 'https://books.toscrape.com/catalogue/thirst_946/index.html', 'https://books.toscrape.com/catalogue/the-murder-that-never-was-forensic-instincts-5_939/index.html', 'https://books.toscrape.com/catalogue/tuesday-nights-in-1980_870/index.html', 'https://books.toscrape.com/catalogue/the-vacationers_863/index.html', 'https://books.toscrape.com/catalogue/the-regional-office-is-under-attack_858/index.html', 'https://books.toscrape.com/catalogue/finders-keepers-bill-hodges-trilogy-2_807/index.html', 'https://books.toscrape.com/catalogue/the-time-keeper_766/index.html', 'https://books.toscrape.com/catalogue/the-testament-of-mary_765/index.html', 'https://books.toscrape.com/catalogue/the-first-hostage-jb-collins-2_749/index.html', 'https://books.toscrape.com/catalogue/take-me-with-you_741/index.html', 'https://books.toscrape.com/catalogue/still-life-with-bread-crumbs_738/index.html', 'https://books.toscrape.com/catalogue/shtum_733/index.html', 'https://books.toscrape.com/catalogue/my-name-is-lucy-barton_720/index.html', 'https://books.toscrape.com/catalogue/my-mrs-brown_719/index.html', 'https://books.toscrape.com/catalogue/mr-mercedes-bill-hodges-trilogy-1_717/index.html', 'https://books.toscrape.com/catalogue/i-am-pilgrim-pilgrim-1_703/index.html', 'https://books.toscrape.com/catalogue/eligible-the-austen-project-4_692/index.html']
</code>

In [10]:
book_urls=[]

for element in book_elements:
    url_element = element.find_element('xpath',"h3/a")
    book_urls.append(url_element.get_attribute('href'))
print(book_urls)

['https://books.toscrape.com/catalogue/soumission_998/index.html', 'https://books.toscrape.com/catalogue/private-paris-private-10_958/index.html', 'https://books.toscrape.com/catalogue/we-love-you-charlie-freeman_954/index.html', 'https://books.toscrape.com/catalogue/thirst_946/index.html', 'https://books.toscrape.com/catalogue/the-murder-that-never-was-forensic-instincts-5_939/index.html', 'https://books.toscrape.com/catalogue/tuesday-nights-in-1980_870/index.html', 'https://books.toscrape.com/catalogue/the-vacationers_863/index.html', 'https://books.toscrape.com/catalogue/the-regional-office-is-under-attack_858/index.html', 'https://books.toscrape.com/catalogue/finders-keepers-bill-hodges-trilogy-2_807/index.html', 'https://books.toscrape.com/catalogue/the-time-keeper_766/index.html', 'https://books.toscrape.com/catalogue/the-testament-of-mary_765/index.html', 'https://books.toscrape.com/catalogue/the-first-hostage-jb-collins-2_749/index.html', 'https://books.toscrape.com/catalogue/t

<div class="exercise"><b>Exercise 6:</b></div> 

We are now interested in creating a Python list, `book_prices`, of all the prices of the books. As before, we can get this information from our list `book_elements`. Recall that the element for the first book, *Soumission*, is as follows.

<code>
&lt;article class="product_pod"&gt;
  &lt;div class="image_container"&gt;
     &lt;a href="../../../soumission_998/index.html"&gt;
       &lt;img src="..2830.jpg" alt="Soumission" class="thumbnail"&gt;
     &lt;/a&gt;
  &lt;/div&gt;
  &lt;p class="star-rating One"&gt;...&lt;/p&gt;
  &lt;h3&gt;
    &lt;a href="../../../soumission_998/index.html" title="Soumission"&gt;Soumission&lt;/a&gt;
  &lt;/h3&gt;
  &lt;div class="product_price"&gt;
    <b>&lt;p class="price_color"&gt;£50.10&lt;/p&gt;</b>
     &lt;p class="instock availability"&gt;
        &lt;i class="icon-ok"&gt;&lt;/i&gt; In stock
     &lt;/p&gt;
     &lt;form&gt;...&lt;/form&gt;
  &lt;/div&gt;
&lt;/article&gt;
</code>
    
Note that the relative XPath to the price is `div/p` (make sure you understand where this came from by studying the element above before moving on). So we can find the cost of the first book, **which is stored between tags as opposed to being stored as an attribute**, using the `.text` method.

**Study and run the code cell below.**

In [11]:
price_element = book_elements[0].find_element('xpath',"div/p")
book_price = price_element.text

print(book_price)

£50.10


**It is now your turn--complete the Python code below to create a list, `book_prices`, that contains the prices of all the books.** Please review the code cell you just ran along with page 162 in the Course Notes first.  The expected output is

<code>
['£50.10', '£47.61', '£50.27', '£17.27', '£54.11', '£21.04', '£42.15', '£51.36', '£53.53', '£27.88', '£52.67', '£25.85', '£45.21', '£26.41', '£55.84', '£41.56', '£24.48', '£28.90', '£10.60', '£27.09']
</code>

In [12]:
book_prices=[]

for element in book_elements:
    price_element = element.find_element('xpath',"div/p")
    book_prices.append(price_element.text)
    
print(book_prices)

['£50.10', '£47.61', '£50.27', '£17.27', '£54.11', '£21.04', '£42.15', '£51.36', '£53.53', '£27.88', '£52.67', '£25.85', '£45.21', '£26.41', '£55.84', '£41.56', '£24.48', '£28.90', '£10.60', '£27.09']


<div class="exercise"><b>Exercise 7:</b></div> 

Now we want to create a Python list, `book_descs`, of all the book descriptions. Recall that the book descriptions are not provided on the main webpage, rather, we need to click the link associated with each book. Note that we can use the list of urls, `book_urls`, that we created in Exercise 5.

In the code cell below, I use a for-loop over all the urls in `book_urls` and extract the book descriptions. **The code is complete except for one piece--the XPath to the book description.** Use Inspect on your browser to the get the XPath to the description and replace it in the code below.

The expected output is as follows.

<code>
Paris is burning--and only Private's Jack Morgan can put out the fire.When Jack Morgan stops by Private's Paris office, he envisions a quick hello during an otherwise relaxing trip filled with fine food and sightseeing. But Jack is quickly pressed into duty after a call from his client Sherman Wilkerson, asking Jack to track down his young granddaughter who is on the run f Paris is burning--and only Private's Jack Morgan can put out the fire.When Jack Morgan stops by Private's Paris office, he envisions a quick hello during an otherwise relaxing trip filled with fine food and sightseeing. But Jack is quickly pressed into duty after a call from his client Sherman Wilkerson, asking Jack to track down his young granddaughter who is on the run from a brutal drug dealer.Before Jack can locate her, several members of France's cultural elite are found dead--murdered in stunning, symbolic fashion. The only link between the crimes is a mysterious graffiti tag. As religious and ethnic tensions simmer in the City of Lights, only Jack and his Private team can connect the dots before the smoldering powder keg explodes. ...more
</code>
<code>
The number of book descriptions is 20
</code>

In [14]:
book_descs = []

for url in book_urls:
    driver.get(url)
    descrip = driver.find_element('xpath','//*[@id="content_inner"]/article/p').text      # Update the XPath here
    book_descs.append(descrip)

print(book_descs[1])
print('\nThe number of book descriptions is ' + str(len(book_descs)))

Paris is burning--and only Private's Jack Morgan can put out the fire.When Jack Morgan stops by Private's Paris office, he envisions a quick hello during an otherwise relaxing trip filled with fine food and sightseeing. But Jack is quickly pressed into duty after a call from his client Sherman Wilkerson, asking Jack to track down his young granddaughter who is on the run f Paris is burning--and only Private's Jack Morgan can put out the fire.When Jack Morgan stops by Private's Paris office, he envisions a quick hello during an otherwise relaxing trip filled with fine food and sightseeing. But Jack is quickly pressed into duty after a call from his client Sherman Wilkerson, asking Jack to track down his young granddaughter who is on the run from a brutal drug dealer.Before Jack can locate her, several members of France's cultural elite are found dead--murdered in stunning, symbolic fashion. The only link between the crimes is a mysterious graffiti tag. As religious and ethnic tensions s

# Build a Data Frame

We can now use `book_titles`, `book_prices`, and `book_descs` lists from the previous exercises to create a data frame of this informtion.

**Study and run the code cell below**

In [18]:
bookDict = {'Title':book_titles,'Price':book_prices, 'Description':book_descs}

bookDF = pd.DataFrame(bookDict)
bookDF

Unnamed: 0,Title,Price,Description
0,Soumission,£50.10,"Dans une France assez proche de la nôtre, un h..."
1,Private Paris (Private #10),£47.61,Paris is burning--and only Private's Jack Morg...
2,"We Love You, Charlie Freeman",£50.27,"The Freeman family--Charles, Laurel, and their..."
3,Thirst,£17.27,"On a searing summer Friday, Eddie Chapman has ..."
4,The Murder That Never Was (Forensic Instincts #5),£54.11,"Given the opportunity, would you assume someon..."
...,...,...,...
60,When I'm Gone,£51.96,"Dear Luke,First let me say—I love you…I didn’t..."
61,The Silent Wife,£12.34,A chilling psychological thriller about a marr...
62,The Bette Davis Club,£30.66,"The morning of her niece’s wedding, Margo Just..."
63,Kitchens of the Great Midwest,£57.20,“A sweet and savory treat.” —People“An impress...


# Looping Over all Pages

At this point we have extracted all the information from the first page of Fiction books. However, there are a total of four pages that we would like to extract. Luckily, the pages are numbered consecutively:

https://books.toscrape.com/catalogue/category/books/fiction_10/page-1.html<br>
https://books.toscrape.com/catalogue/category/books/fiction_10/page-2.html<br>
https://books.toscrape.com/catalogue/category/books/fiction_10/page-3.html<br>
https://books.toscrape.com/catalogue/category/books/fiction_10/page-4.html<br>

We can use a for-loop to loop over each page, adding (concatenating) to our `book_titles`, `book_prices`, and `book_descs` lists.

**Study and run the code cell below.**

In [22]:
# Create empty lists
book_titles=[]
book_urls=[]
book_prices=[]
book_descs=[]

# Loop over the four pages
for j in range(1, 5):
    try:
        print('Processing page ' + str(j)) # displays current page we are processing
        
        driver.get('https://books.toscrape.com/catalogue/category/books/fiction_10/page-' + str(j) + '.html')
        
        # wait between 2-3 seconds
        time.sleep(random.uniform(2,3))
        
        # Create list of elements
        book_elements = driver.find_elements(By.CLASS_NAME, "product_pod")

        for element in book_elements:
            # Extract the title and url (both of which are located at the same XPath)
            title_element = element.find_element('xpath',"h3/a")
            book_titles.append(title_element.get_attribute('title'))
            book_urls.append(title_element.get_attribute('href'))
            
            # Extract the price
            price_element = element.find_element('xpath',"div/p")
            book_prices.append(price_element.text)  

    except:
        print('Error!')

print("Done processing")

# Extract the book descriptions using the urls
for url in book_urls:
    driver.get(url)
    descrip = driver.find_element('xpath','//*[@id="content_inner"]/article/p').text
    book_descs.append(descrip)

Processing page 1
Error!
Processing page 2
Error!
Processing page 3
Error!
Processing page 4
Error!
Done processing


Finally, we can create a data frame of all the Fiction books.

**Study and run the code cell below.**

In [23]:
bookDict = {'Title':book_titles,'Price':book_prices, 'Description':book_descs}

bookDF = pd.DataFrame(bookDict)
bookDF

Unnamed: 0,Title,Price,Description


At this point we have successfully scraped the webpage and retrieved the information we needed to create our dataframe. We can now close the browser window controlled by the WebDriver and terminate the session completely.  **Run the code cell below to close the broswer window.**

In [24]:
driver.quit()

# Congratulations on Finishing Homework 14