Created by R. David Beales for the [Kelvin Smith Library](https://case.edu/library/) at [Case Western Reserve University](https://case.edu) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email rdb104@case.edu.<br />
___

# Web Scraping: Exploring Website Structure and Getting Specific Data from Web Scraping

**Description:** This lesson introduces basic website structure and how to identify elements for use in web scraping.  

**Use Case:** For Learners (Additional explanation, not ideal for researchers)

**Difficulty:** Beginner

**Completion time:** 60 minutes

**Knowledge Required:** Basic Python

**Knowledge Recommended:** HTML Structure

**Data Format:** `html`, `txt`, `py`, `csv`

**Libraries Used:** `requests` `BeautifulSoup` 
___

## Introduction

As in the previous tutorial, we will be using [Books to Scrape](https://books.toscrape.com/) as our test website for this tutorial.  The site exists just to provide a platform for people to practice web scraping.  You may want to keep the website open in a another browser tab so you can compare the data we are getting with the website as a regular user will see it. 

In this project you will:
1. Use the `Inspect` tool in your web browser to explore the structure of the <a href="https://books.toscrape.com/">Books to Scrape</a> website.
2. Understand how book titles on the site are tagged/classified. 
3. Understand and use a python script to crawl the web page and extract only the data that meets the classification criteria we identified for titles in step 2.
4. Look at the list of titles we scraped.  Identify problems with the data and explore an alternative strategy of using Beautiful Soup to get the correct titles.
5. Write the list of correct titles to a file. 


### The `Inspect` Tool and how to use it.

Modern web broswers provide powerful tools for exploring the structure of web pages.  In Firefox, (<a url='https://duckduckgo.com/?t=ffab&q=why+use+firefox'>You should be using Firefox.</a>) right clicking anywhere on a web page will open the context menu.  One of the options in the context menu is `Inspect`.  You can also find the `Inspect` option in Chrome if you are still using that for some reason...

In your browser, right click on a book title on [Books to Scrape](https://books.toscrape.com) and then click on the `Inspect` option in the context menu.  You will see the Web Developer Tools open up and see the html structure in the Inspector panel.  

 ![title](img/inspect1.png)  ![title](img/inspect2.png)


You can see that as you move the mouse around the inspector panel, it will highlight the piece of the web page that the code you are hovering over is being used to create.  It should be obvious now how powerful this tool is for exploring the structure of pages.  

Now let's take a minute to talk about the HTML we are looking at in the Inspector panel.  

HTML elements make up the page.  Each element has tags and attrubutes for its syntax that define how the content is displayed.

If we're looking at the title of the books, as shown in the images above, we see html like this.  


In [None]:
<h3>
    <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>
</h3>

The `<h3>` tag and the `<a>` tag are elements.  The `<h3>` tag has no syntax, but the `<a>` tag has an `<href>` atribute and a `<title>` attribute. The affected content is the text "A Light in the...".  So when you see that text on the page, it is displayed as a heading 3 as indicated by the `<h3>` tag and it is also displayed as a link to another page because of the `a` tag and the `html` attribute.  This image shows the structure of the `<a>` tag.


![title](img/htmlelements.png)

So!  Looking at the image above, we can tell that if we want to collect a list of book titles, we will have to use the `title` attribute of the `<a>` tag to get that information.  We know from the last lesson that we can scrape the entire HTML document, but how do we tell the web scraper to only look for the `title` attribute of the `<a>` tag?

We will use python package called Beautiful Soup to manage a process called parsing. 

Now, let's get back to scraping!

In [None]:
from bs4 import BeautifulSoup
import requests 

In [None]:
# 1.Fetch the page
results = requests.get("https://books.toscrape.com/")

# 2.Get the page content and assign it to the varaible 'content'
content = results.text

This is what we always do when beginning a scraping project, but now we are going to start using BeautifulSoup to sift through that content for the pieces we are interested in. Beautiful Soup we use the “soup” object to find elements in a website. To create this object execute the following code cell.

In [None]:

# 3. Create the soup
soup = BeautifulSoup(content, "lxml")

We created the `soup` so now we can look through it to find the data we are looking for.

There are two main ways to parse the results using Beautiful Soup, `find()` and `findall()`.  `find()` gets the first element that matches a specific tag name, class name, and/or id.  `findall()` will get all the elements that match those criteria and put them in a python list.

The syntax is `variable = soup.find('tag', AttributeName='Value)`

We can omit any of the arguments of the `find` function if we don't need to specify an attribute value. 

Let’s look at some examples of how to locate elements with Beautiful Soup. We’ll be using this HTML for the first book from [Books to Scrape](https://books.toscrape.com).

In [None]:
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>

We can also scrape the title information using the `h3` tags.  If you look at this line of HTML, you can see that the title, as well as a link to product page for the specific book, are in bewteen the `h3` tags.

`<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>`

We can get just the text from what we've scraped using the `text` property.

Let's use the syntax in the examples below `variable = soup.find('tag', AttributeName='Value)`. In this case we will omit the `AttributeValue` argument.  


In [None]:
# Get just the title element using the h3 tag.
title_element = soup.find('h3')

# Get the text from the title element.
title_text = title_element.text

Display the `title_text` variable using the code block below.  Let's see if we stored the book title there.

In [None]:
title_text


### Troubleshooting!
As you can see, we've got the title text, but it is not the full title of the book.  The full title is "A Light in the Attic" but we only scraped the abbreviated title for display on the web page.  We would probably want to scrape the complete title of the book in order to create a meaningful list.  What if later on we were looking for books aobut attics in our data?!

Luckily there is a solution.  

`<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>`

If we look at the `h3` tags again, we can see there is also a `title` attribute of the `<a>` tags that contains the complete title of the book.  Just like we used the `find()` function on the `soup` we created, we can use `find()` on parts of the soup as well.

First we find the `h3` element. Second, we look in that element for the `a` element.  Finally, we use the `get` function to retrieve the `title` attribute.  Remember that there are nested tags in the html, so we are drilling own one level at a time to find what we need.

And we can use a short `if/esle` statement to automatically check what we scraped.

In [None]:
h3_element = soup.find('h3')

# Find the <a> element within the <h3> element
a_element = h3_element.find('a')

# Extract the title attribute of the <a> element
if a_element:
    title_attribute = a_element.get('title')
    print("Title Attribute:", title_attribute)
else:
    print("No <a> element found within <h3>.")

AHA!  We have a complete title!  But that's only one...  

In order to find all the titles we need to modify our code so we are using the `find_all()` instead of `find()`.

Additionally, we are now looking through multiple `h3` elements instead of one, so we are going to have to use a `for` loop.


Let's break down the loop:

`h3_elements = soup.find_all('h3')`: This line finds all the `<h3>` elements in the parsed HTML content using BeautifulSoup's find_all method and stores them in the h3_elements variable. This creates a collection (list-like object) containing all the `<h3>` elements found.

`for h3_element in h3_elements:`: The for loop starts here. It iterates over each `<h3>` element found in the h3_elements collection. For each iteration, the current `<h3>` element is stored in the variable h3_element.

`a_element = h3_element.find('a')`: Within each iteration of the loop, `h3_element.find('a')` looks for the first `<a>` element inside the current `<h3>` element (h3_element). If an `<a>` element is found, it is stored in the variable a_element. If no `<a>` element exists within the current `<h3>` element, a_element will be None.

`if a_element:`: This if statement checks if an `<a>` element was found within the current `<h3>` element. If a_element is not None, indicating that an `<a>` element exists, it proceeds to the next step.

`title_attribute = a_element.get('title')`: This line extracts the value of the title attribute from the found `<a>` element (a_element) using the get() method and stores it in the variable title_attribute.

`title_attributes.append(title_attribute)`: If a title_attribute was obtained (meaning the `<a>` element had a title attribute), it appends this title_attribute value to the title_attributes list.

The loop continues this process for each `<h3>` element found in the HTML content, collecting the title attributes of the nested `<a>` elements and storing them in the title_attributes list.

Finally, `print("Title Attributes List:", title_attributes)` displays the list for us so we can check that the data we've scraped is indeed what we were looking for.  

In [None]:
# Find all <h3> elements
h3_elements = soup.find_all('h3')

# Initialize an empty list to store title attributes
title_attributes = []

# Iterate through each <h3> element to find and extract the title attributes of <a> elements
for h3_element in h3_elements:
    a_element = h3_element.find('a')  # Find the <a> element within each <h3>
    if a_element:
        title_attribute = a_element.get('title')  # Extract the title attribute
        title_attributes.append(title_attribute)  # Append the title attribute to the list

print("Title Attributes List:", title_attributes)

Excellent.  Now we have a list of titles!  

In order to write them to a file, we need to convert the items in the list to a string first.  The `write()` method expects strings as input.

A string is a sequence of characters, which can be letters, numbers, symbols, or spaces enclosed within either single or double quotes.  Strings are immutable, meaning their contents cannot be changed after creation.  

`str(title)` will convert each title in the list into a string.  `\n` will add a line break after every title, so each one will be on a separate line in the text file.  

Run the code below.  Check the contents of your file by clicking on it in the file explorer on the left.

In [None]:
with open('title_list.txt', 'w') as outfile:
    outfile.writelines((str(title)+'\n' for title in title_attributes))

Well done!  That was a complex set of problems we explored.  In the next lesson, we will look at getting all the data connected to each title, price, rating, stock status, etc., so we will have a dataset that can provide some real insight into the store's selection.