**Josh Hellings** - Practical Data Science for Economists 2024

First, let's import the required packages.

In [63]:
import requests                     # for making HTTP requests
from bs4 import BeautifulSoup       # for parsing HTML

<a href="https://colab.research.google.com/drive/1X-L1pjABev2rh32KNTJWBJ2HN-4v2U28?usp=sharing" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping

This notebook introduces web scraping with the `requests` and `BeautifulSoup` modules using some sample HTML, then we will move on a real website.

<br>
<br>

# Part 1. A Basic Example
## Searching HTML with `BeautifulSoup`

BeautifulSoup is a Python library for parsing HTML documents. We can use it to extract data from HTML fetched from the web but first we'll try it out with a simple HTML example.
- **Tip**: After working through this notebook, carry on learning about BeautifulSoup in this [tutorial](https://www.tutorialspoint.com/beautiful_soup/beautiful_soup_overview.htm).

In [44]:
sample_html = """
<html>
<body>
    <h1> BeautifulSoup </h1>
    <p> BeautifulSoup is a Python library for parsing HTML documents. We can use it to extract data from HTML we fetch
        from the web.
        Here, we're just using it to parse some simple sample HTML.
    </p>
    <h2 class="important"> Searching the tree </h2>
    <p id="searching_description" style="color: red"> BeautifulSoup allows us to search the HTML tree in lots of different ways: by tag, by
        id, by CSS class, and so on. </p>
    <ol>
        <li role="example_1"> By tag: we could search for every li </li>
        <li role="example_2"> By id: we could search for the p tag with id="searching_description" </li>
        <li role="example_3" class="important"> By class: we could search for every tag with a given class </li>
        <li> and so on... </li>
    </ol>
    <div>
        <p>Div tags are used to group block-elements and structure the web page.</p>
    </div>
</body>
</html>
"""

print("sample_html")

sample_html


---

<div class="alert alert-block alert-info">
<h4>HTML cheatsheet: Tags & Attributes</h4>

An HTML element is a part of an HTML document that is made up of a start tag, some content, and then a closing tag: `<div>Some text about anything</div>`. These elements can also include attributes, which go in the start tag: `<div id="unique_abc">Some text</div>`.

<br>

**Tags** are the building blocks of HTML. They are used to define the structure and content of a web page. In the **`sample_html`**, **html**, **body**, **h1**, **p**, **h2**, **ol**, **li**, and **div** are examples of tags.
- `<html>`: The root element that defines the whole HTML document.
- `<body>`: Contains the content of an HTML document.
- `<h1>`, `<h2>`: Heading tags used to define headings. `<h1>` represents the main heading, while `<h2>` represents a subheading (heading tags go all the way down to <b>h6</b>, which is the smallest heading).
- `<p>`: Defines a paragraph.
- `<ol>`: Defines an ordered list.
- `<li>`: Defines a list item within a list.
- `<div>`: A block-level element used as a container to group and style sections of HTML documents with CSS or manipulate them with JavaScript.

<br>

**Attributes** provide additional information about an element's properties. They are always specified in the start tag of an element and usually come in name/value pairs like name="value". Common tags include:
- `class`: An attribute used to specify a class name for an element. It is used by CSS and JavaScript to perform certain tasks for elements with the specified class name.
- `id`: An attribute used to specify a unique id for an element. It is used to identify a single element in the document. In the sample, `id="searching_description"` is used to uniquely identify a paragraph.
- `style`: Defines the inline CSS style for an element. For example, `style="color: red"` changes the text colour of the element to red.

<br>

**IDs vs. Classes**
- IDs are unique identifiers for elements. An ID can only be associated with a single element in an HTML document.
- Classes are not unique. Multiple elements can share the same class. This makes classes useful for applying the same styling or behaviour to multiple elements.

</div>

---

<br>
<br>

### 1.1 **Parsing HTML with BeautifulSoup**

At present, the `sample_html` variable is just a string. In order to search through it, we'll create a **Soup** object with BeautifulSoup - a searchable object representation. Using BeautifulSoup, we can search and navigate this structure in various ways:

- By tag name: Finds all instances of a specified tag.
- By ID: Finds the element with the specified ID.
- By class name: Finds all elements that match the specified class.

The BeautifulSoup object, soup, represents the parsed HTML document as a whole. With soup, you can use its methods to search for and manipulate elements based on their tags, IDs, classes, and other attributes. This flexibility makes BeautifulSoup a powerful tool for web scraping.

In [45]:
soup = BeautifulSoup(sample_html, 'html.parser')    # Parsing the HTML using BeautifulSoup and the built-in HTML parser
print(soup.prettify())      # Print the HTML in a nicely formatted way (with indentation)

<html>
 <body>
  <h1>
   BeautifulSoup
  </h1>
  <p>
   BeautifulSoup is a Python library for parsing HTML documents. We can use it to extract data from HTML we fetch
        from the web.
        Here, we're just using it to parse some simple sample HTML.
  </p>
  <h2 class="important">
   Searching the tree
  </h2>
  <p id="searching_description" style="color: red">
   BeautifulSoup allows us to search the HTML tree in lots of different ways: by tag, by
        id, by CSS class, and so on.
  </p>
  <ol>
   <li role="example_1">
    By tag: we could search for every li
   </li>
   <li role="example_2">
    By id: we could search for the p tag with id="searching_description"
   </li>
   <li class="important" role="example_3">
    By class: we could search for every tag with a given class
   </li>
   <li>
    and so on...
   </li>
  </ol>
  <div>
   <p>
    Div tags are used to group block-elements and structure the web page.
   </p>
  </div>
 </body>
</html>



Printing this `soup` object shows the same HTML, but its now stored in a way that allows us to search through it by specifying certain characteristics we want to find.

<br>

### 1.2 **Searching the HTML**

We can now search the `soup` object to extract useful information. Let's try searching by tag, attribute, and by both.

Useful BeautifulSoup methods include:
- `.find()`: Searches for the **first** tag that matches a given name or filter, returning a single result.
- `.find_all()`: Searches for all tags that match a given name or filter, returning a list of results.
- `.select()`: Uses a CSS selector to search for matching tags, returning a list of results.
- `.get_text()`: Extracts all text from a tag and its children, returning a single string (`.text` also achieves this).
- `.get()`: Retrieves the value of a tag attribute, such as "href" in an `<a>` tag or "src" in an `<img>` tag.

<br>

##### 1.2.1 **by tag**

Here's an example of using .find() to find the first `<p>` tag:

In [46]:
# Finding the first occurrence of the <p> tag
first_p_tag = soup.find('p')    ### NOTE: find() returns the first occurrence of the tag. If you want all occurrences, use find_all()
print(first_p_tag)

<p> BeautifulSoup is a Python library for parsing HTML documents. We can use it to extract data from HTML we fetch
        from the web.
        Here, we're just using it to parse some simple sample HTML.
    </p>


So we've extracted the first paragraph (`<p>`) element, but it includes the opening and closing tags. To get just the text, we'll use the `.get_text()` method.

In [47]:
print(first_p_tag.get_text())

 BeautifulSoup is a Python library for parsing HTML documents. We can use it to extract data from HTML we fetch
        from the web.
        Here, we're just using it to parse some simple sample HTML.
    


What if we want to find the text of every list (`<li>`) element on the page? We can use `find_all()`

In [52]:
list_elements = soup.find_all('li') # Finding all the list elements in the HTML
for element in list_elements:
    print(element.text) # Printing the text of each list element

 By tag: we could search for every li 
 By id: we could search for the p tag with id="searching_description" 
 By class: we could search for every tag with a given class 
 and so on... 


**Note:** when using `.find_all()`, any matching elements are returned as a list - even if only 1 element is returned.

<br>

##### 1.2.2 **by id**

Let's find the paragraph (`<p>`) with `id`="searching_description"

In [54]:
# instead of using find_all, we can use `find()` to find the first element that matches the search criteria
description_element = soup.find('p', {'id': 'searching_description'}) # Finding the paragraph with id="searching_description"
print(description_element)

print('\nJust the text:', description_element.text) # Printing the text of the paragraph with id="searching_description"

<p id="searching_description" style="color: red"> BeautifulSoup allows us to search the HTML tree in lots of different ways: by tag, by
        id, by CSS class, and so on. </p>

Just the text:  BeautifulSoup allows us to search the HTML tree in lots of different ways: by tag, by
        id, by CSS class, and so on. 


As well as getting text *within* HTML elements, we can extract the attribute values of that element. Let's get the style attribute.

In [55]:
print('Paragraph style:', description_element['style']) # Printing the value of the style attribute of the paragraph with id="searching_description"

Paragraph style: color: red


<br>

##### 1.2.3 **by class**

Let's find all elements with the `class` `important`

In [56]:
important_elements = soup.find_all(class_='important') # Finding all the elements with class="important"
print(important_elements)

[<h2 class="important"> Searching the tree </h2>, <li class="important" role="example_3"> By class: we could search for every tag with a given class </li>]


Let's view these elements individually.

In [57]:
print(f"There are {len(important_elements)} elements with class='important'")
for element in important_elements:
    print(element) # Printing each element with class="important"

There are 2 elements with class='important'
<h2 class="important"> Searching the tree </h2>
<li class="important" role="example_3"> By class: we could search for every tag with a given class </li>


<br>

#### 1.2.4 **Bonus**: custom searching

we can search in any way we want by supplying a function of our own

let's write a simple function that identifies whether an element's `role` attribute that starts with `example_`

In [13]:
def important_role(role_attribute):
    return role_attribute != None and 'example_' in role_attribute

# Find the first <ol> element
ol_element = soup.find('ol')
ol_element.find_all('li', role=important_role) # Finding all the list elements with a role attribute that contains 'example-'

[<li role="example_1"> By tag: we could search for every li </li>,
 <li role="example_2"> By id: we could search for the p tag with id="searching_description" </li>,
 <li class="important" role="example_3"> By class: we could search for every tag with a given class </li>]

<br>

#### 1.2.5 **Bonus**: Searching using CSS selectors.

We can also use the `.select()` method to search for elements using CSS notation. Here's how we would define styles in a CSS file to apply HTML tags, classes, and ids:
- p {} -> applies to all HTML 'p' tags
- #chart {} -> applies to all HTML elements with the attribute class="chart"
- .chart1 {} -> applies to all HTML elements with the attribute id="chart1"

`figure`, `#chart1`, `.chart1` are each CSS selectors. We can pass this to the `.select()` method to extract matching elements.

- **Note**: we can achieve the same results with the .find .find_all etc methods we showed above, but using CSS selectors can be nice as it keeps the same system we're used to when using CSS on our own website.

- **Tip**: There are more useful features of the `.select()` method that allow for complex searching (such as by attribute or for the nth type). [More info here](https://www.tutorialspoint.com/beautiful_soup/beautiful_soup_find_element_using_css_selectors.htm).

In [58]:
soup.select('p')

[<p> BeautifulSoup is a Python library for parsing HTML documents. We can use it to extract data from HTML we fetch
         from the web.
         Here, we're just using it to parse some simple sample HTML.
     </p>,
 <p id="searching_description" style="color: red"> BeautifulSoup allows us to search the HTML tree in lots of different ways: by tag, by
         id, by CSS class, and so on. </p>,
 <p>Div tags are used to group block-elements and structure the web page.</p>]

In [15]:
# Search for elements with id="searching_description"
soup.select('#searching_description')

[<p id="searching_description" style="color: red"> BeautifulSoup allows us to search the HTML tree in lots of different ways: by tag, by
         id, by CSS class, and so on. </p>]

In [59]:
# Search for p elements with a style attribute (it doesn't matter what the value of the style attribute is)
soup.select('p[style]')

[<p id="searching_description" style="color: red"> BeautifulSoup allows us to search the HTML tree in lots of different ways: by tag, by
         id, by CSS class, and so on. </p>]

<br>

---

<br>
<br>

# Part 2. Live HTML

## Getting HTML from the web with `Requests`

<div style="display: flex; align-items: flex-start;">
    <div style="flex: 0 2 auto;">
        <img src="https://raw.githubusercontent.com/FM-ds/ScrapingWorkshop/main/notebook_images/sample_html_safari.png" width="400px">
    </div>
    <div style="flex: 1 1 auto; margin-top: 10px; margin-right: 150px">
        <p>We just extracted information from HTML which was defined locally in a string <code>sample_html</code>. Usually, we care about HTML found on the internet. As a simple example, the page defined in <code>sample_html</code> is available at <a href="http://www.fmcevoy.io/ScrapingWorkshop/sample_html">http://www.fmcevoy.io/ScrapingWorkshop/sample_html</a></p>
        <p>To download HTML (and any other resources) from the internet, we can use the <code>requests</code> module.</p>
    </div>
</div>


In [64]:
import requests                # for making HTTP requests

We'll use the `.get()` method with our target URL to make this request.

In [78]:
response = requests.get('http://www.fmcevoy.io/ScrapingWorkshop/sample_html') # Making a request for the sample HTML hosted on the web

# Let's check the content of our get request (showing only the first 400 characters)
print(response.text[:400], '...')

<html>
<body>
    <h1> BeautifulSoup </h1>
    <p> BeautifulSoup is a Python library for parsing HTML documents. We can use it to extract data from HTML we fetch
        from the web.
        Here, we're just using it to parse some simple sample HTML. </p>

    <h2 class="important"> Searching the tree </h2>
    <p id="searching_description" style="color: red"> BeautifulSoup allows us to search th ...


We've successfully managed to use Python code to fetch HTML code from the web!

<br>

**Parsing** - We need to convert the HTML code from normal text (i.e. string) into some format that we can search. Just as we did before, we'll use BeautifulSoup to parse the response into a special `soup` object.

In [19]:
soup = BeautifulSoup(response.text, 'html.parser') # Instead of using the sample HTML, we're using the HTML from the web that we fetched in the previous step
soup.find_all('li') # Finding all the list elements in the HTML

[<li role="example_1"> By tag: we could search for every li </li>,
 <li role="example_2"> By id: we could search for the p tag with id="searching_description" </li>,
 <li class="important" role="example_3"> By class: we could search for every tag with a given class </li>,
 <li> and so on... </li>,
 <li><a href="https://example.com/page1">Page 1</a></li>,
 <li><a href="https://example.com/page2">Page 2</a></li>,
 <li><a href="https://example.com/page3">Page 3</a></li>]

<br>
<br>

### <font color='Green'><strong>Exercises: </strong></font>

**EX 1** Retrieve the text of the paragraph (`<p>`) with the id "exercise1".

In [None]:
### 1. Add Solution Here ###
exercise1 = # TODO: Add your solution here

<br>

**EX 2** Count the number of `<div>` elements on the page that have the class "exercise".

In [None]:
### 2. Add Solution Here ###
exercise_divs = #TODO: Add your solution here

<br>

**EX 3** Extract and print all the URLs from the anchor (`<a>`) tags within the list in div with the id "exercise3".

In [None]:
### 3. Add Solution Here ###

anchor_tags = #TODO: Add your solution here
for tag in anchor_tags:
    # TODO: print the href attribute of each anchor tag
