#Beautiful Soup Workshop - NTU

## Workshop Contents <br>

1. Setup +  Ethics and legality of web scraping
2. Basics of HTML
3. Inspecting your website
4. Scraping HTML from a website with requests
5. Parsing HTML with BeautifulSoup




## Section 1: Setting up + Ethics and legality of web scraping
Locally, we would run the next cell to install the modules required for the workshop. Since we are using Google Colab, all the modules are already preinstalled on our Colab instance.

Running the next cell shows that all the modules are already installed.

In [None]:
# Install the module (already installed on Colab)
!pip install requests beautifulsoup4

Web scraping is the process of gathering information from the internet. Even copying and pasting the lyrics of your favorite song can be considered a form of web scraping! However, the term “web scraping” usually refers to a process that involves automation. While some websites don’t like it when automatic scrapers gather their data, which can lead to legal issues, others don’t mind it.

If you’re scraping a page respectfully for educational purposes, then you’re unlikely to have any problems. Still, it’s a good idea to do some research on your own to make sure you’re not violating any Terms of Service before you start a large-scale web scraping project.

"Using Beautiful Soup is legal because you only use it for parsing documents. Web scraping in general is also legal if you respect a website’s terms of service and copyright laws."

## Section 2: Basics of HTML

Hypertext Markup Language (HTML) is the standard markup language for Web pages.

A HTML element is defined by a starting tag, content and an ending tag.

```<tagname> Content goes here... </tagname>```

### HTML Attributes
Beyond just tags and content, HTML elements can have **attributes**. These provide additional information about the element. Attributes are usually key-value pairs inside the opening tag.

For example:

```html
<a href="https://www.example.com" class="external-link">Click here</a>
```
Here, `href` and `class` are attributes of the `<a>` (anchor) tag. `BeautifulSoup` uses these attributes (especially `id` and `class`) to find specific elements on a page.

### Nesting of HTML Elements (Parent-Child Relationships)
HTML elements are often nested inside one another, creating a hierarchical structure, similar to a family tree. An element inside another is called a 'child', and the enclosing element is its 'parent'.

```html
<div> <!-- Parent -->
  <p> <!-- Child of div -->
    <span> <!-- Child of p -->
      Hello World
    </span>
  </p>
</div>
```

Understanding this nesting is useful because `BeautifulSoup` allows us to navigate up and down this tree (e.g., finding the 'parent' of an element, or children within a parent).

### Common HTML Tags You'll Encounter
While there are many HTML tags, some are more common in web scraping:

*   `<p>`: Paragraph of text.
*   `<a>`: Link (anchor). The actual URL is usually in the `href` attribute.
*   `<div>`: Division or section. Often used to group other elements and define layout.
*   `<span>`: An inline container for small pieces of content.
*   `<h1>` to `<h6>`: Headings (from largest to smallest).
*   `<ul>`, `<ol>`, `<li>`: Unordered lists, ordered lists, and list items.
*   `<img>`: Image. The source URL is in the `src` attribute.

Knowing these helps you quickly identify what kind of content you're looking at when inspecting a web page.

### `id` and `class` Attributes
While any attribute can be used, `id` and `class` are exceptionally useful for pinpointing specific elements:

*   **`id`**: Stands for "identifier". An `id` attribute should be **unique** within an HTML document. This makes it perfect for finding a single, specific element quickly.
    *   Example: `<div id="main-content">...</div>`

*   **`class`**: Used to classify elements. Multiple elements can share the same `class` attribute. This is great for finding groups of similar elements (e.g., all job cards, all navigation links).
    *   Example: `html<p class="job-title">...</p>`
    *   In BeautifulSoup: `soup.find_all("p", class_="job-title")` (Note the underscore after `class` in Python to avoid conflict with the `class` keyword).

When inspecting a web page, always look for these attributes first, as they often provide the easiest way to target the data you want.

### Reinforcing 'Inspect Element'
Constantly remind participants that the most powerful tool for web scraping is their browser's **Developer Tools** (usually accessed by `F12` or `Right Click + Inspect`). They need to practice:

1.  **Selecting an element:** To see its HTML code, tags, and attributes.
2.  **Navigating the HTML tree:** To understand parent-child relationships and identify unique selectors (`id`, `class`).

This hands-on inspection is how they will figure out what to tell `BeautifulSoup` to find!

## Section 3: Inspecting your target website
We will use this website for this workshop: https://realpython.github.io/fake-jobs/

Open Developer Tools by ```F12``` or ```Right Click + Inspect```

Specifically find an element by highlighting the text and press ```Right Click + Inspect```

Try to identify some of the structure earlier by using the Developer Tools


## Section 4: Scraping HTML from a website
Run the following code cell and look at the output.

In [2]:
# module imports
import requests
from bs4 import BeautifulSoup

Here we send a HTTP GET request to the URL and get the HTML source code and save it into the page variable.

What is a HTTP GET request?
HTTP or Hypertext Transfer Protocol is the interface in which how web browsers request and receive data. There are other HTTP methods such as GET, POST, PUT, DELETE etc.
We use GET to request data from the web server.

Run the cell to see the pages source code.

In [3]:
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)

print(page.text)

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Fake Python</title>
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">
  </head>
  <body>
  <section class="section">
    <div class="container mb-5">
      <h1 class="title is-1">
        Fake Python
      </h1>
      <p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
    </div>
    <div class="container">
    <div id="ResultsContainer" class="columns is-multiline">
    <div class="column is-half">
<div class="card">
  <div class="card-content">
    <div class="media">
      <div class="media-left">
        <figure class="image is-48x48">
          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">
        </figure>
      </div>
      <div class="media-content">
        <h2 class="title is-

Here we use soup to find the tags with the ```id="ResultsContainer"``` and store it in the ```results``` variable.

Run the cell the ```results``` variable.

In [None]:
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="ResultsContainer")

print(results)

## Section 5: Parsing the HTML
In this section we will go through how to get the information you need from the html.

We search through the whole ```results``` html and find all the ```<h2>``` tags with "python" in the content and add it to the ```python_jobs``` list.

Run the next cell to see whats inside the ```python_jobs``` variable.



In [17]:
python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)

print(python_jobs)

[<h2 class="title is-5">Senior Python Developer</h2>, <h2 class="title is-5">Software Engineer (Python)</h2>, <h2 class="title is-5">Python Programmer (Entry-Level)</h2>, <h2 class="title is-5">Python Programmer (Entry-Level)</h2>, <h2 class="title is-5">Software Developer (Python)</h2>, <h2 class="title is-5">Python Developer</h2>, <h2 class="title is-5">Back-End Web Developer (Python, Django)</h2>, <h2 class="title is-5">Back-End Web Developer (Python, Django)</h2>, <h2 class="title is-5">Python Programmer (Entry-Level)</h2>, <h2 class="title is-5">Software Developer (Python)</h2>]


Now for each ```<h2>``` tagged element in the ```python_jobs``` list, we want to get the parent 3 levels up by calling ```.parent``` 3 times. We do this because all the information we need such as title, company and location will be available in the content of this parent element.

```python_job_cards``` holds all the parent for each h2 element we scraped earlier.


Run the next cell to see whats inside the ```python_jobs_cards``` variable.

In [18]:
python_job_cards = [
    h2_element.parent.parent.parent for h2_element in python_jobs
]
print(python_job_cards)

[<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>
<div class="content">
<p class="location">
        Stewartbury, AA
      </p>
<p class="is-small has-text-grey">
<time datetime="2021-04-08">2021-04-08</time>
</p>
</div>
<footer class="card-footer">
<a class="card-footer-item" href="https://www.realpython.com" target="_blank">Learn</a>
<a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">Apply</a>
</footer>
</div>, <div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img alt="Real Python Logo" src="https://f

The following is a snippet of html from the website. The ```<div>``` element with the card-content class contains all the information you want. It’s a third-level parent of the ```<h2>``` title element that you found using your filter.
```
<div class="card">
  <div class="card-content">
    <div class="media">
      <div class="media-left">
        <figure class="image is-48x48">
          <img
            src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg"
            alt="Real Python Logo"
          />
        </figure>
      </div>
      <div class="media-content">
        <h2 class="title is-5">Senior Python Developer</h2>

        <h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
      </div>
    </div>

    <div class="content">
      <p class="location">Stewartbury, AA</p>
      <p class="is-small has-text-grey">
        <time datetime="2021-04-08">2021-04-08</time>
      </p>
    </div>
    <footer class="card-footer">
      <a
        href="https://www.realpython.com"
        target="_blank"
        class="card-footer-item"
        >Learn</a
      >
      <a
        href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html"
        target="_blank"
        class="card-footer-item"
        >Apply</a
      >
    </footer>
  </div>
</div>
```

Putting it all together, we have a script that scraped the website and finds jobs with python in the title along with company and location as well as the application link.

Run the next cell to see the outcome

In [19]:
import requests
from bs4 import BeautifulSoup

URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="ResultsContainer")


python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)

python_job_cards = [
    h2_element.parent.parent.parent for h2_element in python_jobs
]


for job_card in python_job_cards:
    title_element = job_card.find("h2", class_="title")
    company_element = job_card.find("h3", class_="company")
    location_element = job_card.find("p", class_="location")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    link_url = job_card.find_all("a")[1]["href"]
    print(f"Apply here: {link_url}\n")

Senior Python Developer
Payne, Roberts and Davis
Stewartbury, AA
Apply here: https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html

Software Engineer (Python)
Garcia PLC
Ericberg, AE
Apply here: https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html

Python Programmer (Entry-Level)
Moss, Duncan and Allen
Port Sara, AE
Apply here: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-20.html

Python Programmer (Entry-Level)
Cooper and Sons
West Victor, AE
Apply here: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-30.html

Software Developer (Python)
Adams-Brewer
Brockburgh, AE
Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-40.html

Python Developer
Rivera and Sons
East Michaelfort, AA
Apply here: https://realpython.github.io/fake-jobs/jobs/python-developer-50.html

Back-End Web Developer (Python, Django)
Stewart-Alexander
South Kimberly, AA
Apply here: https://rea

References:
* https://realpython.com/beautiful-soup-web-scraper-python/
* https://www.w3schools.com/html/
<br>

Other resources for self learning:
* [FreeCodeCamp's BeautifulSoup Crash Course](https://www.youtube.com/watch?v=XVv6mJpFOb0)
* [Tinkernut's Beginner Guide to Web Scraping with Python](https://www.youtube.com/watch?v=QhD015WUMxE&pp=ygUTd2ViIHNjcmFwaW5nIHB5dGhvbg%3D%3D)
* [Corey Schafer's Web Scraping with Beautiful Soup and Requests](https://youtu.be/ng2o98k983k)