# Introduction to Web Scraping in Python

Web scraping involves extracting and processing data from websites.

The web is one of the richest sources of information available today. Fields like data science, business intelligence, and journalism can gain valuable insights by collecting and analyzing web data.

In this tutorial, you will learn how to:

- Extract website data using string manipulation and regular expressions
- Parse HTML content with an HTML parser
- Interact with forms and dynamic web elements




**Before using Python for web scraping, always review the website's terms of service to ensure automated access is allowed. Scraping a website without permission can be legally unclear, and violating its terms may lead to potential issues.**

---

## Building Our First Web Scraper

A useful package for web scraping in Python’s standard library is `urllib`, which provides tools for handling URLs. Specifically, the `urllib.request` module includes the `urlopen()` function, allowing you to open a URL directly within your program.


For this tutorial, we’ll use a page that’s hosted on Real Python’s server. The page that we’ll access has been set up for use with these kind of tutorials.

### 1. - Using ```urllib``` 

### 1.1 - Import the `urlopen` Function from the `urllib.request` Module

In [13]:
from urllib.request import urlopen

### 1.2 - Define the URL to Scrape

In [14]:
url = "http://olympus.realpython.org/profiles/aphrodite"

### 1.3 - Open a URL Using `urlopen()`

In [15]:
page = urlopen(url)

In [16]:
page # urlopen() returns an HTTPResponse object

<http.client.HTTPResponse at 0x7b13575252d0>

### 1.4 - Extract the HTML Content of the Page

TTo extract the HTML from a webpage, first use the `.read()` method of the `HTTPResponse` object, which returns the data as a sequence of bytes. Then, apply the `.decode()` method to convert the bytes into a string, typically using UTF-8 encoding.


In [17]:
html_bytes = page.read()
html = html_bytes.decode("utf-8")

In [18]:
print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



The output you're viewing is the HTML code of the website, which your browser interprets and renders when you visit [http://olympus.realpython.org/profiles/aphrodite](http://olympus.realpython.org/profiles/aphrodite).

Using `urllib`, you accessed the website just like a browser would. However, instead of displaying the content visually, you retrieved the source code as text. Now that you have the HTML as text, you can extract information from it in several ways.



---

## 2. - Extract Text From HTML With String Methods

One way to extract information from a webpage's HTML is by using string methods. For example, you can use `.find()` to search through the HTML text for the `<title>` tags and extract the page's title.

To begin, you'll extract the title from the webpage you requested earlier. If you know the index of the first character of the title and the index of the closing `</title>` tag, you can use a string slice to retrieve the title.

Since `.find()` returns the index of the first occurrence of a substring, you can get the index of the opening `<title>` tag by passing the string `"<title>"` to `.find()`:


### 2.1 - Extract the Title of the Page

In [19]:
title_index = html.find("<title>")
title_index

14

You don't actually want the index of the `<title>` tag itself, but rather the index of the title text. To get the index of the first letter in the title, simply add the length of the string `"<title>"` to the `title_index` value:


In [8]:
start_index = title_index + len("<title>")
start_index

21

Next, get the index of the closing `</title>` tag by passing the string `"</title>"` to `.find()`:


In [9]:
end_index = html.find("</title>")
end_index

39

Finally, you can extract the title by slicing the HTML string:

In [10]:
title = html[start_index:end_index]
title

'Profile: Aphrodite'

Real-world HTML can be much more complex and unpredictable compared to the HTML on the Aphrodite profile page. Here’s [another profile page](http://olympus.realpython.org/profiles/poseidon) with messier HTML that you can scrape:


In [21]:
url = "http://olympus.realpython.org/profiles/poseidon"

In [22]:
page = urlopen(url)
html = page.read().decode("utf-8")
print(html)

<html>
<head>
<title >Profile: Poseidon</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/poseidon.jpg" />
<h2>Name: Poseidon</h2>
<br><br>
Favorite animal: Dolphin
<br><br>
Favorite color: Blue
<br><br>
Hometown: Sea
</center>
</body>
</html>



### 2.2 - Extract the Title of the New Page

In [26]:
url = "http://olympus.realpython.org/profiles/poseidon"
page = urlopen(url)
html = page.read().decode("utf-8")
start_index = html.find("<title>") + len("<title>")
end_index = html.find("</title>")
title = html[start_index:end_index]
title

'\n<head>\n<title >Profile: Poseidon'

Whoops! There’s a bit of HTML mixed in with the title. Why is that?

The HTML for the `/profiles/poseidon` page looks similar to the `/profiles/aphrodite` page, but there’s a small difference: the opening `<title>` tag has an extra space before the closing angle bracket (>), rendering it as `<title >`.

As a result, `html.find("<title>")` returns -1 because the exact substring `"<title>"` doesn’t exist. When -1 is added to `len("<title>")`, which is 7, the `start_index` variable is assigned the value 6.

The character at index 6 of the `html` string is a newline character (`\n`), right before the opening angle bracket (<) of the `<head>` tag. This means that `html[start_index:end_index]` returns all the HTML starting from that newline and ending just before the `</title>` tag.

These types of issues can arise in many unpredictable ways, highlighting the need for a more reliable method to extract text from HTML.


---

## 3. - Extract Text From HTML With Regular Expressions

Regular expressions—or regexes for short—are patterns used to search for text within a string. Python supports regular expressions through the standard library's `re` module.


### 3.1 - Import the `re` Module

In [27]:
import re

### 3.2 - Try to parse out the title [from another profile page](http://olympus.realpython.org/profiles/dionysus), which contains this rather carelessly written line of HTML:

```html
<TITLE >Profile: Dionysus</title  / >
```

In [28]:
import re
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title) # Remove HTML tags

print(title)

Profile: Dionysus


Let's break down the first regular expression in the pattern string into three components:

1. `<title.*?>` matches the opening `<TITLE>` tag in the HTML. The `<title` portion aligns with `<TITLE` because `re.search()` is invoked with `re.IGNORECASE`. The `.*?>` portion matches any text that follows `<TITLE` up to the first occurrence of `>`.

2. `.*?` matches all text following the opening `<TITLE>`, but does so non-greedily, meaning it stops at the first instance of `</title.*?>`.

3. `</title.*?>` is similar to the first pattern, but it includes the `/` character, allowing it to match the closing `</title>` tag in the HTML.

The second regular expression, `<.*?>`, also employs the non-greedy `.*?` to match all HTML tags within the title string. By replacing any found matches with `""`, the `re.sub()` function effectively removes all tags, leaving only the text.



---

## 4. - Check Your Understanding

### Write a program that grabs the full HTML from the following URL:

### [http://olympus.realpython.org/profiles/dionysus](http://olympus.realpython.org/profiles/dionysus)

### Next, use `.find()` to extract the text following "Name:" and "Favorite Color:". Ensure that you do not include any leading spaces or trailing HTML tags that may be present on the same line.


In [15]:
from urllib.request import urlopen

In [16]:
url = "http://olympus.realpython.org/profiles/dionysus"
html_page = urlopen(url)
html_text = html_page.read().decode("utf-8")

In [17]:
for string in ["Name: ", "Favorite Color:"]:
    string_start_idx = html_text.find(string)
    text_start_idx = string_start_idx + len(string)

    next_html_tag_offset = html_text[text_start_idx:].find("<")
    text_end_idx = text_start_idx + next_html_tag_offset

    raw_text = html_text[text_start_idx : text_end_idx]
    clean_text = raw_text.strip(" \r\n\t")
    print(clean_text)

Dionysus
Wine


---

## 5. - Parsing HTML Content With an HTML Parser

While regular expressions are powerful for pattern matching, using an HTML parser specifically designed for parsing HTML pages can often be more straightforward. There are several Python tools available for this purpose, but the [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) library is an excellent choice for beginners.


### 5.1 - Install the `beautifulsoup4` Package

You can install Beautiful Soup using pip:

```bash
pip install beautifulsoup4
```

Because we defined a requirement file, you can install all the packages needed for this class by running the following command:

```bash
pip install -r requirements.txt
```

### 5.2 - Import the `BeautifulSoup` Class and Create a BeautifulSoup Object

In [18]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
soup

<html>
<head>
<title>Profile: Dionysus</title>
</head>
<body bgcolor="yellow">
<center>
<br/><br/>
<img src="/static/dionysus.jpg"/>
<h2>Name: Dionysus</h2>
<img src="/static/grapes.png"/><br/><br/>
Hometown: Mount Olympus
<br/><br/>
Favorite animal: Leopard <br/>
<br/>
Favorite Color: Wine
</center>
</body>
</html>

This code performs three main tasks:

1. Opens the URL [http://olympus.realpython.org/profiles/dionysus](http://olympus.realpython.org/profiles/dionysus) using `urlopen()` from the `urllib.request` module.

2. Reads the HTML from the page as a string and assigns it to the `html` variable.

3. Creates a BeautifulSoup object, assigning it to the `soup` variable.

The BeautifulSoup object created and assigned to `soup` is initialized with two arguments. The first argument is the HTML to be parsed, while the second argument, `"html.parser"`, specifies the parser to use. This indicates that Python's built-in HTML parser should be employed.


### 5.3 - Using a BeautifulSoup Object

For instance, BeautifulSoup objects include a `.get_text()` method that allows you to extract all the text from the document while automatically removing any HTML tags.

In [19]:
print(soup.get_text())



Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine






The output contains many blank lines, which are caused by newline characters in the HTML document's text. If necessary, you can remove these blank lines using the `.replace()` string method.

### 5.4 - Extracting Text From the HTML using BeautifulSoup

Frequently, you may want to extract only specific text from an HTML document. In such cases, using Beautiful Soup to extract the text and then applying the `.find()` string method can be easier than working directly with regular expressions.

However, there are times when the HTML tags themselves indicate the data you want to retrieve. For example, if you want to gather the URLs for all the images on a page, these links are found in the `src` attribute of `<img>` HTML tags.

In this scenario, you can use `find_all()` to return a list of all instances of that specific tag:


In [20]:
soup.find_all("img")

[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]

This will return a list of all `<img>` tags in the HTML document. Although the objects in the list may appear to be strings representing the tags, they are actually instances of the Tag object provided by Beautiful Soup. Tag objects offer a straightforward interface for interacting with the information they contain.


In [21]:
image1, image2 = soup.find_all("img")

Each Tag object has a .name property that returns a string containing the HTML tag type:

In [22]:
image1.name

'img'

You can access the HTML attributes of a Tag object by placing their names within square brackets, similar to how you would access values in a dictionary.

For instance, the `<img src="/static/dionysus.jpg"/>` tag has a single attribute, `src`, with the value `"/static/dionysus.jpg"`. Similarly, an HTML tag like the link `<a href="https://realpython.com" target="_blank">` has two attributes, `href` and `target`.

To retrieve the source of the images on the Dionysus profile page, you can access the `src` attribute using the dictionary notation described above:


You can check the attributes of a Tag object by using the `.attrs` property:

In [23]:
image1.attrs

{'src': '/static/dionysus.jpg'}

Certain tags in HTML documents can be accessed through properties of the Tag object. For example, to retrieve the `<title>` tag in a document, you can use the `.title` property:

In [24]:
soup.title

<title>Profile: Dionysus</title>

In [25]:
soup.title.string

'Profile: Dionysus'

---

## 6. - Check Your Understanding

### Write a program that grabs the full HTML from the following URL:

### [http://olympus.realpython.org/profiles](http://olympus.realpython.org/profiles)

### Next, use Beautiful Soup to extract a list of all the links on the page by looking for HTML tags with the name `a` and retrieving the value taken on by the href attribute of each tag.

### The final output should look like this:

```shell
http://olympus.realpython.org/profiles/aphrodite
http://olympus.realpython.org/profiles/poseidon
http://olympus.realpython.org/profiles/dionysus
```


In [26]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [27]:
base_url = "http://olympus.realpython.org"

In [28]:
html_page = urlopen(base_url + "/profiles")
html_text = html_page.read().decode("utf-8")

In [29]:
soup = BeautifulSoup(html_text, "html.parser")

In [30]:
for link in soup.find_all("a"):
    link_url = base_url + link["href"]
    print(link_url)

http://olympus.realpython.org/profiles/aphrodite
http://olympus.realpython.org/profiles/poseidon
http://olympus.realpython.org/profiles/dionysus


---

## 7. - Interacting With Forms and Dynamic Web Elements

The `urllib` module you've been using in this tutorial is great for requesting the contents of a web page. However, there are instances when you need to interact with a web page to obtain the necessary content. For example, you might need to submit a form or click a button to reveal hidden content.

The Python standard library does not include built-in functionality for interacting with web pages, but many third-party packages are available on PyPI. One popular and relatively straightforward option is [MechanicalSoup](https://mechanicalsoup.readthedocs.io/en/stable/).

Essentially, MechanicalSoup acts as a headless browser—a web browser that operates without a graphical user interface. This headless browser can be controlled programmatically through a Python script.


### 7.1 - Install the `mechanicalsoup` Package

You can install MechanicalSoup using pip:

```bash
pip install MechanicalSoup
```

Because we defined a requirement file, you can install all the packages needed for this class by running the following command:

```bash
pip install -r requirements.txt
```

**Note: You may need to restart your Jupyter Notebook kernel after installing MechanicalSoup.**

### 7.2 - Create a Browser Object

In [33]:
import mechanicalsoup as ms

browser = ms.Browser()

Browser objects represent the headless web browser. You can utilize these objects to request a page from the Internet by passing a URL to their `.get()` method:

In [34]:
url = "http://olympus.realpython.org/login"
page = browser.get(url)
page

<Response [200]>

The number 200 represents the status code returned by the request. A status code of 200 indicates that the request was successful. Conversely, an unsuccessful request might return a status code of 404 if the URL does not exist, or 500 if there is a server error during the request.


MechanicalSoup utilizes BeautifulSoup to parse the HTML obtained from the request. The `page` object includes a `.soup` attribute, which represents a BeautifulSoup object:

In [35]:
type(page.soup)

bs4.BeautifulSoup

In [36]:
page.soup

<html>
<head>
<title>Log In</title>
</head>
<body bgcolor="yellow">
<center>
<br/><br/>
<h2>Please log in to access Mount Olympus:</h2>
<br/><br/>
<form action="/login" method="post" name="login">
Username: <input name="user" type="text"/><br/>
Password: <input name="pwd" type="password"/><br/><br/>
<input type="submit" value="Submit"/>
</form>
</center>
</body>
</html>

Notice this page has a `<form>` on it with `<input>` elements for a username and a password.

### 7.3 - Filling Out and Submitting a Form

Before proceeding, open the [/login](http://olympus.realpython.org/login) page from the previous example in a browser and take a look at it yourself.


Try entering a random username and password combination. If your guess is incorrect, the message "Wrong username or password!" will appear at the bottom of the page.

On the other hand, if you enter the correct login credentials, you will be redirected to the [/profiles](http://olympus.realpython.org/profiles) page:

| Username | Password       |
|----------|----------------|
| zeus     | ThunderDude    |


In the following example, you'll learn how to use MechanicalSoup to fill out and submit this form using Python!

The key part of the HTML code is the login form, which includes everything inside the `<form>` tags. This form has its `name` attribute set to "login" and contains two `<input>` elements: one named `user` and the other named `pwd`. Additionally, there is a third `<input>` element for the Submit button.

With an understanding of the login form's structure and the required credentials, let's examine a program that fills out the form and submits it.


In [37]:
import mechanicalsoup

# 1
browser = mechanicalsoup.Browser()
url = "http://olympus.realpython.org/login"
login_page = browser.get(url)
login_html = login_page.soup

# 2
form = login_html.select("form")[0]
form.select("input")[0]["value"] = "zeus"
form.select("input")[1]["value"] = "ThunderDude"

# 3
profiles_page = browser.submit(form, login_page.url)

In [38]:
profiles_page.url

'http://olympus.realpython.org/profiles'

Now, let's break down the example:

1. You create a `Browser` instance and use it to request the URL `http://olympus.realpython.org/login`. The HTML content of the page is assigned to the `login_html` variable using the `.soup` property.

2. `login_html.select("form")` returns a list of all `<form>` elements on the page. Since there is only one `<form>` element, you can access it by retrieving the element at index 0 of the list. Alternatively, if there's only one form on the page, you can also use `login_html.form`.

3. The next two lines select the username and password input fields and set their values to "zeus" and "ThunderDude", respectively.

4. You submit the form using `browser.submit()`. Note that you pass two arguments to this method: the form object and the URL of the login page, which you can access via `login_page.url`.

5. In the interactive window, you confirm that the submission successfully redirected to the `/profiles` page. If something had gone wrong, the value of `profiles_page.url` would still be `"http://olympus.realpython.org/login"`.


Now that you have the `profiles_page` variable set, it's time to programmatically obtain the URL for each link on the `/profiles` page.

To achieve this, you can use the `.select()` method again, this time passing the string `"a"` to select all `<a>` anchor elements on the page:


In [39]:
links = profiles_page.soup.select("a")

In [40]:
for link in links:
    address = link["href"]
    text = link.text
    print(f"{text}: {address}")

Aphrodite: /profiles/aphrodite
Poseidon: /profiles/poseidon
Dionysus: /profiles/dionysus


The URLs contained in each `href` attribute are relative URLs, which can be less useful if you want to navigate to them later using MechanicalSoup. If you know the full URL, you can easily construct the complete URL.

In this case, the base URL is `http://olympus.realpython.org`. You can concatenate this base URL with the relative URLs found in the `href` attributes to form complete URLs.


In [41]:
base_url = "http://olympus.realpython.org"
for link in links:
    address = base_url + link["href"]
    text = link.text
    print(f"{text}: {address}")

Aphrodite: http://olympus.realpython.org/profiles/aphrodite
Poseidon: http://olympus.realpython.org/profiles/poseidon
Dionysus: http://olympus.realpython.org/profiles/dionysus


---

## 8. - Check Your Understanding

### Use MechanicalSoup to provide the correct username (zeus) and password (ThunderDude) to the login form located at the URL http://olympus.realpython.org/login.

### Once the form is submitted, display the title of the current page to determine that you’ve been redirected to the /profiles page.

### Your program should print the text `<title>All Profiles</title>`.



In [1]:
import mechanicalsoup

browser = mechanicalsoup.Browser()

In [2]:
login_url = "http://olympus.realpython.org/login"
login_page = browser.get(login_url)
login_html = login_page.soup

In [3]:
form = login_html.form
form.select("input")[0]["value"] = "zeus"
form.select("input")[1]["value"] = "ThunderDude"

In [4]:
profiles_page = browser.submit(form, login_page.url)

In [5]:
print(profiles_page.soup.title)

<title>All Profiles</title>


In [7]:
links = profiles_page.soup.select("a")

In [8]:
base_url = "http://olympus.realpython.org"
all_links = []
for link in links:
    address = base_url + link["href"]
    text = link.text
    print(f"{text}: {address}")
    all_links.append(address)
    
all_links

Aphrodite: http://olympus.realpython.org/profiles/aphrodite
Poseidon: http://olympus.realpython.org/profiles/poseidon
Dionysus: http://olympus.realpython.org/profiles/dionysus


['http://olympus.realpython.org/profiles/aphrodite',
 'http://olympus.realpython.org/profiles/poseidon',
 'http://olympus.realpython.org/profiles/dionysus']

In [10]:
for link in all_links:
    page = browser.get(link)
    print(page.soup.title.string)

Profile: Aphrodite
Profile: Poseidon
Profile: Dionysus


---

## 9. - Interacting With Websites in Real Time

Sometimes, you may want to fetch real-time data from a website that provides continually updated information.

In the past, before learning Python programming, you might have had to sit in front of a browser, clicking the Refresh button to reload the page every time you wanted to check for updated content. Now, you can automate this process using the `.get()` method of the MechanicalSoup Browser object.

To see this in action, open your preferred browser and navigate to the URL: [http://olympus.realpython.org/dice](http://olympus.realpython.org/dice).


The [/dice](http://olympus.realpython.org/dice) page simulates a roll of a six-sided die, updating the result with each browser refresh. Below, you'll write a program that repeatedly scrapes the page for a new result.

First, you need to identify which element on the page contains the die roll result. To do this, right-click anywhere on the page and select **View Page Source**. Look for an `<h2>` tag about halfway down the HTML code that appears as follows:

```html
<h2 id="result">3</h2>
```



### 9.1 - To begin, write a simple program that opens the `/dice` page, scrapes the result, and prints it to the console. Here’s a basic example to get you started:

In [11]:
import mechanicalsoup

browser = mechanicalsoup.Browser()
page = browser.get("http://olympus.realpython.org/dice")
page.soup

<html>
<head>
<title>Dice Roll</title>
</head>
<body bgcolor="yellow">
<center>
<br/><br/>
<h1>Your dice roll result:</h1>
<br/>
<h2 id="result">3</h2>
<br/>
<p><a href="/dice">Roll it again</a></p>
<br/>
<br/>
<p id="time">October 18, 2024 09:49:37AM</p>
</center>
</body>
</html>

In [12]:
tag = page.soup.select("#result")[0]
result = tag.text

print(f"The result of your dice roll is: {result}")

The result of your dice roll is: 3


This example uses the `BeautifulSoup` object’s `.select()` method to find the element with `id=result`. The string `#result`, which you pass to `.select()`, utilizes the CSS ID selector `#` to indicate that `result` is an ID value.

To periodically get a new result, you’ll need to create a loop that loads the page at each step. Therefore, everything below the line `browser = mechanicalsoup.Browser()` in the above code needs to be placed inside the loop.

For this example, you want to roll the dice four times at ten-second intervals. To achieve this, the last line of your code needs to instruct Python to pause execution for ten seconds. You can do this with `time.sleep()` from Python’s time module. The `.sleep()` method takes a single argument that represents the amount of time to sleep in seconds.


Here’s an example that illustrates how sleep() works:

In [49]:
import time

print("I'm about to wait for five seconds...")
time.sleep(5)
print("Done waiting!")

I'm about to wait for five seconds...
Done waiting!


When you run this code, you’ll notice that the "Done waiting!" message isn’t displayed until five seconds have passed since the first `print()` function was executed.

### 9.2 - Now, combine the code snippets above to create a program that rolls the die four times at ten-second intervals:

In [51]:
import time
import mechanicalsoup

browser = mechanicalsoup.Browser()

for i in range(4):
    page = browser.get("http://olympus.realpython.org/dice")
    tag = page.soup.select("#result")[0]
    result = tag.text
    print(f"The result of your dice roll is: {result}")
    time.sleep(10)

The result of your dice roll is: 5
The result of your dice roll is: 1
The result of your dice roll is: 5
The result of your dice roll is: 5


When you run the program, you’ll immediately see the first result printed to the console. After ten seconds, the second result is displayed, then the third, and finally the fourth. What happens after the fourth result is printed?

The program continues running for another ten seconds before it finally stops. That’s kind of a waste of time! You can stop it from doing this by using an if statement to run `time.sleep()` for only the first three requests:


In [52]:
import time
import mechanicalsoup

browser = mechanicalsoup.Browser()

for i in range(4):
    page = browser.get("http://olympus.realpython.org/dice")
    tag = page.soup.select("#result")[0]
    result = tag.text
    print(f"The result of your dice roll is: {result}")

    # Wait 10 seconds if this isn't the last request
    if i < 3:
        time.sleep(10)

The result of your dice roll is: 3
The result of your dice roll is: 6
The result of your dice roll is: 3
The result of your dice roll is: 2


**With techniques like this, you can scrape data from websites that periodically update their data. However, you should be aware that requesting a page multiple times in rapid succession can be seen as suspicious, or even malicious, use of a website.**


It’s even possible to crash a server with an excessive number of requests, so you can imagine that many websites are concerned about the volume of requests to their server! Always check the Terms of Use and be respectful when sending multiple requests to a website.


---

# Conclusion

Although it’s possible to parse data from the Web using tools in Python’s standard library, there are many tools on PyPI that can help simplify the process.

In this tutorial, you learned how to:

- Request a web page using Python’s built-in urllib module
- Parse HTML using Beautiful Soup
- Interact with web forms using MechanicalSoup
- Repeatedly request data from a website to check for updates

Writing automated web scraping programs is fun, and the Internet has no shortage of content that can lead to all sorts of exciting projects.

Just remember, not everyone wants you pulling data from their web servers. Always check a website’s Terms of Use before you start scraping, and be respectful about how you time your web requests so that you don’t flood a server with traffic.


---

# Additional Resources



For more information on web scraping with Python, check out the following resources:

- [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)
- [urllib Documentation](https://docs.python.org/3/library/urllib.html)
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [MechanicalSoup documentation](https://mechanicalsoup.readthedocs.io/en/stable/)