In [8]:
# import urlopen(): 
from urllib.request import urlopen

In [9]:
# webpage to open
url = "http://olympus.realpython.org/profiles/aphrodite"

In [10]:
# To open the web page, pass url to urlopen():
page = urlopen(url)

In [14]:
# urlopen() returns an HTTPResponse object
page

<http.client.HTTPResponse at 0x24c76473730>

In [11]:
html_bytes = page.read()

In [12]:
html = html_bytes.decode("utf-8")

In [13]:
print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



With urllib, you accessed the website similarly to how you would in your browser. However, instead of rendering the content visually, you grabbed the source code as text. Now that you have the HTML as text, you can extract information from it in a couple of different ways.

In [16]:
# get the index of the opening <title> tag by passing the string "<title>" to .find():

title_index = html.find("<title>")
title_index

14

In [17]:
# To get the index of the first letter in the title, you can add the length of the string "<title>" to title_index:
start_index = title_index + len("<title>")
start_index

21

In [18]:
# get the index of the closing </title> tag by passing the string "</title>" to .find():
end_index = html.find("</title>")
end_index

39

In [19]:
# extract the title by slicing the html string:
title = html[start_index:end_index]
title

'Profile: Aphrodite'

### Try out a lot more complicated website

In [49]:
url1 = "http://olympus.realpython.org/profiles/poseidon"
page = urlopen(url1)
html = page.read().decode("utf-8")
start_index = html.find("<title>") + len("<title>")
end_index = html.find("</title>")
title = html[start_index:end_index]
title  # this will display along side some html text

'\n<head>\n<title >Profile: Poseidon'

##### Regular expressions—or regexes for short—are patterns that you can use to search for text within a string. Python supports regular expressions through the standard library’s re module.
Note: Regular expressions aren’t particular to Python. They’re a general programming concept and are supported in many programming languages.

Regular expressions use special characters called metacharacters to denote different patterns. For instance, the asterisk character (*) stands for zero or more instances of whatever comes just before the asterisk.

In [21]:
import re

In [22]:
re.findall("ab*c", "ac")

['ac']

The regular expression "ab*c" matches any part of the string that begins with "a", ends with "c", and has zero or more instances of "b" between the two. re.findall() returns a list of all matches. The string "ac" matches this pattern, so it’s returned in the list. ['ac']

In [24]:
re.findall("ab*c", "abcd")


['abc']

In [25]:
re.findall("ab*c", "acc")

['ac']

In [26]:
re.findall("ab*c", "abcac")

['abc', 'ac']

In [28]:
re.findall("ab*c", "abdc") # Note: that if no match is found, then .findall() returns an empty list.

[]

##### Pattern matching is case sensitive.
If you want to match this pattern regardless of the case, then you can pass a third argument with the value re.IGNORECASE:



In [29]:
re.findall("ab*c", "ABC")

[]

In [30]:
re.findall("ab*c", "ABC", re.IGNORECASE)

['ABC']

You can use a period (.) to stand for any single character in a regular expression. For instance, you could find all the strings that contain the letters "a" and "c" separated by just single character in a case where it is not just a single character it returns an empty list

In [31]:
re.findall("a.c", "abc")

['abc']

In [32]:
re.findall("a.c", "abbc")

[]

In [33]:
re.findall("a.c", "acc")

['acc']

In [34]:
re.findall("a.c", "ac")

[]

The pattern .* inside a regular expression stands for any character repeated any number of times. For instance, you can use "a.*c" to find every substring that starts with "a" and ends with "c", regardless of which letter—or letters—are in between:

In [35]:
re.findall("a.*c", "abc")

['abc']

In [36]:
re.findall("a.*c", "acc")

['acc']

In [37]:
re.findall("a.*c", "abbc")

['abbc']

In [40]:
re.findall("a.*c", "aghjmmmcc")

['aghjmmmcc']

Use re.search() to search for a particular pattern inside a string. This function is somewhat more complicated than re.findall() because it returns an object called MatchObject that stores different groups of data. This is because there might be matches inside other matches, and re.search() returns every possible result.

In [41]:
match_results = re.search("ab*c", "ABC", re.IGNORECASE)
match_results.group()

'ABC'

##### In regex we have re.sub() which behaves sort of like the .replace() string method
re.sub() which is short for substitute, allows you to replace the text in a string that matches a regular expression with new text.

In [45]:
string = "Everything is <replaced>  if it's in <tags>.." # this will find the longest possible match when * are used
string = re.sub("<.*>", "ELEPHANTS", string)
string

'Everything is ELEPHANTS..'

In [46]:
string = "Everything is <replaced> if it's in <tags>."
string = re.sub("<.*?>", "ELEPHANTS", string) # this matches the shortest possible string of text
string

"Everything is ELEPHANTS if it's in ELEPHANTS."

This time, re.sub() finds two matches, <replaced> and <tags>, and substitutes the string "ELEPHANTS" for both matches.

#### Now Extract Text From HTML With Regular Expressions

In [None]:
<TITLE >Profile: Dionysus</title  / >

In [48]:
# regex_soup.py

import re
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title) # Remove HTML tags using regex

print(title)

Profile: Dionysus


Take a closer look at the first regular expression in the pattern string by breaking it down into three parts:

<title.*?> matches the opening <TITLE > tag in html. The <title part of the pattern matches with <TITLE because re.search() is called with re.IGNORECASE, and .*?> matches any text after <TITLE up to the first instance of >.

.*? non-greedily matches all text after the opening <TITLE >, stopping at the first match for </title.*?>.

    
</title.*?> differs from the first pattern only in its use of the / character, so it matches the closing </title  / > tag in html.
 
    
</title.*?> differs from the first pattern only in its use of the / character, so it matches the closing </title  / > tag in html.

### Mini task

In [50]:
url = "http://olympus.realpython.org/profiles/dionysus"
html_page = urlopen(url)
html_text = html_page.read().decode("utf-8")

Then use .find() to display the text following Name: and Favorite Color: (not including any leading spaces or trailing HTML tags that might appear on the same line).

In [51]:
for string in ["Name: ", "Favorite Color:"]:
    string_start_idx = html_text.find(string)
    text_start_idx = string_start_idx + len(string)

    next_html_tag_offset = html_text[text_start_idx:].find("<")
    text_end_idx = text_start_idx + next_html_tag_offset

    raw_text = html_text[text_start_idx : text_end_idx]
    clean_text = raw_text.strip(" \r\n\t")
    print(clean_text)

Dionysus
Wine


### Use an HTML Parser for Web Scraping in Python

#### Beautiful soup

In [58]:
    from bs4 import BeautifulSoup
    from urllib.request import urlopen

# Opens the URL http://olympus.realpython.org/profiles/dionysus by using urlopen() from the urllib.request module
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)

# Read the HTML from the page as a string and assigns it to the html variable
html = page.read().decode("utf-8")

# Creates a BeautifulSoup object and assigns it to the soup variable
soup = BeautifulSoup(html, "html.parser")

In [55]:
print(soup.get_text())



Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine






In [59]:
soup.find_all("img")

[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]

In [60]:
image1, image2 = soup.find_all("img")

In [62]:
image1.name # Each Tag object has a .name property that returns a string containing the HTML tag type:

'img'

In [63]:
image1["src"]


'/static/dionysus.jpg'

In [64]:
image2["src"]

'/static/grapes.png'

In [65]:
soup.title

<title>Profile: Dionysus</title>

In [66]:
soup.title.string

'Profile: Dionysus'

Beautiful Soup automatically cleans up the tags for you by removing the extra space in the opening tag and the extraneous forward slash (/) in the closing tag.

In [67]:
soup.find_all("img", src="/static/dionysus.jpg")

[<img src="/static/dionysus.jpg"/>]

### Interact With HTML Forms

MechanicalSoup

In [76]:
import mechanicalsoup
browser = mechanicalsoup.Browser()

In [77]:
url = "http://olympus.realpython.org/login"
page = browser.get(url)

In [78]:
page

<Response [200]>

MechanicalSoup uses Beautiful Soup to parse the HTML from the request, and page has a .soup attribute that represents a BeautifulSoup object

In [79]:
type(page.soup)

bs4.BeautifulSoup

In [80]:
page.soup

<html>
<head>
<title>Log In</title>
</head>
<body bgcolor="yellow">
<center>
<br/><br/>
<h2>Please log in to access Mount Olympus:</h2>
<br/><br/>
<form action="/login" method="post" name="login">
Username: <input name="user" type="text"/><br/>
Password: <input name="pwd" type="password"/><br/><br/>
<input type="submit" value="Submit"/>
</form>
</center>
</body>
</html>

In [81]:
import mechanicalsoup

# 1
browser = mechanicalsoup.Browser()
url = "http://olympus.realpython.org/login"
login_page = browser.get(url)
login_html = login_page.soup

# 2
form = login_html.select("form")[0]
form.select("input")[0]["value"] = "zeus"
form.select("input")[1]["value"] = "ThunderDude"

# 3
profiles_page = browser.submit(form, login_page.url)

In [82]:
profiles_page.url

'http://olympus.realpython.org/profiles'

Now that you have the profiles_page variable set, it’s time to programmatically obtain the URL for each link on the /profiles page.
To do this, you use .select() again, this time passing the string "a" to select all the <a> anchor elements on the page

In [83]:
links = profiles_page.soup.select("a")

In [84]:
for link in links:
...     address = link["href"]
...     text = link.text
...     print(f"{text}: {address}")

Aphrodite: /profiles/aphrodite
Poseidon: /profiles/poseidon
Dionysus: /profiles/dionysus


The URLs contained in each href attribute are relative URLs, which aren’t very helpful if you want to navigate to them later using MechanicalSoup. If you happen to know the full URL, then you can assign the portion needed to construct a full URL

In [85]:
base_url = "http://olympus.realpython.org"
for link in links:
    address = base_url + link["href"]
    text = link.text
    print(f"{text}: {address}")

Aphrodite: http://olympus.realpython.org/profiles/aphrodite
Poseidon: http://olympus.realpython.org/profiles/poseidon
Dionysus: http://olympus.realpython.org/profiles/dionysus


### Task

Use MechanicalSoup to provide the correct username (zeus) and password (ThunderDude) to the login form located at the URL http://olympus.realpython.org/login.

Once the form is submitted, display the title of the current page to determine that you’ve been redirected to the /profiles page.

Your program should print the text <title>All Profiles</title>.

In [86]:
import mechanicalsoup

browser = mechanicalsoup.Browser()

login_url = "http://olympus.realpython.org/login"
login_page = browser.get(login_url)
login_html = login_page.soup

Because the page has only a single form on it, you can access the form via login_html.form. Using .select(), select the username and password inputs and fill them with the username "zeus" and the password "ThunderDude":

In [87]:
form = login_html.form
form.select("input")[0]["value"] = "zeus"
form.select("input")[1]["value"] = "ThunderDude"

# Now that the form is filled out, you can submit it with browser.submit()
profiles_page = browser.submit(form, login_page.url)
print(profiles_page.soup.title)

<title>All Profiles</title>


### Interact With Websites in Real Time

Sometimes you want to be able to fetch real-time data from a website that offers continually updated information. You can automate this process using the .get() method of the MechanicalSoup Browser object.

In [None]:
# mech_soup.py

import mechanicalsoup

browser = mechanicalsoup.Browser()
page = browser.get("http://olympus.realpython.org/dice")
tag = page.soup.select("#result")[0]
result = tag.text

print(f"The result of your dice roll is: {result}")

This example uses the BeautifulSoup object’s .select() method to find the element with id=result. The string "#result", which you pass to .select(), uses the CSS ID selector # to indicate that result is an id value.

In [88]:
import time

print("I'm about to wait for five seconds...")
time.sleep(5)
print("Done waiting!")

I'm about to wait for five seconds...
Done waiting!


In [89]:
# mech_soup.py

import time
import mechanicalsoup

browser = mechanicalsoup.Browser()

for i in range(4):
    page = browser.get("http://olympus.realpython.org/dice")
    tag = page.soup.select("#result")[0]
    result = tag.text
    print(f"The result of your dice roll is: {result}")
    time.sleep(10)

The result of your dice roll is: 6
The result of your dice roll is: 4
The result of your dice roll is: 1
The result of your dice roll is: 5


In [90]:
import time
import mechanicalsoup

browser = mechanicalsoup.Browser()

for i in range(4):
    page = browser.get("http://olympus.realpython.org/dice")
    tag = page.soup.select("#result")[0]
    result = tag.text
    print(f"The result of your dice roll is: {result}")

    # Wait 10 seconds if this isn't the last request
    if i < 3:
        time.sleep(10)

The result of your dice roll is: 1
The result of your dice roll is: 4
The result of your dice roll is: 6
The result of your dice roll is: 2
