Web pages use **HyperText Markup Language** (HTML). It's a **markup language** with its own syntax and rules.

Here's the HTML for a very simple Web page:

<img src="https://dq-content.s3.amazonaws.com/6ne0anS.png" align=left />

HTML documents contain a few major sections. 
* The head section contains information that's useful to the Web browser that's rendering the page; the user doesn't see it. 
* The body section contains the bulk of the content the user interacts with on the page.

Check out [MDN's guide to the HTML element](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) for a list of all possible HTML tags.

In [1]:
import requests

In [2]:
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")

content = response.content
content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

We'll use the [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) library to parse the Web page with Python. This library allows us to extract tags from an HTML document.

In [3]:
from bs4 import BeautifulSoup

In [4]:
# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')

print(type(parser))
parser

<class 'bs4.BeautifulSoup'>


<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [5]:
# Get the body tag from the document.
# Since we passed in the top level of the document to the parser, we need to pick a branch off of the root.
# With BeautifulSoup, we can access branches by using tag types as attributes.
body = parser.body

body

<body>
<p>Here is some simple content for this page.</p>
</body>

In [6]:
# Get the p tag from the body.
p = body.p

p

<p>Here is some simple content for this page.</p>

In [7]:
# Print the text inside the p tag.
# Text is a property that gets the inside text of a tag.
print(p.text)

Here is some simple content for this page.


In [8]:
# Print the text inside the title tag.
print(parser.head.title.text)

A simple example page


It's usually better to be more explicit by using the **find_all** method. This method will find all occurrences of a tag in the current element, and return a list.

In [9]:
# Get a list of all occurrences of the body tag in the element.
body = parser.find_all("body")

print(type(body))
body

<class 'bs4.element.ResultSet'>


[<body>
 <p>Here is some simple content for this page.</p>
 </body>]

In [10]:
# Get the paragraph tag.
p = body[0].find_all("p")

# Get the text.
print(p[0].text)

p

Here is some simple content for this page.


[<p>Here is some simple content for this page.</p>]

HTML allows elements to have **IDs**. Because they are unique, we can use an ID to refer to a specific element.

Here's an example page:

<img src="https://dq-content.s3.amazonaws.com/WBG4aCQ.png" align=left />

In [11]:
# Get the page content and set up a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

parser

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p id="first">
                First paragraph.
            </p>
</div>
<p id="second">
<b>
                Second paragraph.
            </b>
</p>
</body>
</html>

In [12]:
# Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all("p", id="first")[0]
print(first_paragraph.text)


                First paragraph.
            


In [13]:
print(parser.find_all("p", id="second")[0].text)



                Second paragraph.
            



In HTML, elements can also have **classes**. Classes aren't globally unique. In other words, many different elements belong to the same class, usually because they share a common purpose or characteristic.

<img src="https://dq-content.s3.amazonaws.com/T2TguLL.png" align=left />

In [14]:
# Get the website that contains classes.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

parser

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

In [15]:
# Get the first inner paragraph.
# Find all the paragraph tags with the class inner-text.
# Then, take the first element in that list.
first_inner_paragraph = parser.find_all("p", class_="inner-text")[0]
print(first_inner_paragraph.text)


                First paragraph.
            


In [16]:
print(parser.find_all("p", class_="inner-text")[1].text)
print(parser.find_all("p", class_="outer-text")[0].text)
print(parser.find_all("p", class_="outer-text")[1].text)


                Second paragraph.
            


                First outer paragraph.
            



                Second outer paragraph.
            



**Cascading Style Sheets**, or **CSS**, is a language for adding styles to HTML pages.