# Intro

Web pages use HyperText Markup Language (HTML). HTML isn't a programming language like Python. It's a markup language with its own syntax and rules. When a Web browser like Chrome or Firefox downloads a Web page, it reads the HTML to determine how to render it and display it to you.

## HTML basics

HTML consists of tags. We open a tag like this:

`<p>`


We close a tag like this:


`</p>`

Read more here: 
* https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics
* https://developer.mozilla.org/en-US/docs/Web/HTML/Element

In [14]:
import requests

response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
content = response.content

print(content)

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'


## Retrieving HTML elements w/ BeautifulSoup

Downloading the page is the easy part. Let's say that we want to get the text in the first paragraph. Now we need to parse the page and extract the information we want.

We'll use the **[BeautifulSoup library](https://www.crummy.com/software/BeautifulSoup/) to parse the Web page** with Python. This library allows us to extract tags from an HTML document.




In [15]:
from bs4 import BeautifulSoup

# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')


# Get the body tag from the document.
# Since we passed in the top level of the document to the parser, we need to pick a branch off of the root.
# With BeautifulSoup, we can access branches by using tag types as attributes.
body = parser.body

# Get the p tag from the body.
p = body.p

# Print the text inside the p tag.
# Text is a property that gets the inside text of a tag.
print(p.text)

#shorter version:
title_text = parser.head.title.text
print(title_text)


Here is some simple content for this page.
A simple example page


## Find_all

While it's nice to use the tag type as a property, it's not always a very robust way to parse a document. It's usually better to be more explicit by using the `find_all` method. This method will **find all occurrences of a tag in the current element, and return a list**.




In [16]:
parser = BeautifulSoup(content, 'html.parser')

# Get a list of all occurrences of the body tag in the element.
body = parser.find_all("body")

# Get the paragraph tag.
p = body[0].find_all("p")

# Get the text.
print(p[0].text)

Here is some simple content for this page.


In [17]:
# Shorter version to get the title text
title_text = parser.find_all("head")[0].find_all("title")[0].text
print(title_text)


A simple example page


## Element IDs
HTML allows elements to have IDs. Because they are unique, we can use an ID to refer to a specific element.

E.g.: https://dataquestio.github.io/web-scraping-pages/simple_ids.html




In [18]:
# Get the page content and set up a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all("p", id="first")[0]
print(first_paragraph.text)


                First paragraph.
            


In [20]:
second_paragraph_text = parser.find_all("p", id="second")[0].text
print(second_paragraph_text)



                Second paragraph.
            



## Classes

In HTML, elements can also have classes. Classes aren't globally unique. In other words, many different elements belong to the same class, usually because they share a common purpose or characteristic.

For example, you may want to create three dividers to display three of your photographs. You can create a common look and feel for these dividers, such as a border and caption style.

This is where classes come into play. You could create a class called "gallery," define a style for it once using CSS (which we'll talk about soon), and then apply that class to all of the dividers you'll use to display photos. One element can even have multiple classes.



In [30]:
response = requests.get("https://dataquestio.github.io/web-scraping-pages/simple_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

print(content)

classes = parser.find_all("p",class_="inner-text")
print(classes)


second_inner_paragraph_text = parser.find_all("p", class_="inner-text")[1].text
print(second_inner_paragraph_text)

first_outer_paragraph_text = parser.find_all("p", class_="outer-text")[0].text
print(first_outer_paragraph_text)

b'<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <div>\n            <p class="inner-text">\n                First paragraph.\n            </p>\n            <p class="inner-text">\n                Second paragraph.\n            </p>\n        </div>\n        <p class="outer-text">\n            <b>\n                First outer paragraph.\n            </b>\n        </p>\n        <p class="outer-text">\n            <b>\n                Second outer paragraph.\n            </b>\n        </p>\n    </body>\n</html>'
[<p class="inner-text">
                First paragraph.
            </p>, <p class="inner-text">
                Second paragraph.
            </p>]

                Second paragraph.
            


                First outer paragraph.
            



## CSS Selectors

Continue: https://www.dataquest.io/m/54/web-scraping/7/css-selectors