# Intro

Web pages use HyperText Markup Language (HTML). HTML isn't a programming language like Python. It's a markup language with its own syntax and rules. When a Web browser like Chrome or Firefox downloads a Web page, it reads the HTML to determine how to render it and display it to you.

## HTML basics

HTML consists of tags. We open a tag like this:

`<p>`


We close a tag like this:


`</p>`

Read more here: 
* https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics
* https://developer.mozilla.org/en-US/docs/Web/HTML/Element

In [2]:
import requests

response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
content = response.content

print(content)

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'


## Retrieving HTML elements w/ BeautifulSoup

Downloading the page is the easy part. Let's say that we want to get the text in the first paragraph. Now we need to parse the page and extract the information we want.

We'll use the **[BeautifulSoup library](https://www.crummy.com/software/BeautifulSoup/) to parse the Web page** with Python. This library allows us to extract tags from an HTML document.




In [3]:
from bs4 import BeautifulSoup

# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')


# Get the body tag from the document.
# Since we passed in the top level of the document to the parser, we need to pick a branch off of the root.
# With BeautifulSoup, we can access branches by using tag types as attributes.
body = parser.body

# Get the p tag from the body.
p = body.p

# Print the text inside the p tag.
# Text is a property that gets the inside text of a tag.
print(p.text)

#shorter version:
title_text = parser.head.title.text
print(title_text)


Here is some simple content for this page.
A simple example page


## Find_all

While it's nice to use the tag type as a property, it's not always a very robust way to parse a document. It's usually better to be more explicit by using the `find_all` method. This method will **find all occurrences of a tag in the current element, and return a list**.




In [4]:
parser = BeautifulSoup(content, 'html.parser')

# Get a list of all occurrences of the body tag in the element.
body = parser.find_all("body")

# Get the paragraph tag.
p = body[0].find_all("p")

# Get the text.
print(p[0].text)

Here is some simple content for this page.


In [5]:
# Shorter version to get the title text
title_text = parser.find_all("head")[0].find_all("title")[0].text
print(title_text)


A simple example page


## Element IDs
HTML allows elements to have IDs. Because they are unique, we can use an ID to refer to a specific element.

E.g.: https://dataquestio.github.io/web-scraping-pages/simple_ids.html




In [6]:
# Get the page content and set up a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all("p", id="first")[0]
print(first_paragraph.text)


                First paragraph.
            


In [7]:
second_paragraph_text = parser.find_all("p", id="second")[0].text
print(second_paragraph_text)



                Second paragraph.
            



## Classes

In HTML, elements can also have classes. Classes aren't globally unique. In other words, many different elements belong to the same class, usually because they share a common purpose or characteristic.

For example, you may want to create three dividers to display three of your photographs. You can create a common look and feel for these dividers, such as a border and caption style.

This is where classes come into play. You could create a class called "gallery," define a style for it once using CSS (which we'll talk about soon), and then apply that class to all of the dividers you'll use to display photos. One element can even have multiple classes.



In [8]:
response = requests.get("https://dataquestio.github.io/web-scraping-pages/simple_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

print(content)

classes = parser.find_all("p",class_="inner-text")
print(classes)


second_inner_paragraph_text = parser.find_all("p", class_="inner-text")[1].text
print(second_inner_paragraph_text)

first_outer_paragraph_text = parser.find_all("p", class_="outer-text")[0].text
print(first_outer_paragraph_text)

b'<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <div>\n            <p class="inner-text">\n                First paragraph.\n            </p>\n            <p class="inner-text">\n                Second paragraph.\n            </p>\n        </div>\n        <p class="outer-text">\n            <b>\n                First outer paragraph.\n            </b>\n        </p>\n        <p class="outer-text">\n            <b>\n                Second outer paragraph.\n            </b>\n        </p>\n    </body>\n</html>'
[<p class="inner-text">
                First paragraph.
            </p>, <p class="inner-text">
                Second paragraph.
            </p>]

                Second paragraph.
            


                First outer paragraph.
            



## CSS Selectors

**Cascading Style Sheets**, or CSS, is a language for adding styles to HTML pages. You may have noticed that our simple HTML pages from the past few screens didn't have any styling; all of the paragraphs had black text and the same font size. Most Web pages use CSS to display a lot more than basic black text.

CSS uses **selectors** to add styles to the elements and classes of elements you specify. You can use selectors to add background colors, text colors, borders, padding, and many other style choices to the elements on HTML pages.



### Tag based
This CSS will make all of the text **inside all paragraphs** red:

```CSS
p{
    color: red
 }
 ```

### Tag + class based
This CSS will make all of the text **inside all paragraphs with `inner-text`** red:

```CSS
p.inner-text{
    color: red
 }
```
 
 ### Tag + ID based
 This CSS will change the text color to red for any paragraphs that have the **ID `first`**. We select IDs with the pound or hash symbol (#):

```CSS
p#first{
    color: red
 }
```
 
### ID / class only

You can also style IDs and classes without using any specific tags. For example, this CSS will make the element with the **ID `first`** red (not just paragraphs):

```CSS
#first{
    color: red
 }
```


This CSS will make any element with the **class `inner-text`** red:

```CSS
.inner-text{
    color: red
 }
```

### Combinations / specificity

Selectors can combine tags, classes and ids. Always the most specific selector is taken:
See example: https://codepen.io/Bolland/pen/pZWrMP

**CSS**
```CSS

p{
    color: red;
    font-weight:bold
 }

p.inner{
  color:blue;
}

p.inner#class-test{
  color:green;
}

```

**HTML**
```HTML
<p>p only</p>  // red text

<p class="inner">p and class</p> //blue text

<p class="inner" id="class-test">p and class and id</p> // green text
```

### Using CSS selectors in BeautifulSoup

We can use **BeautifulSoup's `.select` method** to work with CSS selectors. Here's the HTML we'll be working with on this screen:



In [18]:
#Get content
response = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
content = response.content

#print(content)

#initialise BS4 parser
parser = BeautifulSoup(content, "html.parser")

# Select all of the elements that have the first-item class.
first_items = parser.select(".first-item")

print(first_items)

# Print the text of the first paragraph (the first element with the first-item class).
print("First item text: " + first_items[0].text)


[<p class="inner-text first-item" id="first">
                First paragraph.
            </p>, <p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>]
First item text: 
                First paragraph.
            


In [19]:
first_outer_text = parser.select(".outer-text")[0].text
print(first_outer_text)

second_text = parser.select("#second")[0].text
print(second_text)



                First outer paragraph.
            



                First outer paragraph.
            



## CSS Nesting

We can nest CSS selectors similar to the way HTML nests tags. For example, we could use selectors to find all of the paragraphs inside the body tag. Nesting is a very powerful technique that enables us to use CSS to do complex Web scraping tasks.

This selector will target any **paragraph inside a `div` tag**:

```CSS
div p
```

This selector will target any **item inside a `div` tag that has the class `first-item`**:
```CSS
div .first-item
```

This one is even more specific. It selects **any item that's inside a `div` tag inside a `body` tag, but only if it also has the ID `#first`**:
```CSS
body div #first
```

### Example:
https://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html



In [36]:
# Get the Superbowl box score data.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Find the number of turnovers the Seahawks committed.
turnovers = parser.select("#turnovers")[0]
seahawks_turnovers = turnovers.select("td")[1]
seahawks_turnovers_count = seahawks_turnovers.text
print(seahawks_turnovers_count)

#Total Plays for the New England patriots_total_plays_count
patriots_total_plays_count = parser.select("tr#total-plays")[0].select("td")[1].text
print(patriots_total_plays_count)

#Total Yards for the Seahawks
seahawks_total_yards_count = parser.select("#total-yards")[0].select("td")[2].text
print(seahawks_total_yards_count )

1
53
377
