A lot of data aren't accessible through data sets or APIs. They may exist on the Internet as Web pages, though. One way to access the data without waiting for the provider to create an API is to use a technique called Web scraping.
Web scraping allows us to load a Web page into Python and extract the information we want. We can then work with the data using standard analysis tools like pandas and numpy.
Before we can do Web scraping, we need to understand the structure of the Web page we're working with, then find a way to extract parts of that structure in a sensible way.
We'll use the requests library heavily as we learn about Web scraping. This library enables us to download a Web page. We'll also use the beautifulsoup library to extract the relevant parts of the Web page.

Web Page Structure

Web pages use HyperText Markup Language (HTML). HTML isn't a programming language like Python. It's a markup language with its own syntax and rules. When a Web browser like Chrome or Firefox downloads a Web page, it reads the HTML to determine how to render it and display it to you.

Here's the HTML for a very simple Web page:
<html>
    <head>
        <title>A simple page </title>
    </head>
    <body>
        <p>Here is simple content for the page.</p>
    </body>
</html> 
Anything in between the opening and closing of a tag is the content of that tag. We can nest tags to create complex formatting rules. Here's an example:
The b tag bolds the content inside it, and the p tag creates a new paragraph. The HTML above will display as a bold paragraph because the b tag is inside the p tag. In other words, the b tag is nested within the p tag.

HTML documents contain a few major sections. The head section contains information that's useful to the Web browser that's rendering the page; the user doesn't see it. The body section contains the bulk of the content the user interacts with on the page.

Different tags have different purposes. For example, the title tag tells the Web browser what page title to display at the top of your tab. The p tag indicates that the content inside it is a single paragraph

In [1]:
import requests
response=requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
content=response.content
print(content)

<!DOCTYPE html>
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <p>Here is some simple content for this page.</p>
    </body>
</html>


Retrieving Elements from a Page

Downloading the page is the easy part. Let's say that we want to get the text in the first paragraph. Now we need to parse the page and extract the information we want.
We'll use the BeautifulSoup library to parse the Web page with Python. This library allows us to extract tags from an HTML document.
We can think of HTML documents as "trees," and the nested tags as "branches" (similar to a family tree). BeautifulSoup works the same way.
If we look at this page, for example, the root of the "tree" is the html tag:
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <p>Here is some simple content for this page.</p>
    </body>
</html>

The html tag contains two "branches," head and body. head contains one "branch," title. body contains one branch, p. Drilling down through these multiple branches is one way to parse a Web page.
To extract the text inside the p tag, we would first need to get the body element, then the p element, and then finally the text inside the p element.

In [3]:
from bs4 import BeautifulSoup

# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')
# Get the body tag from the document.
# Since we passed in the top level of the document to the parser, we need to pick a branch off of the root.
# With BeautifulSoup, we can access branches by using tag types as attributes.
body = parser.body

# Get the p tag from the body.
p = body.p

# Print the text inside the p tag.
# Text is a property that gets the inside text of a tag.
print(p.text)
#print the title
head=parser.head
title=head.title
title_text=title.text
print(title_text)

Here is some simple content for this page.
A simple example page


Using Find All

While it's nice to use the tag type as a property, it's not always a very robust way to parse a document. It's usually better to be more explicit by using the find_all method. This method will find all occurrences of a tag in the current element, and return a list.
If we only want the first occurrence of an item, we'll need to index the list to get it. Aside from this difference, it behaves the same way as passing in the tag type as an attribute.

In [4]:
from bs4 import BeautifulSoup
parser=BeautifulSoup(content,'html.parser')
# Get a list of all occurrences of the body tag in the element.
body=parser.find_all("body") # returns only 1 body
# Get the paragraph tag.
p=body[0].find_all("p") #returns only 1 para
# Get the text.
para_text=p[0].text
print(para_text)

#get the title
head=parser.find_all("head")
title=head[0].find_all("title")
title_text=title[0].text
print(title_text)

Here is some simple content for this page.
A simple example page


Element IDs

HTML allows elements to have IDs. Because they are unique, we can use an ID to refer to a specific element.
Here's an example page:
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <div>
            <p id ="First">First paragrah.</p>
        </div>
        <div>
            <p id ="Second"><b>Second paragraph.</b></p>
        </div>    
    </body>
</html>

You can see the page here.

HTML uses the div tag to create a divider that splits the page into logical units. We can think of a divider as a "box" that contains content. For example, different dividers hold a Web page's footer, sidebar, and horizontal menu.

There are two paragraphs on the page; the first is nested inside a div. Luckily, the paragraphs have IDs. This means we can access them easily, even through they're nested.

Let's use the find_all method to access those paragraphs, and pass in the additional id attribute.

In [5]:
# Get the page content and set up a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all("p", id="first")[0]
print(first_paragraph.text)
second_paragraph=parser.find_all("p",id="second")[0]
second_paragraph_text=second_paragraph.text
print(second_paragraph.text)


                First paragraph.
            


                Second paragraph.
            



Element Classes

In HTML, elements can also have classes. Classes aren't globally unique. In other words, many different elements belong to the same class, usually because they share a common purpose or characteristic.
For example, you may want to create three dividers to display three of your photographs. You can create a common look and feel for these dividers, such as a border and caption style.
This is where classes come into play. You could create a class called "gallery," define a style for it once using CSS (which we'll talk about soon), and then apply that class to all of the dividers you'll use to display photos. One element can even have multiple classes.
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <div>
            <p class ="Inner-text">First inner paragrah.</p>
            <p class= "Inner-text">Second inner paragraph.</p>
        </div>
        <div>
            <p class ="Outer-text"><b>First Outer paragraph.</b></p>
            <p class="Outer-text"><b>Second Outer paragraph.</b></p>
        </div>    
    </body>
</html>
Take a look at this page to see how we've used classes to style paragraphs.

We can use find_all to select elements by class. We'll just need to pass in the class_ parameter.

In [6]:
# Get the website that contains classes.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Get the first inner paragraph.
# Find all the paragraph tags with the class inner-text.
# Then, take the first element in that list.
first_inner_paragraph = parser.find_all("p", class_="inner-text")[0]
print(first_inner_paragraph.text)
second_inner_paragraph=parser.find_all("p",class_="inner-text")[1]
second_inner_paragraph_text=second_inner_paragraph.text
print(second_inner_paragraph.text)
first_outer_paragraph=parser.find_all("p",class_="outer-text")[0]
first_outer_paragraph_text=first_outer_paragraph.text
print(first_outer_paragraph.text)


                First paragraph.
            

                Second paragraph.
            


                First outer paragraph.
            



Using CSS Selectors

We can use BeautifulSoup's .select method to work with CSS selectors. Here's the HTML we'll be working with on this screen:

<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <div>
            <p class ="Inner-text first_item" id="first">First inner paragrah.</p>
            <p class= "Inner-text">Second inner paragraph.</p>
        </div>
        <div>
            <p class ="Outer-text first_item" id="second"><b>First Outer paragraph.</b></p>
            <p class="Outer-text"><b>Second Outer paragraph.</b></p>
        </div>    
    </body>
</html>

You may have noticed that the same element can have both an ID and a class. We can also assign multiple classes to a single element; we just separate the classes with a space.

In [7]:
# Get the website that contains classes and IDs.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Select all of the elements that have the first-item class.
#for class we have to use . for searching
first_items = parser.select(".first-item")
# Print the text of the first paragraph (the first element with the first-item class).
print(first_items[0].text)
#Select all of the elements that have the class outer-text.
outer_first_items=parser.select(".outer-text")
first_outer_text=outer_first_items[0].text
print(first_outer_text)
#for id we have to use # for searching
#Select all of the elements that have the ID second.
second_items=parser.select("#second")
second_text=second_items[0].text
print(second_text)


                First paragraph.
            


                First outer paragraph.
            



                First outer paragraph.
            



Nesting CSS Selectors

We can nest CSS selectors similar to the way HTML nests tags. For example, we could use selectors to find all of the paragraphs inside the body tag. Nesting is a very powerful technique that enables us to use CSS to do complex Web scraping tasks
This selector will target any paragraph inside a div tag:

div p

This selector will target any item inside a div tag that has the class first-item:

div .first-item

This one is even more specific. It selects any item that's inside a div tag inside a body tag, but only if it also has the ID first:

body div #first

This selector zeroes in on any items with the ID first that are inside any items with the class first-item:

.first-item #first

As you can see, we can nest CSS selectors in infinite ways. This allows us to extract data from websites with complex layouts. You can test selectors by using the .select method as you write them. Because it's easy to write a selector that doesn't work the way you expect, we highly recommend doing this.

Now that we know about nested CSS selectors, let's try them out. We can use them with the same .select method we used for our CSS selectors.

We'll be practicing on this HTML:

<html><head lang="en">
        <meta charset="UTF-8">
        <title>2014 Superbowl Team Stats</title>
    </head>
    <body>

        <table class="stats_table nav_table" id="team_stats">
            <tbody>
                <tr id="teams">
                    <th></th>
                    <th>SEA</th>
                    <th>NWE</th>
                </tr>
                <tr id="first-downs">
                    <td>First downs</td>
                    <td>20</td>
                    <td>25</td>
                </tr>
                <tr id="total-yards">
                    <td>Total yards</td>
                    <td>396</td>
                    <td>377</td>
                </tr>
                <tr id="turnovers">
                    <td>Turnovers</td>
                    <td>1</td>
                    <td>2</td>
                </tr>
                <tr id="penalties">
                    <td>Penalties-yards</td>
                    <td>7-70</td>
                    <td>5-36</td>
                </tr>
                <tr id="total-plays">
                    <td>Total Plays</td>
                    <td>53</td>
                    <td>72</td>
                </tr>
                <tr id="time-of-possession">
                    <td>Time of Possession</td>
                    <td>26:14</td>
                    <td>33:46</td>
                </tr>
            </tbody>
        </table>

    
</body></html>

 	                SEA 	NWE
First downs 	    20 	    25
Total yards 	    396 	377
Turnovers 	        1 	     2
Penalties-yards     7-70 	5-36
Total Plays 	    53   	72
Time of Possession 	26:14 	33:46

In [8]:
# Get the Superbowl box score data.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Find the number of turnovers the Seahawks committed.
turnovers = parser.select("#turnovers")[0]
seahawks_turnovers = turnovers.select("td")[1]
seahawks_turnovers_count = seahawks_turnovers.text
print(seahawks_turnovers_count)
#Find the Total Plays for the New England Patriots
totalplays=parser.select("#total-plays")[0]
nwe_plays=totalplays.select("td")[2]
patriots_total_plays_count=nwe_plays.text
print(patriots_total_plays_count)
#Find the Total Yards for the Seahawks
totalyards=parser.select("#total-yards")[0]
swe_yards=totalyards.select("td")[1]
seahawks_total_yards_count=swe_yards.text
print(seahawks_total_yards_count)

1
72
396
