Web scraping is the process of gathering information from the Internet.

The Python libraries requests and Beautiful Soup are powerful tools for the job.

In [None]:
import requests

aphrodite_URL = 'http://olympus.realpython.org/profiles/aphrodite'
page = requests.get(aphrodite_URL)
print(page.status_code)

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully

In [7]:
print(page.content)

b'<html>\n<head>\n<title>Profile: Aphrodite</title>\n</head>\n<body bgcolor="yellow">\n<center>\n<br><br>\n<img src="/static/aphrodite.gif" />\n<h2>Name: Aphrodite</h2>\n<br><br>\nFavorite animal: Dove\n<br><br>\nFavorite color: Red\n<br><br>\nHometown: Mount Olympus\n</center>\n</body>\n</html>\n'


We can use the BeautifulSoup library to parse this document, and extract the text

In [8]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object

In [9]:
print(soup.prettify())

<html>
 <head>
  <title>
   Profile: Aphrodite
  </title>
 </head>
 <body bgcolor="yellow">
  <center>
   <br/>
   <br/>
   <img src="/static/aphrodite.gif"/>
   <h2>
    Name: Aphrodite
   </h2>
   <br/>
   <br/>
   Favorite animal: Dove
   <br/>
   <br/>
   Favorite color: Red
   <br/>
   <br/>
   Hometown: Mount Olympus
  </center>
 </body>
</html>



In [None]:
soup.children

list(soup.children)

children returns a list generator, so we need to call the list function on it

The Tag object allows us to navigate through an HTML document, and extract other tags and text.

In [13]:
[type(item) for item in list(soup.children)]

[bs4.element.Tag, bs4.element.NavigableString]

We can now select the html tag and its children by taking the first item in the list

Now, we can find the children inside the html tag

In [None]:
html = list(soup.children)[0]

print(html)

In [None]:
list(html.children)

In [None]:
body = list(html.children)[3]
print(body.get_text())

print(body)
list(body.children)

Want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object

In [None]:
soup.find('h2')

find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [None]:
soup.find_all('h2')[0].get_text()

In [27]:
def getAndParseURL(url):
    result = requests.get(url)
    soup = BeautifulSoup(result.text, 'html.parser')
    return(soup)

In [34]:
dq_ids_classes_url = "http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html"

simple_ex_soup = getAndParseURL(dq_ids_classes_url)
simple_ex_soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [None]:
simple_ex_soup.find_all('p')[0].get_text()

for x in simple_ex_soup.findAll("p"):
  print(x.get_text())

simple_ex_soup.find_all('p', class_='outer-text')

In [None]:
simple_ex_soup.find_all(class_='outer-text')

In [None]:
simple_ex_soup.find_all(id="first")

BeautifulSoup objects support searching a page via CSS selectors using the select method. We can use CSS selectors to find all the p tags in our page that are inside of a div like this:

In [None]:
simple_ex_soup.select("div p")