# Web Scraping

Now that we have looked at some HTML tags, let's explore how we can actually perform web scraping.


<div style="text-align: center"><h3>The Reality of Scraping</h3><img src="images/scraping_meme.png" style="width: 600px"></div>

## Why do we scrape the web?

* Realistically, data that you want to study won't always be available to you in the form of a curated data set.
* Need to go to the internet to find interesting data:
    * From an existing company
    * Text for NLP
    * Images


# Scraping from a Web Page with Python

Scraping a web site basically comes down to making a **request from Python and parsing through the HTML** that is returned from each page. For each of these tasks we have a Python library, **`requests` and `bs4`**, respectively.

### Getting Info from a Web Page

Suppose that we have access to the HTML for a web page, we need **some way to pull the desired content from it**. Luckily there is already a system in place to do this. Using HTML tags, we can identify the information on a HTML page that we wish to retrieve and grab it with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

To try it out, let's define some sample html code.

In [1]:
from IPython.display import HTML

sample_html = '''<!DOCTYPE html>
<html>
<head>
<title>My Holiday Photos</title>
</head>
<body>
<h1>My Photos</h1>
<div class='intro'>
<p>These are some photos of my trips.</p>
<img src="me.jpg" width = "212" height = "107">
</div>

<h3>Italy</h3>
<div class='country'>
<img src="venice1.jpg" alt="Venice" width = "212" height = "107"> <br />
<img src="venice2.jpg" alt="Venice" width = "212" height = "107"> <br />
<img src="rome.jpg" alt="Roma" width = "212" height = "107">
</div>

<h3>Germany</h3>
<div class='country'>
<img src="berlin.jpg" alt="Berlin" width = "212" height = "107">
</div>
</body>
</html>
'''
HTML(sample_html)

## Beautiful Soup

First we have to import the `BeautifulSoup` module from the `bs4` library.

In [2]:
from bs4 import BeautifulSoup

# we create a soup object with the sample_html:
soup = BeautifulSoup(sample_html, 'html.parser')
type(soup)

bs4.BeautifulSoup

We can see that `soup` is a `BeautifulSoup` object. This represents the entire parsed document that we can search through.



We can use the `prettify()` method to print the HTML document tree.

In [3]:
# The prettify method returns the HTML DOM tree as a string, which shows
# there are actually many newlines '\n' in the document.
soup.prettify()

'<!DOCTYPE html>\n<html>\n <head>\n  <title>\n   My Holiday Photos\n  </title>\n </head>\n <body>\n  <h1>\n   My Photos\n  </h1>\n  <div class="intro">\n   <p>\n    These are some photos of my trips.\n   </p>\n   <img height="107" src="me.jpg" width="212"/>\n  </div>\n  <h3>\n   Italy\n  </h3>\n  <div class="country">\n   <img alt="Venice" height="107" src="venice1.jpg" width="212"/>\n   <br/>\n   <img alt="Venice" height="107" src="venice2.jpg" width="212"/>\n   <br/>\n   <img alt="Roma" height="107" src="rome.jpg" width="212"/>\n  </div>\n  <h3>\n   Germany\n  </h3>\n  <div class="country">\n   <img alt="Berlin" height="107" src="berlin.jpg" width="212"/>\n  </div>\n </body>\n</html>\n'

Priting out the result will format the HTML in indented form.

In [None]:
print(soup.prettify())

## Tag objects

Another type of object in `BeautifulSoup` is a `Tag` object. This corresponds to a HTML tag in the document. For example, we can access the `<title>` tag in the original HTML document.

In [None]:
title_tag = soup.title
print(title_tag)
print(type(title_tag))

In [None]:
title_tag.name

In [None]:
title_tag.string

Using the tag name returns the first tag that matches. For example, there are many `<img>` tags, but the first one will be returned.

In [None]:
soup.img

## Attributes

We can see from that the `<img>` contains a few attributes. We can access these attributes using `attrs`

In [None]:
soup.img.attrs

It looks like the attributes are organised in a dictionary, so we can access the values by using the keys:

In [None]:
# How can we acess the file name of the image?


## NavigableString objects

The text within a tag is returned by Beautiful Soup as a `Navigable String` object.


In [None]:
soup.p.string

In [None]:
info = soup.p.string
type(info)


In [None]:
# How are NavigableString methods different from those in normal Python Strings?
dir(info)

## Quick Exercises

In [None]:
# Get the first <h1> tag


In [None]:
# Get the first <h3> tag


In [None]:
# Get the first <h2> tag


In [None]:
# Get the first <div> tag


What did you notice about the values that are returned?
- if a tag contains other tags
- if a tag does not exist

## Navigating the HTML Tree

Since the HTML document is organized as a tree, we can navigate it. You can copy and paste the sample HTML code into the [live DOM viewer](https://software.hixie.ch/utilities/js/live-dom-viewer/) to see it better.

We can navigate the `BeautifulSoup` and `Tag` objects using the tag names.

In [None]:
# Get the first <div> tag
intro_tag = soup.div 

# Then get the <img> tag within this 
intro_tag.img

In [None]:
# or just get the img tag within the div tag
soup.div.img

## Getting all tags

As you can see, using the tag name only returns the first tag. If you want to find _all_ tags, you have to use the `find_all()` method. 

In [None]:
soup.find_all('h3')

In [None]:
# Getting the text from the 2nd item in the list


In [None]:
# Find all using attribute. 
# Note that 'class' has a special meaning in Python, so for this attribute
# we add an underscore


In [None]:
# Find all elements with tag <img>

# print the number of <img> tags found


In [None]:
# Print the first two tags



In [None]:
# Can you loop through the img_tags?


In [None]:
# Find all <img> tags with an alt attribute


In [None]:
# Find all image tags with attribute 'alt' equal to 'Venice'


### If I wanted to get a list of all of the countries visited, how would I do it?

In [None]:
# A:


## Navigating Nested Tags


We can use `.contents`,`.children` and `.descendants` to get nested tags.



In [None]:
header = soup.head
print(header)

In [None]:
header.contents

In [None]:
for c in header.children:
    print(c)

In [None]:
# find all <div> tags, then show the contents of the second one.



### Descendants

`.children` only returns the first level of nested tags, while `.descendants` returns all the tag's children and their children.

In [None]:
soup.div

In [None]:
soup.div.contents

In [None]:
[(c,type(c)) for c in soup.div.children]

In [None]:
# Descendants will retrieve the children of the tag as well
[(c,type(c)) for c in soup.div.descendants]

In [None]:
# To just obtain the strings 
[s for s in soup.strings]

In [None]:
# To just obtain the strings without extra whitespace
[s for s in soup.stripped_strings]

## Parents

The `.parent` attribute returns the outer tag.


In [None]:
# For example, if we get the first <img> tag
soup.img

In [None]:
# Find its parent

soup.img.parent

In [None]:
# Find all its parents
[p.name for p in soup.img.parents]

## Siblings

Similarly, we can get the siblings of a tag, which are those on the same level:

In [None]:
header3 = soup.h3
header3.next_sibling

As we can see the immediate sibling is actually a newline. To get all siblings on the same level, we can use `next_siblings`

In [None]:
header3.next_siblings


In [None]:
# Ok, let's list them.
[s for s in header3.next_siblings]

In [None]:
# What about the previous sibling(s) of header 3?


For more examples and shortcuts, refer to the [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.Comment)