# Web Scraping

Now that we have looked at some HTML tags, let's explore how we can actually perform web scraping.


<div style="text-align: center"><h3>The Reality of Scraping</h3><img src="images/scraping_meme.png" style="width: 600px"></div>

## Why do we scrape the web?

* Realistically, data that you want to study won't always be available to you in the form of a curated data set.
* Need to go to the internet to find interesting data:
    * From an existing company
    * Text for NLP
    * Images


# Scraping from a Web Page with Python

Scraping a web site basically comes down to making a **request from Python and parsing through the HTML** that is returned from each page. For each of these tasks we have a Python library, **`requests` and `bs4`**, respectively.

### Getting Info from a Web Page

Suppose that we have access to the HTML for a web page, we need **some way to pull the desired content from it**. Luckily there is already a system in place to do this. Using HTML tags, we can identify the information on a HTML page that we wish to retrieve and grab it with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

To try it out, let's define some sample html code.

In [1]:
from IPython.display import HTML

sample_html = '''<!DOCTYPE html>
<html>
<head>
<title>My Holiday Photos</title>
</head>
<body>
<h1>My Photos</h1>
<div class='intro'>
<p>These are some photos of my trips.</p>
<img src="me.jpg" width = "212" height = "107">
</div>

<h3>Italy</h3>
<div class='country'>
<img src="venice1.jpg" alt="Venice" width = "212" height = "107"> <br />
<img src="venice2.jpg" alt="Venice" width = "212" height = "107"> <br />
<img src="rome.jpg" alt="Roma" width = "212" height = "107">
</div>

<h3>Germany</h3>
<div class='country'>
<img src="berlin.jpg" alt="Berlin" width = "212" height = "107">
</div>
</body>
</html>
'''
HTML(sample_html)

## Beautiful Soup

First we have to import the `BeautifulSoup` module from the `bs4` library.

In [2]:
from bs4 import BeautifulSoup

# we create a soup object with the sample_html:
soup = BeautifulSoup(sample_html, 'html.parser')
type(soup)

bs4.BeautifulSoup

We can see that `soup` is a `BeautifulSoup` object. This represents the entire parsed document that we can search through.



We can use the `prettify()` method to print the HTML document tree.

In [3]:
# The prettify method returns the HTML DOM tree as a string, which shows
# there are actually many newlines '\n' in the document.
soup.prettify()

'<!DOCTYPE html>\n<html>\n <head>\n  <title>\n   My Holiday Photos\n  </title>\n </head>\n <body>\n  <h1>\n   My Photos\n  </h1>\n  <div class="intro">\n   <p>\n    These are some photos of my trips.\n   </p>\n   <img height="107" src="me.jpg" width="212"/>\n  </div>\n  <h3>\n   Italy\n  </h3>\n  <div class="country">\n   <img alt="Venice" height="107" src="venice1.jpg" width="212"/>\n   <br/>\n   <img alt="Venice" height="107" src="venice2.jpg" width="212"/>\n   <br/>\n   <img alt="Roma" height="107" src="rome.jpg" width="212"/>\n  </div>\n  <h3>\n   Germany\n  </h3>\n  <div class="country">\n   <img alt="Berlin" height="107" src="berlin.jpg" width="212"/>\n  </div>\n </body>\n</html>\n'

Priting out the result will format the HTML in indented form.

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   My Holiday Photos
  </title>
 </head>
 <body>
  <h1>
   My Photos
  </h1>
  <div class="intro">
   <p>
    These are some photos of my trips.
   </p>
   <img height="107" src="me.jpg" width="212"/>
  </div>
  <h3>
   Italy
  </h3>
  <div class="country">
   <img alt="Venice" height="107" src="venice1.jpg" width="212"/>
   <br/>
   <img alt="Venice" height="107" src="venice2.jpg" width="212"/>
   <br/>
   <img alt="Roma" height="107" src="rome.jpg" width="212"/>
  </div>
  <h3>
   Germany
  </h3>
  <div class="country">
   <img alt="Berlin" height="107" src="berlin.jpg" width="212"/>
  </div>
 </body>
</html>



## Tag objects

Another type of object in `BeautifulSoup` is a `Tag` object. This corresponds to a HTML tag in the document. For example, we can access the `<title>` tag in the original HTML document.

In [5]:
title_tag = soup.title
print(title_tag)
print(type(title_tag))

<title>My Holiday Photos</title>
<class 'bs4.element.Tag'>


In [6]:
title_tag.name

'title'

In [7]:
title_tag.string

'My Holiday Photos'

Using the tag name returns the first tag that matches. For example, there are many `<img>` tags, but the first one will be returned.

In [8]:
soup.img

<img height="107" src="me.jpg" width="212"/>

## Attributes

We can see from that the `<img>` contains a few attributes. We can access these attributes using `attrs`

In [9]:
soup.img.attrs

{'src': 'me.jpg', 'width': '212', 'height': '107'}

It looks like the attributes are organised in a dictionary, so we can access the values by using the keys:

In [10]:
# How can we acess the file name of the image?
soup.img['src']

'me.jpg'

## NavigableString objects

The text within a tag is returned by Beautiful Soup as a `Navigable String` object.


In [11]:
soup.p.string

'These are some photos of my trips.'

In [12]:
info = soup.p.string
type(info)


bs4.element.NavigableString

In [16]:
name='Python'
dir(name)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',


In [13]:
# How are NavigableString methods different from those in normal Python Strings?
dir(info)

['PREFIX',
 'SUFFIX',
 '__add__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_all_strings',
 '_find_all',
 '_find_one',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 'append',
 'capitalize',
 'casefold',
 'center',
 'count',
 'decomposed',
 'default',
 'encode',
 'endswith',
 'expandtabs',
 'extend',
 'extract',
 'fetchNextSiblings',
 'fetchParents',
 'fetchPrevious',
 'fetchPreviousSiblings',
 'find',
 'findAllNext',
 'findAllPrevious',
 'findNext',
 'findNextSibling',
 'findNextSiblings',
 'findP

## Quick Exercises

In [17]:
# Get the first <h1> tag
soup.h1

<h1>My Photos</h1>

In [18]:
# Get the first <h3> tag
soup.h3

<h3>Italy</h3>

In [19]:
# Get the first <h2> tag
soup.h2

In [20]:
# Get the first <div> tag
soup.div

<div class="intro">
<p>These are some photos of my trips.</p>
<img height="107" src="me.jpg" width="212"/>
</div>

What did you notice about the values that are returned?
- if a tag contains other tags
- if a tag does not exist

## Navigating the HTML Tree

Since the HTML document is organized as a tree, we can navigate it. You can copy and paste the sample HTML code into the [live DOM viewer](https://software.hixie.ch/utilities/js/live-dom-viewer/) to see it better.

We can navigate the `BeautifulSoup` and `Tag` objects using the tag names.

In [21]:
# Get the first <div> tag
intro_tag = soup.div 

# Then get the <img> tag within this 
intro_tag.img

<img height="107" src="me.jpg" width="212"/>

In [22]:
# or just get the img tag within the div tag
soup.div.img

<img height="107" src="me.jpg" width="212"/>

## Getting all tags

As you can see, using the tag name only returns the first tag. If you want to find _all_ tags, you have to use the `find_all()` method. 

In [35]:
soup.find_all('h3')

[<h3>Italy</h3>, <h3>Germany</h3>]

In [38]:
# Getting the text from the 2nd item in the list
soup.find_all('h3')[1].string

'Germany'

In [25]:
# Find all using attribute. 
# Note that 'class' has a special meaning in Python, so for this attribute
# we add an underscore
soup.find_all('div', class_='country')

[<div class="country">
 <img alt="Venice" height="107" src="venice1.jpg" width="212"/> <br/>
 <img alt="Venice" height="107" src="venice2.jpg" width="212"/> <br/>
 <img alt="Roma" height="107" src="rome.jpg" width="212"/>
 </div>,
 <div class="country">
 <img alt="Berlin" height="107" src="berlin.jpg" width="212"/>
 </div>]

In [28]:
# Find all elements with tag <img>
img_tags = soup.find_all('img')

# print the number of <img> tags found
print(len(img_tags))

5


In [None]:
# Print the first two tags

print(img_tags[0])
print(img_tags[1])

In [34]:
# Can you loop through the img_tags?
[img for img in img_tags]

[<img height="107" src="me.jpg" width="212"/>,
 <img alt="Venice" height="107" src="venice1.jpg" width="212"/>,
 <img alt="Venice" height="107" src="venice2.jpg" width="212"/>,
 <img alt="Roma" height="107" src="rome.jpg" width="212"/>,
 <img alt="Berlin" height="107" src="berlin.jpg" width="212"/>]

In [33]:
# Find all <img> tags with an alt attribute
soup.find_all('img', alt=True)

[<img alt="Venice" height="107" src="venice1.jpg" width="212"/>,
 <img alt="Venice" height="107" src="venice2.jpg" width="212"/>,
 <img alt="Roma" height="107" src="rome.jpg" width="212"/>,
 <img alt="Berlin" height="107" src="berlin.jpg" width="212"/>]

In [None]:
# Find all image tags with attribute 'alt' equal to 'Venice'
soup.find_all('img', alt='Venice')

### If I wanted to get a list of all of the countries visited, how would I do it?

In [39]:
# A:
country_tags = soup.find_all('h3')

In [42]:
[c.string.text for c in country_tags]

['Italy', 'Germany']

In [45]:
country_tags[0].string.text

'Italy'

In [49]:
type(country_tags[0].text)

str

In [48]:
type(country_tags[0].string)

bs4.element.NavigableString

## Navigating Nested Tags


We can use `.contents`,`.children` and `.descendants` to get nested tags.



In [50]:
header = soup.head
print(header)

<head>
<title>My Holiday Photos</title>
</head>


In [51]:
header.contents

['\n', <title>My Holiday Photos</title>, '\n']

In [52]:
for c in header.children:
    print(c)



<title>My Holiday Photos</title>




In [53]:
# find all <div> tags, then show the contents of the second one.

country = soup.find_all('div')
country[1].contents

['\n',
 <img alt="Venice" height="107" src="venice1.jpg" width="212"/>,
 ' ',
 <br/>,
 '\n',
 <img alt="Venice" height="107" src="venice2.jpg" width="212"/>,
 ' ',
 <br/>,
 '\n',
 <img alt="Roma" height="107" src="rome.jpg" width="212"/>,
 '\n']

### Descendants

`.children` only returns the first level of nested tags, while `.descendants` returns all the tag's children and their children.

In [54]:
soup.div

<div class="intro">
<p>These are some photos of my trips.</p>
<img height="107" src="me.jpg" width="212"/>
</div>

In [55]:
soup.div.contents

['\n',
 <p>These are some photos of my trips.</p>,
 '\n',
 <img height="107" src="me.jpg" width="212"/>,
 '\n']

In [56]:
[(c,type(c)) for c in soup.div.children]

[('\n', bs4.element.NavigableString),
 (<p>These are some photos of my trips.</p>, bs4.element.Tag),
 ('\n', bs4.element.NavigableString),
 (<img height="107" src="me.jpg" width="212"/>, bs4.element.Tag),
 ('\n', bs4.element.NavigableString)]

In [57]:
# Descendants will retrieve the children of the tag as well
[(c,type(c)) for c in soup.div.descendants]

[('\n', bs4.element.NavigableString),
 (<p>These are some photos of my trips.</p>, bs4.element.Tag),
 ('These are some photos of my trips.', bs4.element.NavigableString),
 ('\n', bs4.element.NavigableString),
 (<img height="107" src="me.jpg" width="212"/>, bs4.element.Tag),
 ('\n', bs4.element.NavigableString)]

In [58]:
# To just obtain the strings 
[s for s in soup.strings]

['\n',
 '\n',
 '\n',
 'My Holiday Photos',
 '\n',
 '\n',
 '\n',
 'My Photos',
 '\n',
 '\n',
 'These are some photos of my trips.',
 '\n',
 '\n',
 '\n',
 'Italy',
 '\n',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 '\n',
 '\n',
 'Germany',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n']

In [59]:
# To just obtain the strings without extra whitespace
[s for s in soup.stripped_strings]

['My Holiday Photos',
 'My Photos',
 'These are some photos of my trips.',
 'Italy',
 'Germany']

## Parents

The `.parent` attribute returns the outer tag.


In [60]:
# For example, if we get the first <img> tag
soup.img

<img height="107" src="me.jpg" width="212"/>

In [66]:
# Find its parent

soup.img.parent

<div class="intro">
<p>These are some photos of my trips.</p>
<img height="107" src="me.jpg" width="212"/>
</div>

In [67]:
# Find all its parents
[p.name for p in soup.img.parents]

['div', 'body', 'html', '[document]']

## Siblings

Similarly, we can get the siblings of a tag, which are those on the same level:

In [71]:
header3 = soup.h3
header3.next_sibling

'\n'

As we can see the immediate sibling is actually a newline. To get all siblings on the same level, we can use `next_siblings`

In [72]:
header3.next_siblings


<generator object PageElement.next_siblings at 0x0000016664181F20>

In [75]:
# Ok, let's list them.
[s for s in header3.next_siblings]

['\n',
 <div class="country">
 <img alt="Venice" height="107" src="venice1.jpg" width="212"/> <br/>
 <img alt="Venice" height="107" src="venice2.jpg" width="212"/> <br/>
 <img alt="Roma" height="107" src="rome.jpg" width="212"/>
 </div>,
 '\n',
 <h3>Germany</h3>,
 '\n',
 <div class="country">
 <img alt="Berlin" height="107" src="berlin.jpg" width="212"/>
 </div>,
 '\n']

In [None]:
# What about the previous sibling(s) of header 3?
[s for s in header3.previous_siblings]

For more examples and shortcuts, refer to the [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.Comment)