# Basic Webpage Scraping
Webpage scraping consists of two steps: crawling and parsing.  In this tutorial, we focus on parsing HTML data.   Beautifulsoup is a powerful tool to process static HTML.  More details can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

To simplify our learning, we will use a simple example from W3Schools: https://www.w3schools.com/howto/tryit.asp?filename=tryhow_css_example_website

In [1]:
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    !wget https://www.dropbox.com/s/w5khgpro1ym3icg/simple_page.html?dl=0 -O simple_page.html

In [2]:
with open('simple_page.html') as f:
    html = f.read()

In [3]:
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n<title>Page Title</title>\n<meta charset="UTF-8">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<style>\n* {\n  box-sizing: border-box;\n}\n\n/* Style the body */\nbody {\n  font-family: Arial, Helvetica, sans-serif;\n  margin: 0;\n}\n\n/* Header/logo Title */\n.header {\n  padding: 80px;\n  text-align: center;\n  background: #1abc9c;\n  color: white;\n}\n\n/* Increase the font size of the heading */\n.header h1 {\n  font-size: 40px;\n}\n\n/* Sticky navbar - toggles between relative and fixed, depending on the scroll position. It is positioned relative until a given offset position is met in the viewport - then it "sticks" in place (like position:fixed). The sticky value is not supported in IE or Edge 15 and earlier versions. However, for these versions the navbar will inherit default position */\n.navbar {\n  overflow: hidden;\n  background-color: #333;\n  position: sticky;\n  position: -webkit-sticky;\n  top: 0;\n}\

In [4]:
from bs4 import BeautifulSoup
from bs4.element import Tag
from IPython.core.display import HTML

In [5]:
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Page Title
  </title>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <style>
   * {
  box-sizing: border-box;
}

/* Style the body */
body {
  font-family: Arial, Helvetica, sans-serif;
  margin: 0;
}

/* Header/logo Title */
.header {
  padding: 80px;
  text-align: center;
  background: #1abc9c;
  color: white;
}

/* Increase the font size of the heading */
.header h1 {
  font-size: 40px;
}

/* Sticky navbar - toggles between relative and fixed, depending on the scroll position. It is positioned relative until a given offset position is met in the viewport - then it "sticks" in place (like position:fixed). The sticky value is not supported in IE or Edge 15 and earlier versions. However, for these versions the navbar will inherit default position */
.navbar {
  overflow: hidden;
  background-color: #333;
  position: sticky;
  position: -webkit-sticky;
  top: 0;
}

/* Style the nav

## Beautiful Soup DOM Tree
The structure of Beautiful Soup bases on the concept of DOM, which is used in all web browsers.  DOM is a tree of all elements in the webpage.  Each element node consists of:
- tag
- innerHTML/outerHTML
- id
- attributes
- parent and children

### Traversing simple HTML's DOM Tree

In our example, the structure is as followed:

```
html
+-- head
|   +-- title
|   +-- meta
|   +-- meta
|   +-- style
+-- body
    +-- div
    |   +-- h1
    |   +-- p
    |       +-- b
    +-- div
    |   +-- a
    |   +-- a
    |   +-- a
    |   +-- a
    +-- div
    |   +--div
    |   |   +-- h2
    |   |   +-- h5
    |   |   +-- ...
    |   +--div
    |       +-- h2
    |       +-- h5
    |       +-- ...
    +-- div
        +-- h2
```

In [6]:
# title is a tag of one of the element node in the example.
# we can refer to the node by using the tag name
type(soup.title)

bs4.element.Tag

In [7]:
soup.title

<title>Page Title</title>

In [8]:
soup.head.style

<style>
* {
  box-sizing: border-box;
}

/* Style the body */
body {
  font-family: Arial, Helvetica, sans-serif;
  margin: 0;
}

/* Header/logo Title */
.header {
  padding: 80px;
  text-align: center;
  background: #1abc9c;
  color: white;
}

/* Increase the font size of the heading */
.header h1 {
  font-size: 40px;
}

/* Sticky navbar - toggles between relative and fixed, depending on the scroll position. It is positioned relative until a given offset position is met in the viewport - then it "sticks" in place (like position:fixed). The sticky value is not supported in IE or Edge 15 and earlier versions. However, for these versions the navbar will inherit default position */
.navbar {
  overflow: hidden;
  background-color: #333;
  position: sticky;
  position: -webkit-sticky;
  top: 0;
}

/* Style the navigation bar links */
.navbar a {
  float: left;
  display: block;
  color: white;
  text-align: center;
  padding: 14px 20px;
  text-decoration: none;
}


/* Right-aligned link */

In [9]:
# we can get tag of a node with 'name'
soup.title.name

'title'

In [10]:
# we can get outerHTML by converting node to string
str(soup.title)

'<title>Page Title</title>'

In [11]:
# we can get innerHTML with 'string'
soup.title.string

'Page Title'

In [12]:
# we can get id with 'id' (it is empty in this example)
soup.title.id

In [13]:
# we can get attribute values with 'attrs'
soup.title.attrs

{}

In [14]:
# getting the parent node with 'parent'
soup.title.parent.name

'head'

In [15]:
# referring to children
soup.title.children

<list_iterator at 0x11336a9d0>

## DOM Structure

In [16]:
def walk_dom(node, depth=None, indent='', only_tag=True):
    if only_tag and (not isinstance(node, Tag)):
        return
    
    print('{}{} : {}'.format(indent, node.name, type(node)))
    if isinstance(node, Tag):
        if len(node.attrs) > 0:
            print(indent, '>>', node.attrs)
        if depth is None or depth > 1:
            indent += '    '
            for c in node.children:
                if depth is None:
                    walk_dom(c, indent=indent, only_tag=only_tag)
                else:
                    walk_dom(c, depth-1, indent=indent, only_tag=only_tag)

In [17]:
walk_dom(soup.html, depth=2, only_tag=False)

html : <class 'bs4.element.Tag'>
 >> {'lang': 'en'}
    None : <class 'bs4.element.NavigableString'>
    head : <class 'bs4.element.Tag'>
    None : <class 'bs4.element.NavigableString'>
    body : <class 'bs4.element.Tag'>
    None : <class 'bs4.element.NavigableString'>


In [18]:
walk_dom(soup.html)

html : <class 'bs4.element.Tag'>
 >> {'lang': 'en'}
    head : <class 'bs4.element.Tag'>
        title : <class 'bs4.element.Tag'>
        meta : <class 'bs4.element.Tag'>
         >> {'charset': 'UTF-8'}
        meta : <class 'bs4.element.Tag'>
         >> {'name': 'viewport', 'content': 'width=device-width, initial-scale=1'}
        style : <class 'bs4.element.Tag'>
    body : <class 'bs4.element.Tag'>
        div : <class 'bs4.element.Tag'>
         >> {'class': ['header']}
            h1 : <class 'bs4.element.Tag'>
            p : <class 'bs4.element.Tag'>
                b : <class 'bs4.element.Tag'>
        div : <class 'bs4.element.Tag'>
         >> {'class': ['navbar']}
            a : <class 'bs4.element.Tag'>
             >> {'href': '#', 'class': ['active']}
            a : <class 'bs4.element.Tag'>
             >> {'href': '#'}
            a : <class 'bs4.element.Tag'>
             >> {'href': '#'}
            a : <class 'bs4.element.Tag'>
             >> {'href': '#', 'cla

In [19]:
walk_dom(soup.head)

head : <class 'bs4.element.Tag'>
    title : <class 'bs4.element.Tag'>
    meta : <class 'bs4.element.Tag'>
     >> {'charset': 'UTF-8'}
    meta : <class 'bs4.element.Tag'>
     >> {'name': 'viewport', 'content': 'width=device-width, initial-scale=1'}
    style : <class 'bs4.element.Tag'>


In [20]:
body_text = str(soup.body)
body_text[:300]

'<body>\n<div class="header">\n<h1>My Website</h1>\n<p>A <b>responsive</b> website created by me.</p>\n</div>\n<div class="navbar">\n<a class="active" href="#">Home</a>\n<a href="#">Link</a>\n<a href="#">Link</a>\n<a class="right" href="#">Link</a>\n</div>\n<div class="row">\n<div class="side">\n<h2>About Me</h2>'

In [21]:
HTML(body_text)

In [22]:
walk_dom(soup.body)

body : <class 'bs4.element.Tag'>
    div : <class 'bs4.element.Tag'>
     >> {'class': ['header']}
        h1 : <class 'bs4.element.Tag'>
        p : <class 'bs4.element.Tag'>
            b : <class 'bs4.element.Tag'>
    div : <class 'bs4.element.Tag'>
     >> {'class': ['navbar']}
        a : <class 'bs4.element.Tag'>
         >> {'href': '#', 'class': ['active']}
        a : <class 'bs4.element.Tag'>
         >> {'href': '#'}
        a : <class 'bs4.element.Tag'>
         >> {'href': '#'}
        a : <class 'bs4.element.Tag'>
         >> {'href': '#', 'class': ['right']}
    div : <class 'bs4.element.Tag'>
     >> {'class': ['row']}
        div : <class 'bs4.element.Tag'>
         >> {'class': ['side']}
            h2 : <class 'bs4.element.Tag'>
            h5 : <class 'bs4.element.Tag'>
            div : <class 'bs4.element.Tag'>
             >> {'id': 'my_photo', 'class': ['fakeimg'], 'style': 'height:200px;'}
            p : <class 'bs4.element.Tag'>
            h3 : <class 'bs4.

In [23]:
a = soup.a
a

<a class="active" href="#">Home</a>

In [24]:
a.attrs

{'href': '#', 'class': ['active']}

In [25]:
a.get('href')

'#'

In [26]:
soup.div

<div class="header">
<h1>My Website</h1>
<p>A <b>responsive</b> website created by me.</p>
</div>

## Finding Nodes

In [27]:
all_div = soup.find_all('div')

In [28]:
n = 0
for div in all_div:
    print('-- {} --'.format(n))
    print(div)
    n += 1

-- 0 --
<div class="header">
<h1>My Website</h1>
<p>A <b>responsive</b> website created by me.</p>
</div>
-- 1 --
<div class="navbar">
<a class="active" href="#">Home</a>
<a href="#">Link</a>
<a href="#">Link</a>
<a class="right" href="#">Link</a>
</div>
-- 2 --
<div class="row">
<div class="side">
<h2>About Me</h2>
<h5>Photo of me:</h5>
<div class="fakeimg" id="my_photo" style="height:200px;">Image</div>
<p>Some text about me in culpa qui officia deserunt mollit anim..</p>
<h3>More Text</h3>
<p>Lorem ipsum dolor sit ame.</p>
<div class="fakeimg" style="height:60px;">Image</div><br/>
<div class="fakeimg" style="height:60px;">Image</div><br/>
<div class="fakeimg" style="height:60px;">Image</div>
</div>
<div class="main" id="div_1">
<h2>TITLE HEADING</h2>
<h5>Title description, Dec 7, 2017</h5>
<div class="fakeimg" style="height:200px;">Image</div>
<p id="more_text">Some text..</p>
<p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmo

In [29]:
div8 = all_div[8]
HTML(str(div8))

In [30]:
walk_dom(div8, depth=2)

div : <class 'bs4.element.Tag'>
 >> {'class': ['main'], 'id': 'div_1'}
    h2 : <class 'bs4.element.Tag'>
    h5 : <class 'bs4.element.Tag'>
    div : <class 'bs4.element.Tag'>
     >> {'class': ['fakeimg'], 'style': 'height:200px;'}
    p : <class 'bs4.element.Tag'>
     >> {'id': 'more_text'}
    p : <class 'bs4.element.Tag'>
    br : <class 'bs4.element.Tag'>
    h2 : <class 'bs4.element.Tag'>
    h5 : <class 'bs4.element.Tag'>
    div : <class 'bs4.element.Tag'>
     >> {'class': ['fakeimg'], 'style': 'height:200px;'}
    p : <class 'bs4.element.Tag'>
    p : <class 'bs4.element.Tag'>


In [31]:
div8.attrs

{'class': ['main'], 'id': 'div_1'}

In [32]:
str(div8)

'<div class="main" id="div_1">\n<h2>TITLE HEADING</h2>\n<h5>Title description, Dec 7, 2017</h5>\n<div class="fakeimg" style="height:200px;">Image</div>\n<p id="more_text">Some text..</p>\n<p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco.</p>\n<br/>\n<h2>TITLE HEADING</h2>\n<h5>Title description, Sep 2, 2017</h5>\n<div class="fakeimg" style="height:200px;">Image</div>\n<p>Some text..</p>\n<p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco.</p>\n</div>'

In [33]:
div8.get('class')

['main']

In [34]:
div8.find_all('div')

[<div class="fakeimg" style="height:200px;">Image</div>,
 <div class="fakeimg" style="height:200px;">Image</div>]

In [35]:
soup.find(id='more_text')

<p id="more_text">Some text..</p>

In [36]:
soup.find_all(attrs={'class': 'fakeimg'})

[<div class="fakeimg" id="my_photo" style="height:200px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:200px;">Image</div>,
 <div class="fakeimg" style="height:200px;">Image</div>]

In [37]:
soup.find_all(attrs={'style': 'height:60px;'})

[<div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>]

## CSS Selector

In [38]:
soup.select('p')

[<p>A <b>responsive</b> website created by me.</p>,
 <p>Some text about me in culpa qui officia deserunt mollit anim..</p>,
 <p>Lorem ipsum dolor sit ame.</p>,
 <p id="more_text">Some text..</p>,
 <p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco.</p>,
 <p>Some text..</p>,
 <p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco.</p>]

In [39]:
soup.select('#more_text')

[<p id="more_text">Some text..</p>]

In [40]:
soup.select('.fakeimg')

[<div class="fakeimg" id="my_photo" style="height:200px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:200px;">Image</div>,
 <div class="fakeimg" style="height:200px;">Image</div>]

In [41]:
soup.select('#my_photo.fakeimg')

[<div class="fakeimg" id="my_photo" style="height:200px;">Image</div>]

In [42]:
for node in soup.select('h2'):
    print(str(node))
    print('----')

<h2>About Me</h2>
----
<h2>TITLE HEADING</h2>
----
<h2>TITLE HEADING</h2>
----
<h2>Footer</h2>
----


In [43]:
node = soup.select('div div h2')

In [44]:
str(node)

'[<h2>About Me</h2>, <h2>TITLE HEADING</h2>, <h2>TITLE HEADING</h2>]'

The CSS Selector includes:
- **string**: select node with the specific *tag* e.g. div for node with tag 'div'
- **.class**: select node with the specific *class*
- **#id**: select node with the specific *id*
- **tag[attr]**: select node with the specific *tag* and *attr*

## Advanced DOM Walk

In [45]:
walk_dom(div8.parent, depth=2)

div : <class 'bs4.element.Tag'>
 >> {'class': ['row']}
    div : <class 'bs4.element.Tag'>
     >> {'class': ['side']}
    div : <class 'bs4.element.Tag'>
     >> {'class': ['main'], 'id': 'div_1'}


In [46]:
div8.find_previous_sibling('div')

<div class="side">
<h2>About Me</h2>
<h5>Photo of me:</h5>
<div class="fakeimg" id="my_photo" style="height:200px;">Image</div>
<p>Some text about me in culpa qui officia deserunt mollit anim..</p>
<h3>More Text</h3>
<p>Lorem ipsum dolor sit ame.</p>
<div class="fakeimg" style="height:60px;">Image</div><br/>
<div class="fakeimg" style="height:60px;">Image</div><br/>
<div class="fakeimg" style="height:60px;">Image</div>
</div>

Here is the list of DOM navigation:
- **node.children**: iterator for all children of a node
- **node.descendants**: iterator for all of a tag’s children, recursively: its direct children, the children of its direct children, and so on
- **node.parent**: parent of the existing node
- **node.parents**: iterator for all of an element’s parents to the root of the DOM tree
- **node.next_sibling / node.previous_sibling**: navigate between page elements that are on the same level of the DOM tree
- **node.next_element / node.previous_element**: navigate between page elements in the DOM tree, regardless of the level