# In this notebook we go over the basics of web scraping, covering *some* special cases

In [1]:
!pip install beautifulsoup4 requests lxml # it is do-able without lxml!

You should consider upgrading via the '/home/joram/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
from bs4 import BeautifulSoup as BS
import requests

## Why scraping
Some websites have information of high value, a good example of this is https://www.example.org/ !
It would be amazing if we can periodically scrape the data from this page to be notified if anything changes.

Here we have a picture of the website.

![example-img](../src/basic-scraping-example-img.png)

What does a website consist of on the client side?

![example-html](../src/basic-scraping-example-html.png)

Obtaining a webpage on python using the `requests` library.

In [3]:
url = 'https://www.example.org/'
page = requests.get(url).text

In [4]:
page

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

## Where BeautifulSoup comes in
It's not intuitively possible to query specific parts of a web page using only the `requests` library.

Say we want to look at the heading of the document `h1`:

In [5]:
try:
    page.h1
except Exception as e:
    print(e)

'str' object has no attribute 'h1'


This results in an error, since the `requests.get(url).text` call returns a string object.

####  In a one liner: The library **BeautifulSoup** parses *html text* into a *parsed tree* that can be used to extract data.

Think of the parsed tree as a python dictionary, for example: `example_dict = {'company': 'Cape AI', 'fun factor': 'over 9000'}`

Where we can then access the fields of the dictionary: `example_dict['company']` returns `'Cape AI'`

Commonly, the parsed tree object is called soup, since it is a combination of a bunch of stuff and at the end you're not really sure about everything you added anymore.

`soup = BS(page, parser)`

Lets create a soup object of the text page of https://www.example.org/ 

In [6]:
soup = BS(page, 'lxml') # here you can also use 'html.parser' instead of 'lxml'

Now we can easily access the heading of the page using:

In [7]:
soup.h1

<h1>Example Domain</h1>

And the main paragraph using:

In [8]:
soup.p

<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

Printing it also looks much more user friendly!

In [9]:
soup

<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples

It is important to note that there could be many instances of all elements/tags (h1, div, etc). Using `soup.h1` will always return the first instance.
`soup.h1` is equivalent to using `soup.find('h1')`

### Nested elements
Elements are generally nested in html, especially inside `div`'s, as web pages are generally split up into a large amount of divisions. These can be accessed by tracing the nesting to the element you are interested in.

In [10]:
soup.html.body.div.h1

<h1>Example Domain</h1>

## A more complicated task/website

Our company website https://cape-ai.com/ is quite a bit more complicated than https://www.example.org/. 

Let's see what happens when we try to scrape the `h1` element of this page.

In [11]:
url = 'https://cape-ai.com/'
page = requests.get(url).text
soup = BS(page, 'lxml')

In [12]:
h1 = soup.h1
print(h1.prettify())

<h1 class="display-3 font-weight-bold text-dark" id="welcomeHeadingSource">
 <span style="background: rgba(250, 250, 250, 0.4)">
  Artificial intelligence to
 </span>
 <br/>
 <span class="text-white bg-primary" data-options='{"strings": ["Stop poaching", "Create township jobs", "Understand customers", "Improve healthcare", "Fight climate change"]}' data-toggle="typed">
 </span>
</h1>



### Get
You can use `h1.get('attribute name')` to get the values of the attributes inside an element.

In [13]:
h1.get('class')

['display-3', 'font-weight-bold', 'text-dark']

### find_all
As mentioned before, elements can have multiple instances. Here we see two `span` elements inside the `h1` element. Using `h1.span` returns only the first instance. What if we want a list of the varying strings in the second span element?

In [14]:
h1.span

# same as soup.h1.span

<span style="background: rgba(250, 250, 250, 0.4)">Artificial intelligence to</span>

Using `h1.find_all('span')`, we can find all instances of the `span` element inside `h1`

In [15]:
spans = h1.find_all('span')
spans

[<span style="background: rgba(250, 250, 250, 0.4)">Artificial intelligence to</span>,
 <span class="text-white bg-primary" data-options='{"strings": ["Stop poaching", "Create township jobs", "Understand customers", "Improve healthcare", "Fight climate change"]}' data-toggle="typed"></span>]

In [16]:
spans[0]

<span style="background: rgba(250, 250, 250, 0.4)">Artificial intelligence to</span>

In [17]:
spans[1]

<span class="text-white bg-primary" data-options='{"strings": ["Stop poaching", "Create township jobs", "Understand customers", "Improve healthcare", "Fight climate change"]}' data-toggle="typed"></span>

In [18]:
spans[1].get('data-options')

'{"strings": ["Stop poaching", "Create township jobs", "Understand customers", "Improve healthcare", "Fight climate change"]}'

In [24]:
try:
    dict(spans[1].get('data-options'))
except Exception as e:
    print(e)

dictionary update sequence element #0 has length 1; 2 is required


In [19]:
import ast

In [20]:
ast.literal_eval(h1.find_all('span')[1].get('data-options'))

{'strings': ['Stop poaching',
  'Create township jobs',
  'Understand customers',
  'Improve healthcare',
  'Fight climate change']}

In [21]:
text_fields = ast.literal_eval(h1.find_all('span')[1].get('data-options'))['strings']
text_fields

['Stop poaching',
 'Create township jobs',
 'Understand customers',
 'Improve healthcare',
 'Fight climate change']

## Shortcut!
Using `find_all` with the `attrs` argument, we can hone in one the specific element we want based on one of its attributes.

In [22]:
specific_span = soup.find_all('span', attrs={'data-toggle':'typed'})
specific_span

[<span class="text-white bg-primary" data-options='{"strings": ["Stop poaching", "Create township jobs", "Understand customers", "Improve healthcare", "Fight climate change"]}' data-toggle="typed"></span>]

In [23]:
text_fields = ast.literal_eval(specific_span[0].get('data-options'))['strings']
text_fields

['Stop poaching',
 'Create township jobs',
 'Understand customers',
 'Improve healthcare',
 'Fight climate change']

That covers basic scraping! On to bigger things