# Web Scraping

In this notebook, you'll learn how to use the BeautifulSoup package to work with HTML.

Installation instructions: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

Outline:

* Basic HTML elements
* HTML in a notebook: display vs render
* Parsing HTML as a string and from files
* Extracting data with BeautifulSoup
    * Simple examples with HTML
        * Headings
        * Paragraphs
        * Lists
        * Links
    * Working with real HTML
        * Exploring attributes
        * Idenfifying parent elements
        

## Basic HTML elements

First, let's look at the basic example of HTML, below. It is simply a string, containing the HTML code.

What elements are in it?

`<html>` - this is always the most outwardly nested element and everything else is inside it

`<head>` - this element contains information about the head, most of which is not displayed to the user

`<title>` - a header element, which sets the text displayed at the top of the window and on the page's tab

`body` - this generally contains all the actual content

`h1` - the largest header setting

`p` - a paragraph of text

`ul` - marks the start/end of a list. Nested inside are the actual list items, marked with `li`

`a` - marks a link to another resource. The destination is given by the `href` attribute. The actual displayed text is placed between the `a` tags

In [10]:
raw_html = '''
<html>
    <head>
        <title>A very basic web page</title>
    </head>
    <body>
        <h1>Welcome!</h1>
        <p>This is a very simple webpage, with nothing too fancy on it. The structure is pretty basic.</p>
        
        
        <p> Reasons to keep it simple:
            <ul>
                <li>Will load quickly</li>
                <li>Easy to read</li>
                <li>Quick to create</li>
                <li>Easy to update</li>
            </ul>
        </p>

        <h3>Other links to visit...</h3>
        <a href='http://www.site1.com'>This great site</a>
        <br>        
        <a href='http://www.site2.net'>This equally great site</a>
        <br>
        <a href='http://www.site3.org'>This one is actually quite bad</a>

        <h3>Get in touch...</h3>
        <a href='mailto:me@address.com'>You can email me here</a>

    </body>
</html>'''

## HTML: display vs render

You can display the raw HTML string using print()

In [11]:
print(raw_html)


<html>
    <head>
        <title>A very basic web page</title>
    </head>
    <body>
        <h1>Welcome!</h1>
        <p>This is a very simple webpage, with nothing too fancy on it. The structure is pretty basic.</p>
        
        
        <p> Reasons to keep it simple:
            <ul>
                <li>Will load quickly</li>
                <li>Easy to read</li>
                <li>Quick to create</li>
                <li>Easy to update</li>
            </ul>
        </p>

        <h3>Other links to visit...</h3>
        <a href='http://www.site1.com'>This great site</a>
        <br>        
        <a href='http://www.site2.net'>This equally great site</a>
        <br>
        <a href='http://www.site3.org'>This one is actually quite bad</a>

        <h3>Get in touch...</h3>
        <a href='mailto:me@address.com'>You can email me here</a>

    </body>
</html>


But you might want to see how it would look in the browser. This can be useful if you want to see how things are formatted, without having to interpret the elements.

In [12]:
from IPython.display import HTML, display

rendered_html = HTML(raw_html) # Converts the string to something the notebook can render

display(rendered_html) # Actually displays it in the notebook

# Extracting data with BeautifulSoup

What data might we actually want to extract from the above HTML?

* Simple text
    * The text of the section headings
    * The main text of the paragraphs
* Structured text
    * The list of items
    * URLs plus any descriptive text

In [13]:
print(raw_html)


<html>
    <head>
        <title>A very basic web page</title>
    </head>
    <body>
        <h1>Welcome!</h1>
        <p>This is a very simple webpage, with nothing too fancy on it. The structure is pretty basic.</p>
        
        
        <p> Reasons to keep it simple:
            <ul>
                <li>Will load quickly</li>
                <li>Easy to read</li>
                <li>Quick to create</li>
                <li>Easy to update</li>
            </ul>
        </p>

        <h3>Other links to visit...</h3>
        <a href='http://www.site1.com'>This great site</a>
        <br>        
        <a href='http://www.site2.net'>This equally great site</a>
        <br>
        <a href='http://www.site3.org'>This one is actually quite bad</a>

        <h3>Get in touch...</h3>
        <a href='mailto:me@address.com'>You can email me here</a>

    </body>
</html>


In [14]:
from bs4 import BeautifulSoup

parsed_html = BeautifulSoup(raw_html) # Parse the HTML from before

## Headings and paragraphs

Now, get all the headings and paragraphs. Iterate through them, printing their content.

In [15]:
page = BeautifulSoup(raw_html)

headings = page.find_all(['h1','h2','h3','h4','h5','h6'])
for n, h in enumerate(headings):
    print(n+1, h.name, h.text.strip())

paras = page.find_all('p')

for n, p in enumerate(paras):
    print(n+1, p.name, p.text.strip())
    

1 h1 Welcome!
2 h3 Other links to visit...
3 h3 Get in touch...
1 p This is a very simple webpage, with nothing too fancy on it. The structure is pretty basic.
2 p Reasons to keep it simple:
            
Will load quickly
Easy to read
Quick to create
Easy to update


## Lists

We know there is only one list on this page. Target it, extract its children using `find_all`. Print the contents of the child elements of the list. Maybe add some line numbers.

In [19]:
lists = page.find_all('ul')[0]
n_lists = len(lists)
print('n lists:', n_lists)

for n, l in enumerate(lists.find_all()):
    print(n+1, l.name, l.text)

n lists: 9
1 li Will load quickly
2 li Easy to read
3 li Quick to create
4 li Easy to update


## Links

Now, extract all the links (both mailto: and http:) plus the text that goes with them.

Create a list of dictionaries that could be used to generate a CSV file.

The columns of this CSV file should be:
* the link address (everything after `mailto:` or `http://`)
* the type of link data it is
* the text associated with the link

Hint: `href` is an attribute. The `Tag` class stores any attributes as a dictionary under the `.attrs` variable
Hint: use a regular expression to detect `mailto:` vs `http://`

In [20]:
import re

links = page.find_all('a')

websites = []
emails_addresses = []
for l in links:
    link = l.get('href')
    site_pattern = r'http:.+\.[\w]+'
    email_pattern = r'mailto:.+\@.+'
    
    if re.search(site_pattern, link):
        websites.append(link)
    
    elif re.search(email_pattern, link):
        emails_addresses.append(link)


print('websites:',websites,'\nemails:', emails_addresses) 

websites: ['http://www.site1.com', 'http://www.site2.net', 'http://www.site3.org'] 
emails: ['mailto:me@address.com']


## Working with real HTML - more difficult examples

The examples used so far are basic, to introduce the structure of HTML. You should now be familiar with tags, elements, nesting and attributes, as well as the common BeautifulSoup functions for working with these.

Now, load the file `data/data/wiki/Breakeven (song).html` and parse it in BeautifulSoup.

View the raw HTML in a text editor that will highlight the tags for you, to make it easier to understand. Some text editors will also prettify the raw HTML, to show the nested structure more clearly.

Alternatively, a parsed HTML object has a `.prettify()` method which will do this so you can `print()` it.

(Note: the HTML file here has been prettified already.)

In [138]:
with open('data/breakeven.html','r') as file:
    soup = BeautifulSoup(file)
    


## Exploring attributes

What parts of the page are likely to be of interest?

* The basic info in the box in the top right corner
* The track listing for this song
* The four tables showing chart performance
* The table showing certifications

Targeting them using simple elements (`ul`, `table`) will be hard, because the page contains a lot of of these elements.

Instead, you can use attributes of these elements to target them.

To find out which attributes are best, you can inspect the relevant parts of the HTML to find the element and see what attributes apply. You could do a quick CTRL+F plus some text that goes with the part you are interested in.

That can be hard to read. Instead, you could extract all the relevant elements and list their attributes and contents. Try this now.

* Get all the tables and lists
* Print their attributes
* Print how many rows it has (look for children with element `tr` and remember that this includes the header row) 
* Print the column names from the header row (look for `th`)

In [139]:
tables = soup.find_all('table')

for t in tables:
    print('\ntable attirbutes:', t.attrs)
    print('n attributes inc headers:', len(t.findChildren('tr')))
    print('column names:', t.find_all('th'))
    
    


table attirbutes: {'class': ['infobox', 'vevent'], 'style': 'width:22em'}
n attributes inc headers: 16
column names: [<th class="summary" colspan="2" style="text-align:center;font-size:125%;font-weight:bold;background-color: khaki;">"Breakeven"</th>, <th class="description" colspan="2" style="text-align:center;background-color: khaki;"><a href="/wiki/Single_(music)" title="Single (music)">Single</a> by <a href="/wiki/The_Script" title="The Script">The Script</a></th>, <th class="description" colspan="2" style="text-align:center;background-color: khaki;">from the album <i><a href="/wiki/The_Script_(album)" title="The Script (album)">The Script</a></i></th>, <th scope="row"><span class="nowrap"><a href="/wiki/A-side_and_B-side" title="A-side and B-side">B-side</a></span></th>, <th scope="row">Released</th>, <th scope="row">Format</th>, <th scope="row">Recorded</th>, <th scope="row"><a href="/wiki/Music_genre" title="Music genre">Genre</a></th>, <th scope="row">Length</th>, <th scope="ro

# Tables

There are different classes of table, but the ones you want have a `class` of `wikitable` or `infobox`

The `multicol` table seems to be a way of grouping multiple tables together inside one table. Other tables seem to contain navigation elements of the page.

Now, extract only the tables with class `wikitable`.

For each table, get the `th` elements and print the text they contain.

In [145]:
tables = soup.find_all('table', attrs={'class':'wikitable'})

for t in tables:
    print([r.text.strip() for r in t.find_all('th')])
#     headings = [h.find_all('tr').text.strip() for h in t]
# headings

['Chart (2008–12)', 'Peakposition']
['Chart (2009)', 'Position']
['Chart (2010)', 'Position']
['Chart (2011)', 'Position']
['Region', 'Certification', 'Certified units/sales', 'Australia (ARIA)[28]', 'Italy (FIMI)[29]', 'United Kingdom (BPI)[30]', 'United States (RIAA)[32]']


You can see that the first four tables have the two headers we expect, but the final one has some extra data. It looks like the contents of the first column's cells are being considered headers. Let's look at it directly.

Some of the `th` elements have an attribute `scope` with a value of `row` or `col`.

This is another way of laying out tables in HTML. Read about it here: https://www.w3schools.com/tags/att_th_scope.asp

We can adjust our code to deal with this.

In [146]:
tables = soup.find_all('table', attrs={'class':'wikitable'})

for table in tables:
    th_list = table.find_all('th', attrs={'scope':'col'})

    if not th_list:
        # If no th elements with scope=col are found, then th_list will be empty
        # So we need to look for "normal" th elements
        th_list = table.findChildren('th')

    print([th.text.strip() for th in th_list])

['Chart (2008–12)', 'Peakposition']
['Chart (2009)', 'Position']
['Chart (2010)', 'Position']
['Chart (2011)', 'Position']
['Region', 'Certification', 'Certified units/sales']


As you did before, create a list of lists suitable for making a CSV file.

For each table, the first list should be the header. The subsequent lists should be the rows.

Just focus on the four "normal" tables for now.

Extract the header, extract each row and get the text.

In [237]:
tables = soup.find_all('table', attrs={'class':'wikitable'})[:4]

all_tables = []

for table in tables:
    
    table_data = []
    headers = [t.text.strip() for t in table.find_all('th') if not t.attrs == {'scope': 'col'}]
    table_data.append(headers)
    
    rows = []
    for row in table.findChildren('tr'):
        cell = [r.text.strip() for r in row.find_all('td') ]
        
        if cell:
            rows.append(cell)
    table_data.append(rows)    
    all_tables.append(table_data)


csvs = []

for i, table in enumerate(all_tables):
    csvs.append(pd.DataFrame(columns=table[0], data = table[1]))

csvs[0]    
    

Unnamed: 0,Chart (2008–12),Peakposition
0,Australia (ARIA)[8],3
1,Austria (Ö3 Austria Top 40)[9],75
2,Canada (Canadian Hot 100)[10],20
3,Germany (Official German Charts)[11],71
4,Ireland (IRMA)[12],10
5,Japan (Japan Hot 100)[13],77
6,Netherlands (Single Top 100)[14],48
7,Slovakia (Rádio Top 100)[15],97
8,Switzerland (Schweizer Hitparade)[16],66
9,UK Singles (Official Charts Company)[17],21


#### Using Pandas to create DataFrames from HTML tables

In [190]:
import pandas as pd
tables = soup.find_all('table', attrs={'class':'wikitable'})

Pandas offers an alternative means by which to access table data from HTML.  

Use `pd.read_html` to create a list of DataFrames from the string of the `tables` object defined above, then display the final DataFrame in that list.

# Conclusion and next steps

You've practised using BeautifulSoup to apply what you know about HTML structure, in order to efficiently extract data from HTML.

In the first basic example, you used your knowledge of elements, before going onto extract data from a real-world example. This much more complicated document required you to use your knowledge of how HTML is structured in terms of attributes, nested relational positioning (children, parents) and relative position (previous siblings, next siblings).

The key steps to extracting data from HTML are:

* Know the structure of the HTML - read it!
* Determine what elements correspond to the content you want to extract
* Look for useful attributes you can use to target the content you want - `id` is especially great, but also `class`
* If you can't use attributes of the exact parts you want, look at the structure:
    * Can you target its parent, then select the correct child? Or vice-versa?

To consolidate what you have covered here, try applying it to the `infobox` table and extracting the data from that. It has a slightly odd structure, compared to the other tables!

To continue practicing, you can try the following:

* Get a collection of HTML documents that contain similar structures (e.g. multiple Wikipedia pages for songs) 
* Write functions that extract the data you want, process it, save it to disk
* Put them together into a script that can be used to processed as many documents as you want

BeautifulSoup can do a lot more than we have covered here. For example, you can use it to modify HTML too - perhaps you want to change all the headers of a particular class to a different size, while leaving others alone? Easy! Take a look at the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

And be sure to familiarise yourself with the full range of elements in HTML: https://www.w3schools.com/html/default.asp