# Web scraping

Python has many libraries for reading and writing data in HMTL and XML formats. Examples include `lxml` (http://lxml.de), `Beautiful Soup`, `html5lib`. While lxml is much faster Beautiful Soup and html5lib can handle malformed HTML or XML files. `pandas` has a builtin function `read_html` wich uses Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/) and lxml under the hood. Since we might need to fix or parameterize things 'under the hood' we will work with a Beautiful Soup study case in this course as well. Later on in this programming 1 course we work with JSON, web API's, *hiearchal data formats* like HDF5 files and SQL databases. 

Before we start working with web scraper libraries we need to install them. If the system does not have the libraries lxml, beautifulsoup4 and html5lib we can install them in a virtual environment on our system.

In [None]:
virtualenv -p /usr/bin/python3 venv
source venv/bin/activate

#install tools
pip3 install beautifulsoup4
pip3 install html5lib
pip3 install lxml

## Scraping HTML

Now let us try to scrape the web via the builtin pandas.read_html. Most of the time this does not work since there are some issues like certification

In [None]:
import pandas as pd
url = 'https://en.wikipedia.org/wiki/Tilburg_Trappers'
tables = pd.read_html(url)
tables[0]

Indeed we get an `URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)>` error. 

We need to implement a work around with BeautifulSoup directly. First we open the url with the library urllib.request.urlopen() parsing the url and the 'do not verify the certificate mode'. Basically this is the wget function in the terminal. 


In [None]:
import urllib.request, urllib.parse, urllib.error
import ssl
from bs4 import BeautifulSoup
import re
import pandas as pd


def hack_ssl():
    """ ignores the certificate errors"""
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    return ctx


def open_url(url):
    """ opens url"""
    ctx = hack_ssl()
    html = urllib.request.urlopen(url, context=ctx).read()
    return html
    

The pd.read_html fetches all the tables directly. We need to add another function to extract the tables from the html. When we inspect the html we can see that the table of interest is a wikitable, not a table. So we use a regex to find the wikitables.



In [None]:
def fetch_tables(html):
    """ reads html file as a big string and cleans the html file to make it
        more readable. input: html, output: tables
    """
    soup = BeautifulSoup(html, 'html.parser')
    tables = soup.findAll(attrs={'class': re.compile(r".*\bwikitable\b.*")})
    return tables[0]

We now extracted the first table. 

In [None]:
def main():
    html = open_url('https://en.wikipedia.org/wiki/Tilburg_Trappers')
    t = fetch_tables(html)
    print(t)
    return 0

    
main()

in the html documentation https://www.w3schools.com/ we can see that `<table>` defines a table, `<tr>` defines a row and `<td>` defines a cell. We can use this to fetch the rows and columns and put them in a dataframe. Since the first row 'tr' is used as a header (correct html would use `<th>`) we can use the first row as an index and we need to consider the following `<tr>'`s (`matrix[1:]`) as data. This is exactly the reason why we need to inspect the html code first. Most of the time it is crappy :-(

In [None]:
def table_df(table):
    """parses the html table to a pandas dataframe"""
    #fetch dimensions
    l = len(table.find_all('tr')) 
    w = len(table.find_all('tr')[0].find_all('td'))
    matrix = [['' for i in range(0,w)] for i in range(0,l)]
    #fetch content
    for i, row in enumerate(table.find_all('tr')):
        for j, column in enumerate(row.find_all('td')):        
            matrix[i][j]=column.get_text().strip()
    #put in df making first row the header
    df = pd.DataFrame(matrix[1:], columns = matrix[0])
    return df

Chaning the main function will now print the dataframe

In [None]:
def main():
    html = open_url('https://en.wikipedia.org/wiki/Tilburg_Trappers')
    t = fetch_tables(html)
    df = table_df(t)
    print(df)
    return 0

    
main()

Finally we have a dataframe to work with. We can now use NumPy to conduct the analysis and to perform calculations. Remember that the html is text, so we might need to transfer data into another data format. More about NumPy and pandas in the next lecture. 

## Scraping XML

XML is another common structured data format supporting hierarchal nested data with metadata. XML and HTML are structured simular but XML is more general. An example of XML you find below

<people>
  <person>
      <name>Fenna</name>
      <address>Maluslaan 116</address>
  </person>
  <person>
      <name>Kees</name>
      <address>Onbekend</address>
    </person>
</people>

We can fetch the xml tree by open the url and read the data and decode the data. 

In [None]:
import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
import ssl


# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'https://bioinf.nl/~fennaf/DSLS/plants.xml'
#url = 'http://www.phyloxml.org/examples/apaf.xml'
print('Retrieving', url)
uh = urllib.request.urlopen(url, context=ctx)

data = uh.read()
print('Retrieved', len(data), 'characters')
print(data.decode())


If we need specific information we can use element tree to fetch that data

In [None]:
import xml.etree.ElementTree as ET

data = '''
<person>
  <name>Fenna</name>
  <phone type="intl">
    +31646080034
  </phone>
  <email hide="yes" />
</person>'''

tree = ET.fromstring(data)
print('Name:', tree.find('name').text)
print('Attr:', tree.find('email').get('hide'))

In [None]:
import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
import ssl


# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'https://bioinf.nl/~fennaf/DSLS/plants.xml'
print('Retrieving', url)
uh = urllib.request.urlopen(url, context=ctx)

data = uh.read()
tree = ET.fromstring(data)
for child in tree:
    print('\n')
    for element in child:
        print(element.tag, element.text)
