# Automating data querying with Python

Lucie Le Rolland

In [167]:
%pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Introduction and disclaimer

In this session, we are going to fetch data from remote sources, whether they are structured to be queried or a bit less so. 

One thing to always keep in mind is that somebody has to maintain the service and/or is paying for the bandwidth you're using for these queries. Always be considerate when querying an API or scraping a website. Make sure you're not asking for way more data than you need, and always remember that the following lines of code are your friends:

In [69]:
import time

time.sleep(1)  # Let's take a quick snooze

Even if you don't particularly care for the provider you're querying, remember that while a couple of well-timed requests won't alarm anyone, a sudden and brazen spike in requests may attract unwanted attention. 

It also goes without saying that before scraping a website, you should check its terms of services to make sure that it's legal to do so. Before going through the trouble (or fun) of scraping the website, you can also just ask the owner if they'd be keen on sharing their data with you. Some might!

## Extracting data from the web with requests and BeautifulSoup

### What is a webpage ?

Let's first take a few minutes to remind you of how HTML works. If you need in-depth information, there are plenty of guides available online.

- HTML is the language in which simple web pages are written. You can look up the HTML code of every page you're browsing by clicking right anywhere on the page and click on "Inspect". 
- Information is organised through opening and closing tags (`<p>A paragraph</p>`, e.g.). Among the most common, `<p>` denotes a paragraph, `<h1>...<hn>` are headers, `<ul>` is a list where each entry is a `<li>`. 
- Some of these tags have attributes, which sometimes help selecting them in the page
- When you load a page, what your browser does is basically a GET API call. You can replicate it with requests. 

When scraping, the general strategy is to read the HTML code of you page with inspect, understand its structure, then get the whole HTML with requests and extract the elements you need from it. 

Scraping is an artisanal process that relies on trial and error. There isn't generally one true way to extract the data you need. You need to balance the amount of effort you put into writing the code with how reusable you need it to be, and how well you need it to generalize to new pages.

For this reason, when you scrape a big corpus (several hundreds or thousands of different pages), where the time.sleep() constraint weighs on the time it takes to build the dataset, it's generally not a bad idea to save the HTMLs as you go. This way, if it turns out that your code didn't anticipate a specific feature that appears on some pages, you won't have to start all over again.

### Parsing HTML with BeautifulSoup

That extraction phase is where BeautifulSoup comes along. It does quite a bit of invisible stuff to help you, such as fixing broken HTML - which used to plague the web - and dealing with encodings. It also helps you navigate the HTML by building it as a tree. [This section of the documentation](https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree) contains a lot of information to properly look things up in your HTML. 

Let's look up an example. 

In [170]:
scraped_url = "https://www.w3docs.com/learn-html/table-of-html-tags.html"

list_of_tags = requests.get(scraped_url)

with open('list_of_tags.html', 'w') as file:  # This is how you'd save the file
    file.write(list_of_tags.text)

In [171]:
from bs4 import BeautifulSoup

souped_list_of_tags = BeautifulSoup(list_of_tags.text)

# Alternatively, we could read the html we saved

with open('list_of_tags.html', 'r') as file:
    souped_read_list_of_tags = BeautifulSoup(file.read())

list_of_tag_tables = []

for table in souped_list_of_tags.find_all('table'):    # We look up all tables
    headers = [header.string for header in table.find_all('th')]   # First let's extract the headers
    entries = {header: [] for header in headers}
    table_body = table.find('tbody')  # find looks up just the first element that matches the criterion
    for line in table_body.find_all('tr'):
        for cell_number in range(len(headers)):
            entries[headers[cell_number]].append(line.find_all('td')[cell_number].text)

    list_of_tag_tables.append(pd.DataFrame(entries))


In [172]:
list_of_tags_df = pd.concat(list_of_tag_tables)
print(list_of_tags_df.head())  # A bit annoying : one of the table has a "Descriptions" column rather than "Description"

for table in list_of_tag_tables:
    table.rename({"Descriptions": "Description"}, axis = 1, inplace = True)

list_of_tags_df = pd.concat(list_of_tag_tables)

print(list_of_tags_df)

          Tag                                        Description Descriptions
0  <!DOCTYPE>                     Sets the type of the document.          NaN
1      <html>                            Sets an HTML document.           NaN
2      <head>  Contains general information (metadata) about ...          NaN
3     <title>                     Sets a title of the document.           NaN
4      <body>                Specifies the body of the document.          NaN
           Tag                                        Description
0   <!DOCTYPE>                     Sets the type of the document.
1       <html>                            Sets an HTML document. 
2       <head>  Contains general information (metadata) about ...
3      <title>                     Sets a title of the document. 
4       <body>                Specifies the body of the document.
..         ...                                                ...
1   <noscript>  Defines an alternate content to be displayed i...
2   

Now what if I weren't interested in all the table, but just the basic tags? Then we'd need to only look up a single table.

It's a bit tricky because all tables seem to have the same attributes. The only leverageable thing seems to be the header right before. This is where this all becomes quite artisanal! (And you may have better ideas or ones that I didn't see!)

In [135]:
basic_tags_html = souped_list_of_tags.find('h2', id='basic-tags-2').next_sibling.next_sibling

headers = [header.string for header in basic_tags_html.find_all('th')]   # First let's extract the headers
entries = {header: [] for header in headers}
table_body = basic_tags_html.find('tbody')  # find looks up just the first element that matches the criterion
for line in table_body.find_all('tr'):
    for cell_number in range(len(headers)):
        entries[headers[cell_number]].append(line.find_all('td')[cell_number].text)
        
basic_tags_df = pd.DataFrame(entries)
basic_tags_df

Unnamed: 0,Tag,Description
0,<!DOCTYPE>,Sets the type of the document.
1,<html>,Sets an HTML document.
2,<head>,Contains general information (metadata) about ...
3,<title>,Sets a title of the document.
4,<body>,Specifies the body of the document.
5,\n<h1> to <h6>\n,Defines \r\nHTML headings.\r\n
6,<p>,Defines a paragraph.
7,<br>,Specifies a line break.
8,<hr>,Inserts a horizontal line or defines a themati...
9,<!-- ... -->,Defines a comment.


## Automating navigation on a website with Selenium

In its quest to scrape as much HTML as possible, requests has one enemy: interactivity. Basically, all the things that happen after you load the page and that you can interact with (such as all this pesky Javascript) are not attainable with requests.

Let's look an example (courtesy of a student from last year, happy to shift source if you have an example). 

In [144]:
commercial_registry = 'https://cres.gov.ai/bereg/searchbusinesspublic#'

commercial_registry_html = BeautifulSoup(requests.get(commercial_registry).text)

In [145]:
commercial_registry_html   # The table is not there!!

<html><body><p>ï»¿<!DOCTYPE html>

</p>
<meta charset="utf-8"/>
<script src="js/base.js?_=13" type="application/javascript"></script>
<title>Anguilla Commercial Registry</title>
<meta content="description" name="description"/>
<meta content="DevOOPS" name="author"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="css/base.css?_=13" rel="stylesheet"/>
<link href="css/all.css?_=2962" rel="stylesheet"/>
<link href="apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
<link href="apple-icon-60x60.png" rel="apple-touch-icon" sizes="60x60"/>
<link href="apple-icon-72x72.png" rel="apple-touch-icon" sizes="72x72"/>
<link href="apple-icon-76x76.png" rel="apple-touch-icon" sizes="76x76"/>
<link href="apple-icon-114x114.png" rel="apple-touch-icon" sizes="114x114"/>
<link href="apple-icon-120x120.png" rel="apple-touch-icon" sizes="120x120"/>
<link href="apple-icon-144x144.png" rel="apple-touch-icon" sizes="144x144"/>
<link href="apple-icon-152x152.png" rel

Thankfully, there's a solution to that problem: Selenium. This module, that was primarily developed to test web applications, can also be leveraged to scrape some precious data. 

Selenium basically emulates a browser. The first thing you need to do before using Selenium is to install a driver that will mimic one. There are [drivers for all major web browsers](https://selenium-python.readthedocs.io/installation.html#drivers). In this class, we'll use Firefox, but some pages may not be compatible with all drivers so keep in mind that you can switch them up. Each OS has their own driver, I've put compressed versions for the three main OS, let's first install this.

Selenium allows you to find HTML objects that are the results of an interaction, and either investigate their contents or interact with them (click on something, enter some text in a field, etc). Let's first watch Selenium extract data using the same example. 

In [149]:
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Firefox()
driver.get(commercial_registry)

In [160]:
# This example is very similar to what we did with the html tag example 
# I even copied, pasted and slightly modified some snippets from above to match the find_element vs find_elements
# structure of Selenium, and the (By.(something), 'something') logic.

table = driver.find_element(By.TAG_NAME, 'table')

headers = [header.text for header in table.find_elements(By.TAG_NAME, 'th')]
entries = {header: [] for header in headers}

table_body = table.find_element(By.TAG_NAME, 'tbody')

for lines in table_body.find_elements(By.TAG_NAME, 'tr'):
    for cell_number in range(len(headers)):
        entries[headers[cell_number]].append(lines.find_elements(By.TAG_NAME, 'td')[cell_number].text)

entries_df = pd.DataFrame(entries)

Even if requests had been able to read the table, it wouldn't have been able to turn the pages on the website. With Selenium, you can do that. Let's find this button and click it.

In [166]:
page_turner = driver.find_element(By.CLASS_NAME, "jtable-page-number-next").click()

And we're on page 2! 

With a well-constructed loop, we can scrape the whole thing. Let's try to build that loop. 

In [None]:
# Let's do this in class. Let's also tackle some other scraping examples you may have. 

Selenium is a very powerful tool. Do note that browsers can now be run as "headless", meaning that you don't need a window to interact with them. While we've not used Selenium as headless in the class in order to keep an eye on what we're doing, you probably wouldn't do that when you run your giant scraping function on thousands of pages. That was one of the big upside of selenium compared to a browser itself. In the future, we'll probably use browsers and no longer emulation to scrape!

Do note that in spite of your best efforts, scraping code is fragile at best. Website structures change as content is added do them or they're being revamped. Before you invest a lot of time in scraping, make sure 1- there's no other way to get the data and 2- you're going to be able to be flexible enough. 