# Week 5 - How to ethically scrape the web
*© 2020 Colin Conrad*

Welcome to Week 5 of INFO 6270! Last week we covered ways of working with files on your local computer using the csv and PyPDF libraries, and was considered our final module on basic Python programming. There will be no new programming concepts from this point onwards! Instead, this week we are going to start learning about data science topics earnestly by applying the programming skills to work with web data.

When designing this course I had envisioned that we would explore both Chapters 12 and 18 from [Sweigart (2020)](https://automatetheboringstuff.com/), however I now believe it is important to _only focus on Chapter 12_ and leave Chapter 18 (on sending email and text messages with Python) only to those who are interested. 

**This week, we will achieve the following objectives:**
- Search for books using the Open Library API
- Use an API to retrieve Open Library collections data
- Retrieve and process webpage data
- Ethically scrape XKCD comics

Weekly reading: Sweigart (2014) Ch. 12.

**Note: There are 8 (mostly smaller) challenge questions this week, as opposed to the 5 given in previous weeks**

# Case: Open Library
Though there are virtually infinite uses for web application programming interfaces (APIs), one of the most tangible and easy to use is that provided by [Internet Archive's](https://archive.org/) Open Library. The vision of the [Open Library](https://openlibrary.org/) is to make all of humanity's published works freely available to everyone in the world. It does this by providing a digital collection of books in a variety of formats, ranging from text to Kindle.

The [Open Library's API](https://openlibrary.org/developers/api) gives detailed documentation about how to access and use data retained on their system. Though we will not use the API to retrieve book content, we will use it to navigate their collection and retain their library system data. Though book content may be copyrighted, their system data is [freely available for web developers to use](https://openlibrary.org/developers/licensing). For more information about open data licenses, please refer to the documentation on [opensource.org](https://opensource.org/licenses).

There are many APIs which may be useful to you in your research. You may be interested in checking out some of the following free APIs:
- [Open Corporates](https://api.opencorporates.com/) - A large repository of company information;
- [NASA](https://api.nasa.gov/) - NASA images galore! (Requires a key);
- [Chuck Norris jokes](http://www.icndb.com/api/) - A simple Chuck Norris joke generator;
- [REST Countries](https://restcountries.eu/) - Country information;
- [Reddit](https://www.reddit.com/dev/api/) - Access to social media data.

# Objective 1: Search for books using the Open Library API
As many of you have likely encountered in the past, application programming interfaces (APIs) are a critical piece of computer infrastructure, particularly for web applications. APIs are communication protocols that govern how software communicates to each other. In the case of web APIs, they often govern how computers exchange data over the internet. These days, internet data is often exchanged in JavaScript Object Notation (JSON) format, which we explored briefly in Week 3.

Python has a few great libraries for retrieving and managing JSON data. The first library we will explore is `requests`. Before we proceed, please ensure that you have this library installed. If not, install it using `pip install requests` explored in previous weeks.

In [1]:
import requests # import the library

Though it might seem like wizardry at times, all the `requests` library does is allow us to make web requests similarly to how a web browser does. Building on Sweigart's example, we could use requests to retrieve a web page from a particular URL. For example, the following code retrieves the results of a request (in this case, HTML code of my home page) and saves it in the specified variable.

In [2]:
resp = requests.get('https://python.org') # retrieves colin'sb homepage
resp # tell us what this is

<Response [200]>

If you execute the code above successfully you should get something along the lines of `<Response [200]>`, which denotes that this is a response object that was successful (HTTP's code for success). If we wanted to see the content of the response we could try typing the following.

In [5]:
resp.text # give us the text result from the request



The text generated above is actually the HTML code for *Python.org*. When rendered by a web browser, it creates a nice interface, though here it does not. You can take a look at the code next to the page in Chrome below. 

### Figure 1 - Demonstration of the web page code
![alt text](img/5-1.png "Python.org")

Similarly, we can make web requests to an API to retrieve data. For those of you who took INFO 5590, you may recall a lab where we retrieved JSON data using an API through either a web browser or Linux shell, possibly using the Open Library API. We will explore this in more detail throughout this lab. 

Open a web browser then copy and paste the following URL and see what happens: `http://openlibrary.org/search.json?q=brave+new+world`. 

You will likely be given a wall of JSON text. These are results of a request to the Open Library API for 'Brave New World', one of my all-time favorite fiction novels. We can retrieve the same results `requests` in Python.  

In [6]:
# save the response as a variable and retrieve the JSON data

response = requests.get('http://openlibrary.org/search.json?q=brave+new+world')
response.json()

{'start': 0,
 'num_found': 463,
 'numFound': 463,
 'docs': [{'title_suggest': 'Brave New World',
   'edition_key': ['OL22123296M',
    'OL25413824M',
    'OL23757347M',
    'OL13806517M',
    'OL9229851M',
    'OL14130759M',
    'OL6770033M',
    'OL6474536M',
    'OL19834566M',
    'OL7275142M',
    'OL6289093M',
    'OL19303376M',
    'OL17349234M',
    'OL20879326M',
    'OL16824082M',
    'OL22604809M',
    'OL23266684M',
    'OL6504102M',
    'OL22810116M',
    'OL23757350M',
    'OL14243013M',
    'OL17727850M',
    'OL13547921M',
    'OL16602680M',
    'OL22756376M',
    'OL18286548M',
    'OL6068250M',
    'OL22830653M',
    'OL22605471M',
    'OL22406097M',
    'OL19856478M',
    'OL19104413M',
    'OL19303365M',
    'OL22605539M',
    'OL16824209M',
    'OL6199917M',
    'OL22802346M',
    'OL16162200M',
    'OL15052604M',
    'OL20150225M',
    'OL19856336M',
    'OL21070749M',
    'OL22111634M',
    'OL19132839M',
    'OL5572119M',
    'OL19333295M',
    'OL20344586M',
    

You can probably see where this is going. When we do this in Python we retrieve the data in JSON format easily and use it like a Python dictionary. This gives us a lot of power; in the wise words of Uncle Ben 'with great power comes great responsibility'.

### *Challenge Question 1 (1 point)*
The Open Library API provides [documentation on performing searches](https://openlibrary.org/dev/docs/api/search) using the API. In a previous example, we searched for Aldus Huxley's *Brave New World*. Modify the previously used code to conduct a search, retrieve the results as JSON data, and display the JSON data in Jupyter.

In [7]:
response = requests.get('http://openlibrary.org/search.json?q=game+of+thrones')
response.json()

{'start': 0,
 'num_found': 139,
 'numFound': 139,
 'docs': [{'title_suggest': 'Game of Thrones',
   'edition_key': ['OL27912404M',
    'OL27912393M',
    'OL26641908M',
    'OL25420179M',
    'OL26426005M',
    'OL26425721M',
    'OL26425712M',
    'OL26425705M',
    'OL26425699M',
    'OL26425671M',
    'OL25140747M',
    'OL26425346M',
    'OL26425339M',
    'OL26425278M',
    'OL26425265M',
    'OL9213872M',
    'OL7826547M',
    'OL7829767M',
    'OL20507908M',
    'OL26724040M',
    'OL26745451M',
    'OL25447024M',
    'OL7255733M',
    'OL17217841M',
    'OL807276M',
    'OL26087035M',
    'OL7259157M',
    'OL7817274M',
    'OL8718362M',
    'OL26425703M',
    'OL24283280M',
    'OL9478797M',
    'OL7914095M',
    'OL7830295M',
    'OL7830296M',
    'OL26425327M',
    'OL25226226M',
    'OL27312631M',
    'OL27325765M',
    'OL27010745M',
    'OL27264224M',
    'OL26942180M',
    'OL27102457M',
    'OL26425337M',
    'OL25302821M',
    'OL26425330M',
    'OL26424812M',
    'OL2

### *Challenge Question 2 (1 point)*
Retrieve the number of items found from your search above. **Hint:** you should be able to do that by saving your response.json() as a variable and retrieving the dictionary value of the `num_found` key.

In [8]:
# saves the JSON results as a variable and retrieves the num_found

# insert code here
count= response.json()
count['num_found']

139

### *Challenge Question 3 (1 point)*
This is probably the first time where you can tangibly see how your hard work has paid off. You now have a super power: the ability to search for web data results. You have also already have a bunch of tools that you can use to extend this power. For instance, we can save our results to our computer if we would like.

#### Reminder: last time you use the following code to write a text file from PDF documents

In [9]:
pip install pypdf2

Note: you may need to restart the kernel to use updated packages.


In [15]:
import os
os.getcwd()

'C:\\Users\\Jaswanth\\Desktop\\Data Science\\Lab5'

In [17]:
councillors_text = open('data/councillors.txt','w') # opens a new write file called councillors.txt in the data folder
councillors_text.write(str(response.json()) # writes the PDF contents in the txt file
councillors_text.close()# closes the txt file

SyntaxError: invalid syntax (<ipython-input-17-fa0b4c7d558d>, line 3)

#### *Modify the above code to write the JSON output to a text file here:*

In [19]:
import json
councillors_text = open('data/gameofthrones.txt','w') # opens a new write file called councillors.txt in the data folder
councillors_text.write(str(count)) # writes the PDF contents in the txt file
councillors_text.close() # closes the txt file

# Objective 2: Use an API to retrieve Open Library collections data
Let's take a closer look at how the Open Library API works. [According to their documentation](https://openlibrary.org/dev/docs/api/books), we also have the ability to retrieve particular book information. The books are indexed by many keys, including ISBN numbers and a unique Open Library ID key (OLID). Using these keys we can retrieve data about the particular books in question.

When you conducted a search for Brave New World earlier, you retrieved a series of OLID keys the first of which was `OL22123296M`. Using the first key in that set, we can retrieve the data for this particular collection item. Just like everything in this lab, we make a requests over the internet for using a specific URL. The structure of an Open Library query is as follows:
- It begins with the call for book data (rather than, say, a search): `https://openlibrary.org/api/books?` 
- It then then adds the key information: `bibkeys=OLID:OL22123296M`
- And completes by stating the desired format: `&format=json`

This leads us to the following URL call: `https://openlibrary.org/api/books?bibkeys=OLID:OL22123296M&format=json`. Try calling this request below.

In [20]:
import requests # retrieve the requests library
request = requests.get('https://openlibrary.org/api/books?bibkeys=OLID:OL22123296M&format=json')
bnw_info = request.json() # as before
bnw_info

{'OLID:OL22123296M': {'bib_key': 'OLID:OL22123296M',
  'preview': 'noview',
  'preview_url': 'https://openlibrary.org/books/OL22123296M/Brave_new_world',
  'info_url': 'https://openlibrary.org/books/OL22123296M/Brave_new_world'}}

The data saved in the `bnw` variable is now callable in a dictionary format. If we want to retrieve the preview URL we can execute the code below. Consider copying this into your web browser!

In [21]:
bnw_info['OLID:OL22123296M']['preview_url'] # note that there are two levels in this dictionary

'https://openlibrary.org/books/OL22123296M/Brave_new_world'

### *Challenge Question 4 (1 point)*
In Challenge Question 1 you searched for a book and retrieved a series of OLID keys. Using one of these keys, in the cell below conduct a book query and provide the results using print or by calling the request. 

In [22]:
response = requests.get('http://openlibrary.org/api/books?bibkeys=OLID:OL27912393M&format=json')

book=response.json()
response.json()

{'OLID:OL27912393M': {'bib_key': 'OLID:OL27912393M',
  'preview': 'borrow',
  'thumbnail_url': 'https://covers.openlibrary.org/b/id/9269938-S.jpg',
  'preview_url': 'https://archive.org/details/gameofthronesaso00geor',
  'info_url': 'http://openlibrary.org/books/OL27912393M/A_Game_of_Thrones'}}

### *Challenge Question 5 (1 point)*
[Sweigart (2020)](https://automatetheboringstuff.com/2e/chapter12/) provides code for ordering Python to open your web browser to a specified URL. In theory, we could combine this code with the Open Library API to create a simple app for reading books. Retrieve the `preview_url` for your book as demonstrated above and use the `webbrowser.open()` function to order your web browser to open the book preview. 

You can use this skill in many different ways. I would quote Spiderman again, though it will lose its impact if done too much. 

In [23]:
import webbrowser
webbrowser.open(book['OLID:OL27912393M']['preview_url'])


True

# Objective 3: Retrieve and process webpage data
In addition to APIs, we can also use Python to retrieve and process regular web data. Last time we tried this using the `requests` module, we retrieved a series of unreadable HTML text. It would be much easier to process this type of data if there was a more efficient library.

Fortunately, Python has `Beautiful Soup` which is designed exactly for this task. This library structures HTML data retrieved using requests in a way that is not only readable, but also manageable. For instance, if we wanted to retrieve the Open Library home page, we could execute the following code.

**Note: It is possible that the Beautiful Soup `bs4` library is not installed. If not, use `pip install bs4` before executing this code.**

In [25]:
import bs4 # import the Beautiful Soup library
res = requests.get('https://openlibrary.org/') # the target URL
res.raise_for_status() # checks for errors
librarySoup = bs4.BeautifulSoup(res.text, 'html.parser') # retrieve the text and format it as HTML
print(librarySoup) #print the HTML


<!DOCTYPE html>

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="" name="title"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="OpenLibrary.org" name="author">
<meta content="OpenLibrary.org" name="creator"/>
<meta content="Original content copyright; 2007-2015" name="copyright"/>
<meta content="Global" name="distribution"/>
<link href="https://openlibrary.org/" rel="canonical"/>
<link href="/static/images/openlibrary-120x120.png" rel="apple-touch-icon"/>
<link href="/static/images/openlibrary-152x152.png" rel="apple-touch-icon" sizes="152x152"/>
<link href="/static/images/openlibrary-167x167.png" rel="apple-touch-icon" sizes="167x167"/>
<link href="/static/images/openlibrary-180x180.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/static/images/openlibrary-192x192.png" rel="icon" sizes="192x192"/>
<link href="/static/images/openlibrary-12

Beautiful Soup has a few handy functions that greatly lightening our load when processing web data. We could save this HTML data by opening a file and saving the content of the retrieved website on our local computer. For instance, the following code retrieves Open Library's web page and saves the code on our local computer in the `data` folder.

In [27]:
exampleFile = open('data/example.html', 'w', encoding='utf-8') # we need to explicitly state UTF8 encoding
exampleFile.write(str(librarySoup)) # writes the file
exampleFile.close() # closes the html file

Try opening the file using a code editor such as Notepad++. You will see that you have just copied Open Library's web page; this is to say, you **scraped** Open Library's web page. This example illustrates how computers access and process web data. Web scrapers also form the backbone of search engine technology and also the Open Internet Archive's software.

Web scrapers are ubiquitous, though they may not necessarily be legal in many circumstances. Many (or perhaps even most) web materials are copyrighted (e.g. many newspaper articles) and may not permit you accessing their data in this way. Fortunately the Open Internet Archive allows scholars to access their materials. Other sites may not be so generous.

### Retrieving specific web data
Using Beautiful Soup we can also access particular page elements. HTML documents consist of a series of elements which could include tags (e.g. `<div>`) as well as properties (e.g. the logo class `.logo`). Beautiful Soup helps us to navigate these elements so that we can retrieve the data that we want, rather than whole web pages.

This is better expressed using an example. If we wanted to retrieve data from specific elements from the Open Library web page, we can use the `select` method to retrieve that data. The following code retrieves only data which is contained in their `page-banner` class (usually reserved for important catch phrases).  

In [28]:
res = requests.get('https://openlibrary.org/') # the target URL
res.raise_for_status() # checks for errors
librarySoup = bs4.BeautifulSoup(res.text, 'html.parser') # retrieve the text and format it as HTML
elems = librarySoup.select('.page-banner') # select only elements with the page-banner class
elems # print the data retrieved

[<div class="page-banner page-banner-black page-banner-center">
 <div class="iaBar">
 <div class="iaBarLogo">
 <a href="https://archive.org"><img alt="Internet Archive logo" src="/static/images/ia-logo.svg" width="160"/></a>
 </div>
 <div class="iaBarMessage">
 <a class="ghost-btn" data-ol-link-track="IABar|DonateButton" href="https://archive.org/donate?platform=ol" style="text-underline">Donate <span aria-hidden="true" class="heart">♥</span></a>
 </div>
 </div>
 </div>, <div class="page-banner page-banner-body">
 <strong>Together</strong>, let's build an Open Library for the World. <a class="cta-btn cta-btn--available" href="/sponsorship" style="display: inline;">Sponsor a Book</a>
 </div>]

Beautiful Soup detected two elements with this feature. The first was the donate button and the second was their catch phrase "Together, let's build an Open Library for the world". Beautiful Soup retrieved these in a list format, so we can retrieve the second of these elements using the following code.

In [29]:
elems[1]

<div class="page-banner page-banner-body">
<strong>Together</strong>, let's build an Open Library for the World. <a class="cta-btn cta-btn--available" href="/sponsorship" style="display: inline;">Sponsor a Book</a>
</div>

Beautiful soup's elements object also has a specific `getText()` method for retrieving only text. Using this we can retrieve the slogan from their web page. A picture of the exact element retrieved is provided for your reference.

In [30]:
elems[1].getText()

"\nTogether, let's build an Open Library for the World. Sponsor a Book\n"

### Figure 1 - The element retrieved
![alt text](img/5-2.png "Python.org")

### *Challenge Question 6 (1 point)*
Using the Beautiful Soup library, retrieve and print the HTML data from `https://dal.ca`. You can modify the code we used to retrieve the Open Library page for this task.

In [31]:
import bs4 # import the Beautiful Soup library
res = requests.get('https://www.dal.ca/') # the target URL
res.raise_for_status() # checks for errors
librarySoup = bs4.BeautifulSoup(res.text, 'html.parser') # retrieve the text and format it as HTML
print(librarySoup) 

<!DOCTYPE html>

<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="" name="keywords"/>
<meta content="With campuses located in Halifax and Truro, NS, Dalhousie is a research-intensive university offering over 180 degrees in 12 diverse faculties." name="description"/>
<meta content="2020-02-18T14:05:53Z" name="coveoDate"/><link href="https://cdn.dal.ca/etc/designs/dalhousie/clientlibs/global/default/images/favicon/favicon.ico.lt_cf5aed4779c03bd30d9b52d875efbe6c.res/favicon.ico" rel="shortcut icon"/>
<link href="https://cdn.dal.ca/etc/designs/dalhousie/clientlibs/global/default/images/favicon/apple-touch-icon-57x57.png.lt_ae6cddc0ec58c6ad0db492db183a7861.res/apple-touch-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
<link href="https://cdn.dal.ca/etc/designs/dalhousie/clientlibs/global/default/images/favicon/apple-touch-icon-114x114.png.lt_84b32f81f79631253e6176ec40ca2de9.res/apple-touch-icon-114x114.png" rel="apple-touch-icon" size

### *Challenge Question 7 (2 points)*
Again using Beautiful Soup, retrieve and print the text from Dalhousie University's `mainLogo` class. Your result should write something along the lines of `\nDalhousie University\n\n`.

In [32]:
res = requests.get('https://www.dal.ca/') # the target URL
res.raise_for_status() # checks for errors
librarySoup = bs4.BeautifulSoup(res.text, 'html.parser') # retrieve the text and format it as HTML
elems = librarySoup.select('.mainLogo') # select only elements with the page-banner class
elems # print the data retrieved

[<div class="mainLogo"><h2>
 <a href="https://www.dal.ca/" title="Back to Dalhousie University Home Page">Dalhousie University</a>
 </h2>
 </div>]

In [33]:
elems[0].getText()

'\nDalhousie University\n\n'

# Objective 4: Ethically scrape XKCD comics
Though it probably feels like you achieved a lot this week, most of it was gained by applying previously developed skills (notably string manipulation, lists and how to use libraries) to new Python tools which gave you superpowers. In this last section you are asked to put these skills to the test. Fortunately, you have resources to guide you.

### *Challenge Question 8 (2 points)*
In Chapter 12, [Sweigart (2020)](https://automatetheboringstuff.com/2e/chapter12/) provides a really cool project to download all [XKCD comics](https://xkcd.com/), which are provided under the Creative Commons attribute-noncomercial license. We will complete a similar project. Rather than asking you to copy *all* of the XKCD comics, please write a script that retrieves 10  comics and saves them in the `data` folder (*not* a separate xkcd folder). If you get lost, you can refer back to Sweigart's book and today's exercise to help you.

In short, your script should:
- Download today's XKCD webpage;
- Find and download the comic image;
- Save the image to the data folder (**not**  the xkcd folder);
- Retrieve the previous button's url;
- Loop only 10 times, (**not** for all of the XCKD comics so that Dalhousie doesn't block our computer's IPs!!);
- Print 'Done!' when finished.

Please don't use today's skills for evil. Happy hacking!

In [51]:
import requests, os, bs4

url = 'https://xkcd.com'               # starting url
os.makedirs('xkcd', exist_ok=True)    # store comics in ./xkcd
while not url.endswith('#'):
     # Download the page.
    print('Downloading page %s...' % url)
    res = requests.get(url)
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    # Find the URL of the comic image.
    comicElem = soup.select('#comic img')
    if comicElem == []:
        print('Could not find comic image.')
    else:
        comicUrl = 'https:' + comicElem[0].get('src')
        # Download the image.
        print('Downloading image %s...' % (comicUrl))
        res = requests.get(comicUrl)
        res.raise_for_status()
          # Save the image to ./data.
        imageFile = open(os.path.join('data', os.path.basename(comicUrl)),
'wb')
        for chunk in res.iter_content(100000):
            imageFile.write(chunk)
        imageFile.close()

    # Get the Prev button's url.
    prevLink = soup.select('a[rel="prev"]')[0]
    url = 'https://xkcd.com' + prevLink.get('href')
    x=len([f for f in os.listdir('data') 
     if f.endswith('.png') and os.path.isfile(os.path.join('data', f))])
    if x==10:
        break

print('Done.')

Downloading page https://xkcd.com...
Downloading image https://imgs.xkcd.com/comics/picking_bad_stocks.png...
Downloading page https://xkcd.com/2269/...
Downloading image https://imgs.xkcd.com/comics/phylogenetic_tree.png...
Downloading page https://xkcd.com/2268/...
Downloading image https://imgs.xkcd.com/comics/further_research_is_needed.png...
Downloading page https://xkcd.com/2267/...
Downloading image https://imgs.xkcd.com/comics/blockchain.png...
Downloading page https://xkcd.com/2266/...
Downloading image https://imgs.xkcd.com/comics/leap_smearing.png...
Downloading page https://xkcd.com/2265/...
Downloading image https://imgs.xkcd.com/comics/tax_ai.png...
Downloading page https://xkcd.com/2264/...
Downloading image https://imgs.xkcd.com/comics/satellite.png...
Downloading page https://xkcd.com/2263/...
Downloading image https://imgs.xkcd.com/comics/cicadas.png...
Downloading page https://xkcd.com/2262/...
Downloading image https://imgs.xkcd.com/comics/parker_solar_probe.png...
