# Web scraping with Python


## Setup instructions

###  Install the Anaconda Python distribution
If using your own computer please install the Anaconda Python distribution from [https://www.anaconda.com/download/](https://www.anaconda.com/download/). (Note that Python version $\leq$ 3.0 differs considerably from more recent releases. For this workshop you will need version $\geq$ 3.4.)

Accepting the defaults proposed by the Anaconda installer is generally recommended. However, if it offers to install Microsoft Visual Studio Code you may safely skip this step.

### Download workshop materials
Download the materials from [http://tutorials.iq.harvard.edu/Python/PythonWebScrape.zip](http://tutorials.iq.harvard.edu/Python/PythonWebScrape.zip) and extract the zipped directory (Right-click => Extract All on Windows, double-click on Mac).




## Workshop goals and approach
In this workshop you will
- learn basic web scraping principles and techniques,
- learn how to use the requests package in Python,
- practice making requests and manipulating responses from the server.

This workshop is relatively *informal*, *example-oriented*, and *hands-on*. We will learn by working through an example web scraping project. Specifically, we will use python to retrieve information about all the people affiliated with the Institute for Quantitative Social Science (IQSS) at Harvard.

## Preliminary questions

### What is web scraping?
Web scraping the activity of automating retrieval of information from a web service designed for human interaction.


### Is web scraping legal? Is it ethical?
It depends. If you have legal questions seek legal counsel. You can mitigate ethical issues by building delays and restrictions into your web scraping program so as to avoid impacting the availability of the web service for other users or the cost of hosting the service for the service provider.
 

## Example project overview and goals
I would like to know what time of day events at the Harvard Art Museums are held. Are more events held in the morning? Afternoon? Late in the evening? I don't know but I'm determined to find out. I can do that by scraping the page at <https://www.harvardartmuseums.org/visit/calendar>.

The basic strategy is pretty much the same for most scraping projects. We will use our web browser (Chrome or Firefox recommended) to examin the page you wish to retrieve data from, and copy/paste information from your web browser into your scraping program.

In this workshop I will demonstrate web scraping techniques using the Exhibitions page at <https://www.harvardartmuseums.org/visit/exhibitions> and let you use the skills you'll learn to retrieve event times from <https://www.harvardartmuseums.org/visit/calendar>

## Examining the structure of our target web service
We wish to extract information from <https://www.harvardartmuseums.org/visit/exhibitions>. Like most modern web pages, a lot goes on behind the scenes to produce the page we see in our browser. Our goal is to pull back the curtain to see what the website does when we interact with it. Once we see how the website works we can often grab the data we want directly.

We start by opening that page in a web browser and inspecting it.

![dev_tools](img/dev_tools.png)

![dev_tools](img/dev_tools_pane.png)

If we scroll down to the bottom of the Exhibitions page, we'll see a button that says "Load More Exhibitions". Let's see what happens when we click on that button.To do so, click on "Network" in the developer tools window,then click the button. You should see a list of requests that were made as a result of clicking that button, as shown below.

![dev_tools](img/dev_tools_network.png)

If we look at that second request, the one that to `load_next`, we'll see that it returns all the information we need, in a convenient format called `JSON`. All we need to retrieve exhibition data is call make `GET` requests to <https://www.harvardartmuseums.org/search/load_next> with the correct parameters. For example, we can retrieve the first set of exhibitions in Python as follows:


In [5]:
import requests
from lxml import html

In [6]:
exhibitions0 = requests.get('https://www.harvardartmuseums.org/search/load_next?type=past-exhibition&year=&page=0').json()

In [14]:
exhibitions0["records"][0]

{'primaryimageurl': 'https://nrs.harvard.edu/urn-3:HUAM:GS010411', 'videos': [{'primaryurl': 'https://vimeo.com/277511290', 'description': 'Marina Isgro, the Nam June Paik Research Fellow at the Harvard Art Museums, introduces the museums’ new exhibition, “Nam June Paik: Screen Play,” on view from June 30 to August 5, 2018. Drawn almost entirely from our collections, the exhibition reveals the breadth of this groundbreaking global artist’s oeuvre from the 1960s through early 2000s.\r\n\r\nMusic by Chris Zabriskie (CC BY 4.0 creativecommons.org/licenses/by/4.0/legalcode)', 'videoid': 489238}], 'textiledescription': '<i>Nam June Paik: Screen Play</i> presents a group of works by groundbreaking global artist Nam June Paik (1932–2006). Paik was born in Korea but spent much of his life in the United States; his practice combined music, performance, sculpture, painting and drawing, video, and broadcast television, among other media. \r\n\r\nDrawn almost entirely from the Harvard Art Museums 

In [17]:
exhibitions0["records"][0].keys()

dict_keys(['primaryimageurl', 'videos', 'textiledescription', 'venues', 'shortdescription', 'people', 'url', 'id', 'lastupdate', 'title', 'temporalorder', 'exhibitionid', 'color', 'description', 'poster', 'images', 'enddate', 'begindate'])

In [20]:
[{'title': x['title'], 'start': x['begindate'], 'end': x['enddate']} for x in  exhibitions0["records"]]

[{'title': 'Nam June Paik: Screen Play', 'start': '2018-06-30', 'end': '2018-08-05'}, {'title': 'A.K. Burns: Survivor’s Remorse', 'start': '2018-05-19', 'end': '2018-08-14'}, {'title': 'Analog Culture: Printer’s Proofs from the Schneider/Erdman Photography Lab, 1981–2001', 'start': '2018-05-19', 'end': '2018-08-12'}, {'title': 'Inventur—Art in Germany, 1943–55', 'start': '2018-02-09', 'end': '2018-06-03'}, {'title': 'JODI: OXO', 'start': '2018-02-07', 'end': '2018-04-23'}, {'title': 'Looking Back: The Western Tradition in Retrospect', 'start': '2018-01-20', 'end': '2018-05-06'}, {'title': 'Fernando Bryce: The Book of Needs', 'start': '2018-01-20', 'end': '2018-05-06'}, {'title': 'Rome: Eternal City', 'start': '2018-01-20', 'end': '2018-05-06'}, {'title': 'No More, America', 'start': '2017-09-27', 'end': '2017-12-31'}, {'title': 'The Art of Drawing in the Early Dutch Golden Age, 1590–1630: Selected Works from the Abrams Collection', 'start': '2017-09-09', 'end': '2018-01-14'}]