# Python Web-Scraping - An Introduction

## Topics

- Web basics
- Making web requests
- Inspecting web sites
- Retrieving JSON data
- Using Xpaths to retrieve html content
- Parsing html content
- Cleaning and storing text from html

## Prerequisites

Assumes knowledge of Python, including:
- Lists
- Dictionaries
- Logical indexing
- Iteration with for-loops

Assumes basic knowledge of web page structure.

## Goals

This workshop is organized into two main parts:
1. Retrieve information in JSON format.
2. Parse HTML files.


## Web Scraping Background

### What is web scraping?
Web scraping is the activity of automating the retrieval of information from a web service designed for human interaction.

### Is web scraping legal? Is it ethical?
It depends. Legal aspects vary, so if you have legal questions, seek legal counsel. Ethically, you can mitigate issues by building delays and restrictions into your web scraping program to avoid impacting the availability of the web service for other users or the hosting costs for the service provider.

### Web Scraping Approaches
No two websites are identical — websites are built for different purposes by different people and thus have different underlying structures. Because they are heterogeneous, there is no single way to scrape a website. The scraping approach must be tailored to each individual site. Commonly used approaches include:
- Using requests to extract information from structured JSON files.
- Using requests to extract information from HTML.
- Automating a browser to retrieve information from HTML.

Remember, even once you've decided upon the best approach for a particular site, it will be necessary to modify that approach to suit your specific use-case.

### How does the web work?
#### Components
- **Clients** are the typical web user’s internet-connected devices (e.g., your computer connected to your Wi-Fi) and web-accessing software available on those devices (usually a web browser like Firefox or Chrome).
- **Servers** are computers that store webpages, sites, or apps. When a client device wants to access a webpage, a copy of the webpage is downloaded from the server onto the client machine to be displayed in the user’s web browser.
- **HTTP** is the language for clients and servers to speak to each other.

#### Process
When you type a web address into your browser:
1. The browser finds the address of the server that the website lives on.
2. The browser sends an HTTP request message to the server, asking it to send a copy of the website to the client.
3. If the server approves the client’s request, the server sends the client a "200 OK" message, and then the website starts displaying in the browser.


## Goal: Retrieve Data in JSON Format

The objective is to retrieve information in JSON format and organize it into a spreadsheet. Below are the steps we will follow to achieve this:

### Steps to Retrieve and Organize Data

1. **Inspect the Website**:
   - Check if the content at [Harvard Art Museums Collections](https://www.harvardartmuseums.org/collections) is stored in JSON format.

2. **Make a Request**:
   - Send a request to the website server to retrieve the JSON file. This involves using tools like `requests` in Python to access the data.

3. **Convert JSON to Dictionary**:
   - Once the JSON data is retrieved, convert it from JSON format into a Python dictionary using Python’s json library. This step is crucial for manipulating the data in Python.

4. **Extract Data and Store in CSV**:
   - Extract the necessary data from the Python dictionary and store it in a .csv file. This is done using Python’s csv library or pandas DataFrame to format and save the data.
   
![json-format.png](json-format.png)

### Understanding the Website's Backend

Like most modern web pages, a lot goes on behind the scenes at the Harvard Art Museums website to produce the page we see in our browser. By understanding how the website works when we interact with it, we can begin to retrieve data effectively.

If we are lucky, we'll find a resource on the website that returns the data we're looking for in a structured format like JSON. This is advantageous as it simplifies the process of converting data from JSON into a spreadsheet format such as CSV or Excel.

## Examine the Website's Structure for Data Retrieval

### Basic Strategy for Web Scraping

The approach to web scraping generally follows a consistent process across different projects. We will use a web browser (Chrome or Firefox recommended) to examine the page from which we wish to retrieve data. The key is to use the developer tools available in the browser to inspect the webpage and identify how data is loaded and presented.

### Step-by-Step Process

1. **Open the Collections Web Page**:
   - Begin by navigating to the collections page on a web browser and open the developer tools. This is typically done by right-clicking on the page and selecting "Inspect" or pressing `Ctrl+Shift+I` on your keyboard.

2. **Using Network Tools**:
   - Within the developer tools, switch to the "Network" tab. This tab is crucial as it displays network requests made by your browser to the server.

3. **Interact with the Page**:
   - Scroll to the bottom of the Collections page and click on the “Load More” button. Observe the network activity that occurs when you click the button.

4. **Analyze the Requests**:
   - A list of HTTP requests will appear in the Network tab when you click the “Load More Collections” button. Review these requests to identify which one carries the data you need.

5. **Identify Data Retrieval Method**:
   - Pay attention to the second request made to a script named 'browse'. This request returns information in JSON format, which is what we need for scraping. 
   
![harvardart_1.png](harvardart_1.png)

![harvardart_2.png](harvardart_2.png)

### Retrieving Data

- **Endpoint for Collection Data**:
  - To retrieve the collection data, make GET requests to `https://www.harvardartmuseums.org/browse` with the correct parameters. This will fetch the data in JSON format, which can then be processed or converted into the desired format (e.g., CSV).

By following these steps, you can start to retrieve data from web pages that load additional content dynamically, such as through the "Load More" buttons or infinite scroll functionalities.


## Making Requests

To effectively retrieve information from a website, understanding the structure of the URL, or "web address," is crucial. This allows us to specify the location of the resources we want to collect, such as web pages.

### Understanding URL Structure

A URL is typically composed of several parts:

1. **Protocol**: The method of access (e.g., `https`, `http`).
2. **Domain**: The central web address of the site (e.g., `www.harvardartmuseums.org`).
3. **Path**: The specific address within the domain where resources are located (e.g., `/browse`).
4. **Parameters (Query String)**: Additional instructions for the server about what exactly to return, often in key-value pairs (e.g., `load_amount=10&offset=0`).
5. **Fragment**: An internal page reference that directs the browser to a specific part of the page (optional and may not be present).

In [1]:
museum_domain = 'https://www.harvardartmuseums.org'
collection_path = 'browse'

collection_url = (museum_domain
                  + "/"
                  + collection_path)

print(collection_url)

https://www.harvardartmuseums.org/browse



### Practical Tips for URL Management

- **Variable Usage**: It’s practical to create variables for commonly used domains and paths. This simplifies the process of changing out paths and parameters when needed.
- **Syntax Details**:
  - The path is separated from the domain by a `/`.
  - Parameters are appended to the URL after the path and start with a `?`.
  - If multiple parameters are used, they are separated by `&`.

Understanding these URL components and their structure helps in crafting precise requests to retrieve data from web servers efficiently.

In [6]:
import requests

collections1 = requests.get(
    collection_url,
    params = {'load_amount': 10,
                  'offset': 0}
)

In [7]:
collections1 = collections1.json()

In [8]:
# print(collections1)

That’s it. Really, we are done here. Everyone go home!

OK not really, there is still more we can learn. But you have to admit that was pretty easy. If you can identify a service that returns the data you want in structured from, web scraping becomes a pretty trivial enterprise. We’ll discuss several other scenarios and topics, but for some web scraping tasks this is really all you need to know.

## Organizing & saving the data
The records we retrieved from https://www.harvardartmuseums.org/browse are arranged as a list of dictionaries. We can easily select the fields of arrange these data into a pandas DataFrame to facilitate subsequent analysis.

In [10]:
import pandas as pd
records1 = pd.DataFrame.from_records(collections1['records'])

In [32]:
#records1

## Iterating to retrieve all the data
Of course we don’t want just the first page of collections. How can we retrieve all of them?

Now that we know the web service works, and how to make requests in Python, we can iterate in the usual way.

In [13]:
records = []
for offset in range(0, 50, 10):
    param_values = {'load_amount': 10, 'offset': offset}
    current_request = requests.get(collection_url, params = param_values)
    records.extend(current_request.json()['records'])
## convert list of dicts to a `DataFrame`
records_final = pd.DataFrame.from_records(records)

In [19]:
#records_final

## Retrieve exhibits data

In this exercise, we will retrieve information about the art exhibitions at Harvard Art Museums from https://www.harvardartmuseums.org/exhibitions

Using a web browser (Firefox or Chrome recommended) inspect the page at https://www.harvardartmuseums.org/exhibitions. Examine the network traffic as you interact with the page. Try to find where the data displayed on that page comes from.

In [20]:
museum_domain = "https://www.harvardartmuseums.org"
exhibit_path = "search/load_more"
exhibit_url = museum_domain + "/" + exhibit_path
print(exhibit_url)

https://www.harvardartmuseums.org/search/load_more


Make a get request in Python to retrieve the data from the URL identified in step1.

In [21]:
import requests
from pprint import pprint as print 
exhibit1 = requests.get(exhibit_url, params = {'type': 'past-exhibition', 'page': 1})
print(exhibit1.headers["Content-Type"])

'application/json'


In [30]:
exhibit = exhibit1.json()
#print(exhibit)

In [28]:
# Base URL of the exhibition section
base_url = "https://www.harvardartmuseums.org/browse"

# List to hold all records from the first five pages
firstFivePages = []

# Loop through the first five pages
for page in range(1, 6):
    # Parameters for the GET request
    params = {
        'type': 'past-exhibition',
        'page': page
    }
    
    # Make the GET request
    response = requests.get(base_url, params=params)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Convert JSON response to Python dictionary
        data = response.json()
        
        # Check if 'records' key is in the JSON data
        if 'records' in data:
            # Extend the list with records
            firstFivePages.extend(data['records'])
        else:
            print(f"No records found on page {page}")
    else:
        print(f"Failed to retrieve data for page {page}: {response.status_code}")

# Create a DataFrame from the list of records
firstFivePages_records = pd.DataFrame(firstFivePages)

In [31]:
firstFivePages_records.columns

Index(['copyright', 'contextualtextcount', 'creditline', 'accesslevel',
       'dateoflastpageview', 'classificationid', 'division', 'markscount',
       'publicationcount', 'totaluniquepageviews', 'contact', 'colorcount',
       'rank', 'id', 'state', 'verificationleveldescription', 'period',
       'images', 'worktypes', 'imagecount', 'totalpageviews', 'accessionyear',
       'standardreferencenumber', 'signed', 'classification', 'relatedcount',
       'verificationlevel', 'primaryimageurl', 'titlescount', 'peoplecount',
       'style', 'lastupdate', 'commentary', 'periodid', 'technique', 'edition',
       'description', 'medium', 'lendingpermissionlevel', 'title',
       'accessionmethod', 'colors', 'provenance', 'groupcount', 'dated',
       'department', 'dateend', 'people', 'url', 'dateoffirstpageview',
       'century', 'objectnumber', 'labeltext', 'datebegin', 'culture',
       'exhibitioncount', 'imagepermissionlevel', 'mediacount', 'objectid',
       'techniqueid', 'dimension

## Document Object Model (DOM)

### Understanding DOM

The Document Object Model (DOM) is crucial for working with HTML or XML documents programmatically. It provides a structured tree representation of the document, allowing developers to navigate and modify the content effectively.

### Features of DOM

- **Tree Structure**: The DOM represents an HTML or XML document as a tree structure where each node is an object representing part of the document.
- **Language-Independent**: It is a cross-platform and language-independent interface, making it a standard tool for web development across different programming environments.
- **Nodes and Objects**: Each branch of the tree ends in a node, and each node can contain objects like elements, attributes, and text.

### Manipulating DOM

- **Programmatic Access**: DOM methods provide programmatic access to the tree, enabling changes to the document’s structure, style, and content.
- **Dynamic Interactions**: This allows web pages to be dynamic, as scripts can react to user events, modify the DOM, and update the display without needing to reload the page.

   
![dom_webscraping.png](dom_webscraping.png)

## Retrieving HTML
When I inspect the network traffic while interacting with https://www.harvardartmuseums.org/calendar I don’t see any requests that return JSON data. The best we can do appears to be to return HTML.

To retrieve data on the events listed in the calender, the first step is the same as before: we make a get request.

In [33]:
calendar_path = 'calendar'

calendar_url = (museum_domain # recall that we defined museum_domain earlier
                  + "/"
                  + calendar_path)

print(calendar_url)

'https://www.harvardartmuseums.org/calendar'


In [34]:
events = requests.get(calendar_url)

In [36]:
events.headers['Content-Type']

'text/html; charset=UTF-8'

## Scrapy: for large / complex projects
Scraping websites using the requests library to make GET and POST requests, and the lxml library to process HTML is a good way to learn basic web scraping techniques. It is a good choice for small to medium size projects. For very large or complicated scraping tasks the scrapy library offers a number of conveniences, including asynchronous retrieval, session management, convenient methods for extracting and storing values, and more. More information about scrapy can be found at https://doc.scrapy.org.

