# Webscraping with Python

## Instructors

- Scott Bailey
- Claire Cahoon
- Walt Gurley

## Learning objectives

By the end of our workshop today, we hope you'll have a sense of when and why to webscrape, and how to extract select information from websites into useable data.  

## Topics

- what is webscraping?
- ethical and legal issues in webscraping
- HTML and CSS
- webscraping with requests-html

##  Setup

With this Google Colab notebook open, click the "Copy to Drive" button that appears in the menu bar. The notebook will then be attached to your own user account, so you can edit it in any way you like -- you can even take notes directly in the notebook.

## Zoom etiquette

Please make sure that your mic is muted during the workshop.

## Questions during the workshop

During the workshop, we have a second instructor who will be monitoring chat on Zoom. Please feel free to ask questions by chat throughout the workshop. Our second instructor will answer as able, and will aggregate questions with answers that might help everyone. 

At the end of each section of the workshop, the primary instructor will answer aggregated and new questions as time permits. If we aren't able to get to your question during the workshop, please follow up with us afterward. 

## Jupyter Notebooks and Google Colaboratory

Jupyter notebooks are a way to write and run Python code in an interactive way. They're quickly becoming a standard way of putting together data, code, and written explanations or visualizations into a single document and sharing that. There are a lot of ways that you can run Jupyter notebooks, including just locally on your computer, but we've decided to use Google's Colaboratory notebook platform for this workshop.  Colaboratory is “a Google research project created to help disseminate machine learning education and research.”  If you would like to know more about Colaboratory in general, you can visit the [Welcome Notebook](https://colab.research.google.com/notebooks/welcome.ipynb).

Using the Google Colaboratory platform allows us to focus on learning and writing Python in the workshop rather than on setting up Python, which can sometimes take a bit of extra work depending on platforms, operating systems, and other installed applications. If you'd like to install a Python distribution locally, though, we're happy to help. Feel free to drop by our walk-in consulting or schedule an appointment with us.

https://go.ncsu.edu/dvs-request


## Environment
If you would prefer to use Anaconda or your own local installation of Python or Jupyter Notebooks, for this workshop you will need an environment with the following packages installed and available:
- `pandas`
- `requests-html`

Please note that we will likely not have time during the workshop to support you with problems related to a local environment, and we do recommend using the Colaboratory notebooks if you are at all unsure.



## What is webscraping?

**Question**: what types of tasks do you think of as webscraping?

Webscraping is the selective retrieval of information from HTML documents on the web. Expansively, we could include the process of directed, automated retrieval of other filetypes such as PDF and CSV from web servers. 

Webcrawling is the automated indexing of websites, and typically involves progressive processing of a site and its links, and the repetition of this process.  

## Ethical concerns in webscraping

First: ethics is not law, but you should be concerned with both. The legality of different types and situations in webscraping continues to be debated and decided. 

There are at least two things you need to check before starting to scrape a website. 

1. Is the content under copyright or licensed in such a way that you should not scrape it? Are there terms of service that limit your use of the site and/or its content?
2. Does the site have a robots.txt file that circumscribes what a robot/scraper should do on the site?

Further considerations:
- **How you scrape**: Is your webscraping going to negatively impact the site, especially due to frequency of requests? Are you identifying yourself in a header when you scrape? Are you publishing your code or redistributing that? Be good citizens of the web.
- **What you do with the data**: Are you giving correct attribution? Are you illegally or unethically redistributing content or data? 

There are plenty of resources online about law, ethics, and best practices around webscraping. Here are a small few further resources if you'd like to think further about these concerns:
 
- https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01
- https://gijn.org/2015/08/12/on-the-ethics-of-web-scraping-and-data-journalism/
- https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/

**Notice: I am not offering you legal advice on whether to scrape or not or what to scrape, just mentioning issues for consideration.**




## Common webscraping libraries

There are hundreds of tutorials online about webscraping with Python, of varying quality. 

- [`selenium`](https://www.selenium.dev/selenium/docs/api/py/)
- [`scrapy`](https://github.com/scrapy/scrapy)
- [`beautifulsoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [`requests`](https://github.com/psf/requests)
- [`urllib`](https://docs.python.org/3/library/urllib.html)
- [`lxml`](https://github.com/lxml/lxml)
- [`MechanicalSoup`](https://github.com/MechanicalSoup/MechanicalSoup)
- [`requests-html`](https://github.com/psf/requests-html)

## Webscraping with requests-html

Why `requests-html`?

- Wraps commonly used libraries in a straightforward API that answers 80% of webscraping cases
- Provides asynchronous methods
- Provides support for dynamic sites rendered with Javascript 

[`requests-html` docs](https://requests.readthedocs.io/projects/requests-html/en/latest/)

In [None]:
# !pip install requests-html

In [None]:
# Import the particular class we need to make http requests
# and parse the re
from requests_html import HTMLSession

In [None]:
# Create an instance of that class
session = HTMLSession()

We're going to learn first with a site set up by ScrapingHub for practicing webscraping: http://books.toscrape.com/. This site gives us well-formed, common looking HTML and CSS. It is much cleaner than most sites you might try to scrape. 

In [None]:
# Define a url variable for ease of use
url = "http://books.toscrape.com/"

In [None]:
# Send the get request to that url, and save the response in the variable r
r = session.get(url)

I've used the variable `r` to save the response we get from the server, which contains a lot of information. `requests-html` makes quite a lot of that information available to us as properties on the response. We'll take a look at a few here. 

In [None]:
# Check the status code of the response
# Docs on status codes: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
r.status_code

In [None]:
# Check the response headers
# For info: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers
r.headers

In [None]:
# HTML object of response
r.html

In [None]:
# HTML content of html object
r.html.html

In [None]:
# Text from the <html> element
r.html.text

## An interlude on HTML and CSS

Why do we need to understand HTML and CSS to webscrape? Without HTML and CSS, we can't pinpoint the specific sections of the page or content we are interested in. Specifically, within HTML we need to understand `elements` or `tags`, and `attributes`. Within CSS, we definitely need to know `classes` and `ids`.

Let's take a look at our sample page with our browers' developer tools. In most browsers, you can right click on a part of the page, and click 'Inspect Element' or 'Inspect' to open the dev tools. 

In Safari, you may need to first enable Developer Tools: Preferences -> Advanced -> "Show Develop menu in menu bar" 

In [None]:
# Find a piece of HTML by class
# Find returns all instances in a lis
# If you want just the first, add first=True to your find method
r.html.find(".product_pod")

In [None]:
first_prod = r.html.find(".product_pod", first=True)
first_prod

**Question**: What's the difference in type between the two `find` calls?

In [None]:
# Once we find an element, we can get it's HTML
first_prod.html

In [None]:
# Or we can get the text from it
first_prod.text

In [None]:
# We can also chain the find calls to find progressively narrower
# elements on the page
sidebar = r.html.find(".sidebar", first=True)

In [None]:
# Let's find all of the genre links within the sidebar
# Notice how the successive tags let us find nested elements
sidebar_genres = sidebar.find(".nav-list ul li")
sidebar_genres

In [None]:
# Once we have the elements, we could collect the text
genres = [item.text for item in sidebar_genres]
genres

`requests-html` pre-collects links for you, since it's such a common and easily findable element. 

In [None]:
# Notice the type of the absolute links
links = r.html.absolute_links
print(type(links))
links

**Question**: What's the difference between the absolute links and the links below? 

In [None]:
r.html.links

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
On the sample page linked above, pick out some feature on the page that you want to locate. Using the developer tools in your browser, try to find something that has a clear html or css identifier. In the cell below, write code that returns just that element. 
</p>
</div>

In [None]:
# Write code here

## More webscraping

We'll run through another example, looking at how to extract other relevant information. 

In [None]:
# A Byte of Python
url = "https://python.swaroopch.com/"

In [None]:
r = session.get(url)

In [None]:
# Find the nav section, and get all links within it
nav = r.html.find("nav", first=True)
chapters = nav.absolute_links
chapters

In [None]:
# We could also find those links without the shortcut
chapter_link_els = nav.find("a")
chapter_link_els

In [None]:
# And then extract the href attribute to get the link url 
urls = [a.attrs["href"] for a in chapter_link_els]
urls

These are relative links, but we could recompose the absolute url if needed to. 

We used the `.attrs` method on the link elements. You could use the same method with a different attribute key, things like `href`, to find any other HTML attributes. 

In [None]:
# If we wanted just the chapter links, since this is a standard list, 
# we could use slicing to remove the first and last links
urls_filtered = urls[1:-1]
urls_filtered

We've learned how to extract specific information from a webpage. Let's build on that by writing a couple of functions to help us save some of that information.

Here's a standard workflow:
- generate a list of URLs
- iterate over those to scrape content we're interested information
- write that content to a file that we could process in any way we want

In [None]:
# Create a directory to store data if it doesn't exist
import os
if not os.path.exists("data"):
    os.makedirs("data")

In [None]:
# A function to create a filename from a url and a directory name
def create_filename(url, dirname):
    chunks = url.split("/")
    name = chunks[-1].split(".")[0]
    return os.path.join(dirname, f"{name}.text")

In [None]:
# A function that takes a url, and returns the text of the first paragraph
# Note that this is written for this specific content
def get_para_text(url):
    r = session.get(url)
    try:
        text = r.html.find(".markdown-section", first=True).text
    except:
        text = "Missing"
    return text


In [None]:
import time

In [None]:
sorted(list(chapters))[1:-1]

In [None]:
# Iterate over chapter urls, get the text, and write it to a file
# We have to convert the chapters set into a list to subscript it
for url in sorted(list(chapters))[1:-1]:
    filename = create_filename(url, "data")
    text = get_para_text(url)
    with open(filename, 'w') as f:
        f.write(text)
    time.sleep(2)

**Question**: Why do you think that we put the `time.sleep` function here?

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
TODO
</p>
</div>

## Some examples

Let's look at a few concrete examples that can also show different ways of using webscraping

### Scraping text and metadata

#### Epistolae

My first suggestion is to email the folks who run this and find out if they would be willing to provide raw XML or txt files for your research purposes. 

But let's see how we would scrape it if we had to.

In [None]:
base_url = 'https://epistolae.ctl.columbia.edu/'
letter_url = 'https://epistolae.ctl.columbia.edu/letter/25967.html'

In [None]:
r = session.get(letter_url)

In [None]:
translated_letter = r.html.find('.pane-node-field-translated-letter')[0].find('.field-items')[0].text
translated_letter

In [None]:
original_letter = r.html.find('.pane-node-field-original-letter')[0].find('.field-items')[0].text
original_letter

In [None]:
printed_source = r.html.find('.pane-node-field-printed-source')[0].find('.field-items')[0].text
printed_source

In [None]:
letter_date = r.html.find('.pane-node-field-date')[0].find('.field-items')[0].text
letter_date

In [None]:
import pandas as pd

In [None]:
# Let's compile the data into a dictionary and create a pandas dataframe
# At this point, we're putting together content with metadata into 
# a format that we could easily export as csv for analysis
letter_dict = {
    'date': [letter_date],
    'printed-source': [printed_source],
    'translated-letter': [translated_letter],
    'original-letter': [original_letter]
}
df = pd.DataFrame(data=letter_dict)
df

If you really wanted to crawl and scrape the whole site, you would go to the table of letters, get all the links, and start scraping. You just have to handle the pagination, which in this case is tricky since it doesn't modify the url. 

You might have to actually spin up an instance of a headless browser and simulate a click. That takes us to a level of complexity beyond what we want to cover today though.

### Downloading PDFS or CSVs

In [None]:
url = "https://arxiv.org/search/?query=category+theory&searchtype=all&source=header"
pdf_url = "https://arxiv.org/pdf/1805.08795"
r = session.get(pdf_url)
with open('math.pdf', 'wb') as f:
    f.write(r.content)

You could do the exact same thing with a CSV file; the key piece for either a PDF or CSV is the write occuring in binary mode, and the object written being the contents of the request rather than the html. 

### Scraping HTML data tables

Not all tables on the web are actually table elements. We'll go through how to handle a table, and you could take the process and apply it to other HTML structures.

In [None]:
table_url = "https://en.wikipedia.org/wiki/Table_(information)"

In [None]:
r = session.get(table_url)
r

In [None]:
# Find the table and see what it contains
data = r.html.find('.wikitable', first=True)
data.text

In [None]:
rows = data.find('tr')
rows

In [None]:
firsts = []
lasts = []
ages = []

In [None]:
for row in rows:
    if len(row.find('td')) > 0:
        firsts.append(row.find('td')[0].text)
        lasts.append(row.find('td')[1].text)
        ages.append(row.find('td')[2].text)

# Be aware that this is super bug prone code since it is so specific to 
# the precise data table I'm trying to scrape

In [None]:
firsts, lasts, ages

In [None]:
data = {'first_names': firsts, "last_names": lasts, "ages": ages}
df = pd.DataFrame(data=data)
df

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
As a final activity, if you'd like, pick a site you've been wanting to scrape or have thought about. Figure out whether you could/should scrape the site. If you legally/ethically can, pick a part of a single page from the site, and scrape information from it. 
</p>
</div>

## Further topics:

- headless browsers
- dynamic websites with Javascript