# Webscraping with Python

## Instructors

- Scott Bailey
- Claire Cahoon
- Walt Gurley

## Learning objectives

By the end of our workshop today, we hope you'll have a sense of when and why to webscrape, and how to extract select information from websites into useable data.  

## Topics

- what is webscraping?
- ethical and legal issues in webscraping
- HTML and CSS
- webscraping with requests-html

##  Setup

With this Google Colab notebook open, click the "Copy to Drive" button that appears in the menu bar. The notebook will then be attached to your own user account, so you can edit it in any way you like -- you can even take notes directly in the notebook.

## Zoom etiquette

Please make sure that your mic is muted during the workshop.

## Questions during the workshop

During the workshop, we have a second instructor who will be monitoring chat on Zoom. Please feel free to ask questions by chat throughout the workshop. Our second instructor will answer as able, and will aggregate questions with answers that might help everyone. 

At the end of each section of the workshop, the primary instructor will answer aggregated and new questions as time permits. If we aren't able to get to your question during the workshop, please follow up with us afterward. 

## Jupyter Notebooks and Google Colaboratory

Jupyter notebooks are a way to write and run Python code in an interactive way. They're quickly becoming a standard way of putting together data, code, and written explanations or visualizations into a single document and sharing that. There are a lot of ways that you can run Jupyter notebooks, including just locally on your computer, but we've decided to use Google's Colaboratory notebook platform for this workshop.  Colaboratory is “a Google research project created to help disseminate machine learning education and research.”  If you would like to know more about Colaboratory in general, you can visit the [Welcome Notebook](https://colab.research.google.com/notebooks/welcome.ipynb).

Using the Google Colaboratory platform allows us to focus on learning and writing Python in the workshop rather than on setting up Python, which can sometimes take a bit of extra work depending on platforms, operating systems, and other installed applications. If you'd like to install a Python distribution locally, though, we're happy to help. Feel free to drop by our walk-in consulting or schedule an appointment with us.

https://go.ncsu.edu/dvs-request


## Environment
If you would prefer to use Anaconda or your own local installation of Python or Jupyter Notebooks, for this workshop you will need an environment with the following packages installed and available:
- `pandas`
- `requests-html`

Please note that we will likely not have time during the workshop to support you with problems related to a local environment, and we do recommend using the Colaboratory notebooks if you are at all unsure.



## What is webscraping?

**Question**: what types of tasks do you think of as webscraping?

Webscraping is the selective retrieval of information from HTML documents on the web. Expansively, we could include the process of directed, automated retrieval of other filetypes such as PDF and CSV from web servers. 

Webcrawling is the automated indexing of websites, and typically involves progressive processing of a site and its links, and the repetition of this process.  

## Ethical concerns in webscraping

First: ethics is not law, but you should be concerned with both. The legality of different types and situations in webscraping continues to be debated and decided. 

There are at least two things you need to check before starting to scrape a website. 

1. Is the content under copyright or licensed in such a way that you should not scrape it? Are there terms of service that limit your use of the site and/or its content?
2. Does the site have a robots.txt file that circumscribes what a robot/scraper should do on the site?

Further considerations:
- **How you scrape**: Is your webscraping going to negatively impact the site, especially due to frequency of requests? Are you identifying yourself in a header when you scrape? Are you publishing your code or redistributing that? Be good citizens of the web.
- **What you do with the data**: Are you giving correct attribution? Are you illegally or unethically redistributing content or data? 

There are plenty of resources online about law, ethics, and best practices around webscraping. Here are a small few further resources if you'd like to think further about these concerns:
 
- https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01
- https://gijn.org/2015/08/12/on-the-ethics-of-web-scraping-and-data-journalism/
- https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/

**Notice: I am not offering you legal advice on whether to scrape or not or what to scrape, just mentioning issues for consideration.**




## Common webscraping libraries

There are hundreds of tutorials online about webscraping with Python, of varying quality. 

- [`selenium`](https://www.selenium.dev/selenium/docs/api/py/)
- [`scrapy`](https://github.com/scrapy/scrapy)
- [`beautifulsoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [`requests`](https://github.com/psf/requests)
- [`urllib`](https://docs.python.org/3/library/urllib.html)
- [`lxml`](https://github.com/lxml/lxml)
- [`MechanicalSoup`](https://github.com/MechanicalSoup/MechanicalSoup)
- [`requests-html`](https://github.com/psf/requests-html)

## Webscraping with requests-html

Why `requests-html`?

- Wraps commonly used libraries in a straightforward API that answers 80% of webscraping cases
- Provides asynchronous methods
- Provides support for dynamic sites rendered with Javascript 

[`requests-html` docs](https://requests.readthedocs.io/projects/requests-html/en/latest/)

In [None]:
# !pip install requests-html

In [3]:
# Import the particular class we need to make http requests
# and parse the re
from requests_html import HTMLSession

In [None]:
# Create an instance of that class
session = HTMLSession()

We're going to learn first with a site set up by ScrapingHub for practicing webscraping: http://books.toscrape.com/. This site gives us well-formed, common looking HTML and CSS. It is much cleaner than most sites you might try to scrape. 

In [5]:
# Define a url variable for ease of use
url = "http://books.toscrape.com/"

In [8]:
# Send the get request to that url, and save the response in the variable r
r = session.get(url)

I've used the variable `r` to save the response we get from the server, which contains a lot of information. `requests-html` makes quite a lot of that information available to us as properties on the response. We'll take a look at a few here. 

In [9]:
# Check the status code of the response
# Docs on status codes: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
r.status_code

200

In [10]:
# Check the response headers
# For info: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers
r.headers

{'Server': 'nginx/1.14.0 (Ubuntu)', 'Date': 'Mon, 12 Oct 2020 18:13:13 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Last-Modified': 'Wed, 29 Jun 2016 21:39:03 GMT', 'X-Upstream': 'toscrape-books-master_web', 'Content-Encoding': 'gzip'}

In [16]:
# HTML object of response
r.html

<HTML url='http://books.toscrape.com/'>

In [17]:
# HTML content of html object
r.html.html

  \n            <h3><a href="catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html" title="The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics">The Boys in the ...</a></h3>\n        \n\n        \n            <div class="product_price">\n                \n\n\n\n\n\n\n    \n        <p class="price_color">£22.60</p>\n    \n\n<p class="instock availability">\n    <i class="icon-ok"></i>\n    \n        In stock\n    \n</p>\n\n                \n                    \n\n\n\n\n\n\n    \n    <form>\n        <button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">Add to basket</button>\n    </form>\n\n\n                \n            </div>\n        \n    </article>\n\n</li>\n                    \n                        <li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">\n\n\n\n\n\n\n    <article class="product_pod">\n        \n            <div class="image_contai

In [19]:
# Text from the <html> element
r.html.text



## An interlude on HTML and CSS

Why do we need to understand HTML and CSS to webscrape? Without HTML and CSS, we can't pinpoint the specific sections of the page or content we are interested in. Specifically, within HTML we need to understand `elements` or `tags`, and `attributes`. Within CSS, we definitely need to know `classes` and `ids`.

Let's take a look at our sample page with our browers' developer tools. In most browsers, you can right click on a part of the page, and click 'Inspect Element' or 'Inspect' to open the dev tools. 

In Safari, you may need to first enable Developer Tools: Preferences -> Advanced -> "Show Develop menu in menu bar" 

In [20]:
# Find a piece of HTML by class
# Find returns all instances in a lis
# If you want just the first, add first=True to your find method
r.html.find(".product_pod")

[<Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>,
 <Element 'article' class=('product_pod',)>]

In [25]:
first_prod = r.html.find(".product_pod", first=True)
first_prod

<Element 'article' class=('product_pod',)>

**Question**: What's the difference in type between the two `find` calls?

In [23]:
# Once we find an element, we can get it's HTML
first_prod.html

'<article class="product_pod">\n<div class="image_container">\n<a href="catalogue/a-light-in-the-attic_1000/index.html"><img src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg" alt="A Light in the Attic" class="thumbnail"/></a>\n</div>\n<p class="star-rating Three">\n<i class="icon-star"/>\n<i class="icon-star"/>\n<i class="icon-star"/>\n<i class="icon-star"/>\n<i class="icon-star"/>\n</p>\n<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>\n<div class="product_price">\n<p class="price_color">£51.77</p>\n<p class="instock availability">\n<i class="icon-ok"/>\n    \n        In stock\n    \n</p>\n<form>\n<button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">Add to basket</button>\n</form>\n</div>\n</article>'

In [24]:
# Or we can get the text from it
first_prod.text

'A Light in the ...\n£51.77\nIn stock\nAdd to basket'

In [27]:
# We can also chain the find calls to find progressively narrower
# elements on the page
sidebar = r.html.find(".sidebar", first=True)

In [32]:
# Let's find all of the genre links within the sidebar
# Notice how the successive tags let us find nested elements
sidebar_genres = sidebar.find(".nav-list ul li")
sidebar_genres

[<Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >,
 <Element 'li' >]

In [33]:
# Once we have the elements, we could collect the text
genres = [item.text for item in sidebar_genres]
genres

['Travel',
 'Mystery',
 'Historical Fiction',
 'Sequential Art',
 'Classics',
 'Philosophy',
 'Romance',
 'Womens Fiction',
 'Fiction',
 'Childrens',
 'Religion',
 'Nonfiction',
 'Music',
 'Default',
 'Science Fiction',
 'Sports and Games',
 'Add a comment',
 'Fantasy',
 'New Adult',
 'Young Adult',
 'Science',
 'Poetry',
 'Paranormal',
 'Art',
 'Psychology',
 'Autobiography',
 'Parenting',
 'Adult Fiction',
 'Humor',
 'Horror',
 'History',
 'Food and Drink',
 'Christian Fiction',
 'Business',
 'Biography',
 'Thriller',
 'Contemporary',
 'Spirituality',
 'Academic',
 'Self Help',
 'Historical',
 'Christian',
 'Suspense',
 'Short Stories',
 'Novels',
 'Health',
 'Politics',
 'Cultural',
 'Erotica',
 'Crime']

`requests-html` pre-collects links for you, since it's such a common and easily findable element. 

In [36]:
# Notice the type of the absolute links
links = r.html.absolute_links
print(type(links))
links

<class 'set'>


{'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'http://books.toscrape.com/catalogue/category/books/academic_40/index.html',
 'http://books.toscrape.com/catalogue/category/books/add-a-comment_18/index.html',
 'http://books.toscrape.com/catalogue/category/books/adult-fiction_29/index.html',
 'http://books.toscrape.com/catalogue/category/books/art_25/index.html',
 'http://books.toscrape.com/catalogue/category/books/autobiography_27/index.html',
 'http://books.toscrape.com/catalogue/category/books/biography_36/index.html',
 'http://books.toscrape.com/catalogue/category/books/business_35/index.html',
 'http://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
 'http://books.toscrape.com/catalogue/category/books/christian-fiction_34/index.html',
 'http://books.toscrape.com/catalogue/category/books/christian_43/index.html',
 'http://books.toscrape.com/catalogue/category/books/classics_6/index.html',
 'http://books.toscrape.com/catalogue/catego

**Question**: What's the difference between the absolute links and the links below? 

In [37]:
r.html.links

{'catalogue/a-light-in-the-attic_1000/index.html',
 'catalogue/category/books/academic_40/index.html',
 'catalogue/category/books/add-a-comment_18/index.html',
 'catalogue/category/books/adult-fiction_29/index.html',
 'catalogue/category/books/art_25/index.html',
 'catalogue/category/books/autobiography_27/index.html',
 'catalogue/category/books/biography_36/index.html',
 'catalogue/category/books/business_35/index.html',
 'catalogue/category/books/childrens_11/index.html',
 'catalogue/category/books/christian-fiction_34/index.html',
 'catalogue/category/books/christian_43/index.html',
 'catalogue/category/books/classics_6/index.html',
 'catalogue/category/books/contemporary_38/index.html',
 'catalogue/category/books/crime_51/index.html',
 'catalogue/category/books/cultural_49/index.html',
 'catalogue/category/books/default_15/index.html',
 'catalogue/category/books/erotica_50/index.html',
 'catalogue/category/books/fantasy_19/index.html',
 'catalogue/category/books/fiction_10/index.ht

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
On the sample page linked above, pick out some feature on the page that you want to locate. Using the developer tools in your browser, try to find something that has a clear html or css identifier. In the cell below, write code that returns just that element. 
</p>
</div>

In [None]:
# Write code here

## More webscraping

We'll run through another example, looking at how to extract other relevant information. 

In [39]:
# A Byte of Python
url = "https://python.swaroopch.com/"

In [40]:
r = session.get(url)

In [41]:
# Find the nav section, and get all links within it
nav = r.html.find("nav", first=True)
chapters = nav.absolute_links
chapters

{'https://python.swaroopch.com/',
 'https://python.swaroopch.com/about.html',
 'https://python.swaroopch.com/about_python.html',
 'https://python.swaroopch.com/basics.html',
 'https://python.swaroopch.com/control_flow.html',
 'https://python.swaroopch.com/data_structures.html',
 'https://python.swaroopch.com/dedication.html',
 'https://python.swaroopch.com/exceptions.html',
 'https://python.swaroopch.com/feedback.html',
 'https://python.swaroopch.com/first_steps.html',
 'https://python.swaroopch.com/floss.html',
 'https://python.swaroopch.com/functions.html',
 'https://python.swaroopch.com/installation.html',
 'https://python.swaroopch.com/io.html',
 'https://python.swaroopch.com/modules.html',
 'https://python.swaroopch.com/more.html',
 'https://python.swaroopch.com/oop.html',
 'https://python.swaroopch.com/op_exp.html',
 'https://python.swaroopch.com/preface.html',
 'https://python.swaroopch.com/problem_solving.html',
 'https://python.swaroopch.com/revision_history.html',
 'https://p

In [42]:
# We could also find those links without the shortcut
chapter_link_els = nav.find("a")
chapter_link_els

[<Element 'a' href='./'>,
 <Element 'a' href='dedication.html'>,
 <Element 'a' href='preface.html'>,
 <Element 'a' href='about_python.html'>,
 <Element 'a' href='installation.html'>,
 <Element 'a' href='first_steps.html'>,
 <Element 'a' href='basics.html'>,
 <Element 'a' href='op_exp.html'>,
 <Element 'a' href='control_flow.html'>,
 <Element 'a' href='functions.html'>,
 <Element 'a' href='modules.html'>,
 <Element 'a' href='data_structures.html'>,
 <Element 'a' href='problem_solving.html'>,
 <Element 'a' href='oop.html'>,
 <Element 'a' href='io.html'>,
 <Element 'a' href='exceptions.html'>,
 <Element 'a' href='stdlib.html'>,
 <Element 'a' href='more.html'>,
 <Element 'a' href='what_next.html'>,
 <Element 'a' href='floss.html'>,
 <Element 'a' href='about.html'>,
 <Element 'a' href='revision_history.html'>,
 <Element 'a' href='translations.html'>,
 <Element 'a' href='translation_howto.html'>,
 <Element 'a' href='feedback.html'>,
 <Element 'a' href='https://www.gitbook.com' target='blank'

In [43]:
# And then extract the href attribute to get the link url 
urls = [a.attrs["href"] for a in chapter_link_els]
urls

['./',
 'dedication.html',
 'preface.html',
 'about_python.html',
 'installation.html',
 'first_steps.html',
 'basics.html',
 'op_exp.html',
 'control_flow.html',
 'functions.html',
 'modules.html',
 'data_structures.html',
 'problem_solving.html',
 'oop.html',
 'io.html',
 'exceptions.html',
 'stdlib.html',
 'more.html',
 'what_next.html',
 'floss.html',
 'about.html',
 'revision_history.html',
 'translations.html',
 'translation_howto.html',
 'feedback.html',
 'https://www.gitbook.com']

These are relative links, but we could recompose the absolute url if needed to. 

We used the `.attrs` method on the link elements. You could use the same method with a different attribute key, things like `href`, to find any other HTML attributes. 

In [44]:
# If we wanted just the chapter links, since this is a standard list, 
# we could use slicing to remove the first and last links
urls_filtered = urls[1:-1]
urls_filtered

['dedication.html',
 'preface.html',
 'about_python.html',
 'installation.html',
 'first_steps.html',
 'basics.html',
 'op_exp.html',
 'control_flow.html',
 'functions.html',
 'modules.html',
 'data_structures.html',
 'problem_solving.html',
 'oop.html',
 'io.html',
 'exceptions.html',
 'stdlib.html',
 'more.html',
 'what_next.html',
 'floss.html',
 'about.html',
 'revision_history.html',
 'translations.html',
 'translation_howto.html',
 'feedback.html']

## Further topics:

- headless browsers
- dynamic websites with Javascript