# Web Scraping with Python: An Introductory Tutorial

### By Rob Osterburg, RRCC

## Topics
* Using requests and beautifulsoup to gather data from web pages
* Go over key concepts, tools and techniques as we go
* Scrape some data 
* Organize it using namedtuple
* Persist it as a JSON file

## Key Packages

* requests -- Issues HTTP requests to web servers and handles the response 

* beautifulsoup4 -- Creates a searchable tree from web page

* These packages are *not* in Python's standard library, here is how to install them:
 
    - Standard Python: `pip install requests beautifulsoup4`

    - Anaconda Python: `conda install requests beautifulsoup4`


In [1]:
# Requests is 'HTTP for Humans'
import requests

# BeautifulSoup parses and builds tree from HTML
from bs4 import BeautifulSoup

## Warning! Warning!
* Web scraping is commonly frowned upon by the site's owners
* **Always check** 
    - **Terms of service**  
    - **Conditions of use**
    - **robots.txt file**   
* Aggressive scrapers can take a site down
* If in doubt, talk to a lawyer, especially for *anything* work related
* Related information and experiences: [Sustainable Scrapers, PyData DC, 2016](http://pyvideo.org/pydata-dc-2016/sustainable-scrapers.html)

* Very real legal issues that can arise from web scraping 
* Taking a site down is a *denial of service attack* covered by the US Computer Fraud and Abuse Act 
* What about PyVideo.org?

## PyVideo.org - Terms and Conditions?

None that I can see, the site's [About page](http://pyvideo.org/pages/about.html) says:

> ... PyVideo.org is a freely available index of freely available resources that seek to provide everyone with the opportunity to learn about Python.

## PyVideo.org - Robots.txt?
* A robots.txt files contain *restrictions* on the content that web crawlers are permitted to access 

* The [http://pyvideo.org/robots.txt](http://pyvideo.org/robots.txt) is empty, except for one comment 

* It appears that there are no restrictions on scraping PyVideo.org  

* Let's minimize our impact on the site 

* To see *The Robots Exclusion Protocol* visit [http://www.robotstxt.org/](http://www.robotstxt.org/).

In [2]:
# Getting the Page
resp = requests.get('http://pyvideo.org/tags.html')
resp.status_code

200

## HTTP Responses

HTTP Verb | Effect | Success | Failure
--------- | ------ | ------- | --------
POST      | Create | 200     | 400, 40X, 500
**GET**       | **Read**   | **200**     | **400, 40X, 500**
PUT       | Update | 200     | 400, 40X, 500 
DELETE    | Delete | 200     | 400, 40X, 500

* [Comprehensive list of HTTP status codes](https://httpstatuses.com/)

* GET is all we will need today
* Status != 200 is a problem of some sort

In [3]:
# Parse the Page
soup = BeautifulSoup(resp.text, 'html.parser')

 ## BeautifulSoup - Parses the page
 
* Models the page as a tree
* Similar to the Document Object Model
 ![tags page structure](./images/htmltree.png)

* Parsers build the tree, pick one
 - html.parser -> Default parser for Python 3
 - HTMLParser  -> Default parser for Python 2
 - lxml        -> Fast requires installing a C library and a PyPI package 
 - html5lib    -> Pure Python and part of the standard library
 
 cite: https://interactivepython.org/runestone/static/pythonds/Trees/ExamplesofTrees.html

## BeautifulSoup - Well documented and easy to use API
* select('css selector') --> [List of Tags]

* find_all(tags, keyword_args, attrs={'attr', 'value'})  --> [List of Tags]
    
* find(tags, keyword_args, attrs={'attr', 'value'}) --> Tag
 

## Understanding Page Structure

* To extract data from a page, you need to understand its structure
* BeautifulSoup holds the tree in memory
* Look into it using the our browser's developer tools
    - Mac -- Cmd + Opt + i
    - Linux -- Ctrl + Shift + i
    - Windows -- Ctrl + Shift + i

Visiting the [Tags page](http://pyvideo.org/tags) page, you can see the structure of the page  
![tags page structure](./images/structure-of-tags-page.png)

In [4]:
# Building a list of links using a CSS selector and a for loop 
topics = []
for topic in soup.select('li > a'):
    # extract the link using the href attribute
    topics.append(topic['href'])
topics[:10]

['http://pyvideo.org/index.html',
 'http://pyvideo.org/events.html',
 'http://pyvideo.org/tags.html',
 'http://pyvideo.org/speakers.html',
 'http://pyvideo.org/pages/about.html',
 'http://pyvideo.org/pages/thank-you-contributors.html',
 'http://pyvideo.org/pages/thanks-will-and-sheila.html',
 'http://pyvideo.org/tag/2to3/',
 'http://pyvideo.org/tag/3d/',
 'http://pyvideo.org/tag/3d-printing/']

In [5]:
# Building a list of links using a CSS selector and a list comprehension
topics = [topic 
          for topic in soup.select('li > a')]
topics[:8]

[<a href="http://pyvideo.org/index.html"><i class="fa fa-fw fa-home"></i> <span>Start</span></a>,
 <a href="http://pyvideo.org/events.html"><i class="fa fa-fw fa-list-ul"></i> <span>Events</span></a>,
 <a href="http://pyvideo.org/tags.html"><i class="fa fa-fw fa-tags"></i> <span>Tags</span></a>,
 <a href="http://pyvideo.org/speakers.html"><i class="fa fa-fw fa-users"></i> <span>Speakers</span></a>,
 <a href="http://pyvideo.org/pages/about.html"><i class="fa fa-fw fa-info"></i> <span>About</span></a>,
 <a href="http://pyvideo.org/pages/thank-you-contributors.html"><i class="fa fa-fw fa-info"></i> <span>Thank You</span></a>,
 <a href="http://pyvideo.org/pages/thanks-will-and-sheila.html"><i class="fa fa-fw fa-info"></i> <span></span></a>,
 <a href="http://pyvideo.org/tag/2to3/">2to3</a>]

In [6]:
topic = topics[7]
topic, type(topic), topic.contents, topic['href']

(<a href="http://pyvideo.org/tag/2to3/">2to3</a>,
 bs4.element.Tag,
 ['2to3'],
 'http://pyvideo.org/tag/2to3/')

In [7]:
# clean up links
topics = [topic 
          for topic in topics 
          if r'tag/' in topic['href']]
len(topics), type(topics[0]), topics[:5]

(1461,
 bs4.element.Tag,
 [<a href="http://pyvideo.org/tag/2to3/">2to3</a>,
  <a href="http://pyvideo.org/tag/3d/">3d</a>,
  <a href="http://pyvideo.org/tag/3d-printing/">3D Printing</a>,
  <a href="http://pyvideo.org/tag/abc/">abc</a>,
  <a href="http://pyvideo.org/tag/accelerate/">accelerate</a>])

In [8]:
# Let's visit the topic page
resp = requests.get('http://pyvideo.org/tag/scraping')
resp.status_code

200

In [9]:
# Parse the page
soup = BeautifulSoup(resp.text, 'html.parser')

In [10]:
# Find the links to video pages 
talks = soup.select('article > section > h4 > a')
talk_links = ['http://pyvideo.org' + talk['href'] 
              for talk in talks]
talk_links[:5]

['http://pyvideo.org/pydata-dc-2016/open-data-dashboards-python-web-scraping.html',
 'http://pyvideo.org/pydata-san-francisco-2016/how-soon-is-now-extracting-publication-dates-with-machine-learning.html',
 'http://pyvideo.org/pygotham-2016/introduction-to-web-scraping-using-scrapy.html',
 'http://pyvideo.org/pygotham-2016/webscraping-by-example-an-introduction-to-beautifulsoup.html',
 'http://pyvideo.org/chipy/scraping-with-python.html']

In [11]:
# Get a page for one of the talks 
resp = requests.get(talk_links[0])
resp.raise_for_status()
soup = BeautifulSoup(resp.text, 'html.parser')

In [12]:
# Get the talk's title
titles = [title.contents[0].strip() 
          for title in soup.select('.entry-title > a')]
title = titles[0]

In [13]:
# Get the speaker's name 
names = [elem.contents[0] 
         for elem in soup.select('.url')]
names

['Marie Whittaker']

In [14]:
# Get the talk details
details = [detail 
           for detail in soup.select('.details-content > ul > li > a')]
details

[<a href="http://pyvideo.org/events/pydata-dc-2016.html">PyData DC 2016</a>,
 <a href="https://www.youtube.com/watch?v=kc676iLvib8" rel="external">YouTube</a>,
 <a href="http://pyvideo.org/tag/data/">Data</a>,
 <a href="http://pyvideo.org/tag/scraping/">scraping</a>,
 <a href="http://pyvideo.org/tag/web/">web</a>]

In [20]:
# Get the YouTube link
links = [detail['href'] 
         for detail in details 
         if 'youtube.com' in detail['href']]
link = links[0]

In [21]:
# Get the subject tags
tags = [link.contents[0].lower() 
        for link in details 
        if 'tag' in link['href']]
tags

['data', 'scraping', 'web']

In [22]:
# Get the description
paragraphs = soup.select('.entry-content > p')
description = '\n\n'.join([p.contents[0] for p in paragraphs])
print(description)

PyData DC 2016

Distilling a world of data down to a few key indicators can be an effective way of keeping an audience informed, and this concept is at the heart of a good dashboard. This talk will cover a few methods of scraping and reshaping open data for dashboard visualization, to automate the boring stuff so you have more time and energy to focus on the analysis and content.

This talk will cover a basic scenario of curating open data into visualizations for an audience. The main goal is to automate data scraping/downloading and reshaping. I use python to automate data gathering, and Tableau and D3 as visualization tools -- but the process can be applied to numerous analytical/visualization suites.

I'll discuss situations where a dashboard makes sense (and when one doesn't). I will make a case also that automation makes for a more seamless data gathering and updating process, but not always for smarter data analysis.

Some python packages I'll cover for web scraping and downloadi

##  Saving the Data

* Use namedtuple to capture information about PyVideo talk

* What do we need to capture
    - Talk title (string)
    - Name of presenter(s) (list)
    - Description (string)
    - Tags (list)
    - Link to video (string)
    
* Let's use JSON 
    - Awesome for persisting structured data 
    - Excellent data exchange format
    

In [35]:
# Storing related data in a namedtuple
from collections import namedtuple

# create the structure
pyvideo = namedtuple('pyvideo', 'title names description tags link')

# create an instance
talk_data = pyvideo(title=title, names=names, tags=tags, link=link, description=description)

# access the data and display it as a formatted string
fmt = 'title={}\nnames={}\ntags={}\nlink={}'
print(fmt.format(talk_data.title, talk_data.names, talk_data.tags, talk_data.link))

title=Open Data Dashboards & Python Web Scraping
names=['Marie Whittaker']
tags=['data', 'scraping', 'web']
link=https://www.youtube.com/watch?v=kc676iLvib8


In [41]:
# Writing the namedtuple to disk as JSON
import json

# Write to file
with open('a_video.json', 'w') as fout:
    # Note: preserve keys by saving a dictionary, not a list
    json.dump(talk_data._asdict(), fout)
    
# Read from file
with open('a_video.json', 'r') as fin:
    restored_talk_data = json.load(fin)

# Deserialize -- Note: pass as keyword arguments using **
restored_talk_data = pyvideo(**restored_talk_data)

# Compare semantically
talk_data == restored_talk_data

<class 'dict'>


True

## Writing web scraping code has advantages

* Excellent source of data 

* Scarce data are more valuable data

## ... and presents some problems too

* Your code breaks when the site is updated

* Read the terms and conditions, otherwise ...

* Getting blocked, banned, sued ...

## Acknowledgements

* [Galvanize](https://www.galvanize.com/pick-a-location?page=%2F) for hosting [Denver Data Science Day 2017](http://denverdatascienceday.com/) and all the work they did to make it happen.

* Bob Mickus and all the volunteers from [PyData Denver](https://www.meetup.com/PyData-Denver/) who helped organize this event [Denver Data Science Day](http://denverdatascienceday.com/).

## Resources
1. [Automate the Boring Stuff](https://automatetheboringstuff.com/), [Chapter 11 — Web Scraping](https://automatetheboringstuff.com/chapter11/) by Al Sweigart (Free PDF version online).  Takes you through topics step-by-step, includes using Selenium to fill out forms and simulate mouse clicks. 

1. [RealPython Blog -- Web Scraping With Scrapy and MongoDB](https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/) by Micheal Herman. Scrapy is a Python package that makes scraping code easier to maintain. 

1. [Talk Python to Me Podcast -- Web scraping at scale with Scrapy and ScrapingHub](https://talkpython.fm/episodes/show/50/web-scraping-at-scale-with-scrapy-and-scrapinghub). Web scraping as a Service from the author of Scrapy.

1. [PyVideo.org](http://pyvideo.org/)— Comprehensive catalog of videos of over 8000 of Python related presentations. Talks on scraping web pages can be found on the [Scraping page](http://pyvideo.org/tag/scraping/). 

1. [Web Scraping with Python: Collecting Data from the Modern Web](https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291) by Ryan Mitchell.  This 4.5 star book on Amazon covers scraping topics in depth.

1. [Awesome Python](https://awesome-python.com/) -- PyPI has over 100,000 packages.  Awesome Python is a curated list of the best, see their recommended web scraping packages [here](https://awesome-python.com/#web-crawling).

1. Practice Sites
    * [Books to Scrape](http://books.toscrape.com/)
    * [Quotes to Scrape](http://quotes.toscrape.com/) 