## Scraping

There is a lot of great data out on the web. Unfortunately, it is not all readily available via APIs. And even when APIs are available, it may restrict the data we have access to. Scraping usually referes to extracting web page content when APIs are not available. 

In the API section, we used urllib to call an API and save data. We can also use it to download webpages.

In [1]:
import urllib.request

In [2]:
html = urllib.request.urlopen("http://xkcd.com/1481/")
print(html.read())

b'<!DOCTYPE html>\n<html>\n<head>\n<script>\n  (function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\n  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n  })(window,document,\'script\',\'https://www.google-analytics.com/analytics.js\',\'ga\');\n\n  ga(\'create\', \'UA-25700708-7\', \'auto\');\n  ga(\'send\', \'pageview\');\n</script>\n<link rel="stylesheet" type="text/css" href="/s/b0dcca.css" title="Default"/>\n<title>xkcd: API</title>\n<meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n<link rel="shortcut icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="alternate" type="application/atom+xml" title="Atom 1.0" href="/atom.xml"/>\n<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/rss.xml"/>\n<script type="text/javascript" src="/s/b66ed7.js" async></scr

We can use the urlretrieve function to retrieve a specific resources, such as a file, via url. This is basic web scraping.

If we look through our html above, we can see there is a url for the image in the page. 

In [3]:
urllib.request.urlretrieve("http://imgs.xkcd.com/comics/api.png", "api.png")

('api.png', <http.client.HTTPMessage at 0x105c3a5c0>)

The cell below this is markdown. Double-click on it so it is in editing mode, then execute it to display the file you downloaded with the previous command.

![](api.png)

Using these methods, we are treating the html as an unstructured string. If we want to retrieve the structured markup, we can use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). "Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work."

In [5]:
from bs4 import BeautifulSoup
url = 'https://litemind.com/best-famous-quotes'

html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
for quote in soup.findAll('div',{'class':'wp_quotepage'}):
    text = quote.findChildren()[0].renderContents()
    author = quote.findChildren()[1].renderContents()
    print(text, author)

b'1. You can do anything, but not everything.' b'\xe2\x80\x94David Allen'
b'2. Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.' b'\xe2\x80\x94Antoine de Saint-Exup\xc3\xa9ry'
b'3. The richest man is not he who has the most, but he who needs the least.' b'\xe2\x80\x94Unknown Author'
b'4. You miss 100 percent of the shots you never take.' b'\xe2\x80\x94Wayne Gretzky'
b'5. Courage is not the absence of fear, but rather the judgement that something else is more important than fear.' b'\xe2\x80\x94Ambrose Redmoon'
b'6. You must be the change you wish to see in the world.' b'\xe2\x80\x94Gandhi'
b'7. When hungry, eat your rice; when tired, close your eyes. Fools may laugh at me, but wise men will know what I mean.' b'\xe2\x80\x94Lin-Chi'
b'8. The third-rate mind is only happy when it is thinking with the majority. The second-rate mind is only happy when it is thinking with the minority. The first-rate mind is only happy when it is th

Scraping takes work. You need to be able to read the page source to understand how the information is structured and how you can access it. The examples here have been fairly straightforward. Sometimes the markup is messy and poorly formed.

The next commands will look at a [restaurant inspection report](http://dc.healthinspections.us/webadmin/dhd_431/lib/mod/inspection/paper/_paper_food_inspection_report.cfm?inspectionID=105185&wguid=1367&wgunm=sysact&wgdmn=431) from the DC Department of Health.

The markup is messy. It took me a while to parse the data I wanted from it. And once I did, I spent a while cleaning it up too. Let's take a look. (To avoid bombarding the site, I have included one of the reports in the repo for this tutorial.)

In [16]:
# pull in the html and take a look at it
html_file = "https://raw.githubusercontent.com/nd1/staging/master/105185.html"
html_rpt = urllib.request.urlopen(html_file).read()
html_rpt

b'<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml"><head>\n<title>Food Establishment Inspection Report</title>\n<link href="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/css/generic.css" rel="stylesheet" type="text/css" media="all, screen, print" />\n<style type="text/css">\ndiv.container {\n\tborder-bottom: 1px solid black;\n\tmargin-right:15px;\n  }\ndiv.container span {\n\tposition:relative;\n\tbottom:-2px;\n\tbackground:#FFFFFF;\n  }\n.checkboxRedN {\nfloat:left;\nborder:1px solid red;\nwidth:10px;\nheight:10px;\nfont-size:5px;\n}\n.checkboxN {\nborder:1px solid black;\nwidth:10px;\nheight:10px;\nfont-size:2px;\n}\n.boxwid2{\n\twidth:185px;\n}\n.spacer{\n\twidth:10px;\n}\n.line1{\n\twidth:400px;\n}\n.line2{\n\twidth:230px;\n}\n.line3{\n\twidth:245px;\n}\n.line5{\n\twidth:235px;\n}\n.hdrSpace{\n\twidth: 4px;\n}\n.hdrSpace2{\n\twidth: 4px;\n}\n.inspTypeSpace{\n\twidth: 15px;\n}\n.vioHeight {\n\t\theight: 675px;\n}\n.sigWidth {\n\twidth: 750px;\n

In [17]:
# Using Beautiful Soup involved a lot of trial and error. Here are some examples of what I parsed.
soup = BeautifulSoup(html_rpt, 'html.parser')
inspection = soup.find_all('tr')

In [18]:
# phone number
inspection[5].get_text()

'\n\n\nTelephone\n\xa0(202) 667-0010\n\xa0E-mail address\n\t\t\t\t\t\t\t\t\xa0kelvin.ferrufino@gmail.com\n\t\t\t\t\t\t\t\n\n'

In [19]:
# inspection details
inspection[6].get_text()

'\n\n\nDate of Inspection\n\xa006\n/\n\xa026\n/\n\xa02013\n\xa0\xa0\xa0\xa0Time In\n\xa007\n:\n\xa000\nPM\n\xa0\xa0\xa0\xa0Time Out\n\xa008\n:\n\xa015\nAM\n\xa0\n\n\n'

In [20]:
# the inspector name and badge number
tables = soup.find_all('table')
inspector_info = tables[11]
inspector = inspector_info.find_all("td")

print(inspector[1])
print(inspector[2])

<td style="width:225px; vertical-align: bottom;"> A. Jackson</td>
<td style="width:90px; vertical-align: bottom;">54 </td>


There are a lot of resources out there for building scrapers. Take a look at the resources in the slides. And try out this tutorial for [building your first scraper](http://first-web-scraper.readthedocs.io/en/latest/).

Thanks!