## Scraping

There is a lot of great data out on the web. Unfortunately, it is not all readily available via APIs. And even when APIs are available, it may restrict the data we have access to. Scraping usually referes to extracting web page content when APIs are not available. 

In the API section, we used urllib to call an API and save data. We can also use it to extract data from webpages.

In [21]:
import urllib.request

In [22]:
html = urllib.request.urlopen("http://xkcd.com/1481/")
print(html.read())

b'<!DOCTYPE html>\n<html>\n<head>\n<script>\n  (function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\n  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n  })(window,document,\'script\',\'https://www.google-analytics.com/analytics.js\',\'ga\');\n\n  ga(\'create\', \'UA-25700708-7\', \'auto\');\n  ga(\'send\', \'pageview\');\n</script>\n<link rel="stylesheet" type="text/css" href="/s/b0dcca.css" title="Default"/>\n<title>xkcd: API</title>\n<meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n<link rel="shortcut icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="alternate" type="application/atom+xml" title="Atom 1.0" href="/atom.xml"/>\n<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/rss.xml"/>\n<script type="text/javascript" src="/s/b66ed7.js" async></scr

We can use the urlretrieve function to retrieve a specific resources, such as a file, via url. This is basic web scraping.

If we look through our html above, we can see there is a url for the image in the page. But before we go doing that, maybe we should check the robots.txt file first...

In [27]:
robot = urllib.request.urlopen("https://xkcd.com/robots.txt")
print(robot.read())

b'User-agent: *\nDisallow: /personal/'


Looks like we are good!

In [28]:
urllib.request.urlretrieve("http://imgs.xkcd.com/comics/api.png", "api.png")

('api.png', <http.client.HTTPMessage at 0x10815c748>)

The cell below this is markdown. Double-click on it so it is in editing mode, then execute it to display the file you downloaded with the previous command.

![](api.png)

Using these methods, we are treating the html as an unstructured string. If we want to retrieve the structured markup, we can use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). "Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work."

Let's look at [this page](https://litemind.com/best-famous-quotes). What if we wanted to extract the quotes and authors? First, are we allowed to?

In [29]:
robot = urllib.request.urlopen("https://litemind.com/robots.txt")
print(robot.read())

b'User-agent: *\nDisallow: /wp-admin\nDisallow: /wp-content/cache\nDisallow: /trackback\nDisallow: */trackback\n\nAllow: /wp-content/uploads\n\nDisallow: /manifests\nDisallow: /search\nDisallow: /newsletter-verify\nDisallow: /newsletter-welcome\nDisallow: /mind-explorations-ebook\nDisallow: /best-of-litemind-ebook\nDisallow: /wp-content/uploads/misc/best-of-litemind-ebook.pdf\n\n\n# BEGIN XML-SITEMAP-PLUGIN\nSitemap: http://litemind.com/sitemap.xml.gz\n# END XML-SITEMAP-PLUGIN\n'


The page we are scraping isn't excluded in the robots.txt file.

In [31]:
from bs4 import BeautifulSoup
url = "https://litemind.com/best-famous-quotes"

html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
for quote in soup.findAll('div',{'class':'wp_quotepage'}):
    text = quote.findChildren()[0].renderContents()
    author = quote.findChildren()[1].renderContents()
    print(text, author)

b'1. You can do anything, but not everything.' b'\xe2\x80\x94David Allen'
b'2. Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.' b'\xe2\x80\x94Antoine de Saint-Exup\xc3\xa9ry'
b'3. The richest man is not he who has the most, but he who needs the least.' b'\xe2\x80\x94Unknown Author'
b'4. You miss 100 percent of the shots you never take.' b'\xe2\x80\x94Wayne Gretzky'
b'5. Courage is not the absence of fear, but rather the judgement that something else is more important than fear.' b'\xe2\x80\x94Ambrose Redmoon'
b'6. You must be the change you wish to see in the world.' b'\xe2\x80\x94Gandhi'
b'7. When hungry, eat your rice; when tired, close your eyes. Fools may laugh at me, but wise men will know what I mean.' b'\xe2\x80\x94Lin-Chi'
b'8. The third-rate mind is only happy when it is thinking with the majority. The second-rate mind is only happy when it is thinking with the minority. The first-rate mind is only happy when it is th

That seems pretty easy, right?

Scraping does take work. You need to be able to read the page source to understand how the information is structured and how you can access it. These examples here have been fairly straightforward. Sometimes the markup is messy and poorly formed.

Let's take a look at this [restaurant inspection report](http://dc.healthinspections.us/webadmin/dhd_431/lib/mod/inspection/paper/_paper_food_inspection_report.cfm?inspectionID=105185&wguid=1367&wgunm=sysact&wgdmn=431) from the DC Department of Health. What does the robots.txt file tell us? Let's use requests this time.

In [33]:
import requests

In [34]:
robot = requests.get("https://dc.healthinspections.us/robots.txt")

In [36]:
robot.status_code

200

In [37]:
robot.content

b'# Global robots.txt\r\n\r\nUser-agent: *\r\nDisallow: / \t\t\t\t\t# Protect Website\r\nDisallow: /_!DEMO_FILES\t\t\t# Protect DEMO\r\nDisallow: /_css/ \t\t\t\t# Protect StyleSheets\r\nDisallow: /_templates/ \t\t\t# Protect Templates\r\nDisallow: /articles/ \t\t\t# Protect Articles\r\nDisallow: /attachments/ \t\t# Protect Attachments\r\nDisallow: /banners/ \t\t\t# Protect Banners\r\nDisallow: /cont/ \t\t\t\t# Protect Content\r\nDisallow: /help/ \t\t\t\t# Protect Help Files\r\nDisallow: /images/ \t\t\t\t# Protect Images\r\nDisallow: /tasks/ \t\t\t\t# Protect Scheduled Tasks\r\nDisallow: /temp/ \t\t\t\t# Protect Temporary Files\r\nDisallow: /toolkit/ \t\t\t# Protect Toolkit\r\nDisallow: /weatherIcons/ \t\t# Protect Weather Icons\r\nDisallow: /webadmin/ \t\t\t# Protect Intranet Tools\r\nDisallow: /media/ \t\t\t# Protect Houston Media Site'

Well, that's not good. It isn't easy to read in this format, but the [robots.txt](https://dc.healthinspections.us/robots.txt) file has this right up top:

```
# Global robots.txt

User-agent: *
Disallow: / 					# Protect Website
```

We should not scrape or crawl this site.

However, it is a great example of messy pages to scrape so I have included one of the reports in the repo for this tutorial.

In [89]:
# pull in the html and take a look at it
html_file = "https://raw.githubusercontent.com/nd1/staging/master/105185.html"
html_rpt = requests.get(html_file)
html_rpt.content

b'<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml"><head>\n<title>Food Establishment Inspection Report</title>\n<link href="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/css/generic.css" rel="stylesheet" type="text/css" media="all, screen, print" />\n<style type="text/css">\ndiv.container {\n\tborder-bottom: 1px solid black;\n\tmargin-right:15px;\n  }\ndiv.container span {\n\tposition:relative;\n\tbottom:-2px;\n\tbackground:#FFFFFF;\n  }\n.checkboxRedN {\nfloat:left;\nborder:1px solid red;\nwidth:10px;\nheight:10px;\nfont-size:5px;\n}\n.checkboxN {\nborder:1px solid black;\nwidth:10px;\nheight:10px;\nfont-size:2px;\n}\n.boxwid2{\n\twidth:185px;\n}\n.spacer{\n\twidth:10px;\n}\n.line1{\n\twidth:400px;\n}\n.line2{\n\twidth:230px;\n}\n.line3{\n\twidth:245px;\n}\n.line5{\n\twidth:235px;\n}\n.hdrSpace{\n\twidth: 4px;\n}\n.hdrSpace2{\n\twidth: 4px;\n}\n.inspTypeSpace{\n\twidth: 15px;\n}\n.vioHeight {\n\t\theight: 675px;\n}\n.sigWidth {\n\twidth: 750px;\n

That is a mess. Whatever they are doing is not easy to parse through. 

Luckily, this is not an uncommon task so there are libraries to help us out. [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) can parse data from HTML and XML. It represents the html document as a nested data structure that we can navigate.

In [55]:
soup = BeautifulSoup(html_rpt.content, 'html.parser')

In [56]:
print(soup.prettify())

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>
   Food Establishment Inspection Report
  </title>
  <link href="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/css/generic.css" media="all, screen, print" rel="stylesheet" type="text/css"/>
  <style type="text/css">
   div.container {
	border-bottom: 1px solid black;
	margin-right:15px;
  }
div.container span {
	position:relative;
	bottom:-2px;
	background:#FFFFFF;
  }
.checkboxRedN {
float:left;
border:1px solid red;
width:10px;
height:10px;
font-size:5px;
}
.checkboxN {
border:1px solid black;
width:10px;
height:10px;
font-size:2px;
}
.boxwid2{
	width:185px;
}
.spacer{
	width:10px;
}
.line1{
	width:400px;
}
.line2{
	width:230px;
}
.line3{
	width:245px;
}
.line5{
	width:235px;
}
.hdrSpace{
	width: 4px;
}
.hdrSpace2{
	width: 4px;
}
.inspTypeSpace{
	width: 15px;
}
.vioHeight {
		height: 675px;
}
.sigWidth {
	width: 750px;
}
.blankFld {
	border-bottom: 1px solid black;
	display:inl

Beautiful Soup lets you access information through tags in the html. The tags are the same as the ones in the document. Let's start at the top

In [83]:
soup.title

<title>Food Establishment Inspection Report</title>

All tags have names.

In [58]:
soup.title.name

'title'

Sometimes they have attributes too. 

In [86]:
soup.title.attr

But title does not. It does contain a string though.

In [87]:
soup.title.string

'Food Establishment Inspection Report'

The head of the document doesn't have much infromation we may want to parse.

In [94]:
soup.head

<head>
<title>Food Establishment Inspection Report</title>
<link href="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/css/generic.css" media="all, screen, print" rel="stylesheet" type="text/css"/>
<style type="text/css">
div.container {
	border-bottom: 1px solid black;
	margin-right:15px;
  }
div.container span {
	position:relative;
	bottom:-2px;
	background:#FFFFFF;
  }
.checkboxRedN {
float:left;
border:1px solid red;
width:10px;
height:10px;
font-size:5px;
}
.checkboxN {
border:1px solid black;
width:10px;
height:10px;
font-size:2px;
}
.boxwid2{
	width:185px;
}
.spacer{
	width:10px;
}
.line1{
	width:400px;
}
.line2{
	width:230px;
}
.line3{
	width:245px;
}
.line5{
	width:235px;
}
.hdrSpace{
	width: 4px;
}
.hdrSpace2{
	width: 4px;
}
.inspTypeSpace{
	width: 15px;
}
.vioHeight {
		height: 675px;
}
.sigWidth {
	width: 750px;
}
.blankFld {
	border-bottom: 1px solid black;
	display:inline-block;
}
@media print{
	body {
		margin-top:inherit;
		margin-left:inheri

It looks like everything is in the body, and is contained in a lot of tables. We can string tags together to access the tables in the body.

In [101]:
soup.body

<body class="pt10 arial" style="margin-left:0px;margin-right:0px;margin-top:0px;margin-bottom:0px;width:750px;">
<table cellpadding="0" cellspacing="0" style="width:750px;margin-bottom:5px;margin-top:10px;">
<tbody><tr>
<td style="vertical-align:bottom;width:110px;">
<img alt="image" height="40" src="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/images/new_bars_and_stars_blueText.jpg" width="150"/>
</td>
<td class="center" style="vertical-align:middle;width:530px;">
<span class="times strong" style="font-size:16pt;">Food Establishment Inspection Report</span><br/>
<span class="pt8">Pursuant to Title 25-A of the District of Columbia Municipal Regulations</span>
</td>
<td style="vertical-align:bottom;text-align:right;width:110px;">
<img alt="image" height="30" src="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/images/doh_logo.jpg" width="75"/>
</td>
</tr>
</tbody></table>
<div class="border times center" style="font-size:7.5pt;"

In [102]:
soup.body.table

<table cellpadding="0" cellspacing="0" style="width:750px;margin-bottom:5px;margin-top:10px;">
<tbody><tr>
<td style="vertical-align:bottom;width:110px;">
<img alt="image" height="40" src="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/images/new_bars_and_stars_blueText.jpg" width="150"/>
</td>
<td class="center" style="vertical-align:middle;width:530px;">
<span class="times strong" style="font-size:16pt;">Food Establishment Inspection Report</span><br/>
<span class="pt8">Pursuant to Title 25-A of the District of Columbia Municipal Regulations</span>
</td>
<td style="vertical-align:bottom;text-align:right;width:110px;">
<img alt="image" height="30" src="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/images/doh_logo.jpg" width="75"/>
</td>
</tr>
</tbody></table>

How many tables does the document contain?

In [103]:
len(soup.find_all('table'))

13

We can navigate up and down a document's structure by looking at a tag's child and parent attributes. 

In [106]:
soup.body.parent

<html xmlns="http://www.w3.org/1999/xhtml"><head>
<title>Food Establishment Inspection Report</title>
<link href="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/css/generic.css" media="all, screen, print" rel="stylesheet" type="text/css"/>
<style type="text/css">
div.container {
	border-bottom: 1px solid black;
	margin-right:15px;
  }
div.container span {
	position:relative;
	bottom:-2px;
	background:#FFFFFF;
  }
.checkboxRedN {
float:left;
border:1px solid red;
width:10px;
height:10px;
font-size:5px;
}
.checkboxN {
border:1px solid black;
width:10px;
height:10px;
font-size:2px;
}
.boxwid2{
	width:185px;
}
.spacer{
	width:10px;
}
.line1{
	width:400px;
}
.line2{
	width:230px;
}
.line3{
	width:245px;
}
.line5{
	width:235px;
}
.hdrSpace{
	width: 4px;
}
.hdrSpace2{
	width: 4px;
}
.inspTypeSpace{
	width: 15px;
}
.vioHeight {
		height: 675px;
}
.sigWidth {
	width: 750px;
}
.blankFld {
	border-bottom: 1px solid black;
	display:inline-block;
}
@media print{
	body {

In [100]:
body.contents

['\n',
 <table cellpadding="0" cellspacing="0" style="width:750px;margin-bottom:5px;margin-top:10px;">
 <tbody><tr>
 <td style="vertical-align:bottom;width:110px;">
 <img alt="image" height="40" src="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/images/new_bars_and_stars_blueText.jpg" width="150"/>
 </td>
 <td class="center" style="vertical-align:middle;width:530px;">
 <span class="times strong" style="font-size:16pt;">Food Establishment Inspection Report</span><br/>
 <span class="pt8">Pursuant to Title 25-A of the District of Columbia Municipal Regulations</span>
 </td>
 <td style="vertical-align:bottom;text-align:right;width:110px;">
 <img alt="image" height="30" src="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/images/doh_logo.jpg" width="75"/>
 </td>
 </tr>
 </tbody></table>,
 '\n',
 <div class="border times center" style="font-size:7.5pt;">
 <em>Bureau of Community Hygiene <span class="hdrSpace" style="display: inline-bl

In [72]:
soup.body.table

<table cellpadding="0" cellspacing="0" style="width:750px;margin-bottom:5px;margin-top:10px;">
<tbody><tr>
<td style="vertical-align:bottom;width:110px;">
<img alt="image" height="40" src="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/images/new_bars_and_stars_blueText.jpg" width="150"/>
</td>
<td class="center" style="vertical-align:middle;width:530px;">
<span class="times strong" style="font-size:16pt;">Food Establishment Inspection Report</span><br/>
<span class="pt8">Pursuant to Title 25-A of the District of Columbia Municipal Regulations</span>
</td>
<td style="vertical-align:bottom;text-align:right;width:110px;">
<img alt="image" height="30" src="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/images/doh_logo.jpg" width="75"/>
</td>
</tr>
</tbody></table>

In [78]:
soup.body.table.tr

<tr>
<td style="vertical-align:bottom;width:110px;">
<img alt="image" height="40" src="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/images/new_bars_and_stars_blueText.jpg" width="150"/>
</td>
<td class="center" style="vertical-align:middle;width:530px;">
<span class="times strong" style="font-size:16pt;">Food Establishment Inspection Report</span><br/>
<span class="pt8">Pursuant to Title 25-A of the District of Columbia Municipal Regulations</span>
</td>
<td style="vertical-align:bottom;text-align:right;width:110px;">
<img alt="image" height="30" src="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/images/doh_logo.jpg" width="75"/>
</td>
</tr>

In [82]:
We saw 

<body class="pt10 arial" style="margin-left:0px;margin-right:0px;margin-top:0px;margin-bottom:0px;width:750px;">
<table cellpadding="0" cellspacing="0" style="width:750px;margin-bottom:5px;margin-top:10px;">
<tbody><tr>
<td style="vertical-align:bottom;width:110px;">
<img alt="image" height="40" src="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/images/new_bars_and_stars_blueText.jpg" width="150"/>
</td>
<td class="center" style="vertical-align:middle;width:530px;">
<span class="times strong" style="font-size:16pt;">Food Establishment Inspection Report</span><br/>
<span class="pt8">Pursuant to Title 25-A of the District of Columbia Municipal Regulations</span>
</td>
<td style="vertical-align:bottom;text-align:right;width:110px;">
<img alt="image" height="30" src="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/images/doh_logo.jpg" width="75"/>
</td>
</tr>
</tbody></table>
<div class="border times center" style="font-size:7.5pt;"

That is how we need to access most of the data in this document. Using Beautiful Soup can involve a lot of trial and error. Here are some examples of what I parsed from this document.

In [68]:
inspection = soup.find_all('tr')

In [69]:
# phone number and email address
inspection[5].get_text()

'\n\n\nTelephone\n\xa0(202) 667-0010\n\xa0E-mail address\n\t\t\t\t\t\t\t\t\xa0kelvin.ferrufino@gmail.com\n\t\t\t\t\t\t\t\n\n'

In [70]:
# inspection details
inspection[6].get_text()

'\n\n\nDate of Inspection\n\xa006\n/\n\xa026\n/\n\xa02013\n\xa0\xa0\xa0\xa0Time In\n\xa007\n:\n\xa000\nPM\n\xa0\xa0\xa0\xa0Time Out\n\xa008\n:\n\xa015\nAM\n\xa0\n\n\n'

In [71]:
# the inspector name and badge number
tables = soup.find_all('table')
inspector_info = tables[11]
inspector = inspector_info.find_all("td")

print(inspector[1])
print(inspector[2])

<td style="width:225px; vertical-align: bottom;"> A. Jackson</td>
<td style="width:90px; vertical-align: bottom;">54 </td>


So once you can find the data you want to extract, you can still end up doing a lot of work to clean it up into a usable format. 

There are a lot of resources out there for building scrapers. Do you have a page you want to scrape? If so, try it out now. If you want some more ideas, here are some resources to take a look at:
*Tutorial for [building your first scraper](http://first-web-scraper.readthedocs.io/en/latest/)
