# Working our way up to web scraping

There's been some interest in web scraping.  It's beyond us, but there are some things we can do.  ...

**Note: requires internet access to run.**  

This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course [Data Bootcamp](http://databootcamp.nyuecon.com/).  

<a id=prelims></a>

## Preliminaries 

Import packages, etc.  

In [21]:
import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import sys                      # system module, used to get Python version 
import os                       # operating system tools (check files)
import datetime as dt           # date tools, used to note current date  

# these are new 
import requests, io             # internet and input tools  
from bs4 import BeautifulSoup   # website parsing

%matplotlib inline 

print('\nPython version: ', sys.version) 
print('Pandas version: ', pd.__version__)
print('Requests version: ', requests.__version__)
print("Today's date:", dt.date.today())


Python version:  3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
Pandas version:  0.18.0
Requests version:  2.9.1
Today's date: 2016-04-25


<a id=lucky></a>

## Sometimes we get lucky

We sometimes find that we can access data straight from a web page with Pandas' `read_html`.  It works just like `read_csv` or `read_excel`.  

The first example is [baseball-reference.com](http://www.baseball-reference.com/).  The same people run similar sites for football and basketball.  Many of their pages are collections of tables.  See, for example, [this one](http://www.baseball-reference.com/players/m/mccutan01.shtml) for Pittsburgh's Andrew McCucthen.    

In [3]:
# baseball reference
url = 'http://www.baseball-reference.com/players/m/mccutan01.shtml'
am  = pd.read_html(url)

print('Ouput has type', type(am), 'and length', len(am))
print('First element has type', type(am[0])')

Ouput has type <class 'list'> and length 10


**Question.** What do we have here?  A list of length 10?  Whose elements are dataframes?  Evidently this reads in all the tables from the page into dataframes and collects them in a list.  

In [4]:
am[4].head()

Unnamed: 0,Year,Tm,Lg,Age,Pos,G,GS,CG,Inn,Ch,...,Unnamed: 105,Unnamed: 106,Unnamed: 107,Unnamed: 108,Unnamed: 109,Unnamed: 110,Unnamed: 111,Unnamed: 112,Unnamed: 113,Unnamed: 114
0,2009.0,PIT,NL,22,CF,108,108.0,106.0,952.2,275.0,...,,,,,,,,,,
1,2009.0,PIT,NL,22,OF,108,108.0,106.0,952.2,275.0,...,,,,,,,,,,
2,2010.0,PIT,NL,23,CF,152,152.0,140.0,1290.1,386.0,...,,,,,,,,,,
3,2010.0,PIT,NL,23,OF,152,152.0,140.0,1290.1,386.0,...,,,,,,,,,,
4,2011.0,PIT,NL,24,CF,155,153.0,146.0,1353.2,430.0,...,,,,,,,,,,


Here's another one:  Google's stock price from Yahoo finance.  

In [5]:
url = 'http://finance.yahoo.com/q/hp?s=GOOG+Historical+Prices'
ggl = pd.read_html(url)

In [7]:
type(ggl)

list

In [8]:
len(ggl)

12

In [14]:
ggl[8]

Unnamed: 0,0
0,Prices


In [16]:
url = 'http://databootcamp.nyuecon.com/'
url = 'google.com'
db  = pd.read_html(url)

ValueError: No tables found

## Scanning urls

Itamar adds:  

Walk through the following steps before running the code:

1) Go to : http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices
2) Enter the dates you want and hit the get prices button.
3) Once the results are shown, look on the url address.
4) The new url will include several parameters, each one is seperated by the & character.
5) Try to explore the meanning of each parameter (s, a,b,c,d,e,f and g)
6) After some trial and error you can realize that each parameter represents the data you entered as input:
    the day, month and year, the stock sybmol, and the frequency  you chose (daily, weekly etc)
7) Scroll down to the bottom of the page. there is a link which allows downloading the data as a csv file. click on it
8) Open the CSV in excel and see the structure of the file.
9) Go back to the web page, instead of clicking on the csv link, right click on it and copy the link address
10) Paste the address in a notebook - This is the url link we can use to access the data from our coding environment



## Accessing web pages 

Requests again...  

In [17]:
url = 'http://databootcamp.nyuecon.com/'
db = requests.get(url)



In [19]:
db.headers

{'X-Timer': 'S1461615585.265324,VS0,VE0', 'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Content-Length': '3499', 'Server': 'GitHub.com', 'X-Served-By': 'cache-iad2120-IAD', 'X-Cache': 'HIT', 'Via': '1.1 varnish', 'Content-Type': 'text/html; charset=utf-8', 'X-Fastly-Request-ID': 'c50dae06b36deefb27c8129eb25e565635a49345', 'Age': '21', 'X-GitHub-Request-Id': '17EB2E29:6096:17E002:571E7BCC', 'Last-Modified': 'Thu, 21 Apr 2016 23:57:03 GMT', 'Connection': 'keep-alive', 'X-Cache-Hits': '1', 'Expires': 'Mon, 25 Apr 2016 20:29:24 GMT', 'Date': 'Mon, 25 Apr 2016 20:19:45 GMT', 'Cache-Control': 'max-age=600', 'Access-Control-Allow-Origin': '*', 'Vary': 'Accept-Encoding'}

In [22]:
db.url

'http://databootcamp.nyuecon.com/'

In [23]:
db.status_code

200

## Extracting pieces of web pages 

Use Beautiful Soup...  



In [38]:
bs = BeautifulSoup(db.content, 'lxml')

print('Type and length:  ', type(bs), ', ', len(bs), sep='')
print('Title: ', bs.title)
print('First n characters:\n', bs.prettify()[0:500], sep='')

Type and length:  <class 'bs4.BeautifulSoup'>, 7
Title:  <title>Data Bootcamp </title>
First n characters:
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js">
 <!--<![endif]-->
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <title>
   Data Bootcamp
  </title>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="True" name="HandheldFriendly"/>
  <meta content="320" name="MobileOptimized"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="" name="description"/>
  <meta content="" name="keywords"/>
  <meta content="Data Bootcamp " property="og:title"/>
  <meta content="Data Bootcamp" property="og:site_name"/>
  <meta content="http://databootcamp.nyuecon.com/" property="og:url"/>
  <meta content="en-

In [35]:
bs.find_all?

In [39]:
bs.head

<head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<title>Data Bootcamp </title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="True" name="HandheldFriendly"/>
<meta content="320" name="MobileOptimized"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="" name="description"/>
<meta content="" name="keywords"/>
<meta content="Data Bootcamp " property="og:title"/>
<meta content="Data Bootcamp" property="og:site_name"/>
<meta content="http://databootcamp.nyuecon.com/" property="og:url"/>
<meta content="en-us" property="og:locale"/>
<meta content="website" property="og:type"/>
<link href="http://databootcamp.nyuecon.com/index.xml" rel="alternate" title="Data Bootcamp" type="application/rss+xml"/>
<link href="http://databootcamp.nyuecon.com/" rel="canonical"/>
<link href="http://databootcamp.nyuecon.com/touch-icon-144-precomposed.png" rel="apple-touch-icon-precomposed" sizes="144x144"/>
<link href="http://da

In [40]:
kids = [ c for c in bs.head.children]

In [41]:
kids

['\n',
 <meta content="text/html; charset=utf-8" http-equiv="content-type"/>,
 '\n',
 <title>Data Bootcamp </title>,
 '\n',
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>,
 '\n',
 <meta content="True" name="HandheldFriendly"/>,
 '\n',
 <meta content="320" name="MobileOptimized"/>,
 '\n',
 <meta content="width=device-width, initial-scale=1" name="viewport"/>,
 '\n',
 <meta content="" name="description"/>,
 '\n',
 <meta content="" name="keywords"/>,
 '\n',
 <meta content="Data Bootcamp " property="og:title"/>,
 '\n',
 <meta content="Data Bootcamp" property="og:site_name"/>,
 '\n',
 <meta content="http://databootcamp.nyuecon.com/" property="og:url"/>,
 '\n',
 <meta content="en-us" property="og:locale"/>,
 '\n',
 <meta content="website" property="og:type"/>,
 '\n',
 <link href="http://databootcamp.nyuecon.com/index.xml" rel="alternate" title="Data Bootcamp" type="application/rss+xml"/>,
 '\n',
 <link href="http://databootcamp.nyuecon.com/" rel="canonical"/>,
 '\n',
 <link href="ht