# BeautifulSoup
- great 'screen scraping' package
- tons of interesting data on webpages
- makes it easy to extract information from complex web pages and XML documents
- can figure out what to do by playing interactively
- [doc](http://www.crummy.com/software/BeautifulSoup/)

# Example
# Want to find all the headlines on the front page of the [New York Times](http://nyt.com)
- look at webpage source - html structure is quite complex
- would be very difficult to do with string.find() or regular expressions 

In [2]:
# 'lxml' is a XML parser(parses HTML too)
# must tell soup what unicode decoding to use
import urllib.request

from bs4 import BeautifulSoup
import lxml

nf2 = urllib.request.urlopen('http://nyt.com')
sp = BeautifulSoup(nf2, 'lxml', from_encoding='utf-8')

In [3]:
# headlines seem to be contained in 'h2' elements

sp.findAll('h2')[10:20]

[<h2 class="refer-heading"><a href="http://www.nytimes.com/2016/10/07/world/americas/hurricane-matthew-haiti.html">As Waters Recede, Haiti Staggers Under the Toll</a></h2>,
 <h2 class="story-heading"><a href="http://www.nytimes.com/2016/10/08/world/americas/nobel-peace-prize-juan-manuel-santos-colombia.html">Nobel Peace Prize Awarded to Colombian President</a></h2>,
 <h2 class="refer-heading"><a href="http://www.nytimes.com/interactive/2016/world/nobel-peace-prize-winners.html">Quiz: How Much Do You Know About Past Winners?</a></h2>,
 <h2 class="story-heading"><a href="http://www.nytimes.com/2016/10/08/upshot/looking-at-the-jobs-report-through-a-political-lens.html">Jobs Data Supports Clinton’s Vision — and Trump’s</a></h2>,
 <h2 class="story-heading"><i class="icon"></i><a href="http://www.nytimes.com/2016/10/08/business/economy/jobs-report-unemployment-wages.html">Job Market Shows Resilience, Despite Pockets of Weakness</a> <time class="timestamp" data-eastern-timestamp="10:20 AM" da

In [4]:
# first 'h2' element

h2 = sp.h2
h2

<h2 class="branding"><a href="http://www.nytimes.com/">
<svg aria-label="The New York Times" class="nyt-logo" height="64" role="img" width="379">
<image alt="The New York Times" border="0" height="64" src="https://a1.nyt.com/assets/homepage/20161003-111909/images/foundation/logos/nyt-logo-379x64.png" width="379" xlink:href="https://a1.nyt.com/assets/homepage/20161003-111909/images/foundation/logos/nyt-logo-379x64.svg"></image>
</svg>
</a></h2>

In [5]:
# can pull 'a' element out of 'h2'
# this 'a' element is a picture

a=h2.find('a')
a

<a href="http://www.nytimes.com/">
<svg aria-label="The New York Times" class="nyt-logo" height="64" role="img" width="379">
<image alt="The New York Times" border="0" height="64" src="https://a1.nyt.com/assets/homepage/20161003-111909/images/foundation/logos/nyt-logo-379x64.png" width="379" xlink:href="https://a1.nyt.com/assets/homepage/20161003-111909/images/foundation/logos/nyt-logo-379x64.svg"></image>
</svg>
</a>

In [6]:
# try pulling the 'a' out of all 'h2' elements
# looks like we get mostly headlines

al=[h2.find('a') for h2 in sp.findAll("h2")]
al[:20]

[<a href="http://www.nytimes.com/">
 <svg aria-label="The New York Times" class="nyt-logo" height="64" role="img" width="379">
 <image alt="The New York Times" border="0" height="64" src="https://a1.nyt.com/assets/homepage/20161003-111909/images/foundation/logos/nyt-logo-379x64.png" width="379" xlink:href="https://a1.nyt.com/assets/homepage/20161003-111909/images/foundation/logos/nyt-logo-379x64.svg"></image>
 </svg>
 </a>,
 None,
 None,
 None,
 None,
 None,
 <a href="http://www.nytimes.com/2016/10/07/us/hurricane-matthew.html">Hurricane Still Off Coast, but Remains Threat to Jacksonville</a>,
 <a href="http://www.nytimes.com/2016/10/07/us/practical-tips-for-surviving-a-hurricane-learned-the-hard-way.html">How to Make Sure You’re Safe in a Hurricane</a>,
 <a href="http://www.nytimes.com/2016/10/07/us/hurricane-matthew-andrew-florida.html">What It’s Like to Be Trapped by a Category 5 Hurricane</a>,
 <a href="http://www.nytimes.com/2016/10/08/world/americas/after-hurricane-matthew-devast

In [7]:
# pull out the 'a' link text 

[a.contents for a in al if a != None][:30]

[['\n',
  <svg aria-label="The New York Times" class="nyt-logo" height="64" role="img" width="379">
  <image alt="The New York Times" border="0" height="64" src="https://a1.nyt.com/assets/homepage/20161003-111909/images/foundation/logos/nyt-logo-379x64.png" width="379" xlink:href="https://a1.nyt.com/assets/homepage/20161003-111909/images/foundation/logos/nyt-logo-379x64.svg"></image>
  </svg>,
  '\n'],
 ['Hurricane Still Off Coast, but Remains Threat to Jacksonville'],
 ['How to Make Sure You’re Safe in a Hurricane'],
 ['What It’s Like to Be Trapped by a Category 5 Hurricane'],
 ['Images of the Hurricane’s Devastation in Southern Haiti'],
 ['As Waters Recede, Haiti Staggers Under the Toll'],
 ['Nobel Peace Prize Awarded to Colombian President'],
 ['Quiz: How Much Do You Know About Past Winners?'],
 ['Jobs Data Supports Clinton’s Vision — and Trump’s'],
 ['Job Market Shows Resilience, Despite Pockets of Weakness'],
 ['Pound Drops Again Amid Brexit Fears'],
 ['U.S. and Iraq Set to Begin 

In [8]:
# filter out images

[a.contents for a in al if a != None and len(a)==1][:30]

[['Hurricane Still Off Coast, but Remains Threat to Jacksonville'],
 ['How to Make Sure You’re Safe in a Hurricane'],
 ['What It’s Like to Be Trapped by a Category 5 Hurricane'],
 ['Images of the Hurricane’s Devastation in Southern Haiti'],
 ['As Waters Recede, Haiti Staggers Under the Toll'],
 ['Nobel Peace Prize Awarded to Colombian President'],
 ['Quiz: How Much Do You Know About Past Winners?'],
 ['Jobs Data Supports Clinton’s Vision — and Trump’s'],
 ['Job Market Shows Resilience, Despite Pockets of Weakness'],
 ['Pound Drops Again Amid Brexit Fears'],
 ['U.S. and Iraq Set to Begin a Climactic Battle Against ISIS'],
 ['N.S.A. Isn’t Sure if Suspect Leaked Data or Just Hoarded It'],
 ['California Today: Hollywood History vs. ‘Souvenir Junk’'],
 ['8 New Books That Times Editors Think You Should Read'],
 ['How to Use a Standing Desk Without Annoying Your Co-Workers'],
 ['A Mother Is Shot Dead, and a Sea of Witnesses Goes Silent'],
 ['Town Hall Format Shapes Debate Prep for Round 2'],
