# BeautifulSoup
- great 'screen scraping' package
- tons of interesting data on webpages designed for people, not programs
- makes it easy to extract information from complex web pages and XML documents
- often can figure out what to do by playing interactively
- [doc](http://www.crummy.com/software/BeautifulSoup/)

# Example
# Want to find all the headlines on the front page of the [New York Times](http://nyt.com)
- look at webpage source - html structure is quite complex
- would be very difficult using with string.find() or regular expressions
- soup reads in the page of interest, then you can query it

In [1]:
# 'lxml' is a XML parser(parses HTML too)
# must tell soup what unicode decoding to use

import urllib.request

from bs4 import BeautifulSoup
import lxml

nf2 = urllib.request.urlopen('http://nyt.com')
sp = BeautifulSoup(nf2, 'lxml', from_encoding='utf-8')

In [2]:
# headlines seem to be contained in 'h2' elements

sp.findAll('h2')[10:20]

[<h2 class="story-heading"><a href="https://www.nytimes.com/2017/02/17/arts/design/friends-no-more-jorge-perez-and-donald-trump.html">Friends No More? Jorge Pérez and Donald Trump</a></h2>,
 <h2 class="story-heading"><a href="https://www.nytimes.com/2017/02/18/us/politics/trump-candidates-national-security-adviser.html">Trump to Interview 4 Candidates for Security Adviser</a></h2>,
 <h2 class="story-heading"><a href="https://www.nytimes.com/2017/02/18/us/a-trump-ally-in-congress-warns-his-state-california-to-make-nice.html">Trump Ally in Congress Warns California to Make Nice</a></h2>,
 <h2 class="story-heading"><i class="icon"></i><a href="https://www.nytimes.com/2017/02/17/world/europe/trump-europe.html">President’s Aides Try to Reassure Europe, but Many Are Wary</a> </h2>,
 <h2 class="story-heading"><a href="https://www.nytimes.com/2017/02/18/us/california-storms.html">Southern California Reels From Floods After Heavy Rains</a></h2>,
 <h2 class="story-heading"><a href="https://www.n

In [3]:
# first 'h2' element

h2 = sp.h2
h2

<h2 class="branding"><a href="http://www.nytimes.com/">
<svg aria-label="The New York Times" class="nyt-logo" height="64" role="img" width="379">
<image alt="The New York Times" border="0" height="64" src="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.png" width="379" xlink:href="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.svg"></image>
</svg>
</a></h2>

In [4]:
# can pull 'a' element out of 'h2'
# this 'a' element is a picture

a=h2.find('a')
a

<a href="http://www.nytimes.com/">
<svg aria-label="The New York Times" class="nyt-logo" height="64" role="img" width="379">
<image alt="The New York Times" border="0" height="64" src="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.png" width="379" xlink:href="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.svg"></image>
</svg>
</a>

In [5]:
# try pulling the 'a' out of all 'h2' elements
# looks like we get mostly headlines

al=[h2.find('a') for h2 in sp.findAll("h2")]
al[:20]

[<a href="http://www.nytimes.com/">
 <svg aria-label="The New York Times" class="nyt-logo" height="64" role="img" width="379">
 <image alt="The New York Times" border="0" height="64" src="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.png" width="379" xlink:href="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.svg"></image>
 </svg>
 </a>,
 None,
 None,
 None,
 None,
 None,
 <a href="https://www.nytimes.com/2017/02/18/us/politics/trump-candidates-top-posts.html">Struggling to Fill Jobs When Total Loyalty Is a Must</a>,
 <a href="https://www.nytimes.com/2017/02/17/us/politics/trump-program-eliminations-white-house-budget-office.html">Trump Budget Hit List Has Programs Long in G.O.P. Sights</a>,
 None,
 <a href="https://www.nytimes.com/2017/02/18/world/middleeast/trump-dubai-vancouver.html">Trump’s Dual Roles Collide With Openings in Dubai and Vancouver</a>,
 <a href="https://www.nytimes.com/2017/02/17/

In [6]:
# pull out the 'a' link text 

[a.contents for a in al if a != None][:30]

[['\n',
  <svg aria-label="The New York Times" class="nyt-logo" height="64" role="img" width="379">
  <image alt="The New York Times" border="0" height="64" src="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.png" width="379" xlink:href="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.svg"></image>
  </svg>,
  '\n'],
 ['Struggling to Fill Jobs When Total Loyalty Is a Must'],
 ['Trump Budget Hit List Has Programs Long in G.O.P. Sights'],
 ['Trump’s Dual Roles Collide With Openings in Dubai and Vancouver'],
 ['Friends No More? Jorge Pérez and Donald Trump'],
 ['Trump to Interview 4 Candidates for Security Adviser'],
 ['Trump Ally in Congress Warns California to Make Nice'],
 ['President’s Aides Try to Reassure Europe, but Many Are Wary'],
 ['Southern California Reels From Floods After Heavy Rains'],
 ['China Suspends All Coal Imports From North Korea'],
 ['How the Nuclear Threat From North Korea Has Gr

In [7]:
# filter out images

[a.contents for a in al if a != None and len(a)==1][:30]

[['Struggling to Fill Jobs When Total Loyalty Is a Must'],
 ['Trump Budget Hit List Has Programs Long in G.O.P. Sights'],
 ['Trump’s Dual Roles Collide With Openings in Dubai and Vancouver'],
 ['Friends No More? Jorge Pérez and Donald Trump'],
 ['Trump to Interview 4 Candidates for Security Adviser'],
 ['Trump Ally in Congress Warns California to Make Nice'],
 ['President’s Aides Try to Reassure Europe, but Many Are Wary'],
 ['Southern California Reels From Floods After Heavy Rains'],
 ['China Suspends All Coal Imports From North Korea'],
 ['How the Nuclear Threat From North Korea Has Grown'],
 ['The Murky Future of Nuclear Power in the United States'],
 ['‘They Are Dying’: Video Seems to Show Massacre in Congo'],
 ['Greeks Turn to Black Market as Bailout Showdown Looms'],
 ['Dutch Nationalist Politician Calls Moroccan Immigrants ‘Scum’'],
 ['North Korean Is Arrested in Killing of Leader’s Relative'],
 ['Bill Maher and Milo Yiannopoulos Find Common Ground'],
 ['Kraft Heinz Offers to Bu