# BeautifulSoup
- great 'screen scraping' package
- tons of interesting data on webpages designed for people, not programs
- makes it easy to extract information from complex web pages and XML documents
- soup reads in the page of interest, then you can query it
- often can figure out what to do by playing interactively
- works in unicode
- new code should use BeautifulSoup version 4
- [doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

# Example:
# Want to find all the headlines on the front page of the [New York Times](http://nyt.com)
- but - key point - i don't want to work very hard!!!
    - look at webpage source - html structure is quite complex - not interested in understanding it
    - would be very difficult to do using text tools we have seen so far, like string.find() and regular expressions

In [1]:
# 'lxml' is a XML parser(parses HTML too)
# must tell soup what unicode decoding to use

import urllib.request

# from bs4 import BeautifulSoup
import bs4
import lxml

nf2 = urllib.request.urlopen('http://nyt.com')
sp = bs4.BeautifulSoup(nf2, 'lxml', from_encoding='utf-8')

In [2]:
# headlines seem to be contained in 'h2' elements

h2s = sp.findAll('h2')
h2s
      

[<h2 class="css-1ypbx2a esl82me2">Listen to ‘The Daily’</h2>,
 <h2 class="css-1ypbx2a esl82me2">In the ‘Watching’ Newsletter</h2>,
 <h2 class="css-1ypbx2a esl82me2">Got a confidential news tip?</h2>,
 <h2 class="css-78b01r esl82me2"><span>4 Ways Fred Trump Made Donald Trump and His Siblings Rich</span></h2>,
 <h2 class="css-8uvv5f esl82me2">11 Takeaways From The Times’s Investigation Into Trump’s Wealth</h2>,
 <h2 class="css-8uvv5f esl82me2">Mr. Trump called The Times’s investigation into his family’s dubious tax schemes a boring hit-piece.</h2>,
 <h2 class="css-78b01r esl82me2"><span>Swing Republican Senators Condemn Trump’s Mockery of Kavanaugh Accuser</span></h2>,
 <h2 class="css-8uvv5f esl82me2">Senate Republicans Open New Attack on Kavanaugh Accuser</h2>,
 <h2 class="css-8uvv5f esl82me2">The Kavanaugh proceedings have exposed just how far the Senate has drifted from the rules of decorum.</h2>,
 <h2 class="css-78b01r esl82me2"><span>In Tennessee Senate Race, Financial Missteps Ling

In [3]:
# pull out the contents of the h2 elements

contents = [h2.contents for h2 in h2s]
contents

[['Listen to ‘The Daily’'],
 ['In the ‘Watching’ Newsletter'],
 ['Got a confidential news tip?'],
 [<span>4 Ways Fred Trump Made Donald Trump and His Siblings Rich</span>],
 ['11 Takeaways From The Times’s Investigation Into Trump’s Wealth'],
 ['Mr. Trump called The Times’s investigation into his family’s dubious tax schemes a boring hit-piece.'],
 [<span>Swing Republican Senators Condemn Trump’s Mockery of Kavanaugh Accuser</span>],
 ['Senate Republicans Open New Attack on Kavanaugh Accuser'],
 ['The Kavanaugh proceedings have exposed just how far the Senate has drifted from the rules of decorum.'],
 [<span>In Tennessee Senate Race, Financial Missteps Linger in the Background</span>],
 ['Missing in the G.O.P.: Black and Hispanic Nominees for Governor'],
 ['The White House has a message for Republican candidates: Stay close to President Trump.'],
 [<span>Vulgar Texts and Dancer Turmoil Force City Ballet to Look in the Mirror</span>],
 [<span>Should Art Be a Battleground for Social Just

In [4]:
# pull out the strings from lists and the <span> tag
# note use of 'ternary if'

[ content[0] if isinstance(content[0], str) else content[0].contents[0] \
 for content in contents]

['Listen to ‘The Daily’',
 'In the ‘Watching’ Newsletter',
 'Got a confidential news tip?',
 '4 Ways Fred Trump Made Donald Trump and His Siblings Rich',
 '11 Takeaways From The Times’s Investigation Into Trump’s Wealth',
 'Mr. Trump called The Times’s investigation into his family’s dubious tax schemes a boring hit-piece.',
 'Swing Republican Senators Condemn Trump’s Mockery of Kavanaugh Accuser',
 'Senate Republicans Open New Attack on Kavanaugh Accuser',
 'The Kavanaugh proceedings have exposed just how far the Senate has drifted from the rules of decorum.',
 'In Tennessee Senate Race, Financial Missteps Linger in the Background',
 'Missing in the G.O.P.: Black and Hispanic Nominees for Governor',
 'The White House has a message for Republican candidates: Stay close to President Trump.',
 'Vulgar Texts and Dancer Turmoil Force City Ballet to Look in the Mirror',
 'Should Art Be a Battleground for Social Justice?',
 'Lady Gaga Isn’t Done Shape-Shifting Yet',
 'Donald Trump and the Se