# Web Scraping with `requests` and `BeautifulSoup4`

## HTML can be quite complicated... but fortunately, Chrome has some great developer tools that let us look at the structure of pages without needing to read or understand HTML

## How to view HTML code:
- To view the entire page: "View Source" or "View Page Source" or "Show Page Source"
- To view a specific part: "Inspect Element"
- Safari users: Safari menu, Preferences, Advanced, Show Develop menu in menu bar
- Let's try it out here!

## How do I webscrape?

We will be using two new libraries for our webscraping:
- **requests** - lets us acquire the HTML code in python, like a web browser would
- **BeautifulSoup** - allows us to interact with the HTML efficiently and easily

Webscraping with Python can be broken down into a few simple steps. First we need is to access and then 'download' the page that we want to scrape.

We want to scrape [this example site](http://econpy.pythonanywhere.com/ex/001.html).


In [1]:
import requests 

In [2]:
html = requests.get('http://econpy.pythonanywhere.com/ex/001.html')

In [3]:
html.status_code

200

In [4]:
html.text

'<!DOCTYPE html>\n<html>\n<head>\n    <meta charset="utf-8">\n    <title>Items 1 to 20 -- Example Page 1</title>\n    <script type="text/javascript">\n      var _gaq = _gaq || [];\n      _gaq.push([\'_setAccount\', \'UA-23648880-1\']);\n      _gaq.push([\'_trackPageview\']);\n      _gaq.push([\'_setDomainName\', \'econpy.org\']);\n    </script>\n</head>\n<body>\n<div align="center">1, <a href="http://econpy.pythonanywhere.com/ex/002.html">[<font color="green">2</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/003.html">[<font color="green">3</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/004.html">[<font color="green">4</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/005.html">[<font color="green">5</font>]</a></div>\n<div title="buyer-info">\n  <div title="buyer-name">Carson Busses</div>\n  <span class="item-price">$29.95</span><br>\n</div>\n<div title="buyer-info">\n  <div title="buyer-name">Earl E. Byrd</div>\n  <span class="item-price">$8.37</span><br>

In [5]:
# convert HTML into a structured Soup object
from bs4 import BeautifulSoup
b = BeautifulSoup(html.text)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [6]:
#what does b look like?
print(b)

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>Items 1 to 20 -- Example Page 1</title>
<script type="text/javascript">
      var _gaq = _gaq || [];
      _gaq.push(['_setAccount', 'UA-23648880-1']);
      _gaq.push(['_trackPageview']);
      _gaq.push(['_setDomainName', 'econpy.org']);
    </script>
</head>
<body>
<div align="center">1, <a href="http://econpy.pythonanywhere.com/ex/002.html">[<font color="green">2</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/003.html">[<font color="green">3</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/004.html">[<font color="green">4</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/005.html">[<font color="green">5</font>]</a></div>
<div title="buyer-info">
<div title="buyer-name">Carson Busses</div>
<span class="item-price">$29.95</span><br/>
</div>
<div title="buyer-info">
<div title="buyer-name">Earl E. Byrd</div>
<span class="item-price">$8.37</span><br/>
</div>
<div title="buyer-info">
<div title="buy

In [7]:
# we can make it prettier:

print(b.prettify())

# This looks slightly better, but it's still pretty interpretable. How could I find the item I'm looking for?

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <title>
   Items 1 to 20 -- Example Page 1
  </title>
  <script type="text/javascript">
   var _gaq = _gaq || [];
      _gaq.push(['_setAccount', 'UA-23648880-1']);
      _gaq.push(['_trackPageview']);
      _gaq.push(['_setDomainName', 'econpy.org']);
  </script>
 </head>
 <body>
  <div align="center">
   1,
   <a href="http://econpy.pythonanywhere.com/ex/002.html">
    [
    <font color="green">
     2
    </font>
    ]
   </a>
   ,
   <a href="http://econpy.pythonanywhere.com/ex/003.html">
    [
    <font color="green">
     3
    </font>
    ]
   </a>
   ,
   <a href="http://econpy.pythonanywhere.com/ex/004.html">
    [
    <font color="green">
     4
    </font>
    ]
   </a>
   ,
   <a href="http://econpy.pythonanywhere.com/ex/005.html">
    [
    <font color="green">
     5
    </font>
    ]
   </a>
  </div>
  <div title="buyer-info">
   <div title="buyer-name">
    Carson Busses
   </div>
   <span class="item-price">
  

## Finding items in the site

Now that we have the website text saved as a beautiful soup object, we can use bs4 functions to find things on the page for us

In [8]:
# 'find' method returns the first matching Tag (and everything inside of it)
print(b.find(name='body').prettify())

<body>
 <div align="center">
  1,
  <a href="http://econpy.pythonanywhere.com/ex/002.html">
   [
   <font color="green">
    2
   </font>
   ]
  </a>
  ,
  <a href="http://econpy.pythonanywhere.com/ex/003.html">
   [
   <font color="green">
    3
   </font>
   ]
  </a>
  ,
  <a href="http://econpy.pythonanywhere.com/ex/004.html">
   [
   <font color="green">
    4
   </font>
   ]
  </a>
  ,
  <a href="http://econpy.pythonanywhere.com/ex/005.html">
   [
   <font color="green">
    5
   </font>
   ]
  </a>
 </div>
 <div title="buyer-info">
  <div title="buyer-name">
   Carson Busses
  </div>
  <span class="item-price">
   $29.95
  </span>
  <br/>
 </div>
 <div title="buyer-info">
  <div title="buyer-name">
   Earl E. Byrd
  </div>
  <span class="item-price">
   $8.37
  </span>
  <br/>
 </div>
 <div title="buyer-info">
  <div title="buyer-name">
   Patty Cakes
  </div>
  <span class="item-price">
   $15.26
  </span>
  <br/>
 </div>
 <div title="buyer-info">
  <div title="buyer-name">
   D

In [9]:
# .text will return the text without the extra tags
print(b.find(name='body').text)


1, [2], [3], [4], [5]

Carson Busses
$29.95


Earl E. Byrd
$8.37


Patty Cakes
$15.26


Derri Anne Connecticut
$19.25


Moe Dess
$19.25


Leda Doggslife
$13.99


Dan Druff
$31.57


Al Fresco
$8.49


Ido Hoe
$14.47


Howie Kisses
$15.86


Len Lease
$11.11


Phil Meup
$15.98


Ira Pent
$16.27


Ben D. Rules
$7.50


Ave Sectomy
$50.85


Gary Shattire
$14.26


Bobbi Soks
$5.68


Sheila Takya
$15.00


Rose Tattoo
$114.07


Moe Tell
$10.09

  (function() {
    var ga = document.createElement('script');     ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:'   == document.location.protocol ? 'https://ssl'   : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
    })();
  



findall will return all matching tags

In [10]:
print(b.find_all('div'))

[<div align="center">1, <a href="http://econpy.pythonanywhere.com/ex/002.html">[<font color="green">2</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/003.html">[<font color="green">3</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/004.html">[<font color="green">4</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/005.html">[<font color="green">5</font>]</a></div>, <div title="buyer-info">
<div title="buyer-name">Carson Busses</div>
<span class="item-price">$29.95</span><br/>
</div>, <div title="buyer-name">Carson Busses</div>, <div title="buyer-info">
<div title="buyer-name">Earl E. Byrd</div>
<span class="item-price">$8.37</span><br/>
</div>, <div title="buyer-name">Earl E. Byrd</div>, <div title="buyer-info">
<div title="buyer-name">Patty Cakes</div>
<span class="item-price">$15.26</span><br/>
</div>, <div title="buyer-name">Patty Cakes</div>, <div title="buyer-info">
<div title="buyer-name">Derri Anne Connecticut</div>
<span class="item-price">$19.25</span

In [11]:
b.find_all('div', title='buyer-name')
# beautiful soup will let us choose specific elements within div tags.

[<div title="buyer-name">Carson Busses</div>,
 <div title="buyer-name">Earl E. Byrd</div>,
 <div title="buyer-name">Patty Cakes</div>,
 <div title="buyer-name">Derri Anne Connecticut</div>,
 <div title="buyer-name">Moe Dess</div>,
 <div title="buyer-name">Leda Doggslife</div>,
 <div title="buyer-name">Dan Druff</div>,
 <div title="buyer-name">Al Fresco</div>,
 <div title="buyer-name">Ido Hoe</div>,
 <div title="buyer-name">Howie Kisses</div>,
 <div title="buyer-name">Len Lease</div>,
 <div title="buyer-name">Phil Meup</div>,
 <div title="buyer-name">Ira Pent</div>,
 <div title="buyer-name">Ben D. Rules</div>,
 <div title="buyer-name">Ave Sectomy</div>,
 <div title="buyer-name">Gary Shattire</div>,
 <div title="buyer-name">Bobbi Soks</div>,
 <div title="buyer-name">Sheila Takya</div>,
 <div title="buyer-name">Rose Tattoo</div>,
 <div title="buyer-name">Moe Tell</div>]

In [13]:
#We can use for loops to select just the text
for i in b.find_all('div', title='buyer-name'):
    print(i.text)

Carson Busses
Earl E. Byrd
Patty Cakes
Derri Anne Connecticut
Moe Dess
Leda Doggslife
Dan Druff
Al Fresco
Ido Hoe
Howie Kisses
Len Lease
Phil Meup
Ira Pent
Ben D. Rules
Ave Sectomy
Gary Shattire
Bobbi Soks
Sheila Takya
Rose Tattoo
Moe Tell


In [14]:
# Now I have a list of all names on the page
# What if I wanted a list of the prices?
for i in b.find_all('span'):
    print(i.text)

$29.95
$8.37
$15.26
$19.25
$19.25
$13.99
$31.57
$8.49
$14.47
$15.86
$11.11
$15.98
$16.27
$7.50
$50.85
$14.26
$5.68
$15.00
$114.07
$10.09


# Pair practice: Group work - build a function that uses the statements above creates a dictionary that pairs the names with prices:

In [15]:
keys = [i.text for i in b.find_all('div', title='buyer-name')]
values = [j.text for j in b.find_all('span')]

dictionary = dict(zip(keys,values))

In [16]:
dictionary

{'Al Fresco': '$8.49',
 'Ave Sectomy': '$50.85',
 'Ben D. Rules': '$7.50',
 'Bobbi Soks': '$5.68',
 'Carson Busses': '$29.95',
 'Dan Druff': '$31.57',
 'Derri Anne Connecticut': '$19.25',
 'Earl E. Byrd': '$8.37',
 'Gary Shattire': '$14.26',
 'Howie Kisses': '$15.86',
 'Ido Hoe': '$14.47',
 'Ira Pent': '$16.27',
 'Leda Doggslife': '$13.99',
 'Len Lease': '$11.11',
 'Moe Dess': '$19.25',
 'Moe Tell': '$10.09',
 'Patty Cakes': '$15.26',
 'Phil Meup': '$15.98',
 'Rose Tattoo': '$114.07',
 'Sheila Takya': '$15.00'}

In [17]:
def dict_generator(tag1, title1, tag2, title2 = None):
    keys = [i.text for i in b.find_all(tag1, title=title1)]
    values = [j.text for j in b.find_all(tag2, title = title2)]
    dictionary = dict(zip(keys,values))
    return dictionary

In [18]:
dict_generator('div','buyer-name','span')

{'Al Fresco': '$8.49',
 'Ave Sectomy': '$50.85',
 'Ben D. Rules': '$7.50',
 'Bobbi Soks': '$5.68',
 'Carson Busses': '$29.95',
 'Dan Druff': '$31.57',
 'Derri Anne Connecticut': '$19.25',
 'Earl E. Byrd': '$8.37',
 'Gary Shattire': '$14.26',
 'Howie Kisses': '$15.86',
 'Ido Hoe': '$14.47',
 'Ira Pent': '$16.27',
 'Leda Doggslife': '$13.99',
 'Len Lease': '$11.11',
 'Moe Dess': '$19.25',
 'Moe Tell': '$10.09',
 'Patty Cakes': '$15.26',
 'Phil Meup': '$15.98',
 'Rose Tattoo': '$114.07',
 'Sheila Takya': '$15.00'}

In [19]:
dict_generator('div','buyer-info','span')

{'\nAl Fresco\n$8.49\n': '$8.49',
 '\nAve Sectomy\n$50.85\n': '$50.85',
 '\nBen D. Rules\n$7.50\n': '$7.50',
 '\nBobbi Soks\n$5.68\n': '$5.68',
 '\nCarson Busses\n$29.95\n': '$29.95',
 '\nDan Druff\n$31.57\n': '$31.57',
 '\nDerri Anne Connecticut\n$19.25\n': '$19.25',
 '\nEarl E. Byrd\n$8.37\n': '$8.37',
 '\nGary Shattire\n$14.26\n': '$14.26',
 '\nHowie Kisses\n$15.86\n': '$15.86',
 '\nIdo Hoe\n$14.47\n': '$14.47',
 '\nIra Pent\n$16.27\n': '$16.27',
 '\nLeda Doggslife\n$13.99\n': '$13.99',
 '\nLen Lease\n$11.11\n': '$11.11',
 '\nMoe Dess\n$19.25\n': '$19.25',
 '\nMoe Tell\n$10.09\n': '$10.09',
 '\nPatty Cakes\n$15.26\n': '$15.26',
 '\nPhil Meup\n$15.98\n': '$15.98',
 '\nRose Tattoo\n$114.07\n': '$114.07',
 '\nSheila Takya\n$15.00\n': '$15.00'}

In [None]:
for k in range(1,6):
    dict_k = dict_generator('div','buyer-info','span')