## Scotch

The scraping notebook walked through some basic web scraping with html that was fairly straightforward and well behaved to show you the basics of the involved libraries. And there are a lot of nicely written tutorials on the web where you can see a polished version of what people did for scraping. But the question I was struggling with while putting this together was how do you really teach someone to scrape? And what is a good site to teach with?

While I was thinking about it, I recieved the weekly email from one of my favorite online retailers, [Drink Up NY](http://www.drinkupny.com/Default.asp). Since I am quite fond of their inventory and I had scraping on the brain, I thought I would see if I could scrape the data (and I didn't see any good reason [why not to](http://www.drinkupny.com/robots.txt)). As I was playing around with this notebook I thought it might be a good way to demonstrate a thought process for approaching scraping. This is not a definitive way to scrape, it is just the way *I* approached *this* page sitting around on my couch the week before this tutorial. 

I can read and write some html and I have broken some CSS on WordPress before, but I am not a web dev. I would call myself barely competent in that realm. If I were, I imagine this would go a lot faster sometimes. I do like tinkering around and trying things though, so that is how I approach scraping and what I will demonstrate here.

So, let's see what is going on with Scotch prices on Drink Up NY!

In [48]:
import requests
from bs4 import BeautifulSoup

I am using [requests](http://docs.python-requests.org/en/master/) here, but I could have just as easily used [urllib](https://docs.python.org/3.5/library/urllib.html). 

First, we need to get the page. If you navigate around the [Drink Up NY](http://www.drinkupny.com/Default.asp) site a bit, you can see they have a lot of drop down menus. I am just intereted in the [Scotch](https://www.drinkupny.com/single-malt-scotch-whisky-s/77.htm) page right now. I navigated here through the "SPIRITS ETC" - "Whisk(e)y" - "Scotch Whisky" links. For scraping, I am going to change the "per page" option to "120 per page" so I have everything on one page.

In [55]:
html_file = "https://www.drinkupny.com/single-malt-scotch-whisky-s/77.htm?searching=Y&sort=9&cat=77&show=120&page=1"
html_rpt = requests.get(html_file)

In [56]:
if html_rpt.status_code == 200:
    print(html_rpt.content)
else:
    print(html_rpt.status_code)



It worked. Let's parse and prettify it.

In [57]:
soup = BeautifulSoup(html_rpt.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<!--[if lte IE 9]><html class="no-js lt-ie10" lang="en"><![endif]-->
<!--[if gt IE 9]><!-->
<html class="no-js gt-ie9" lang="en">
 <!--<![endif]-->
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Best Scotch Whiskey | Single Malt Scotches | Malt Scotch Whiskey
  </title>
  <meta content="Extensive selection of premium scotch whiskey! Save up to $20.00 on aged single malt scotches! Buy the best malt scotch whiskey from DrinkUpNY and save!" name="description"/>
  <meta content="scotch, scotch whiskey, scotch whisky, malt scotch, single malt scotches, single malt whiskey, scotch single malt, single malt scotch, best scotch, scotch brands, scotch online, malt scotch whiskey, single malt scotch whisky, blended scotch whisky, buy scotch, aberlour a bunadh, best scotch whiskey, top rated scotch whiskey, best single malt scotch, best scotch whisky, expensive scotch, glenfiddich scotch, scotch bourbon whiskey, best scotch brands, lagav

Now let's take a look through the structure and find the data. 

In [58]:
# I wouldn't expect my data to be here, but there is some metadata if we want it.
soup.head

<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Best Scotch Whiskey | Single Malt Scotches | Malt Scotch Whiskey</title>
<meta content="Extensive selection of premium scotch whiskey! Save up to $20.00 on aged single malt scotches! Buy the best malt scotch whiskey from DrinkUpNY and save!" name="description"/>
<meta content="scotch, scotch whiskey, scotch whisky, malt scotch, single malt scotches, single malt whiskey, scotch single malt, single malt scotch, best scotch, scotch brands, scotch online, malt scotch whiskey, single malt scotch whisky, blended scotch whisky, buy scotch, aberlour a bunadh, best scotch whiskey, top rated scotch whiskey, best single malt scotch, best scotch whisky, expensive scotch, glenfiddich scotch, scotch bourbon whiskey, best scotch brands, lagavulin scotch, buy scotch online, purchase scotch online, scotch buy online, buy scotch whiskey, buy scotch whisky, best scotch whiskey brands, top rated scotch, scotch whisky online

In [34]:
soup.body

<body>
<!--SCRIPT TO REMOVE HEADER ON 404 page-->
<script type="text/javascript"> 
//<![CDATA[ 
if (location.pathname == "/orderdetails.asp") 
document.writeln("\n<style type='text/css'>#content_area div.page-wrap header.header {display: none;}</style>\n\n"); 
//]]> </script>
<!--End SCRIPT TO REMOVE HEADER ON 404 page-->
<span id="svgIncludes" style="display:none;"></span>
<noscript id="no-js-notice">
      To take full advantage of this site, please enable your browser's JavaScript feature. <a href="http://www.activatejavascript.org/" target="_blank">Learn how</a>
</noscript>
<nav class="menu push-menu hidden-md hidden-lg" data-menu-type="slide-left">
<div class="push-menu__close-btn"><a class="close-menu" href="javascript:void(0);">
<span class="glyphicon glyphicon-vol-close"></span>
</a></div>
<div class="menu" id="display_menu_4"><script type="text/javascript">var breadCrumb="|77||8||278|";</script>
<link href="/a/c/vnav.css" rel="stylesheet" type="text/css">
<script src="/a/j/vna

In [35]:
soup.body.table

<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="100%">
<tr>
<td>
<b>You are here: <a href="https://www.drinkupny.com/">Home</a> &gt; <a href="https://www.drinkupny.com/Online-Spirits-Store-s/278.htm" title="Online_Spirits_Store">Spirits etc</a> &gt; <a href="https://www.drinkupny.com/whiskey-s/8.htm" title="whiskey">Whisk(e)y</a> &gt; <a href="https://www.drinkupny.com/single-malt-scotch-whisky-s/77.htm" title="single malt scotch whisky">Scotch Whisky</a></b>
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td class="colors_lines_light"><img height="1" src="/v/vspfiles/templates/DrinkUpNy/images/clear1x1.gif" width="1"/></td>
</tr>
<tr>
<td align="center" valign="top">
<span id="listOfErrorsSpan">
</span>
</td>
</tr>
</table>

We can certainly pull a lot of links easily with their navigation system.

In [60]:
for link in soup.find_all('a'):
    print(link.get('href'))

http://www.activatejavascript.org/
javascript:void(0);
https://www.drinkupny.com/Wine-Types-s/286.htm
https://www.drinkupny.com/sparkling-wine-s/23.htm
https://www.drinkupny.com/Red-Wine-s/371.htm
https://www.drinkupny.com/cabernet-franc-wine-s/91.htm
https://www.drinkupny.com/wine-cabernet-sauvingnon-s/22.htm
https://www.drinkupny.com/carmenere-wine-s/159.htm
https://www.drinkupny.com/grenache-wine-s/105.htm
https://www.drinkupny.com/malbec-s/25.htm
https://www.drinkupny.com/merlot-wines-s/26.htm
https://www.drinkupny.com/nebbiolo-wine-s/88.htm
https://www.drinkupny.com/pinot-noir-wines-s/28.htm
https://www.drinkupny.com/sangiovese-wine-s/89.htm
https://www.drinkupny.com/syrah-and-shiraz-s/31.htm
https://www.drinkupny.com/tempranillo-red-wine-s/32.htm
https://www.drinkupny.com/Other-Red-Blends-s/69.htm
https://www.drinkupny.com/rose-wine-s/67.htm
https://www.drinkupny.com/sweet-fortified-dessert-wines-s/30.htm
https://www.drinkupny.com/best-vermouth-s/100.htm
https://www.drinkupny.com

It looks like there are a few tables in there.

In [61]:
len(soup.find_all('table'))

12

The Scotch data I want is probably in one of those tables. 

In [62]:
soup.find_all('table')[0]

<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="100%">
<tr>
<td>
<b>You are here: <a href="https://www.drinkupny.com/">Home</a> &gt; <a href="https://www.drinkupny.com/Online-Spirits-Store-s/278.htm" title="Online_Spirits_Store">Spirits etc</a> &gt; <a href="https://www.drinkupny.com/whiskey-s/8.htm" title="whiskey">Whisk(e)y</a> &gt; <a href="https://www.drinkupny.com/single-malt-scotch-whisky-s/77.htm" title="single malt scotch whisky">Scotch Whisky</a></b>
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td class="colors_lines_light"><img height="1" src="/v/vspfiles/templates/DrinkUpNy/images/clear1x1.gif" width="1"/></td>
</tr>
<tr>
<td align="center" valign="top">
<span id="listOfErrorsSpan">
</span>
</td>
</tr>
</table>

Nope, that's not it.

In [63]:
soup.find_all('table')[1]

<table border="0" cellpadding="5" cellspacing="0" width="100%">
<tr>
<td>
<b>You are here: <a href="https://www.drinkupny.com/">Home</a> &gt; <a href="https://www.drinkupny.com/Online-Spirits-Store-s/278.htm" title="Online_Spirits_Store">Spirits etc</a> &gt; <a href="https://www.drinkupny.com/whiskey-s/8.htm" title="whiskey">Whisk(e)y</a> &gt; <a href="https://www.drinkupny.com/single-malt-scotch-whisky-s/77.htm" title="single malt scotch whisky">Scotch Whisky</a></b>
</td>
</tr>
</table>

Nope, not there either.

In [64]:
soup.find_all('table')[2]

<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td>
<table border="0" cellpadding="10" cellspacing="10" width="100%">
<tr>
<td>
<div style="text-align: left;">
<inline style="font-size: 16px; font-family: Times New Roman;"><b>Purchase Scotch Whisky Online at DrinkUpNY!</b>
<!--<inline-->
<div style="text-align: justify;">
<inline style="font-size: 14px; font-family: Times New Roman;">There are so many Scotch whisky brands available today that sometimes it's difficult to find what you're looking for. We offer an extensive selection of some of the best Scotch Whiskies on the market today, including Single Malt Scotches, Blended Malt Scotch Whisky, Blended Whisky and more!
</inline>
</div>
</inline>
</div>
</td>
</tr>
</table>
</td>
</tr>
</table>
<form action="/searchresults.asp" class="search_results_section" id="MainForm" method="post" name="MainForm" onsubmit="return OnSubmitSearchForm(even

BINGO! Where we have tables, we have rows. 

In [68]:
soup.find_all('table')[2].tr

<tr>
<td>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td>
<table border="0" cellpadding="10" cellspacing="10" width="100%">
<tr>
<td>
<div style="text-align: left;">
<inline style="font-size: 16px; font-family: Times New Roman;"><b>Purchase Scotch Whisky Online at DrinkUpNY!</b>
<!--<inline-->
<div style="text-align: justify;">
<inline style="font-size: 14px; font-family: Times New Roman;">There are so many Scotch whisky brands available today that sometimes it's difficult to find what you're looking for. We offer an extensive selection of some of the best Scotch Whiskies on the market today, including Single Malt Scotches, Blended Malt Scotch Whisky, Blended Whisky and more!
</inline>
</div>
</inline>
</div>
</td>
</tr>
</table>
</td>
</tr>
</table>
<form action="/searchresults.asp" class="search_results_section" id="MainForm" method="post" name="MainForm" onsubmit="return OnSubmitSearchForm(event, this);">
<input name="Search" type="hidden" value="">
<input 

That wasn't overly useful. We do have some nice looking CSS selector options though like this:
```<div class="v-product">```

In [69]:
len(soup.select(".v-product"))

65

65 items for that selector. And 65 bottles on the page. That could be it. 

When we use select, Beautiful Soup gives us a list. So let's look at the list elements.

In [70]:
soup.select(".v-product")[0]

<div class="v-product">
<a bunadh="" class="v-product__img" href="https://www.drinkupny.com/Aberlour-a-bunadh-p/s0958.htm" malt="" scotch="" single="" title="Aberlour a" whisky'="">
<img alt="Aberlour a'bunadh Single Malt Scotch Whisky" border="0" src="/v/vspfiles/photos/S0958-1.jpg" style="BORDER-RIGHT: #666666 1px solid; BORDER-TOP: #666666 1px solid; BORDER-LEFT: #666666 1px solid; BORDER-BOTTOM: #666666 1px solid"/></a>
<a class="v-product__title productnamecolor colors_productname" href="https://www.drinkupny.com/Aberlour-a-bunadh-p/s0958.htm" title="Aberlour a'bunadh Non-Chillfiltered Speyside Single Malt Scotch Whisky (750ml), S0958"> 
Aberlour a'bunadh Non-Chillfiltered Speyside Single Malt Scotch Whisky (750ml)
</a>
<p class="text v-product__desc">
<div style="text-align: center;">
<span style="font-size: 16px; color: rgb(139, 69, 19);">
<inline style="font-size: 16px; font-family: Times New Roman;"><b>(96-100) - WE</b></inline></span><!--<inline-->
<inline style="font-size: 1

In [47]:
soup.select(".v-product")[0].select(".v-product__title")[0].text.split('\n', 1)[1].strip()

"Aberlour a'bunadh Non-Chillfiltered Speyside Single Malt Scotch Whisky (750ml)"

In [46]:
soup.select(".v-product")[0].select(".product_productprice")[0].text.split('$', 1)[1].strip()

'130.00'

In [43]:
soup.select(".v-product")[0].select(".product_saleprice")[0].text

'Sale Price: $114.99     '

In [44]:
soup.select(".v-product")[0].select(".product_listprice")[0].text

'Reg Price: $144.99 '

In [14]:
table = soup.select(".v-product")

In [45]:
scotch = table[0]
scotch

<div class="v-product">
<a bunadh="" class="v-product__img" href="https://www.drinkupny.com/Aberlour-a-bunadh-p/s0958.htm" malt="" scotch="" single="" title="Aberlour a" whisky'="">
<img alt="Aberlour a'bunadh Single Malt Scotch Whisky" border="0" src="/v/vspfiles/photos/S0958-1.jpg" style="BORDER-RIGHT: #666666 1px solid; BORDER-TOP: #666666 1px solid; BORDER-LEFT: #666666 1px solid; BORDER-BOTTOM: #666666 1px solid"/></a>
<a class="v-product__title productnamecolor colors_productname" href="https://www.drinkupny.com/Aberlour-a-bunadh-p/s0958.htm" title="Aberlour a'bunadh Non-Chillfiltered Speyside Single Malt Scotch Whisky (750ml), S0958"> 
Aberlour a'bunadh Non-Chillfiltered Speyside Single Malt Scotch Whisky (750ml)
</a>
<p class="text v-product__desc">
<div style="text-align: center;">
<span style="font-size: 16px; color: rgb(139, 69, 19);">
<inline style="font-size: 16px; font-family: Times New Roman;"><b>(96-100) - WE</b></inline></span><!--<inline-->
<inline style="font-size: 1

In [69]:
scotch.select(".product_listprice").get_text

AttributeError: 'list' object has no attribute 'get_text'

In [25]:
scotch.select(".v-product__title")[0]

<a class="v-product__title productnamecolor colors_productname" href="https://www.drinkupny.com/Aberlour-a-bunadh-p/s0958.htm" title="Aberlour a'bunadh Non-Chillfiltered Speyside Single Malt Scotch Whisky (750ml), S0958"> 
Aberlour a'bunadh Non-Chillfiltered Speyside Single Malt Scotch Whisky (750ml)
</a>

In [43]:
soup.find_all('table')[2].find_all('href')

[]