### Webscraping Workshop
#### Cornell Data Science
##### Author: Varun Gande

In [1]:
import numpy as np
import pandas as pd

#### What is web scraping?
Web scraping is the process of extracting certain information from websites. The Internet has *truckloads* of information available, and often websites will have the most up-to-date information on data (sports statistics, book prices, etc.); however, getting this data from websites into an appropriate format (pandas!) is usually much more difficult than downloading a .csv file. Fortunately, web scraping is a powerful way to obtain and parse this data so that it is useable for you!

#### What is HTML?
HTML (HyperText Markup Language) is a descriptive language that specifies web page structure. There's only a few key aspects of HTML you need to understand for this workshop. 

An HTML document is structured by nested *elements*, which are surrounded by matching opening and closing *tags*. Within these tags, we can extend *attributes*, which provide additional information that affects how the browser interprets that element.

Here's an example of an HTML element: 

The opening tag is of type 'p', and has an attribute 'class' with value 'nice'. The enclosed text content is 'Hello! Welcome to INFO 1998!'. And lastly, the closing tag of type 'p' denotes the end of the enclosed content. We designate the closing tag by putting a forward slash (/) in front of the type.

Here are some common HTML tags for web scraping:
* head
* title
* body
* p (paragraph)
* div (block of content)
* a (link)
* li (list)
* table

And many more...

Go to the website (http://books.toscrape.com/catalogue/page-1.html). Notice the structure. There's 20 books listed on the page, each with a title, a rating, a price. What if we liked this data so much that wanted to use it for some kind of project? We don't have a .csv file available, so we will instead try to scrape this webpage!

To see the raw HTML of the website, right click anywhere on the page and select **"Inspect"**. (On Macs, I believe you can also use the keyboard shortcut **CMD+Option+I**). The HTML seems very complicated! Fortunately, Python has libraries that allow us to parse through this mess. The one we will primarily be using for this workshop is **BeautifulSoup**. Let's go ahead and import that!

In [3]:
#pip (or conda) install requests
import requests
#pip (or conda) install beautifulsoup4
from bs4 import BeautifulSoup

The "requests" library allows us to make HTTP requests in Python. Websites are typically hosted on a *server*. Essentially, in order to get access to a webpage, we need to send a *request*. Once the server receives the request, it sends back information in the form of a *response*.

We want to read/parse this website, so we will send a **"GET"** request.

## Sending a GET request

In [4]:
url ='http://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
print(soup.prettify) # This makes the HTML appear nice and neat!

<bound method Tag.prettify of 
<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:30" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="../static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="../static/oscar/css/styles.css" rel="styles

The BeautifulSoup object assigned to 'soup' is created with two arguments. The first argument is the HTML to be parsed (response.txt), and the second argument, the string "html.parser", tells the object which parser to use behind the scenes. "html.parser" represents Python’s built-in HTML parser.

Let us start by just extracting the names of the 20 books on this page. First, as an example, we will extract the first book title: "A Light in the Attic".

In [5]:
book_title = soup.find('li',{'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'}).find('article',{'class':'product_pod'}).find('h3')
print(book_title)

<h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>


Nice! We've found the tag that contains the title of the first book. Now, we just need to extract the title. 

### Question

What part of the above HTML tag do we want to extract?

We can see that one part of the tag contains the full title "A Light in the Attic". But another part (the inner contents of the 'a' tag) contains "A Light in the ...". That's not what we want. We want to extract the full title.

In [6]:
print(book_title.text)
print(book_title.find('a')['title'])

A Light in the ...
A Light in the Attic


To extract the titles of the remaining books, it will be the exact same process, with the exact same tags, etc. Can anyone say why?

Remember, we used the find() function to find the first occurrence of the book 'container'. Fortunately, BeautifulSoup has a **find_all()** function as well. No prizes for guessing what that does!

In [9]:
tags = soup.find_all('li',{'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})
book_titles = []
for i in tags:
    book_titles.append(i.find('article',{'class':'product_pod'}).find('h3'))
print(book_titles[0]) # This is the exact same HTML tag as we had above!
print(book_titles[1]) # This is the next HTML tag, corresponding to the next book!

<h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<h3><a href="tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a></h3>


Now what we want to do is extract the book titles using the same process we did. We can do this using a for loop!

In [10]:
books = []
for tag in book_titles:
    books.append(tag.find('a')['title'])
print(books)

['A Light in the Attic', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History of Humankind', 'The Requiem Red', 'The Dirty Little Secrets of Getting Your Dream Job', 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'The Black Maria', 'Starving Hearts (Triangular Trade Trilogy, #1)', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'Rip it Up and Start Again', 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'Olio', 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'Libertarianism for Beginners', "It's Only the Himalayas"]


Great! We have the titles, so now let's extract the prices. Can anyone tell me which 'div' tag corresponds to the container which holds the price for the first book?

In [11]:
print(tags[0])

<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>
</li>


In [13]:
book_prices = []
for i in tags:
    book_prices.append(i.find('article',{'class':'product_pod'}).find('div',{'class': 'product_price'}).find('p',{'class': 'price_color'}).text[2:])
print(book_prices)

['51.77', '53.74', '50.10', '47.82', '54.23', '22.65', '33.34', '17.93', '22.60', '52.15', '13.99', '20.66', '17.46', '52.29', '35.02', '57.25', '23.88', '37.59', '51.33', '45.17']


### Challenge

Can you extract a list of the ratings (on the 5 star scale, so each entry is either '1','2',...,'5')? This one is a bit trickier...

### Solution to Challenge

In [15]:
book_ratings = []
for i in tags:
    tag = i.find('article',{'class':'product_pod'}).find('p')['class']
    if tag[1] == 'One':
        book_ratings.append('1')
    elif tag[1] == 'Two':
        book_ratings.append('2')
    elif tag[1] == 'Three':
        book_ratings.append('3')
    elif tag[1] == 'Four':
        book_ratings.append('4')
    else:
        book_ratings.append('5')
book_ratings

['3',
 '1',
 '1',
 '4',
 '5',
 '1',
 '4',
 '3',
 '4',
 '1',
 '2',
 '4',
 '5',
 '5',
 '5',
 '3',
 '1',
 '1',
 '2',
 '2']

### Putting it all together: Making a DataFrame!!!

In [16]:
url ='http://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
data = []
tags = soup.find_all('li',{'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})
for i in tags:
    container = i.find('article',{'class':'product_pod'})
    name = container.find('h3').find('a')['title']
    price = container.find('div',{'class': 'product_price'}).find('p',{'class': 'price_color'}).text[2:]
    rating_name = container.find('p')['class'][1]
    if rating_name == 'One':
        rating = '1'
    elif rating_name == 'Two':
        rating = '2'
    elif rating_name == 'Three':
        rating = '3'
    elif rating_name == 'Four':
        rating = '4'
    else:
        rating = '5'
    data.append([name, price, rating])
df = pd.DataFrame(data, columns = ['Name','Price','Rating'])
df

Unnamed: 0,Name,Price,Rating
0,A Light in the Attic,51.77,3
1,Tipping the Velvet,53.74,1
2,Soumission,50.1,1
3,Sharp Objects,47.82,4
4,Sapiens: A Brief History of Humankind,54.23,5
5,The Requiem Red,22.65,1
6,The Dirty Little Secrets of Getting Your Dream...,33.34,4
7,The Coming Woman: A Novel Based on the Life of...,17.93,3
8,The Boys in the Boat: Nine Americans and Their...,22.6,4
9,The Black Maria,52.15,1


### Bonus: Extracting all 1000 books!!!

If you look carefully, when cycling through the pages of books, the only thing in the URL that changes is the page number! We can exploit this!

In [19]:
base_url ='http://books.toscrape.com/catalogue/page-'
html_ext = '.html'
data = []
for pg in range(1,51):
    url = base_url + str(pg) + html_ext
    response = requests.get(url,timeout=10) # What's this timeout parameter?
    soup = BeautifulSoup(response.text,'html.parser')
    tags = soup.find_all('li',{'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})
    for i in tags:
        container = i.find('article',{'class':'product_pod'})
        name = container.find('h3').find('a')['title']
        price = container.find('div',{'class': 'product_price'}).find('p',{'class': 'price_color'}).text[2:]
        rating_name = container.find('p')['class'][1]
        if rating_name == 'One':
            rating = '1'
        elif rating_name == 'Two':
            rating = '2'
        elif rating_name == 'Three':
            rating = '3'
        elif rating_name == 'Four':
            rating = '4'
        else:
            rating = '5'
        data.append([name, price, rating])
df = pd.DataFrame(data, columns = ['Name','Price','Rating'])
df

Unnamed: 0,Name,Price,Rating
0,A Light in the Attic,51.77,3
1,Tipping the Velvet,53.74,1
2,Soumission,50.10,1
3,Sharp Objects,47.82,4
4,Sapiens: A Brief History of Humankind,54.23,5
...,...,...,...
995,Alice in Wonderland (Alice's Adventures in Won...,55.53,1
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",57.06,4
997,A Spy's Devotion (The Regency Spies of London #1),16.97,5
998,1st to Die (Women's Murder Club #1),53.98,1


### A note about this 'timeout' thing and ethics:

As humans browsing the web, we are limited to how many requests we can send at a time. However, with a system like the one above, we can execute thousands of requests in just a few seconds. If a server receives more requests than it can handle, it could become overloaded and stop fulfilling legitimate requests altogether. This is what's known as a Denial of Service (DoS) attack.

If you do need to make a ton of requests, send them slowly instead of all at once. This can be controlled using the "timeout" parameter in requests.get().

Additionally, some websites (Amazon, Facebook, etc.) don't like people accessing information off their websites, and they can try to find who is doing that. Just be mindful of this! (This is why I used a fake bookstore website for this workshop!)

### Thank you!