# Web Scraping Example


## The Requests Library

Now that we know how to use the Beautiful Soup library, let's perform the workflow for web scraping for data analysis.

In this notebook, we will:
1. Use the `Requests` library to get the HTML source for a web page
1. Scrape the data using `Beautiful Soup`
2. Save the results into a CSV file with `Pandas`.

First we have to import the required libraries.

### A Note about web scraping

Although you can basically scrape any website that you can read online, web scraping may be sending multiple requests to a website and this will increase the load on the web server. 

Some companies have terms and conditions regarding web scraping so you should always check. 

### Scraping Sandbox

We are going to use a site that is specifically developed to help us learn web scraping, called [toscrape.com](https://toscrape.com/). 



# Using the Requests Library

We can use the `Requests` library to send a `get` request to a web page. If the response is of type HTML, then we can create a `BeautifulSoup` object by parsing the content.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Let's set up the URL then use the `Requests` library to get the web page which represents a [bookstore](http://books.toscrape.com/).

In [2]:
# Send a request to get the books.toscrape web page

web_url = "http://books.toscrape.com/"
response = requests.get(web_url)

# parse the response using Beautiful Soup
if (response.status_code == 200):
    print(response.headers['Content-Type'])

    print()
else:
    print(response.reason)

text/html



# Using the Beautiful Soup Library

If the content type of the response is "text/html", then we can proceed to create the `BeautifulSoup` object for scraping.

In [3]:
soup = BeautifulSoup(response.content,'html.parser')

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

Let's go through the list of categories. Inspecting the HTML through the `prettify()` output or *inspect element* in Google chrome, we want to find the tag `<div>` with the attribute `class="side_categories`, so let's try that.


In [5]:
category_list = soup.find('div', class_='side_categories')
category_list

<div class="side_categories">
<ul class="nav nav-list">
<li>
<a href="catalogue/category/books_1/index.html">
                            
                                Books
                            
                        </a>
<ul>
<li>
<a href="catalogue/category/books/travel_2/index.html">
                            
                                Travel
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/mystery_3/index.html">
                            
                                Mystery
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/historical-fiction_4/index.html">
                            
                                Historical Fiction
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/sequential-art_5/index.html">
                            
                                Sequential Art
          

Great! We have found it. Let's get all the `<a>`, or *anchor* tags which will lead us to each category.

In [11]:
link_tags = category_list.find_all('a')
links = [l.attrs for l in link_tags]

In [12]:
links

[{'href': 'catalogue/category/books_1/index.html'},
 {'href': 'catalogue/category/books/travel_2/index.html'},
 {'href': 'catalogue/category/books/mystery_3/index.html'},
 {'href': 'catalogue/category/books/historical-fiction_4/index.html'},
 {'href': 'catalogue/category/books/sequential-art_5/index.html'},
 {'href': 'catalogue/category/books/classics_6/index.html'},
 {'href': 'catalogue/category/books/philosophy_7/index.html'},
 {'href': 'catalogue/category/books/romance_8/index.html'},
 {'href': 'catalogue/category/books/womens-fiction_9/index.html'},
 {'href': 'catalogue/category/books/fiction_10/index.html'},
 {'href': 'catalogue/category/books/childrens_11/index.html'},
 {'href': 'catalogue/category/books/religion_12/index.html'},
 {'href': 'catalogue/category/books/nonfiction_13/index.html'},
 {'href': 'catalogue/category/books/music_14/index.html'},
 {'href': 'catalogue/category/books/default_15/index.html'},
 {'href': 'catalogue/category/books/science-fiction_16/index.html'},
 

In [26]:
URL = web_url+links[1]['href']
print(URL)

http://books.toscrape.com/catalogue/category/books/travel_2/index.html


In [27]:
len(links)

51

In [28]:
#for i in range(len(links)):
    #URL = web_url+links[i]['href']
    #print(URL)

In [29]:
link_request = requests.get(URL)


In [30]:
# parse the response using Beautiful Soup
if (link_request.status_code == 200):
    print(link_request.headers['Content-Type'])

    print()
else:
    print(link_request.reason)

text/html



In [31]:
category_soup = BeautifulSoup(link_request.content,'html.parser')

In [32]:
category_soup.title

<title>
    Travel | 
     Books to Scrape - Sandbox

</title>

In [34]:
# We can get the name of the category we are in from the <h1> tag.
# Inspect the element to check!
current_category = category_soup.h1.string
current_category

'Travel'

Now that we are in the Travel books category, let's get the title, rating, price and status from each book.

We can inspect the element to find the relevant tag.

<img src="images/books_inspect.png" alt="drawing" width="200"/>


In [35]:
# Use the first token in the class attribute to find the tag that contain the 
# information about books
books = category_soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')

In [36]:
books

[<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a href="../../../its-only-the-himalayas_981/index.html"><img alt="It's Only the Himalayas" class="thumbnail" src="../../../../media/cache/27/a5/27a53d0bb95bdd88288eaf66c9230d7e.jpg"/></a>
 </div>
 <p class="star-rating Two">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="../../../its-only-the-himalayas_981/index.html" title="It's Only the Himalayas">It's Only the Himalayas</a></h3>
 <div class="product_price">
 <p class="price_color">£45.17</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>
 </li>,
 <li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod"

In [37]:
# Let's just get one book to inspect the tags closely
one_book = books[0]
one_book.contents

['\n',
 <article class="product_pod">
 <div class="image_container">
 <a href="../../../its-only-the-himalayas_981/index.html"><img alt="It's Only the Himalayas" class="thumbnail" src="../../../../media/cache/27/a5/27a53d0bb95bdd88288eaf66c9230d7e.jpg"/></a>
 </div>
 <p class="star-rating Two">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="../../../its-only-the-himalayas_981/index.html" title="It's Only the Himalayas">It's Only the Himalayas</a></h3>
 <div class="product_price">
 <p class="price_color">£45.17</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>,
 '\n']

In [42]:
# Find the title
one_book.h3.string

"It's Only the Himalayas"

In [49]:
# Find the price
one_book.find('p', class_='price_color').string

'£45.17'

In [63]:
# Find the star rating
one_book.find('p', class_='star-rating')['class'][1]


'Two'

Now that we can get the data from one book, let's loop through all the books and store the title, price and star rating in respective lists.

In [70]:
categories = []
titles = []
prices = []
ratings = []
ratings_dict = {'One':1, 'Two':2, 'Three':3, 'Four':4, 'Five':5}
for l in links[1:]:
    URL = web_url+'/'+l['href']
    link_request = requests.get(URL)
    if (link_request.status_code == 200):
        category_soup = BeautifulSoup(link_request.content,'html.parser')
        current_category = category_soup.h1.string
        books = category_soup.find_all('li', class_="col-xs-6")
        for b in books:
            categories.append(current_category)
            titles.append(b.h3.string)
            prices.append(b.find('p', class_='price_color').string)
            ratings_s = b.find('p', class_='star-rating')['class'][1]
            ratings.append(ratings_dict[ratings_s])

In [66]:
# Create a dataframe from the lists
books_df = pd.DataFrame({'Category':categories,
                        'Title': titles,
                        'Price':prices,
                        'Rating':ratings})

In [67]:
books_df

Unnamed: 0,Category,Title,Price,Rating
0,Travel,It's Only the Himalayas,£45.17,2
1,Travel,Full Moon over Noah’s ...,£49.43,4
2,Travel,See America: A Celebration ...,£48.87,3
3,Travel,Vagabonding: An Uncommon Guide ...,£36.94,2
4,Travel,Under the Tuscan Sun,£37.33,3
5,Travel,A Summer In Europe,£44.34,2
6,Travel,The Great Railway Bazaar,£30.54,1
7,Travel,A Year in Provence ...,£56.88,4
8,Travel,The Road to Little ...,£23.21,1
9,Travel,Neither Here nor There: ...,£38.95,3


**Saving to File**

As you can see, we have a pretty nice data frame formed by pandas. It's not perfect, so we will have to do some data cleaning with additional pandas and Python functions.

Now we can save this dataframe into a CSV file using the `to_csv()` method.


In [68]:
# Save dataframe to file
books_df.to_csv('scraped_books.csv')

## Summary

As you can see web scraping takes some exploration, inspecting and filtering of the various tags and attributes. In order to obtain all the books by category, we would have to loop through each of the links to send the request and then add the books from the returned responses, including navigating to additional pages. 



# Exercises

Let's try to extract some data from another wiki page, the [Quotes to Scrape](quotes.toscrape.com)

Q1. Create a `BeautifulSoup` object to scrape the content from this page.


In [72]:
# Q1 Answer

#Send a request to get the HTML page http://quotes.toscrape.com
web_url = "http://quotes.toscrape.com"
response = requests.get(web_url)

# parse the response using Beautiful Soup
if (response.status_code == 200):
    print(response.headers['Content-Type'])

    print()
else:
    print(response.reason)
    
soup = BeautifulSoup(response.content,'html.parser')

print(soup.prettify())

text/html; charset=utf-8

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
    

Q2. Locate the tags which are in the "Top Ten tags" (view the page to check) and create a list with only these 10 tags

In [94]:
# Q2 Answer
tags_list = tags.find_all('a')
tags_list = [l.text for l in tags_list]
tags_list

['love',
 'inspirational',
 'life',
 'humor',
 'books',
 'reading',
 'friendship',
 'friends',
 'truth',
 'simile']

Let's prompt the user which tag they would like to view quotes from.

In [95]:
answer = input("Which types of quotes would you like to view?")

Which types of quotes would you like to view?life


Q3. Based on the user's answer, find the link to the page that contains the quotes for that tag. We can pass the string to search for as an argument in `find_all()` or `find()`


In [96]:
## Put in the suitable tag and argument
quote_tag = soup.find('a', string=answer)


Q4. Send a request to get the quotes from that tag's page (remember to append the link to the correct web_url)

In [100]:
quote_tag['href']
quote_response = requests.get(web_url+quote_tag['href'])

<Response [200]>

Q5. Store each of the quotes and their authors into a dataframe.

Q6. Can you loop through each of the top ten tags to store all those quotes in a dataframe then save to a csv file?