# Web Scraping Example


## The Requests Library

Now that we know how to use the Beautiful Soup library, let's perform the workflow for web scraping for data analysis.

In this notebook, we will:
1. Use the `Requests` library to get the HTML source for a web page
1. Scrape the data using `Beautiful Soup`
2. Save the results into a CSV file with `Pandas`.

First we have to import the required libraries.

### A Note about web scraping

Although you can basically scrape any website that you can read online, web scraping may be sending multiple requests to a website and this will increase the load on the web server. 

Some companies have terms and conditions regarding web scraping so you should always check. 

### Scraping Sandbox

We are going to use a site that is specifically developed to help us learn web scraping, called [toscrape.com](https://toscrape.com/). 



# Using the Requests Library

We can use the `Requests` library to send a `get` request to a web page. If the response is of type HTML, then we can create a `BeautifulSoup` object by parsing the content.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Let's set up the URL then use the `Requests` library to get the web page which represents a [bookstore](http://books.toscrape.com/).

In [2]:
# Send a request to get the books.toscrape web page

web_url = "http://books.toscrape.com/"
response = requests.get(web_url)

# parse the response using Beautiful Soup
if (response.status_code == 200):
    print(response.headers['Content-Type'])
    print()
else:
    print(response.reason)

text/html



# Using the Beautiful Soup Library

If the content type of the response is "text/html", then we can proceed to create the `BeautifulSoup` object for scraping.

In [3]:
soup = BeautifulSoup(response.content,'html.parser')

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

Let's go through the list of categories. Inspecting the HTML through the `prettify()` output or *inspect element* in Google chrome, we want to find the tag `<div>` with the attribute `class="side_categories`, so let's try that.


In [5]:
category_list = soup.find("div", class_='side_categories')
category_list

<div class="side_categories">
<ul class="nav nav-list">
<li>
<a href="catalogue/category/books_1/index.html">
                            
                                Books
                            
                        </a>
<ul>
<li>
<a href="catalogue/category/books/travel_2/index.html">
                            
                                Travel
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/mystery_3/index.html">
                            
                                Mystery
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/historical-fiction_4/index.html">
                            
                                Historical Fiction
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/sequential-art_5/index.html">
                            
                                Sequential Art
          

Great! We have found it. Let's get all the `<a>`, or *anchor* tags which will lead us to each category.

In [6]:
link_tags = category_list.find_all('a')
links = [l.attrs for l in link_tags]

In [7]:
links

[{'href': 'catalogue/category/books_1/index.html'},
 {'href': 'catalogue/category/books/travel_2/index.html'},
 {'href': 'catalogue/category/books/mystery_3/index.html'},
 {'href': 'catalogue/category/books/historical-fiction_4/index.html'},
 {'href': 'catalogue/category/books/sequential-art_5/index.html'},
 {'href': 'catalogue/category/books/classics_6/index.html'},
 {'href': 'catalogue/category/books/philosophy_7/index.html'},
 {'href': 'catalogue/category/books/romance_8/index.html'},
 {'href': 'catalogue/category/books/womens-fiction_9/index.html'},
 {'href': 'catalogue/category/books/fiction_10/index.html'},
 {'href': 'catalogue/category/books/childrens_11/index.html'},
 {'href': 'catalogue/category/books/religion_12/index.html'},
 {'href': 'catalogue/category/books/nonfiction_13/index.html'},
 {'href': 'catalogue/category/books/music_14/index.html'},
 {'href': 'catalogue/category/books/default_15/index.html'},
 {'href': 'catalogue/category/books/science-fiction_16/index.html'},
 

In [11]:
URL = web_url+links[1]['href']
print(URL)

http://books.toscrape.com/catalogue/category/books/travel_2/index.html


In [12]:
link_request = requests.get(URL)


In [13]:
# parse the response using Beautiful Soup
if (link_request.status_code == 200):
    print(link_request.headers['Content-Type'])

    print()
else:
    print(link_request.reason)

text/html



In [14]:
category_soup = BeautifulSoup(link_request.content,'html.parser')

In [15]:
category_soup.title

<title>
    Travel | 
     Books to Scrape - Sandbox

</title>

In [17]:
# We can get the name of the category we are in from the <h1> tag.
# Inspect the element to check!
current_category = category_soup.h1.string
current_category

'Travel'

Now that we are in the Travel books category, let's get the title, rating, price and status from each book.

We can inspect the element to find the relevant tag.

<img src="images/books_inspect.png" alt="drawing" width="200"/>


In [20]:
# Use the first token in the class attribute to find the tag that contain the 
# information about books
books = category_soup.find_all('li', class_="col-xs-6")

In [21]:
books

[<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a href="../../../its-only-the-himalayas_981/index.html"><img alt="It's Only the Himalayas" class="thumbnail" src="../../../../media/cache/27/a5/27a53d0bb95bdd88288eaf66c9230d7e.jpg"/></a>
 </div>
 <p class="star-rating Two">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="../../../its-only-the-himalayas_981/index.html" title="It's Only the Himalayas">It's Only the Himalayas</a></h3>
 <div class="product_price">
 <p class="price_color">£45.17</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>
 </li>,
 <li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod"

In [22]:
# Let's just get one book to inspect the tags closely
one_book = books[0]
one_book.contents

['\n',
 <article class="product_pod">
 <div class="image_container">
 <a href="../../../its-only-the-himalayas_981/index.html"><img alt="It's Only the Himalayas" class="thumbnail" src="../../../../media/cache/27/a5/27a53d0bb95bdd88288eaf66c9230d7e.jpg"/></a>
 </div>
 <p class="star-rating Two">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="../../../its-only-the-himalayas_981/index.html" title="It's Only the Himalayas">It's Only the Himalayas</a></h3>
 <div class="product_price">
 <p class="price_color">£45.17</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>,
 '\n']

In [23]:
# Find the title
one_book.h3.string

"It's Only the Himalayas"

In [24]:
# Find the price
one_book.find('p', class_='price_color').string

'£45.17'

In [28]:
# Find the star rating
one_book.find('p', class_='star-rating')['class'][1]

'Two'

Now that we can get the data from one book, let's loop through all the books and store the title, price and star rating in respective lists.

In [32]:
links

[{'href': 'catalogue/category/books_1/index.html'},
 {'href': 'catalogue/category/books/travel_2/index.html'},
 {'href': 'catalogue/category/books/mystery_3/index.html'},
 {'href': 'catalogue/category/books/historical-fiction_4/index.html'},
 {'href': 'catalogue/category/books/sequential-art_5/index.html'},
 {'href': 'catalogue/category/books/classics_6/index.html'},
 {'href': 'catalogue/category/books/philosophy_7/index.html'},
 {'href': 'catalogue/category/books/romance_8/index.html'},
 {'href': 'catalogue/category/books/womens-fiction_9/index.html'},
 {'href': 'catalogue/category/books/fiction_10/index.html'},
 {'href': 'catalogue/category/books/childrens_11/index.html'},
 {'href': 'catalogue/category/books/religion_12/index.html'},
 {'href': 'catalogue/category/books/nonfiction_13/index.html'},
 {'href': 'catalogue/category/books/music_14/index.html'},
 {'href': 'catalogue/category/books/default_15/index.html'},
 {'href': 'catalogue/category/books/science-fiction_16/index.html'},
 

In [42]:
categories = []
titles = []
prices = []
ratings = []
ratings_dict = {'One':1, 'Two':2, 'Three':3, 'Four':4, 'Five':5}
for l in links[1:]:
    URL = web_url+l['href']
    link_request = requests.get(URL)
    if (link_request.status_code == 200):
        category_soup = BeautifulSoup(link_request.content,'html.parser')
        current_category = category_soup.h1.string
        books = category_soup.find_all('li', class_="col-xs-6")
        for b in books:
            categories.append(current_category)
            titles.append(b.h3.string)
            prices.append(b.find('p', class_='price_color').string)
            ratings_s = b.find('p', class_='star-rating')['class'][1]
            ratings.append(ratings_dict[ratings_s])

    
    

In [43]:
# Create a dataframe from the lists
books_df = pd.DataFrame({'Category':categories,
                        'Title': titles,
                        'Price':prices,
                        'Rating':ratings})

In [47]:
books_df.tail(20)


Unnamed: 0,Category,Title,Price,Rating
497,Self Help,You Are a Badass: ...,£12.08,3
498,Self Help,How to Stop Worrying ...,£46.49,5
499,Historical,All the Light We ...,£29.87,5
500,Historical,The Girl You Left ...,£15.79,1
501,Christian,(Un)Qualified: How God Uses ...,£54.00,5
502,Christian,Crazy Love: Overwhelmed by ...,£47.72,2
503,Christian,Blue Like Jazz: Nonreligious ...,£25.77,1
504,Suspense,Silence in the Dark ...,£58.33,3
505,Short Stories,The Grownup,£35.88,1
506,Novels,Suzie Snowflake: One beautiful ...,£54.81,5


In [41]:
books_df.tail(10)

Unnamed: 0,Category,Title,Price,Rating
527,Health,The Bulletproof Diet: Lose ...,£49.05,3
528,Health,"Eat Fat, Get Thin",£54.07,2
529,Health,10-Day Green Smoothie Cleanse: ...,£49.71,5
530,Health,The Art and Science ...,£52.98,5
531,Politics,Libertarianism for Beginners,£51.33,2
532,Politics,Why the Right Went ...,£52.65,4
533,Politics,Equal Is Unfair: America's ...,£56.86,1
534,Cultural,Amid the Chaos,£36.58,1
535,Erotica,Dark Notes,£19.19,5
536,Crime,The Long Shadow of ...,£10.97,1


**Saving to File**

As you can see, we have a pretty nice data frame formed by pandas. It's not perfect, so we will have to do some data cleaning with additional pandas and Python functions.

Now we can save this dataframe into a CSV file using the `to_csv()` method.


In [48]:
# Save dataframe to file
books_df.to_csv('scraped_books.csv')

## Summary

As you can see web scraping takes some exploration, inspecting and filtering of the various tags and attributes. In order to obtain all the books by category, we would have to loop through each of the links to send the request and then add the books from the returned responses, including navigating to additional pages. 



# Exercises

Let's try to extract some data from another wiki page, the [Quotes to Scrape](quotes.toscrape.com)

Q1. Create a `BeautifulSoup` object to scrape the content from this page.


In [49]:
# Q1 Answer

#Send a request to get the HTML page on country GDP from wikipedia

web_url = "http://quotes.toscrape.com/"
response = requests.get(web_url)
soup = BeautifulSoup(response.content,'html.parser')

In [50]:
soup.title


<title>Quotes to Scrape</title>

Q2. Locate the tags which are in the "Top Ten tags" (view the page to check) and create a list with only these 10 tags

In [51]:
top_ten = soup.find('div', class_='col-md-4 tags-box')
[s for s in top_ten.stripped_strings]

['Top Ten tags',
 'love',
 'inspirational',
 'life',
 'humor',
 'books',
 'reading',
 'friendship',
 'friends',
 'truth',
 'simile']

In [52]:
top_ten

<div class="col-md-4 tags-box">
<h2>Top Ten tags</h2>
<span class="tag-item">
<a class="tag" href="/tag/love/" style="font-size: 28px">love</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/life/" style="font-size: 26px">life</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/books/" style="font-size: 22px">books</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a>
</span>
<span class="tag-i

Let's prompt the user which tag they would like to view quotes from.

In [53]:
answer = input("Which types of quotes would you like to view?")

Which types of quotes would you like to view?inspirational


Q3. Based on the user's answer, find the link to the page that contains the quotes for that tag. We can pass the string to search for as an argument in `find_all()` or `find()`


In [54]:
## Put in the suitable tag and argument
#quote_tag = soup.find(???, string=???)
quote_tag = soup.find('a', string=answer)

In [56]:
quote_tag

<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>

Q4. Send a request to get the quotes from that tag's page (remember to append the link to the correct web_url)

In [55]:
quote_tag['href']

'/tag/inspirational/page/1/'

In [57]:
quote_response = requests.get(web_url+quote_tag['href'])

In [58]:
quote_soup = BeautifulSoup(quote_response.content,'html.parser')

Q5. Store each of the quotes and their authors into a dataframe.

In [59]:
quotes = quote_soup.find_all('div', class_='quote')

In [61]:
quotes_list = []
quotes_author=[]
for q in quotes:
    quotes_list.append(q.find('span', class_='text').string)
    quotes_author.append(q.find('small', class_='author').string)

In [62]:
quotes_df = pd.DataFrame({'Quote':quotes_list, 'Author':quotes_author})

In [63]:
quotes_df['Quote']

0    “There are only two ways to live your life. On...
1    “Imperfection is beauty, madness is genius and...
2    “I have not failed. I've just found 10,000 way...
3    “This life is what you make it. No matter what...
4    “The opposite of love is not hate, it's indiff...
5    “To the well-organized mind, death is but the ...
6    “It is never too late to be what you might hav...
7    “You can never get a cup of tea large enough o...
8        “Only in the darkness can you see the stars.”
9    “When one door of happiness closes, another op...
Name: Quote, dtype: object

In [68]:
tags = [tag for tag in top_ten.stripped_strings]

In [69]:
tags

['Top Ten tags',
 'love',
 'inspirational',
 'life',
 'humor',
 'books',
 'reading',
 'friendship',
 'friends',
 'truth',
 'simile']

In [73]:
quotes_list=[]
quotes_author=[]
for tag in tags[1:]:
    print(tag)
    quote_tag = soup.find('a', string=tag)
    quote_response = requests.get(web_url+quote_tag['href'])
    quote_soup = BeautifulSoup(quote_response.content,'html.parser')
    quotes = quote_soup.find_all('div', class_='quote')
    for q in quotes:
        quotes_list.append(q.find('span', class_='text').string)
        quotes_author.append(q.find('small', class_='author').string)
        
        

love
inspirational
life
humor
books
reading
friendship
friends
truth
simile


In [75]:
quotes_df = pd.DataFrame({'quote':quotes_list, 'author':quotes_author})

In [76]:
quotes_df

Unnamed: 0,quote,author
0,“It is better to be hated for what you are tha...,André Gide
1,“This life is what you make it. No matter what...,Marilyn Monroe
2,"“You may not be her first, her last, or her on...",Bob Marley
3,"“The opposite of love is not hate, it's indiff...",Elie Wiesel
4,"“It is not a lack of love, but a lack of frien...",Friedrich Nietzsche
...,...,...
68,"“The truth."" Dumbledore sighed. ""It is a beaut...",J.K. Rowling
69,“Never tell the truth to people who are not wo...,Mark Twain
70,"“A day without sunshine is like, you know, nig...",Steve Martin
71,“Life is like riding a bicycle. To keep your b...,Albert Einstein


In [79]:
quotes_df['quote'][71]

'“Life is like riding a bicycle. To keep your balance, you must keep moving.”'