# Web Scraping Example


## The Requests Library

Now that we know how to use the Beautiful Soup library, let's perform the workflow for web scraping for data analysis.

In this notebook, we will:
1. Use the `Requests` library to get the HTML source for a web page
1. Scrape the data using `Beautiful Soup`
2. Save the results into a CSV file with `Pandas`.

First we have to import the required libraries.

### A Note about web scraping

Although you can basically scrape any website that you can read online, web scraping may be sending multiple requests to a website and this will increase the load on the web server. 

Some companies have terms and conditions regarding web scraping so you should always check. 

### Scraping Sandbox

We are going to use a site that is specifically developed to help us learn web scraping, called [toscrape.com](https://toscrape.com/). 



# Using the Requests Library

We can use the `Requests` library to send a `get` request to a web page. If the response is of type HTML, then we can create a `BeautifulSoup` object by parsing the content.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Let's set up the URL then use the `Requests` library to get the web page which represents a [bookstore](http://books.toscrape.com/).

In [None]:
# Send a request to get the books.toscrape web page

web_url = "http://books.toscrape.com/"
response = requests.get(????)

# parse the response using Beautiful Soup
if (response.status_code == 200):
    print(response.headers['Content-Type'])

    print()
else:
    print(response.reason)

# Using the Beautiful Soup Library

If the content type of the response is "text/html", then we can proceed to create the `BeautifulSoup` object for scraping.

In [None]:
soup = BeautifulSoup(response.content,'html.parser')

In [None]:
print(soup.prettify())

Let's go through the list of categories. Inspecting the HTML through the `prettify()` output or *inspect element* in Google chrome, we want to find the tag `<div>` with the attribute `class="side_categories`, so let's try that.


In [None]:
category_list = soup.find(???, class_='???)
category_list

Great! We have found it. Let's get all the `<a>`, or *anchor* tags which will lead us to each category.

In [None]:
link_tags = category_list.find_all(???)
links = [l.attrs for l in link_tags]

In [None]:
links

In [None]:
URL = web_url+links[1]['href']
print(URL)

In [None]:
link_request = requests.get(URL)


In [None]:
# parse the response using Beautiful Soup
if (link_request.status_code == 200):
    print(link_request.headers['Content-Type'])

    print()
else:
    print(link_request.reason)

In [None]:
category_soup = BeautifulSoup(link_request.content,'html.parser')

In [None]:
category_soup.title

In [None]:
# We can get the name of the category we are in from the <h1> tag.
# Inspect the element to check!
current_category = category_soup.h1.string

Now that we are in the Travel books category, let's get the title, rating, price and status from each book.

We can inspect the element to find the relevant tag.

<img src="images/books_inspect.png" alt="drawing" width="200"/>


In [None]:
# Use the first token in the class attribute to find the tag that contain the 
# information about books
books = category_soup.find_all(???, class_=????)

In [None]:
books

In [None]:
# Let's just get one book to inspect the tags closely
one_book = books[0]
one_book.contents

In [None]:
# Find the title


In [None]:
# Find the price


In [None]:
# Find the star rating


Now that we can get the data from one book, let's loop through all the books and store the title, price and star rating in respective lists.

In [None]:
categories = []
titles = []
prices = []
ratings = []
ratings_dict = {'One':1, 'Two':2, 'Three':3, 'Four':4, 'Five':5}
for b in books:
    ## Get the category
    ## Get the title
    ## get the price
    ## get the rating
    ## store the rating 
    

In [None]:
# Create a dataframe from the lists
books_df = pd.DataFrame({'Category':categories,
                        'Title': titles,
                        'Price':prices,
                        'Rating':ratings})

In [None]:
books_df

**Saving to File**

As you can see, we have a pretty nice data frame formed by pandas. It's not perfect, so we will have to do some data cleaning with additional pandas and Python functions.

Now we can save this dataframe into a CSV file using the `to_csv()` method.


In [None]:
# Save dataframe to file
books_df.to_csv('scraped_books.csv')

## Summary

As you can see web scraping takes some exploration, inspecting and filtering of the various tags and attributes. In order to obtain all the books by category, we would have to loop through each of the links to send the request and then add the books from the returned responses, including navigating to additional pages. 



# Exercises

Let's try to extract some data from another wiki page, the [Quotes to Scrape](quotes.toscrape.com)

Q1. Create a `BeautifulSoup` object to scrape the content from this page.


In [None]:
# Q1 Answer

#Send a request to get the HTML page http://quotes.toscrape.com


Q2. Locate the tags which are in the "Top Ten tags" (view the page to check) and create a list with only these 10 tags

In [None]:
# Q2 Answer

Let's prompt the user which tag they would like to view quotes from.

In [None]:
answer = input("Which types of quotes would you like to view?")

Q3. Based on the user's answer, find the link to the page that contains the quotes for that tag. We can pass the string to search for as an argument in `find_all()` or `find()`


In [None]:
## Put in the suitable tag and argument
quote_tag = soup.find(???, string=???)


Q4. Send a request to get the quotes from that tag's page (remember to append the link to the correct web_url)

Q5. Store each of the quotes and their authors into a dataframe.

Q6. Can you loop through each of the top ten tags to store all those quotes in a dataframe then save to a csv file?