# Scraping the 'Books to Scrape' website
### http://books.toscrape.com/

Using the **requests** and **BeautifuSoup** modules to extract links from a webpage

The **Books to Scrape** website is a test site which provides example content.

First let’s use the **requests** module to get the HTML of the website’s main page.

Place your cursor in the following block and select 'Run' from menu above.

In [None]:
import requests

url = "http://books.toscrape.com/index.html"

result = requests.get(url)

result.text[:1000] # this show the first 1000 characters of the result

This looks pretty messy. We can tidy it up with the BeautifulSoup 'Prettify' function

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(result.text, 'html.parser')

print(soup.prettify()[:1000])

## Viewing the source code

In your browser, go to http://books.toscrape.com/, right-click on the name of a product and click on inspect. This will show you the HTML part of the web page corresponding to this element.

Note the structure of the HTML code:

![html structure](structure.png "html structure")


The information about this product is contained in an ‘article’ tag with the a class value ‘product_pod’. This seems to be a reliable location to extract product URLs.

BeautifulSoup enables us to find those special ‘article’ tags. We can call the find() function in order to find the first occurence of this tag in the HTML:


In [None]:
soup.find("article", class_ = "product_pod")


We still have too much information.
Let’s dive deeper in the tree by adding the other child tags:


In [None]:
soup.find("article", class_ = "product_pod").div.a

Much better! But we only need the URL contained in the ‘href’ value.
We can get this by adding .get(“href”) to the previous instruction:

In [None]:
soup.find("article", class_ = "product_pod").div.a.get('href')

Now let’s get **all** the products URLs on the main web page at once using the findAll() function:

In [None]:
for x in soup.findAll("article", class_ = "product_pod"):
   
    print (x.div.a.get('href'))