# Web Scraping

Let's parse the front page of [Books To Scrape](http://books.toscrape.com/index.html).<br>
The server receives an URL as a request and sends an HTML file as a response.
The HTML contents are formatted by BeautifulSoup

In [2]:
import requests
from bs4 import BeautifulSoup

def get_parsed(url):
    response = requests.get(url)
    html = response.content
    return BeautifulSoup(html, 'html.parser')

parsed = get_parsed("http://books.toscrape.com/index.html")


Contents of the ***HTML*** file: 

In [12]:
print(parsed.body.article)

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>


Each **book** is an article with title and price elements

In [27]:
def get_elements(articles): 
    for book in in_books:
        title = book.h3.a["title"]
        str_price = book.find(class_="price_color")
        f_price = float(str_price.text.lstrip("£"))
        print("[" + title + "," + str(f_price) + "]")

articles = parsed.find_all("article") 
get_elements(articles)

[On the Road (Duluoz Legend),32.36]
[Old Records Never Die: One Man's Quest for His Vinyl and His Past,55.66]
[Off Sides (Off #1),39.45]
[Of Mice and Men,47.11]
[Myriad (Prentor #1),58.75]
[My Perfect Mistake (Over the Top #1),38.92]
[Ms. Marvel, Vol. 1: No Normal (Ms. Marvel (2014-2015) #1),39.39]
[Meditations,25.89]
[Matilda,28.34]
[Lost Among the Living,27.7]
[Lord of the Flies,24.89]
[Listen to Me (Fusion #1),58.99]
[Kitchens of the Great Midwest,57.2]
[Jane Eyre,38.43]
[Imperfect Harmony,34.74]
[Icing (Aces Hockey #2),40.44]
[Hawkeye, Vol. 1: My Life as a Weapon (Hawkeye #1),45.24]
[Having the Barbarian's Baby (Ice Planet Barbarians #7.5),34.96]
[Giant Days, Vol. 1 (Giant Days #1-4),56.76]
[Fruits Basket, Vol. 1 (Fruits Basket #1),40.28]


## Multi-pages Scraping

Let's parse the first ***2*** pages of the catalog

In [28]:
out_books = []
for page_no in range(3):
    url = "http://books.toscrape.com/catalogue/page-" + str(page_no) + ".html"
    parsed = get_parsed(url)
    articles = parsed.find_all("article") 
    get_elements(articles)

[On the Road (Duluoz Legend),32.36]
[Old Records Never Die: One Man's Quest for His Vinyl and His Past,55.66]
[Off Sides (Off #1),39.45]
[Of Mice and Men,47.11]
[Myriad (Prentor #1),58.75]
[My Perfect Mistake (Over the Top #1),38.92]
[Ms. Marvel, Vol. 1: No Normal (Ms. Marvel (2014-2015) #1),39.39]
[Meditations,25.89]
[Matilda,28.34]
[Lost Among the Living,27.7]
[Lord of the Flies,24.89]
[Listen to Me (Fusion #1),58.99]
[Kitchens of the Great Midwest,57.2]
[Jane Eyre,38.43]
[Imperfect Harmony,34.74]
[Icing (Aces Hockey #2),40.44]
[Hawkeye, Vol. 1: My Life as a Weapon (Hawkeye #1),45.24]
[Having the Barbarian's Baby (Ice Planet Barbarians #7.5),34.96]
[Giant Days, Vol. 1 (Giant Days #1-4),56.76]
[Fruits Basket, Vol. 1 (Fruits Basket #1),40.28]
[On the Road (Duluoz Legend),32.36]
[Old Records Never Die: One Man's Quest for His Vinyl and His Past,55.66]
[Off Sides (Off #1),39.45]
[Of Mice and Men,47.11]
[Myriad (Prentor #1),58.75]
[My Perfect Mistake (Over the Top #1),38.92]
[Ms. Marvel, 