# Python Web Scraping : Notebook from the Le Wagon workshop

This Notebook showcases the exercise from the July 21st Le Wagon Workshop I attended. 

We will __scrape data__ from the front page of a  __fictional online bookstore__ : the [Books To Scrape](http://books.toscrape.com/index.html) website.

We'll use Python and the __Beautiful Soup__ library. 

We will take note of the __syntax differences between html and Python__ whilst we are __exploring and organizing__ the data. 


In [11]:
# Let's start by importing the necessary libraries and parsing the page

import requests
from bs4 import BeautifulSoup

url = "http://books.toscrape.com/index.html"
response = requests.get(url)
html = response.content
scraped = BeautifulSoup(html, 'html.parser')

In [12]:
# Let's check what we got :
# scraped

We have a page title, some books with ttiles, prices, category...

Let's explore. 

### Challenge 1: The title of the page

The title of the page appears in the __header h1__ : however, the h1 tag itself contains more than we want. 

The following code shows how we got to what we wanted, from the full h1 tag to the properly formatted title. 

In [13]:
# write your code here

print('Just the title tag:\n',scraped.h1,'\n')
print('The title tag as text only:\n',scraped.title.text,'\n')
print('The title tag as text, stripped from blank space:\n',scraped.title.text.strip())

Just the title tag:
 <h1>All products</h1> 

The title tag as text only:
 
    All products | Books to Scrape - Sandbox
 

The title tag as text, stripped from blank space:
 All products | Books to Scrape - Sandbox


### Challenge 2: The *full* title of the first book on a page

The title of the page appears in the __header h3__ : just like the h1 tag we just saw, this header is not exactly as we want it. 

The following code shows how we got to what we wanted, from the full h3 tag to the properly formatted title. 

In [14]:
# write your code here

print('This is the full tag:\n',scraped.h3)
print()
print('This is what we get when using "text", like we did with the title:\n',scraped.h3.text)
print()
print('We don\'t get the full title. The full title is actually the value of the title attribute, rather than the content.' )
print()
print('We now print the value of the title attribute:\n',scraped.h3.a['title'])


This is the full tag:
 <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>

This is what we get when using "text", like we did with the title:
 A Light in the ...

We don't get the full title. The full title is actually the value of the title attribute, rather than the content.

We now print the value of the title attribute:
 A Light in the Attic


### Challenge 3: *All* the full titles from the page

We will now use use some Beautiful Soup methods that return a _collection_ of elements. 

We will start with the __find_all__ method. 

We'll then loop over them to see all the titles.

In [15]:
#First we use the find_all method to get all the values of the attribute title:
books= scraped.find_all("a", title=True)

# Then we loop through the list 
for book in books:
    print(book["title"])

A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas


### Challenge 4: All the *prices* from the page

For this collection, we will use another Beautiful Soup method: the  **CSS class selector**. 

In [16]:
# On this site, the info about price are in a division, of which the price_color § contains the price. 
prices=scraped.select(".price_color")

for price in prices:
    price= float(price.text.lstrip("£")) # we need some formating: getting rid of the £ symbol, changing the text to float
    print(price)


51.77
53.74
50.1
47.82
54.23
22.65
33.34
17.93
22.6
52.15
13.99
20.66
17.46
52.29
35.02
57.25
23.88
37.59
51.33
45.17


### Challenge 5: Corresponding price for each title

We will now combine some of data, and get the price (as a float) for each of the titles. 

In [17]:
# Create an empty list
title_prices = []

# Get the info on all the articles, stored in class "product_pod" on the website:
articles = scraped.select(".product_pod")

# Loop to put them all in the dictionnary
for article in articles: 
    title= article.h3.a["title"] # get the title of the article  
    price = article.find("p", class_="price_color") # get the price of the article
    price_float = float(price.text.lstrip("£")) # format the price obtained
    title_prices.append({title: price_float}) # append the dictionnary

# Print the result
print(title_prices)

[{'A Light in the Attic': 51.77}, {'Tipping the Velvet': 53.74}, {'Soumission': 50.1}, {'Sharp Objects': 47.82}, {'Sapiens: A Brief History of Humankind': 54.23}, {'The Requiem Red': 22.65}, {'The Dirty Little Secrets of Getting Your Dream Job': 33.34}, {'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull': 17.93}, {'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics': 22.6}, {'The Black Maria': 52.15}, {'Starving Hearts (Triangular Trade Trilogy, #1)': 13.99}, {"Shakespeare's Sonnets": 20.66}, {'Set Me Free': 17.46}, {"Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)": 52.29}, {'Rip it Up and Start Again': 35.02}, {'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991': 57.25}, {'Olio': 23.88}, {'Mesaerion: The Best Science Fiction Stories 1800-1849': 37.59}, {'Libertarianism for Beginners': 51.33}, {"It's Only the Himalayas": 45.17}]


## Done!

This concludes the exercise from the workshop: we successfully __extracted data from a website__, and organised this data to suit our needs. 