# Web Scraping - Lab

## Introduction

Now that you've seen a more extensive example of developing a web scraping script, it's time to further practice and formalize that knowledge by writing functions to parse specific pieces of information from the web page and then synthesizing these into a larger loop that will iterate over successive web pages in order to build a complete dataset.

## Objectives

You will be able to:

* Write functions to parse specific information from a web page
* Iterate over successive web pages in order to create a dataset

## Lab Overview

This lab will build upon the previous lesson. In the end, you'll look to write a script that will iterate over all of the pages for the demo site and extract the title, price, star rating and availability of each book listed. Building up to that, you'll formalize the concepts from the lesson by writing functions that will extract a list of each of these features for each web page. You'll then combine these functions into the full script which will look something like this:  

```python
df = pd.DataFrame()
for i in range(2,51):
    url = "http://books.toscrape.com/catalogue/page-{}.html".format(i)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    new_titles = retrieve_titles(soup)
    new_star_ratings = retrieve_ratings(soup)
    new_prices = retrieve_prices(soup)
    new_avails = retrieve_avails(soup)
    ...
 ```

## Connect to website

In [15]:
import pandas as pd
import requests
import json
from bs4 import BeautifulSoup

In [12]:
url = "http://books.toscrape.com/catalogue/page-1.html"
site_download = requests.get(url)
# site_download

<Response [200]>

In [271]:
soup = BeautifulSoup(site_download.content)
pods = soup.select('article', class_="product_pod")
# pods[0]

## Playing with the data

In [141]:
# print(soup.select('.product_pod .price_color')[0].prettify())
soup.select('.product_pod .price_color')[0].text

'£51.77'

In [269]:
# print(soup.select('.product_pod')[0].prettify())

In [244]:
# soup.select('.product_pod p')[0]['class'][1]
soup.select('.product_pod .instock.availability')[0].text

'\n\n    \n        In stock\n    \n'

In [331]:
url = "http://books.toscrape.com/catalogue/page-51.html"
site_download = requests.get(url)
site_download.status_code
# soup = BeautifulSoup(site_download.content)


404

## Retrieving Page Data Function

In [282]:
def retrieve_all_data(soup):
    str_to_int = {'One':1,'Two':2,'Three':3,'Four':4,'Five':5}
    title_list = [link['title'] for link in soup.select('.product_pod h3 a')]
    price_list = [float(link.text[1:]) for link in soup.select('.product_pod .price_color')]
    star_list_str = [link['class'][1] for link in soup.select('.product_pod .star-rating')]
    star_list = list(map(lambda x: str_to_int[x],star_list_str))
    avail_list_str = [link.text for link in soup.select('.product_pod .instock.availability')]
    avail_list = list(map(lambda x: 1 if x.find('In stock') >0 else 0,avail_list_str))
    df_of_page = pd.DataFrame({'title':title_list,'price_gbp':price_list,'rating_stars':star_list,'availability':avail_list})
    return df_of_page

## Retrieving Page Data (all pages)

In [350]:
list_of_pages = []
for x in range(49,55):
    try:
        url = "http://books.toscrape.com/catalogue/page-{}.html".format(x)
        site_download = requests.get(url)
        assert(site_download.status_code == 200),"page load error"
        soup = BeautifulSoup(site_download.content)
        list_of_pages.append(retrieve_all_data(soup))
        print("downloading page {}".format(url))
    except:
        break
all_pages = pd.concat(list_of_pages)
all_pages.reset_index(drop=True,inplace=True)

downloading page http://books.toscrape.com/catalogue/page-49.html
downloading page http://books.toscrape.com/catalogue/page-50.html


In [351]:
all_pages

Unnamed: 0,title,price_gbp,rating_stars,availability
0,On the Road (Duluoz Legend),32.36,3,1
1,Old Records Never Die: One Man's Quest for His...,55.66,2,1
2,Off Sides (Off #1),39.45,5,1
3,Of Mice and Men,47.11,2,1
4,Myriad (Prentor #1),58.75,4,1
5,My Perfect Mistake (Over the Top #1),38.92,2,1
6,"Ms. Marvel, Vol. 1: No Normal (Ms. Marvel (201...",39.39,4,1
7,Meditations,25.89,2,1
8,Matilda,28.34,1,1
9,Lost Among the Living,27.7,4,1


## Level-Up: Write a new version of the script you just wrote. 

If you used url hacking to generate each successive page url, instead write a function that retrieves the link from the `"next"` button at the bottom of the page. Conversely, if you already used this approach above, use URL-hacking (arguably the easier of the two methods in this case).

In [None]:
#Your code here

## Summary

Well done! You just completed your first full web scraping project! You're ready to start harnessing the power of the web!