## Web Scraping Intro

*Prepared by:*  
**Jude Michael Teves**  
Faculty, Software Technology Department  
College of Computer Studies - De La Salle University

This notebook shows how to scrape using a toy website. We will be scraping the https://books.toscrape.com/ as it is dedicated for practicing scraping.

**Reminder**

> *"With great power, comes great responsibility"*
    
Remember to perform web scraping with extra caution and to not abuse it. The boundaries are not so clear when it comes to what you can and cannot legally do with scraping. Use your own judgment to determine if what you are about to do is unethical or illegal.
<hr>

### Import libraries

We will be using the `requests` and `BeautifulSoup` libraries for the succeeding cells. These two will give us the functionalities we need to scrape a webpage. If this is not already installed in your environment, you may use the either of the following commands in your command line:

```conda install -c anaconda beautifulsoup4``` or
```pip install beautifulsoup4```

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

### Read the webpage

We will be focusing on the Science books for now. If you select the Science book category on the left side of the screen, the URL will point to https://books.toscrape.com/catalogue/category/books/science_22/index.html. We will be using this URL as our main page.

In [2]:
# get the html of our main page
path = 'https://books.toscrape.com/catalogue/category/books/science_22/index.html'
page = requests.get(path)

# feed it into beautiful soup for parsing
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   Science | 
     Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="
    
" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="../../../../static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="../../../../static/oscar/css/styles.

### Anatomy of HTML

<sup>This HTML anatomy section is taken from <a href="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics">Mozilla Web Docs</a>.

<center><img width=600 src="../images/grumpy-cat-small.png"></center>

The main parts of an HTML element are as follows:

- The opening tag: This consists of the name of the element (in this case, p), wrapped in opening and closing angle brackets. This states where the element begins or starts to take effect — in this case where the paragraph begins.
- The closing tag: This is the same as the opening tag, except that it includes a forward slash before the element name. This states where the element ends — in this case where the paragraph ends. Failing to add a closing tag is one of the standard beginner errors and can lead to strange results.
- The content: This is the content of the element, which in this case, is just text.
- The element: The opening tag, the closing tag, and the content together comprise the element.

We can also add more details to our elements through the use of attributes.

<center><img width=800 src="../images/grumpy-cat-attribute-small.png"></center>

Attributes contain extra information about the element that you don't want to appear in the actual content. Here, class is the attribute name and editor-note is the attribute value. The class attribute allows you to give the element a non-unique identifier that can be used to target it (and any other elements with the same class value) with style information and other things.

An attribute should always have the following:

- A space between it and the element name (or the previous attribute, if the element already has one or more attributes).
- The attribute name followed by an equal sign.
- The attribute value wrapped by opening and closing quotation marks.

### Scraping the contents

Use the developer console to identify the elements you want to be scraped. For Windows, this will pop up when you press `F12`.

<center><img width=800 src="../images/developer-console.png"></center>


**Extra help!!!**  
For a quick guide on how to perform scraping typical elements on a webpage, please refer to the following <a href="http://akul.me/blog/2016/beautifulsoup-cheatsheet/">article</a>.

In [3]:
# the find function returns the tag of the element if we want to remove the tags we call the .text attribute 
print(soup.find('h1'))
print(soup.find('h1').text)

<h1>Science</h1>
Science


In [4]:
print(soup.find('p', attrs={'class':'price_color'}))
print(soup.find('p', attrs={'class':'price_color'}).text)

<p class="price_color">Â£42.96</p>
Â£42.96


In [5]:
articles = soup.find_all('article')
article_links = [article.find('a').get('href') for article in articles]
article_links

['../../../the-most-perfect-thing-inside-and-outside-a-birds-egg_938/index.html',
 '../../../immunity-how-elie-metchnikoff-changed-the-course-of-modern-medicine_900/index.html',
 '../../../sorting-the-beef-from-the-bull-the-science-of-food-fraud-forensics_736/index.html',
 '../../../tipping-point-for-planet-earth-how-close-are-we-to-the-edge_643/index.html',
 '../../../the-fabric-of-the-cosmos-space-time-and-the-texture-of-reality_572/index.html',
 '../../../diary-of-a-citizen-scientist-chasing-tiger-beetles-and-other-new-ways-of-engaging-the-world_517/index.html',
 '../../../the-origin-of-species_499/index.html',
 '../../../the-grand-design_405/index.html',
 '../../../peak-secrets-from-the-new-science-of-expertise_389/index.html',
 '../../../the-elegant-universe-superstrings-hidden-dimensions-and-the-quest-for-the-ultimate-theory_245/index.html',
 '../../../the-disappearing-spoon-and-other-true-tales-of-madness-love-and-the-history-of-the-world-from-the-periodic-table-of-the-elements_

Perform sanity check! We should be getting the same number of items as the ones on the webpage. Make sure that the retrieved values are correct too!

In [6]:
len(article_links)

14

#### Retrieve the first book

In [7]:
index = 0 # first book

We do not need the `../`s so we will be removing those.

In [8]:
article_links[index].split("../../../")

['', 'the-most-perfect-thing-inside-and-outside-a-birds-egg_938/index.html']

Note that the trimmed path is the second element of the list after splitting.

In [9]:
book_link = 'https://books.toscrape.com/catalogue/' + article_links[index].split("../../../")[1]
book_link

'https://books.toscrape.com/catalogue/the-most-perfect-thing-inside-and-outside-a-birds-egg_938/index.html'

In [10]:
#get the html from one of the books in the website
page_book = requests.get(book_link)

#feed it into beautiful soup for parsing
soup_book = BeautifulSoup(page_book.text, 'html.parser')

In [11]:
soup_book.find('h1')

<h1>The Most Perfect Thing: Inside (and Outside) a Bird's Egg</h1>

### Reading the table

In [12]:
table = soup_book.find('table', attrs={'class':'table table-striped'})
table

<table class="table table-striped">
<tr>
<th>UPC</th><td>aadee1c326d286e3</td>
</tr>
<tr>
<th>Product Type</th><td>Books</td>
</tr>
<tr>
<th>Price (excl. tax)</th><td>Â£42.96</td>
</tr>
<tr>
<th>Price (incl. tax)</th><td>Â£42.96</td>
</tr>
<tr>
<th>Tax</th><td>Â£0.00</td>
</tr>
<tr>
<th>Availability</th>
<td>In stock (16 available)</td>
</tr>
<tr>
<th>Number of reviews</th>
<td>0</td>
</tr>
</table>

### Parsing the table

In [13]:
table.find_all('th')

[<th>UPC</th>,
 <th>Product Type</th>,
 <th>Price (excl. tax)</th>,
 <th>Price (incl. tax)</th>,
 <th>Tax</th>,
 <th>Availability</th>,
 <th>Number of reviews</th>]

We only need the text, not the whole element.

In [14]:
header = [i.text for i in table.find_all('th')]
header

['UPC',
 'Product Type',
 'Price (excl. tax)',
 'Price (incl. tax)',
 'Tax',
 'Availability',
 'Number of reviews']

In [15]:
table.find_all('td')

[<td>aadee1c326d286e3</td>,
 <td>Books</td>,
 <td>Â£42.96</td>,
 <td>Â£42.96</td>,
 <td>Â£0.00</td>,
 <td>In stock (16 available)</td>,
 <td>0</td>]

In [16]:
value = [i.text for i in table.find_all('td')]
value

['aadee1c326d286e3',
 'Books',
 'Â£42.96',
 'Â£42.96',
 'Â£0.00',
 'In stock (16 available)',
 '0']

Turn it into a dataframe.

In [17]:
pd.DataFrame({'value': value}, index=header)

Unnamed: 0,value
UPC,aadee1c326d286e3
Product Type,Books
Price (excl. tax),Â£42.96
Price (incl. tax),Â£42.96
Tax,Â£0.00
Availability,In stock (16 available)
Number of reviews,0


### Pandas hack

We will be needing the lxml library for this to work. If this is not already installed in your environment, you may use the either of the following commands in your command line:

```conda install -c anaconda lxml``` or ```pip install lxml```

In [18]:
str(soup_book.find('table'))

'<table class="table table-striped">\n<tr>\n<th>UPC</th><td>aadee1c326d286e3</td>\n</tr>\n<tr>\n<th>Product Type</th><td>Books</td>\n</tr>\n<tr>\n<th>Price (excl. tax)</th><td>Â£42.96</td>\n</tr>\n<tr>\n<th>Price (incl. tax)</th><td>Â£42.96</td>\n</tr>\n<tr>\n<th>Tax</th><td>Â£0.00</td>\n</tr>\n<tr>\n<th>Availability</th>\n<td>In stock (16 available)</td>\n</tr>\n<tr>\n<th>Number of reviews</th>\n<td>0</td>\n</tr>\n</table>'

In [19]:
df = pd.read_html(str(soup_book.find('table')))[0]
df

Unnamed: 0,0,1
0,UPC,aadee1c326d286e3
1,Product Type,Books
2,Price (excl. tax),Â£42.96
3,Price (incl. tax),Â£42.96
4,Tax,Â£0.00
5,Availability,In stock (16 available)
6,Number of reviews,0


What if we want it in columnar fornat?

In [20]:
df.set_index(0).T

Unnamed: 0,UPC,Product Type,Price (excl. tax),Price (incl. tax),Tax,Availability,Number of reviews
1,aadee1c326d286e3,Books,Â£42.96,Â£42.96,Â£0.00,In stock (16 available),0


### Exercise

Scrape all the books in Science category and save it as a Pandas DataFrame which has the following fields:

| |Title|UPC|Product Type|Price (excl. tax)|Price (incl. tax)|Tax|Availability|Number of reviews|
|---|---|---|---|---|---|---|---|---|
|0|The Most Perfect Thing: Inside (and Outside) a Bird's Egg|aadee1c326d286e3|Books|Â£42.96|Â£42.96|Â£0.00|In stock (16 available)|0|
|1|Immunity: How Elie Metchnikoff Changed the Course of Modern Medicine|e4f74c16de34d440|Books|Â£57.36|Â£57.36|Â£0.00|In stock (16 available)|0|
|2|...| | | | | | |

## References

1. https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics

## End
<sup>made by **Jude Michael Teves**</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> <href>judemichaelteves@gmail.com</href> or <href>jude.teves@dlsu.edu.ph</href></sup><br>