## Web Scraping Intro

*Prepared by:*  
**Jude Michael Teves**  
Faculty, Software Technology Department  
College of Computer Studies - De La Salle University

<!-- <sup>```Last run: 2021-07-06 11:39PM (GMT +8)```</sup> -->

This notebook shows how to scrape using a toy website. We will be scraping the https://books.toscrape.com/ as it is dedicated for practicing scraping.

<!-- As of the time this notebook was last updated, this is what the Bloomberg Currencies webpage looks like:

<img width=1000 src="../images/Bloomberg Currencies.png" /> -->

### Import libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

### Read the webpage

In [2]:
path = 'https://books.toscrape.com/catalogue/category/books/science_22/index.html'
#get the html from one of the books in the website
page = requests.get(path)

#feed it into beautiful soup for parsing
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   Science | 
     Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="
    
" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="../../../../static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="../../../../static/oscar/css/styles.

### Scraping the contents

In [3]:
# #the find function returns the tag of the element if we want to remove the tags we call the .text attribute 
print(soup.find('h1'))
print(soup.find('h1').text)

<h1>Science</h1>
Science


In [4]:
print(soup.find('p', attrs={'class':'price_color'}))
print(soup.find('p', attrs={'class':'price_color'}).text)

<p class="price_color">Â£42.96</p>
Â£42.96


In [5]:
articles = soup.find_all('article')
article_links = [article.find('a').get('href') for article in articles]
article_links

['../../../the-most-perfect-thing-inside-and-outside-a-birds-egg_938/index.html',
 '../../../immunity-how-elie-metchnikoff-changed-the-course-of-modern-medicine_900/index.html',
 '../../../sorting-the-beef-from-the-bull-the-science-of-food-fraud-forensics_736/index.html',
 '../../../tipping-point-for-planet-earth-how-close-are-we-to-the-edge_643/index.html',
 '../../../the-fabric-of-the-cosmos-space-time-and-the-texture-of-reality_572/index.html',
 '../../../diary-of-a-citizen-scientist-chasing-tiger-beetles-and-other-new-ways-of-engaging-the-world_517/index.html',
 '../../../the-origin-of-species_499/index.html',
 '../../../the-grand-design_405/index.html',
 '../../../peak-secrets-from-the-new-science-of-expertise_389/index.html',
 '../../../the-elegant-universe-superstrings-hidden-dimensions-and-the-quest-for-the-ultimate-theory_245/index.html',
 '../../../the-disappearing-spoon-and-other-true-tales-of-madness-love-and-the-history-of-the-world-from-the-periodic-table-of-the-elements_

In [6]:
len(article_links)

14

In [7]:
article_links[0].split("../../../")

['', 'the-most-perfect-thing-inside-and-outside-a-birds-egg_938/index.html']

In [8]:
# import os
# os.path.join(path,article_links[0])

book_link = 'https://books.toscrape.com/catalogue/' + article_links[0].split("../../../")[1]
book_link

'https://books.toscrape.com/catalogue/the-most-perfect-thing-inside-and-outside-a-birds-egg_938/index.html'

In [9]:
#get the html from one of the books in the website
page_book = requests.get(book_link)

#feed it into beautiful soup for parsing
soup_book = BeautifulSoup(page_book.text, 'html.parser')

In [10]:
soup_book.find('h1')

<h1>The Most Perfect Thing: Inside (and Outside) a Bird's Egg</h1>

### Reading the table

In [30]:
table = soup_book.find('table', attrs={'class':'table table-striped'})
table

<table class="table table-striped">
<tr>
<th>UPC</th><td>aadee1c326d286e3</td>
</tr>
<tr>
<th>Product Type</th><td>Books</td>
</tr>
<tr>
<th>Price (excl. tax)</th><td>Â£42.96</td>
</tr>
<tr>
<th>Price (incl. tax)</th><td>Â£42.96</td>
</tr>
<tr>
<th>Tax</th><td>Â£0.00</td>
</tr>
<tr>
<th>Availability</th>
<td>In stock (16 available)</td>
</tr>
<tr>
<th>Number of reviews</th>
<td>0</td>
</tr>
</table>

### Parsing the table

In [31]:
table.find_all('th')

[<th>UPC</th>,
 <th>Product Type</th>,
 <th>Price (excl. tax)</th>,
 <th>Price (incl. tax)</th>,
 <th>Tax</th>,
 <th>Availability</th>,
 <th>Number of reviews</th>]

In [34]:
header = [i.text for i in table.find_all('th')]
header

['UPC',
 'Product Type',
 'Price (excl. tax)',
 'Price (incl. tax)',
 'Tax',
 'Availability',
 'Number of reviews']

In [32]:
table.find_all('td')

[<td>aadee1c326d286e3</td>,
 <td>Books</td>,
 <td>Â£42.96</td>,
 <td>Â£42.96</td>,
 <td>Â£0.00</td>,
 <td>In stock (16 available)</td>,
 <td>0</td>]

In [35]:
value = [i.text for i in table.find_all('td')]
value

['aadee1c326d286e3',
 'Books',
 'Â£42.96',
 'Â£42.96',
 'Â£0.00',
 'In stock (16 available)',
 '0']

In [37]:
pd.DataFrame({'header': header, 'value': value})

Unnamed: 0,header,value
0,UPC,aadee1c326d286e3
1,Product Type,Books
2,Price (excl. tax),Â£42.96
3,Price (incl. tax),Â£42.96
4,Tax,Â£0.00
5,Availability,In stock (16 available)
6,Number of reviews,0


### Pandas hack

```conda install -c anaconda lxml``` or ```pip install lxml```

In [24]:
str(soup_book.find('table'))

'<table class="table table-striped">\n<tr>\n<th>UPC</th><td>aadee1c326d286e3</td>\n</tr>\n<tr>\n<th>Product Type</th><td>Books</td>\n</tr>\n<tr>\n<th>Price (excl. tax)</th><td>Â£42.96</td>\n</tr>\n<tr>\n<th>Price (incl. tax)</th><td>Â£42.96</td>\n</tr>\n<tr>\n<th>Tax</th><td>Â£0.00</td>\n</tr>\n<tr>\n<th>Availability</th>\n<td>In stock (16 available)</td>\n</tr>\n<tr>\n<th>Number of reviews</th>\n<td>0</td>\n</tr>\n</table>'

In [28]:
pd.read_html(str(soup_book.find('table')))[0]

Unnamed: 0,0,1
0,UPC,aadee1c326d286e3
1,Product Type,Books
2,Price (excl. tax),Â£42.96
3,Price (incl. tax),Â£42.96
4,Tax,Â£0.00
5,Availability,In stock (16 available)
6,Number of reviews,0


## End
<sup>made by **Jude Michael Teves**</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> <href>judemichaelteves@gmail.com</href> or <href>jude.teves@dlsu.edu.ph</href></sup><br>