### **WEB SCRAPING**

Web Scraping is a technique for extracting information from websites using automated tools, such as python.

Is used in situations where no API is available, or when data is needed that is not accessible through an API.

Most used tools in Python for Web Scraping:

- BeautifulSoup: is a Python library for parsing and extracting information from HTML and XML.
- Requests: is a Python library to send HTTP requests with Python.
- Selenium: is a Python library to automate web browsers for web scraping.

In [1]:
import requests
from bs4 import BeautifulSoup

Scraping has two phases:
- Find the list of web pages where the items are
- Enter each of the pages of the items and capture the information we are looking for

#### **GET THE HTML CODE** ####

In [2]:
result = requests.get("https://books.toscrape.com/") #https://books.toscrape.com/catalogue/page-2.html
print(result.text) #do the same for each page

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="" />
        <meta name="viewport" content="width=device-width" />
        <meta name="robots" content="NOARCHIVE,NOCACHE" />

        <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
        <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->

        
            <link rel="shortcut icon" href="static/oscar/favicon.

In [3]:
soup = BeautifulSoup(result.content, 'html.parser') #html.parser to convert it into an object that can be explored and manipulated
print(soup.prettify()) #print the html code in a readable form adding indents and spaces

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

#### **ACQUIRE RELEVANT INFORMATION** ####

In [4]:
container = soup.find('ol') #'ol' is where all the books are
links = container.find_all('a') #I look for every 'a' to keep every book
links_list = list(set([link.get('href') for link in links])) #set eliminates duplicates
links_list_complete = ["http://books.toscrape.com/" + link for link in links_list]
links_list_complete

['http://books.toscrape.com/catalogue/the-black-maria_991/index.html',
 'http://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html',
 'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'http://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html',
 'http://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html',
 'http://books.toscrape.com/catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html',
 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'http://books.toscrape.com/catalogue/set-me-free_988/index.html',
 'http://books.toscrape.com/catalogue/soumission_998/index.html',
 'http://books.toscrape.com/catalogue/olio_984/index.html',
 'http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'http://books.toscrape.com/cat

In [5]:
len(links_list_complete)

20

The same would have to be done for the remaining 49 pages, changing the link in request 

#### **WEB SCRAPING ON THE FIRST BOOK** ####

In [6]:
result = requests.get(links_list_complete[0])
soup = BeautifulSoup(result.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   The Black Maria | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="
    Praise for Aracelis Girmay:&quot;[Girmay's] every loss—she calls them estrangements—is a yearning for connection across time and place; her every fragment is a bulwark against ruin.&quot; — O, The Oprah Magazine &quot;In Aracelis Girmay we have a poet who collects, polishes, and shares stories with such brilliant invention, tenderness, and intellectual liveliness that it is understandabl Praise for Aracelis Girmay:&quot;[Girmay's

- PRODUCT NAME

In [7]:
product_name = soup.find("h1")
product_name.text

'The Black Maria'

- PRODUCT DESCRIPTION

In [8]:
description = soup.find_all("p")
description[3].text

'Praise for Aracelis Girmay:"[Girmay\'s] every loss—she calls them estrangements—is a yearning for connection across time and place; her every fragment is a bulwark against ruin." — O, The Oprah Magazine "In Aracelis Girmay we have a poet who collects, polishes, and shares stories with such brilliant invention, tenderness, and intellectual liveliness that it is understandabl Praise for Aracelis Girmay:"[Girmay\'s] every loss—she calls them estrangements—is a yearning for connection across time and place; her every fragment is a bulwark against ruin." — O, The Oprah Magazine "In Aracelis Girmay we have a poet who collects, polishes, and shares stories with such brilliant invention, tenderness, and intellectual liveliness that it is understandable that we think of her as the blessed curator of our collective histories. There is in her art the vulnerability of one who lives inside of the stories that she gathers in this remarkable collection. Her poems set off alarms even as they transfor

- PRODUCT INFORMATION

In [9]:
product_information = soup.find_all("tr")
product_information

[<tr>
 <th>UPC</th><td>1dfe412b8ac00530</td>
 </tr>,
 <tr>
 <th>Product Type</th><td>Books</td>
 </tr>,
 <tr>
 <th>Price (excl. tax)</th><td>£52.15</td>
 </tr>,
 <tr>
 <th>Price (incl. tax)</th><td>£52.15</td>
 </tr>,
 <tr>
 <th>Tax</th><td>£0.00</td>
 </tr>,
 <tr>
 <th>Availability</th>
 <td>In stock (19 available)</td>
 </tr>,
 <tr>
 <th>Number of reviews</th>
 <td>0</td>
 </tr>]

In [10]:
product_information[3].select("th")[0].text


'Price (incl. tax)'

In [11]:
product_information[3].select("td")[0].text

'£52.15'

- STORE THE INFORMATION IN A DICTIONARY

In [12]:
dictionary = dict()

dictionary["Name"] = product_name.text
dictionary["Description"] = description[3].text

for atribute in product_information:
    dictionary[atribute.select("th")[0].text] = atribute.select("td")[0].text

dictionary

{'Name': 'The Black Maria',
 'Description': 'Praise for Aracelis Girmay:"[Girmay\'s] every loss—she calls them estrangements—is a yearning for connection across time and place; her every fragment is a bulwark against ruin." — O, The Oprah Magazine "In Aracelis Girmay we have a poet who collects, polishes, and shares stories with such brilliant invention, tenderness, and intellectual liveliness that it is understandabl Praise for Aracelis Girmay:"[Girmay\'s] every loss—she calls them estrangements—is a yearning for connection across time and place; her every fragment is a bulwark against ruin." — O, The Oprah Magazine "In Aracelis Girmay we have a poet who collects, polishes, and shares stories with such brilliant invention, tenderness, and intellectual liveliness that it is understandable that we think of her as the blessed curator of our collective histories. There is in her art the vulnerability of one who lives inside of the stories that she gathers in this remarkable collection. He

#### **LETS CREATE A LIST WITH ALL BOOKS** ####

In [61]:
all_books = list() #the goal is to store all the books in the list

for books in links_list_complete:

    result = requests.get(books)
    soup = BeautifulSoup(result.content, 'html.parser')

    product_name = soup.find("h1")
    description = soup.find_all("p")
    product_information = soup.find_all("tr")

    dictionary = dict()

    dictionary["Name"] = product_name.text
    dictionary["Description"] = description[3].text

    for atribute in product_information:
        dictionary[atribute.select("th")[0].text] = atribute.select("td")[0].text

    all_books.append(dictionary)

In [62]:
all_books

[{'Name': 'The Black Maria',
  'Description': 'Praise for Aracelis Girmay:"[Girmay\'s] every loss—she calls them estrangements—is a yearning for connection across time and place; her every fragment is a bulwark against ruin." — O, The Oprah Magazine "In Aracelis Girmay we have a poet who collects, polishes, and shares stories with such brilliant invention, tenderness, and intellectual liveliness that it is understandabl Praise for Aracelis Girmay:"[Girmay\'s] every loss—she calls them estrangements—is a yearning for connection across time and place; her every fragment is a bulwark against ruin." — O, The Oprah Magazine "In Aracelis Girmay we have a poet who collects, polishes, and shares stories with such brilliant invention, tenderness, and intellectual liveliness that it is understandable that we think of her as the blessed curator of our collective histories. There is in her art the vulnerability of one who lives inside of the stories that she gathers in this remarkable collection. 

#### **SELENIUM** ####