## **Web Scraping with Request and BeautifulSoup**

Before we start performing web scraping on any website, it is important to note that a lot of websites do not allow webscraping on their website, and many companies takes it as illegal and sue businesses and individuals for that, so be careful to check their Privacy policies before scraping any website.

Now, if you know the website you want to extract information from, what you need to do is to first examine the website to understand the structure and layout of the website.

In [None]:
# !pip install requests 

In [None]:
# !pip install bs4

In [2]:
import pandas as pd
import requests 
import bs4 

In [None]:
data = requests.get('https://en.wikipedia.org/wiki/Artificial_intelligence')

In [None]:
# Page source code

data.text

In [None]:
# To parse the data

parsed = bs4.BeautifulSoup(data.text, 'lxml')

In [None]:
parsed

In [None]:
# To select the HTML tags of the page - e.g.: 'h2'

titles = parsed.select('h2')

In [None]:
titles[0].getText()

In [None]:
# To obtain the titles for each 'h2'

for i in titles:
    print(i.text)

In [None]:
# To obtain the images of the page

images = parsed.select('img')

## **Web Scraping at Books to scrape**

Source: http://books.toscrape.com/index.html

In [16]:
# Creating an empty list for each item we want

pages = []
prices = []
ratings = []
title = []
urls = []

In [17]:
# Set the number of pages we gonna scrap

n_pages = 1

In [18]:
# Looping through the pages

for i in range(1, n_pages + 1):
    url = ('http://books.toscrape.com/catalogue/page-{}.html'.format(i))
    pages.append(url)

print('Number of pages:', len(pages))
print(pages)    

Number of pages: 1
['http://books.toscrape.com/catalogue/page-1.html']


Getting the data from the particular page url and converting from object of type **request** to object of type **beautiful soup**, making it more readable

In [19]:
for item in pages:
    page = requests.get(item)
    soup = bs4.BeautifulSoup(page.text, 'html.parser')

In [20]:
print(soup)


<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:30" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="../static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="../static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link 

In [21]:
# In an actual html tag, it includes indentations and nested tags that helps you to identify the actual relationship between tags and so forth.
# In order to get that identation and relationship here, we have to use what is called 'prettify()'.
# 'prettify()' is a built-in function provided by the Beautiful Soup module which gives the visual representation of the parsed URL Source code.
# i.e. it arranges all the tags in a parse-tree manner with better readability.

print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:30" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="../static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="../static/oscar/css/styles.css" rel="stylesheet" typ

## Scraping all the titles

In [22]:
# If we inspect the page we can see that the titles are located inside the h3 tag

for i in soup.findAll('h3'):
    print(i)

<h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<h3><a href="tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a></h3>
<h3><a href="soumission_998/index.html" title="Soumission">Soumission</a></h3>
<h3><a href="sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>
<h3><a href="sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a></h3>
<h3><a href="the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a></h3>
<h3><a href="the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a></h3>
<h3><a href="the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title="The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria 

In [23]:
# Adding to the title empty list

for i in soup.findAll('h3'):
    titles_h3 = i.getText()
    title.append(titles_h3)

print(title)   

['A Light in the ...', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History ...', 'The Requiem Red', 'The Dirty Little Secrets ...', 'The Coming Woman: A ...', 'The Boys in the ...', 'The Black Maria', 'Starving Hearts (Triangular Trade ...', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little ...", 'Rip it Up and ...', 'Our Band Could Be ...', 'Olio', 'Mesaerion: The Best Science ...', 'Libertarianism for Beginners', "It's Only the Himalayas"]
