## Lecture - 1

In this lecture we will be seeing, how to scrape data from a webpage using "BeautifulSoup". 

First step for scraping a webpage is to download the webpage (load the HTML). To complete this step we can send a get() request to the desired webpage from where we want to download the desired HTML content. We will get the HTML content in the form of a string, which we will save in some variable "response".

The HTML content is quite nested. To process this nested data, string processing will not help. We actually need a "parser" with the help of which we can create a "nested tree" structure, so that our traversal and finding of desired data becomes easy. We can parse this HTML content with the help of python library "BeautifulSoup".  

After parsing the HTML content, we can simply locate and extract the desired data.

For example : I have created a demo webpage which I will be using to scrape data in this particular lecture. The webpage is present in my local system and I haven't hosted it anywhere.

In [1]:
html = '<!DOCTYPE html>\
<html>\
<head>\
<title> Testing Web Page </title>\
</head>\
<body>\
<h1> Web Scraping </h1>\
<p class = "abc" id = "first_para">\
Let \'s start learning\
<b>\
Web Scraping\
</b>\
</p>\
<p id = "def">\
You can read more about BeautifulSoup from <a href = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a>\
</p>\
<p class = "abc">\
<a href = "https://codingninjas.in/"> Coding Ninjas </a>\
</p>\
</body>\
</html>'

I have already copied the HTML content and saved it as a string in the variable "html".

Currently I have skipped the first step which is to send a get() request to the desired webpage from where we want to download the desired HTML content.

In order to parse the HTML content we have to pass it in the object "BeautifulSoup()" along with the name of the parser with which we want to parse the HTML content. Here the name of the parser is "html.parser".

In [2]:
# importing the "BeautifulSoup" library from the package "bs4"

from bs4 import BeautifulSoup

data = BeautifulSoup(html, 'html.parser')
data

<!DOCTYPE html>
<html><head><title> Testing Web Page </title></head><body><h1> Web Scraping </h1><p class="abc" id="first_para">Let 's start learning<b>Web Scraping</b></p><p id="def">You can read more about BeautifulSoup from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p><p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p></body></html>

In [3]:
type(data)

bs4.BeautifulSoup

We can store the parsed HTML content in some variable (here "data") and print it to see what result we have got.

The datatype of the variable "data" is an object of type "BeautifulSoup".

We can prit the parsed HTML content in a nicely formatted way (where we can visualise the hierarchy) with the help of the prettify() function. The hierarchy obtained will help us to locate and fetch the desired data.

In [4]:
print(data.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Testing Web Page
  </title>
 </head>
 <body>
  <h1>
   Web Scraping
  </h1>
  <p class="abc" id="first_para">
   Let 's start learning
   <b>
    Web Scraping
   </b>
  </p>
  <p id="def">
   You can read more about BeautifulSoup from
   <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">
    here
   </a>
  </p>
  <p class="abc">
   <a href="https://codingninjas.in/">
    Coding Ninjas
   </a>
  </p>
 </body>
</html>


Let's see the attributes wich are available in the "BeautifulSoup" library for the parsed HTML content stored in the variable "data" ...

##### 1) data.tag_name (any tag name from which I want to extract the data)

Here we get the details of the complete tag.

In [5]:
# <title>

data.title

<title> Testing Web Page </title>

In [6]:
# <head>

data.head

<head><title> Testing Web Page </title></head>

In [7]:
# <h1>

data.h1

<h1> Web Scraping </h1>

In [8]:
# <p>

data.p

<p class="abc" id="first_para">Let 's start learning<b>Web Scraping</b></p>

There are 3 "paragraph" tags in the HTML content, but we are getting the details of only 1 "paragraph" tag (the first one in the hierarchy).

If a tag is not avaiable in the HTML content, then we get blank output (no error).

##### 2) data.tag_name.name

Here we get the name of the tag.

In [9]:
# <title>

data.title.name

'title'

In [10]:
# <head>

data.head.name

'head'

##### 2) data.tag_name.string

Here we get the content (in string) present in the hierarchy of the tag.

In [11]:
# <title>

data.title.string

' Testing Web Page '

In [12]:
# <head>

data.head.string

' Testing Web Page '

##### 2) data.tag_name.attrs

Here we get the attributes like class, id, etc. of the tag. It returns a Python Dictionary (set of  key, value pairs) where the "key" contains the "attribute names of the tag", the "value" contains the "attribute values of the tag".

In [13]:
# <title>

data.title.attrs

{}

We get an empty Python Dictionary because the tag as no attributes.

In [14]:
# <head>

data.head.attrs

{}

We get an empty Python Dictionary because the tag as no attributes.

In [15]:
# <p>

data.p.attrs

{'class': ['abc'], 'id': 'first_para'}

There are 3 "paragraph" tags in the HTML content, but we are getting the attributes of only 1 "paragraph" tag (the first one in the hierarchy).

Let's see how to get the value of the attributes ...

In [16]:
data.p['id']

'first_para'

In [17]:
data.p.get('id')

'first_para'

In [18]:
data.p['class']

['abc']

In [19]:
data.p.get('class')

['abc']

We can write this way because we are having the attributes as a dictionary.

If we try to get the value of an attribute that is not present in the tag, we get an error.

Let's see how to access all the text (without the tags) present in the webpage ...

In [20]:
data.get_text()

" Testing Web Page  Web Scraping Let 's start learningWeb ScrapingYou can read more about BeautifulSoup from  here  Coding Ninjas "

There are some methods like : "find()", "find_all()" that we use to find something from the webpage.

In [21]:
data.find('p')

<p class="abc" id="first_para">Let 's start learning<b>Web Scraping</b></p>

Here we get the same result as "data.p". It takes a "string" argument and returns the first occurence of the "string" argument (here "paragraph" tag). 

If the "string" argument is not available, it gives blank output.

In [22]:
data.find_all('p')

[<p class="abc" id="first_para">Let 's start learning<b>Web Scraping</b></p>,
 <p id="def">You can read more about BeautifulSoup from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p>,
 <p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p>]

Here we get the list of all the "paragraph" tags present in the HTML webpage. It takes a "string" argument and returns all the occurences of the "string" argument.

We can iterate over the list to get each of the "paragraph" tags.

In [23]:
for i in data.find_all('p'):
  print(i)

<p class="abc" id="first_para">Let 's start learning<b>Web Scraping</b></p>
<p id="def">You can read more about BeautifulSoup from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p>
<p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p>


## Lecture - 2

Let's see few more methods and attributes with the help of which we can navigate the "parse tree" created by using the library "BeautifulSoup". 

Let's revisit the "find_all()" function and exlore more about it ...

In [24]:
html = '<!DOCTYPE html>\
<html>\
<head>\
<title> Testing Web Page </title>\
</head>\
<body>\
<h1> Web Scraping </h1>\
<p class = "abc" id = "first_para">\
Let \'s start learning\
<b>\
Web Scraping\
</b>\
</p>\
<p id = "def">\
You can read more about BeautifulSoup from <a href = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a>\
</p>\
<p class = "abc">\
<a href = "https://codingninjas.in/"> Coding Ninjas </a>\
</p>\
</body>\
</html>'

In [25]:
from bs4 import BeautifulSoup
data = BeautifulSoup(html, 'html.parser')

In [26]:
# list

data.find_all(['p', 'a'])

[<p class="abc" id="first_para">Let 's start learning<b>Web Scraping</b></p>,
 <p id="def">You can read more about BeautifulSoup from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p>,
 <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a>,
 <p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p>,
 <a href="https://codingninjas.in/"> Coding Ninjas </a>]

Here we get the list of all the "paragraph" and "anchor" tags present in the webpage.

In [27]:
# True

data.find_all(True)

[<html><head><title> Testing Web Page </title></head><body><h1> Web Scraping </h1><p class="abc" id="first_para">Let 's start learning<b>Web Scraping</b></p><p id="def">You can read more about BeautifulSoup from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p><p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p></body></html>,
 <head><title> Testing Web Page </title></head>,
 <title> Testing Web Page </title>,
 <body><h1> Web Scraping </h1><p class="abc" id="first_para">Let 's start learning<b>Web Scraping</b></p><p id="def">You can read more about BeautifulSoup from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p><p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p></body>,
 <h1> Web Scraping </h1>,
 <p class="abc" id="first_para">Let 's start learning<b>Web Scraping</b></p>,
 <b>Web Scraping</b>,
 <p id="def">You can read more about BeautifulSoup from <a href="https://www.crummy.com/

Here we get the list of each and every tag (hierarchial order) present in the HTML webpage.

In [28]:
# id (find all the tags with this attribute)

data.find_all(id = 'first_para')

[<p class="abc" id="first_para">Let 's start learning<b>Web Scraping</b></p>]

In [29]:
# class (find all the tags with this attribute)

data.find_all(class_ = 'abc')

[<p class="abc" id="first_para">Let 's start learning<b>Web Scraping</b></p>,
 <p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p>]

We can pass CSS selectors by the following syntax : 
#### data.select('selector')

1) We can navigate down the parse tree by using nested tag names.

In [30]:
data.head.title

<title> Testing Web Page </title>

This gives us the same result that we had got for "data.title".

Let's try to print the strings available in all the "paragraph" tags ...

In [31]:
li = data.find_all('p')

for i in li:
    print(i.string)

None
None
 Coding Ninjas 


The ".string" attribute will give a string where there is only "child". For multile "childern" available the ".string" attribute returns "None".

To solve this problem we use the ".strings" attribute. We get some some generator for every "paragraph" tag. We can convert this generator into a list. We can "strip" the string to remove extra spaces before and after it.

In [32]:
li = data.find_all('p')

for i in li:
  print(list(i.stripped_strings))

["Let 's start learning", 'Web Scraping']
['You can read more about BeautifulSoup from', 'here']
['Coding Ninjas']


We observe that for every "paragraph" tag, we get a seperate list.

2) We can navigate down the parse tree by using contents, children, descendants.

In [33]:
data.html.contents

[<head><title> Testing Web Page </title></head>,
 <body><h1> Web Scraping </h1><p class="abc" id="first_para">Let 's start learning<b>Web Scraping</b></p><p id="def">You can read more about BeautifulSoup from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p><p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p></body>]

We get a list of length = 2. The "html" tag has 2 children : "head", "body" tag.

In [34]:
data.html.children 

<list_iterator at 0x23aca6c6040>

We get an iterator object. We can iterate over this iterator object to get the children.

In [35]:
data.html.descendants

<generator object Tag.descendants at 0x0000023ACA72B510>

Here we get a geartor object. We can store this in a variable (list) to view it. In this case, we get a list of length = 17.

Here the "html" tag has 2 children and 17 descendants, ".children" will return direct children of "html" tag only, ".descendants" will return all the children and their children.

## Lecture - 3 (Scraping a Webpage)

In this lecture we will be scraping the frst ever website created. The link is : 'http://info.cern.ch/hypertext/WWW/TheProject.html'

In [36]:
# Load the HTML --> requests

from bs4 import BeautifulSoup
import requests

response = requests.get('http://info.cern.ch/hypertext/WWW/TheProject.html')

In [37]:
print(response.status_code)

200


Here the API cal has been made successfully.

In [38]:
# Parse the HTML --> BeautifulSoup

html_data = response.text
data = BeautifulSoup(html_data, 'html.parser')
print(data.prettify())

<header>
 <title>
  The World Wide Web project
 </title>
 <nextid n="55"/>
</header>
<body>
 <h1>
  World Wide Web
 </h1>
 The WorldWideWeb (W3) is a wide-area
 <a href="WhatIs.html" name="0">
  hypermedia
 </a>
 information retrieval
initiative aiming to give universal
access to a large universe of documents.
 <p>
  Everything there is online about
W3 is linked directly or indirectly
to this document, including an
  <a href="Summary.html" name="24">
   executive
summary
  </a>
  of the project,
  <a href="Administration/Mailing/Overview.html" name="29">
   Mailing lists
  </a>
  ,
  <a href="Policy.html" name="30">
   Policy
  </a>
  , November's
  <a href="News/9211.html" name="34">
   W3  news
  </a>
  ,
  <a href="FAQ/List.html" name="41">
   Frequently Asked Questions
  </a>
  .
  <dl>
   <dt>
    <a href="../DataSources/Top.html" name="44">
     What's out there?
    </a>
    <dd>
     Pointers to the
world's online information,
     <a href="../DataSources/bySubject/Overview.htm

#### Now we can locate and extract the desired data that we want :

In [39]:
# getting the complete text available in the webpage

print(data.get_text())


The World Wide Web project



World Wide WebThe WorldWideWeb (W3) is a wide-area
hypermedia information retrieval
initiative aiming to give universal
access to a large universe of documents.
Everything there is online about
W3 is linked directly or indirectly
to this document, including an executive
summary of the project, Mailing lists
, Policy , November's  W3  news ,
Frequently Asked Questions .

What's out there?
 Pointers to the
world's online information, subjects
, W3 servers, etc.
Help
 on the browser you are using
Software Products
 A list of W3 project
components and their current state.
(e.g. Line Mode ,X11 Viola ,  NeXTStep
, Servers , Tools , Mail robot ,
Library )
Technical
 Details of protocols, formats,
program internals etc
Bibliography
 Paper documentation
on  W3 and references.
People
 A list of some people involved
in the project.
History
 A summary of the history
of the project.
How can I help ?
 If you would like
to support the web..
Getting code
 Getting the cod

In [40]:
# extracting the string present in all the hyperlinks from the webpage

li = data.find_all('a')
for i in li:
  print(i.string)


hypermedia
executive
summary
Mailing lists
Policy
W3  news
Frequently Asked Questions
What's out there?
 subjects
W3 servers
Help
Software Products
Line Mode
Viola
NeXTStep
Servers
Tools
 Mail robot

Library
Technical
Bibliography
People
History
How can I help
Getting code

anonymous FTP


Note : Before scraping the webpage, visit the HTML code of the webpage (through Google Chrome), see what all tags are being used, what kind of data is present under each tag, etc. and then as per your required data, build some logic and start writing code to scrape the data. 

In [41]:
# extracting the string present in all the hyperlinks inside the <dl> tag

li = data.dl.find_all('dt')
for i in li:
  print(i.a.string)

What's out there?
Help
Software Products
Technical
Bibliography
People
History
How can I help
Getting code


Lets try to scrape another website : 'https://books.toscrape.com/'

In [42]:
# Load the HTML

import requests
from bs4 import BeautifulSoup

response = requests.get('https://books.toscrape.com/')

In [43]:
print(response.status_code)

200


Here the API cal has been made successfully.

In [44]:
# Parse the HTML

html_data = response.text
data = BeautifulSoup(html_data, 'html.parser')
print(data.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

#### Now we can locate and extract the desired data that we want :

In [45]:
# extracting the title of the webpage

data.title.string

'\n    All products | Books to Scrape - Sandbox\n'

In [46]:
# extracting the title, url of the 1st book

book1 = data.find(class_ = 'product_pod')
url = book1.h3.a['href']
title = book1.h3.a['title']

In [47]:
title

'A Light in the Attic'

In [48]:
url

'catalogue/a-light-in-the-attic_1000/index.html'

The url that we have got is the "relative url".We have to append this with the "base url" = 'http://books.toscrape.com/' to get the complete url.

In [51]:
# extracting the urls of all the books present in the first webpage

books = data.find_all(class_ = 'product_pod')

base_url = 'http://books.toscrape.com/'
urls = []

for i in books:
  urls.append(base_url + i.h3.a['href'])

for i in urls:
  print(i)

http://books.toscrape.com/frankenstein_20/index.html
http://books.toscrape.com/forever-rockers-the-rocker-12_19/index.html
http://books.toscrape.com/fighting-fate-fighting-6_18/index.html
http://books.toscrape.com/emma_17/index.html
http://books.toscrape.com/eat-pray-love_16/index.html
http://books.toscrape.com/deep-under-walker-security-1_15/index.html
http://books.toscrape.com/choosing-our-religion-the-spiritual-lives-of-americas-nones_14/index.html
http://books.toscrape.com/charlie-and-the-chocolate-factory-charlie-bucket-1_13/index.html
http://books.toscrape.com/charitys-cross-charles-towne-belles-4_12/index.html
http://books.toscrape.com/bright-lines_11/index.html
http://books.toscrape.com/bridget-joness-diary-bridget-jones-1_10/index.html
http://books.toscrape.com/bounty-colorado-mountain-7_9/index.html
http://books.toscrape.com/blood-defense-samantha-brinkman-1_8/index.html
http://books.toscrape.com/bleach-vol-1-strawberry-and-the-soul-reapers-bleach-1_7/index.html
http://books.

This website has a collection of 1000 books which are available in 50 different pages on the website.

I want to scrape the link of all 50 pages present in the website. For this, I have to load the HTML of every webpage, which can be done by sending a get() request to each webpage. Then I have to parse the HTML, look for the class that contains the relative url to the next page, append this relative url to the base url, and finnaly append the url we get to the list of urls.

In [50]:
# extracting the urls of all 50 webpages present in the website

urls = []
current_page = 'https://books.toscrape.com/catalogue/page-1.html'
base_url = 'https://books.toscrape.com/catalogue/'

response = requests.get(current_page)

while response.status_code == 200:
  data = BeautifulSoup(response.text, 'html.parser')
  next_page = data.find(class_ = 'next')
  if next_page is None:  # this is for the last webpage (has no class with name 'next') otherwise we will get an error  
    break
  next_url = base_url + next_page.a['href']
  urls.append(next_url)
  current_page = next_url
  response = requests.get(current_page)

for i in urls:
  print(i)

https://books.toscrape.com/catalogue/page-2.html
https://books.toscrape.com/catalogue/page-3.html
https://books.toscrape.com/catalogue/page-4.html
https://books.toscrape.com/catalogue/page-5.html
https://books.toscrape.com/catalogue/page-6.html
https://books.toscrape.com/catalogue/page-7.html
https://books.toscrape.com/catalogue/page-8.html
https://books.toscrape.com/catalogue/page-9.html
https://books.toscrape.com/catalogue/page-10.html
https://books.toscrape.com/catalogue/page-11.html
https://books.toscrape.com/catalogue/page-12.html
https://books.toscrape.com/catalogue/page-13.html
https://books.toscrape.com/catalogue/page-14.html
https://books.toscrape.com/catalogue/page-15.html
https://books.toscrape.com/catalogue/page-16.html
https://books.toscrape.com/catalogue/page-17.html
https://books.toscrape.com/catalogue/page-18.html
https://books.toscrape.com/catalogue/page-19.html
https://books.toscrape.com/catalogue/page-20.html
https://books.toscrape.com/catalogue/page-21.html
https://

## Lecture - 4 (Store data in CSV)

Let's try to extract multiple information of each and every book and store it in a CSV file.

In this website, for each and every book, there is a dedicted webpage. In the webpage of each book we can find information like Name, Price, Number of Copies (available), etc.

In [62]:
# Load the HTML

import requests
from bs4 import BeautifulSoup

response = requests.get('https://books.toscrape.com/')

In [63]:
print(response.status_code)

200


Here the API cal has been made successfully.

In [64]:
# Parse the HTML

html_data = response.text
data = BeautifulSoup(html_data, 'html.parser')
print(data.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

#### Now we can locate and extract the desired data that we want :

In [65]:
# extracting title, url, price, quantity in stock for book1

book1 = data.find(class_ = 'product_pod')
base_url = 'http://books.toscrape.com/'

book1_url = base_url + book1.h3.a['href']

response = requests.get(book1_url)
data = BeautifulSoup(response.text, 'html.parser')

title = data.h1.string
price = data.find(class_ = 'price_color').string
quantity = data.find(class_ = 'instock availability').contents[-1].strip()   # ".contents" returns the list of all the children

From the variable "price" (string containg currency symbol and value) we need the value only.

From the variable "quantity" (string) we need the numeric value only.

For the 2 variables, we have to do some kind of string processing to get the desired data. For that we can take help of the Python Library "re" (Regular expression operations).

In [66]:
import re

price = float(re.search('[\d.]+', price).group())  
quantity = int(re.search('\d+', quantity).group())

The "re" library is used to match a part of the string present in the original string. 

To get digits from 0 - 9 we use "\d". To get digits beyond we use "\d+".

There is a function "group()" present in this library which returns the part of the string which matches the given pattern.

To get floating values, we can use "[\d.]+".

In [68]:
# converting the title, url, price, quantity in stock for book1 into a CSV file

import pandas as pd

details = []
details.append([title, book1_url, price, quantity])

df = pd.DataFrame(details, columns = ['Title', 'URL', 'Price', 'Quantity in Stock'])
df.to_csv('Book1.csv', index = False)