# Web Scapring using Beautiful Soup
- Beautiful Soup is a python library used for pulling data out of `HTML` and `XML` files
- Beautiful Soup cannot fetch HTML contents from web sites. To pull HTML, we use `requests` library and then pass the HTML to Beautiful Soup constructor
- The main three features of the beautiful soup are:
>- It generates a parse tree of HTML and offers simple methods for navigating, searching and modifying that parse tree. 
>- It automatically converts incoming documents to Unicode and outgoing documents to UTF-8. So, you don't have to worry about encodings
>- It has support of different parsers, using which Beautiful Soup parse the HTML doxuments. Examples of parse tree are: lxml, html5lib, html.parser
- Different parsers may create different parse trees and could result different results depending on the HTML that you are trying to parse. If you are trying to parse perfectly formed HTML, then different parsers will give almost the same output, but if there are the mistakes in the HTML then different parsers will try to fill in missing information differently.

### `urllib` 
- urllib is a package that collects several modules for working with URLs
>- `urllib.request` for opening and reading URLs, using variety of protocols
>- `urllib.error` containing the exceptions raised by `urllib.request`
>- `urllib.parse` for parsing URLs
>- `urllib.robotparser` for parsing `robot.txt` file
#### Download & Install Beautiful Soup

In [2]:
import sys
!{sys.executable} -m pip install --upgrade pip -q
!{sys.executable} -m pip install requests -q
!{sys.executable} -m pip install beautifulsoup4 -q
!{sys.executable} -m pip install --upgrade lxml -q
!{sys.executable} -m pip install html5lib -q

In [3]:
import requests
import bs4 # it is a dummy paackage managed by the developers of the beautiful soup to prevent name squatting
from bs4 import BeautifulSoup
import lxml
import html5lib
requests.__version__, bs4.__version__, lxml.__version__

('2.31.0', '4.11.1', '4.9.1')

### Fetching HTML Contents Using `requests` Library

In [5]:
import requests
print(dir(requests))



In [6]:
responce = requests.get("https://arifpucit.github.io/bss2")
responce.status_code

200

>- Code in the 100 range are informational messages
>- Code in the 200 range are success messages
>- Code in the 300 range are redirectional
>- Code in the 400 range are the client-side error
>- Code in the 500 range are the server-side error

In [7]:
print(dir(responce))

['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']


In [12]:
# return the url of the web page
print(responce.url)

https://arifpucit.github.io/bss2/


In [11]:
# return the headers information
print(responce.headers)

{'Connection': 'keep-alive', 'Content-Length': '2757', 'Server': 'GitHub.com', 'Content-Type': 'text/html; charset=utf-8', 'permissions-policy': 'interest-cohort=()', 'Last-Modified': 'Mon, 27 Jun 2022 12:32:49 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': 'W/"62b9a371-33ad"', 'expires': 'Wed, 16 Aug 2023 05:03:09 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': 'BAD4:2F07:51BAC4:5D35C9:64DC5634', 'Accept-Ranges': 'bytes', 'Date': 'Wed, 16 Aug 2023 04:53:09 GMT', 'Via': '1.1 varnish', 'Age': '0', 'X-Served-By': 'cache-fjr990029-FJR', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1692161589.990542,VS0,VE317', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': '5d0c1c16a219505dc2a7a65f0f7df544adbecc50'}


In [14]:
# Return the html code of the web page in the form of the binary
print(responce.content)

b'<!doctype html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>BSS2</title>\n    <!-- external style sheet -->\n    <link rel="stylesheet" href="./index.css">\n\n    <!--Bootstrap style sheet-->\n    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0-beta1/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-0evHe/X+R7YkIZDRvuzKMRqM+OrBnVFBL6DOitfPri4tjfHxaWutUpFmBp4vmVor" crossorigin="anonymous">\n\n    <!--for icons of tick and cross ans star for in stock-->\n    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> \n\n  </head>\n  <body>\n    <header class="header d-flex align-items-center justify-content-between">\n        <img class="image-container" src="./images//arif.jpg" alt="arif"/>\n         <p> <span class="large_text">Books Scraping Site</span></p>\n        <img class="image-container" src="./images/pu

In [13]:
# Return the html code of the web page in the form of string
print(responce.text)

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>BSS2</title>
    <!-- external style sheet -->
    <link rel="stylesheet" href="./index.css">

    <!--Bootstrap style sheet-->
    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0-beta1/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-0evHe/X+R7YkIZDRvuzKMRqM+OrBnVFBL6DOitfPri4tjfHxaWutUpFmBp4vmVor" crossorigin="anonymous">

    <!--for icons of tick and cross ans star for in stock-->
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> 

  </head>
  <body>
    <header class="header d-flex align-items-center justify-content-between">
        <img class="image-container" src="./images//arif.jpg" alt="arif"/>
         <p> <span class="large_text">Books Scraping Site</span></p>
        <img class="image-container" src="./images/pucit.jpg" alt="pucit"/>

### Creating the `Soup` Object from the `BeautifulSoup` Library
- The `BeautifulSoup()` method is used to create a BeautifulSoup object
>- BeautifulSoup(markup, 'lxml')
- The first argument to the BeautifulSoup constructor is a string or an open filehandle containing the markup you want too be parsed
- The second argument is how you would like the markup parsed. If you do not specify anything, you will get the best HTML parser that's installed. Beautiful Soup ranks lxml's parser as being the best, then html5lib's, and then python's build-in parser
- The method returns a BeautifulSoup object which represents the parsed document and knows how to naviagte through the DOM.

In [15]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(responce.text, 'lxml')
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [16]:
# All the html code without identation
print(soup)

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>BSS2</title>
<!-- external style sheet -->
<link href="./index.css" rel="stylesheet"/>
<!--Bootstrap style sheet-->
<link crossorigin="anonymous" href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0-beta1/dist/css/bootstrap.min.css" integrity="sha384-0evHe/X+R7YkIZDRvuzKMRqM+OrBnVFBL6DOitfPri4tjfHxaWutUpFmBp4vmVor" rel="stylesheet"/>
<!--for icons of tick and cross ans star for in stock-->
<link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
</head>
<body>
<header class="header d-flex align-items-center justify-content-between">
<img alt="arif" class="image-container" src="./images//arif.jpg"/>
<p> <span class="large_text">Books Scraping Site</span></p>
<img alt="pucit" class="image-container" src="./images/pucit.jpg"/>
</header>
<section>
<div class="main-container d-flex align-items-sta

In [18]:
# To get html code in proper format
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   BSS2
  </title>
  <!-- external style sheet -->
  <link href="./index.css" rel="stylesheet"/>
  <!--Bootstrap style sheet-->
  <link crossorigin="anonymous" href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0-beta1/dist/css/bootstrap.min.css" integrity="sha384-0evHe/X+R7YkIZDRvuzKMRqM+OrBnVFBL6DOitfPri4tjfHxaWutUpFmBp4vmVor" rel="stylesheet"/>
  <!--for icons of tick and cross ans star for in stock-->
  <link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
 </head>
 <body>
  <header class="header d-flex align-items-center justify-content-between">
   <img alt="arif" class="image-container" src="./images//arif.jpg"/>
   <p>
    <span class="large_text">
     Books Scraping Site
    </span>
   </p>
   <img alt="pucit" class="image-container" src="./images/pucit.jpg"/>
  </header>

### Tag Objects

In [21]:
soup.header

<header class="header d-flex align-items-center justify-content-between">
<img alt="arif" class="image-container" src="./images//arif.jpg"/>
<p> <span class="large_text">Books Scraping Site</span></p>
<img alt="pucit" class="image-container" src="./images/pucit.jpg"/>
</header>

In [22]:
# First p tag in the html tag
soup.p

<p> <span class="large_text">Books Scraping Site</span></p>

#### 1) Name Object

In [23]:
soup.header.name

'header'

In [24]:
soup.img.name

'img'

In [25]:
soup.p.name

'p'

In [26]:
print(type(soup.header.name))
print(type(soup.img.name))

<class 'str'>
<class 'str'>


#### 2) Attribute Object

In [27]:
soup.p.attrs

{}

In [28]:
soup.img.attrs

{'class': ['image-container'], 'src': './images//arif.jpg', 'alt': 'arif'}

#### 3) Navigate String Object

In [29]:
soup.title.string

'BSS2'

### Navigation the Entire Tree of the Soup Object

In [31]:
# First attribute object in the body 
soup.body.a

<a href="index.html">Operating System</a>

In [33]:
soup.body.a.parent

<li class="link book_type"><a href="index.html">Operating System</a></li>

In [34]:
soup.body.a.parent.parent

<ul class="nav-links">
<div class="link text-center" id="book_title">Books Titles</div>
<li class="link book_type"><a href="index.html">Operating System</a></li>
<li class="link book_type"><a href="SP.html">System Programming</a></li>
<li class="link book_type"><a href="CA.html">Computer Architecture</a></li>
</ul>

In [35]:
soup.body.ul

<ul class="nav-links">
<div class="link text-center" id="book_title">Books Titles</div>
<li class="link book_type"><a href="index.html">Operating System</a></li>
<li class="link book_type"><a href="SP.html">System Programming</a></li>
<li class="link book_type"><a href="CA.html">Computer Architecture</a></li>
</ul>

In [36]:
soup.body.ul.children

<list_iterator at 0x114e2aa3a90>

In [37]:
for i in soup.body.ul.children:
    print(i)



<div class="link text-center" id="book_title">Books Titles</div>


<li class="link book_type"><a href="index.html">Operating System</a></li>


<li class="link book_type"><a href="SP.html">System Programming</a></li>


<li class="link book_type"><a href="CA.html">Computer Architecture</a></li>




### Using the `soup.find()` Method
- The `soup.find()` method returns the first tag that matches the first criteria:
>- ***soup.find(name = None, attrs = {}, recursive = True, text = None, ** kwargs)***
- where
>- `name` is the tag name in the search
>- `attrs = {}` a dictionary of filters on attribute vales
>- `recursive = True` If this is true, `find()` will perform the recursive search of this PageElement's children. Otherwise, only the direct children will be considered.
>- `test = None` st
>- `limit`, stop looking after finding this many results

***The `find()` method can be called on the entire soup object or you can call `find()` method from a specific tag from within a soup object***

In [38]:
# To find a specific tag (div) where a specific attribute (class = navbar)
soup.find('div', {'class':'navbar'})

<div class="navbar">
<ul class="nav-links">
<div class="link text-center" id="book_title">Books Titles</div>
<li class="link book_type"><a href="index.html">Operating System</a></li>
<li class="link book_type"><a href="SP.html">System Programming</a></li>
<li class="link book_type"><a href="CA.html">Computer Architecture</a></li>
</ul>
</div>

In [40]:
# This method can also be used
soup.find('div', class_ = 'navbar')

<div class="navbar">
<ul class="nav-links">
<div class="link text-center" id="book_title">Books Titles</div>
<li class="link book_type"><a href="index.html">Operating System</a></li>
<li class="link book_type"><a href="SP.html">System Programming</a></li>
<li class="link book_type"><a href="CA.html">Computer Architecture</a></li>
</ul>
</div>

### Using the `soup.find_all()` Method
- The `soup.find_all()` method returns a list of all the tags or strings that matche a particular criteria:
>- ***soup.find_all(name = None, attrs = {}, limit, string = None, recursive = True, text = None, ** kwargs)***
- where
>- `name` is the name of the tag to return
>- `attrs = {}` a dictionary of filters on attribute vales
>- `string = None` is used if you want to search a text string instead of tag name
>- `recursive = True` If this is true, `find()` will perform the recursive search for all the descendents. OOtherwise, only the direct children will be considered
>- `limit`, is the number of the elements to return. Defaults to all matching(`find()` method is similar to `find_all()` by passing the limit = 1
- ***Note:*** The class attribute having space separated string means multiple classes, where an id attribute having space separated string means a single id whose name is having spaces in between

In [42]:
prices = soup.find_all('p', class_ = 'price green')
prices

[<p class="price green">Rs.2000</p>,
 <p class="price green">Rs.5000</p>,
 <p class="price green">Rs.6900</p>,
 <p class="price green">Rs.2700</p>,
 <p class="price green">Rs.1700</p>,
 <p class="price green">Rs.1800</p>,
 <p class="price green">Rs.6000</p>,
 <p class="price green">Rs.1000</p>,
 <p class="price green">Rs.1800</p>]

In [44]:
# Since prices is a list, so we have to iterate using for loop
for i in prices:
    print(i.text)

Rs.2000
Rs.5000
Rs.6900
Rs.2700
Rs.1700
Rs.1800
Rs.6000
Rs.1000
Rs.1800


### Example # 01: Scraping Information from a Single Web Page

In [1]:
import requests
from bs4 import BeautifulSoup
import lxml

In [2]:
responce = requests.get('https://arifpucit.github.io/bss2')

In [3]:
soup = BeautifulSoup(responce.text, 'lxml')

#### 1) Extract Book Title and Author Name

In [4]:
sp_titles = soup.find_all('p', class_ = 'book_name')
sp_titles

[<p class="book_name"><a href="https://www.amazon.com/Operating-System-Concepts-Abridged-Companion/dp/1119456339" target="_blank">Operating System Concepts By Avi Silberschatz</a></p>,
 <p class="book_name"><a href="https://www.google.com/search?q=Unix+the+textbook+by+mansoor&amp;rlz=1C1CHBD_enPK987PK987&amp;oq=unix+the+textbook+by+mansoor&amp;aqs=chrome.0.69i59j69i57j69i59j69i60l5.4419j0j7&amp;sourceid=chrome&amp;ie=UTF-8" target="_blank">UNIX The Textbook By Syed Mansoor Sarwar</a></p>,
 <p class="book_name"><a href="https://www.amazon.in/Taxonomy-Ids-Arif-Butt/dp/3639294092" target="_blank">Taxonomy of IDS By Arif Butt</a></p>,
 <p class="book_name"><a href="https://www.amazon.com/Understanding-Operating-Systems-Ann-McHoes/dp/1305674251" target="_blank">Understanding operating systems By Ida Flynn</a></p>,
 <p class="book_name"><a href="https://www.goodreads.com/book/show/829182.Computer_Systems" target="_blank">Computer Systems  By Randal E. Bryant </a></p>,
 <p class="book_name"><

In [5]:
titles = []
for i in sp_titles:
    titles.append(i.text)
print(titles)

['Operating System Concepts By Avi Silberschatz', 'UNIX The Textbook By Syed Mansoor Sarwar', 'Taxonomy of IDS By Arif Butt', 'Understanding operating systems By Ida Flynn', 'Computer Systems  By Randal E. Bryant ', 'Linux bible  Book By Christopher Negus', 'Advanced Programming in the UNIX Environment  By W. Stevans', 'Operating Systems: A Design-oriented Approach By Charles Patrick Crowley', 'Hands-On Network Programming with C  By Lewis Van Winkle']


#### 2) Extract Links of the Books

In [6]:
sp_links = soup.find_all('p', class_='book_name')
sp_links

[<p class="book_name"><a href="https://www.amazon.com/Operating-System-Concepts-Abridged-Companion/dp/1119456339" target="_blank">Operating System Concepts By Avi Silberschatz</a></p>,
 <p class="book_name"><a href="https://www.google.com/search?q=Unix+the+textbook+by+mansoor&amp;rlz=1C1CHBD_enPK987PK987&amp;oq=unix+the+textbook+by+mansoor&amp;aqs=chrome.0.69i59j69i57j69i59j69i60l5.4419j0j7&amp;sourceid=chrome&amp;ie=UTF-8" target="_blank">UNIX The Textbook By Syed Mansoor Sarwar</a></p>,
 <p class="book_name"><a href="https://www.amazon.in/Taxonomy-Ids-Arif-Butt/dp/3639294092" target="_blank">Taxonomy of IDS By Arif Butt</a></p>,
 <p class="book_name"><a href="https://www.amazon.com/Understanding-Operating-Systems-Ann-McHoes/dp/1305674251" target="_blank">Understanding operating systems By Ida Flynn</a></p>,
 <p class="book_name"><a href="https://www.goodreads.com/book/show/829182.Computer_Systems" target="_blank">Computer Systems  By Randal E. Bryant </a></p>,
 <p class="book_name"><

In [7]:
for i in sp_links:
    print(i.find('a'))

<a href="https://www.amazon.com/Operating-System-Concepts-Abridged-Companion/dp/1119456339" target="_blank">Operating System Concepts By Avi Silberschatz</a>
<a href="https://www.google.com/search?q=Unix+the+textbook+by+mansoor&amp;rlz=1C1CHBD_enPK987PK987&amp;oq=unix+the+textbook+by+mansoor&amp;aqs=chrome.0.69i59j69i57j69i59j69i60l5.4419j0j7&amp;sourceid=chrome&amp;ie=UTF-8" target="_blank">UNIX The Textbook By Syed Mansoor Sarwar</a>
<a href="https://www.amazon.in/Taxonomy-Ids-Arif-Butt/dp/3639294092" target="_blank">Taxonomy of IDS By Arif Butt</a>
<a href="https://www.amazon.com/Understanding-Operating-Systems-Ann-McHoes/dp/1305674251" target="_blank">Understanding operating systems By Ida Flynn</a>
<a href="https://www.goodreads.com/book/show/829182.Computer_Systems" target="_blank">Computer Systems  By Randal E. Bryant </a>
<a href="https://www.amazon.com/Linux-Bible-Christopher-Negus/dp/111821854X" target="_blank">Linux bible  Book By Christopher Negus</a>
<a href="https://www.a

In [8]:
links = []
for i in sp_links:
    links.append(i.find('a').get('href'))
links

['https://www.amazon.com/Operating-System-Concepts-Abridged-Companion/dp/1119456339',
 'https://www.google.com/search?q=Unix+the+textbook+by+mansoor&rlz=1C1CHBD_enPK987PK987&oq=unix+the+textbook+by+mansoor&aqs=chrome.0.69i59j69i57j69i59j69i60l5.4419j0j7&sourceid=chrome&ie=UTF-8',
 'https://www.amazon.in/Taxonomy-Ids-Arif-Butt/dp/3639294092',
 'https://www.amazon.com/Understanding-Operating-Systems-Ann-McHoes/dp/1305674251',
 'https://www.goodreads.com/book/show/829182.Computer_Systems',
 'https://www.amazon.com/Linux-Bible-Christopher-Negus/dp/111821854X',
 'https://www.amazon.com/dp/0321637739?tag=uuid10-20',
 'https://www.amazon.com/s?k=Operating+Systems%3A+A+Design-oriented+Approach&i=stripbooks-intl-ship&ref=nb_sb_noss',
 'https://www.amazon.com/Hands-Network-Programming-programming-optimized/dp/1789349869/ref=sr_1_1?crid=11FC0M0GAFA21&amp&keywords=unix+network+programming+2019&amp&qid=1653381349&amp&s=books&amp&sprefix=unix+network+programming+2019%2Cstripbooks-intl-ship%2C356&amp

#### 3) Extract Prices

In [10]:
sp_prices = soup.find_all('p', class_ = 'price green')
sp_prices

[<p class="price green">Rs.2000</p>,
 <p class="price green">Rs.5000</p>,
 <p class="price green">Rs.6900</p>,
 <p class="price green">Rs.2700</p>,
 <p class="price green">Rs.1700</p>,
 <p class="price green">Rs.1800</p>,
 <p class="price green">Rs.6000</p>,
 <p class="price green">Rs.1000</p>,
 <p class="price green">Rs.1800</p>]

In [11]:
prices = []
for i in sp_prices:
    prices.append(i.text)
prices

['Rs.2000',
 'Rs.5000',
 'Rs.6900',
 'Rs.2700',
 'Rs.1700',
 'Rs.1800',
 'Rs.6000',
 'Rs.1000',
 'Rs.1800']

#### 4) Extract Availability of Books

In [12]:
sp_availability = soup.find_all('p', class_ = 'stock')
sp_availability

[<p class="stock in_stock" data-stock="in stock"><i aria-hidden="true" class="fa fa-check"></i> In stock</p>,
 <p class="stock in_stock" data-stock="in stock"><i aria-hidden="true" class="fa fa-check"></i> In stock</p>,
 <p class="stock not_stock" data-stock="not in stock"><i aria-hidden="true" class="fa fa-times"></i> Not in stock</p>,
 <p class="stock not_stock" data-stock="not in stock"><i aria-hidden="true" class="fa fa-times"></i> Not in stock</p>,
 <p class="stock in_stock" data-stock="in stock"><i aria-hidden="true" class="fa fa-check"></i> In stock</p>,
 <p class="stock not_stock" data-stock="not in stock"><i aria-hidden="true" class="fa fa-times"></i> Not in stock</p>,
 <p class="stock in_stock" data-stock="in stock"><i aria-hidden="true" class="fa fa-check"></i> In stock</p>,
 <p class="stock in_stock" data-stock="in stock"><i aria-hidden="true" class="fa fa-check"></i> In stock</p>,
 <p class="stock in_stock" data-stock="in stock"><i aria-hidden="true" class="fa fa-check"></

In [13]:
availability = []
for i in sp_availability:
    availability.append(i.text)
availability

[' In stock',
 ' In stock',
 ' Not in stock',
 ' Not in stock',
 ' In stock',
 ' Not in stock',
 ' In stock',
 ' In stock',
 ' In stock']

#### 5) Extract Count of Reviews

In [14]:
sp_reviews = soup.find_all('p', class_ = 'review')
sp_reviews

[<p class="review green" data-rating="20">20 Reviews</p>,
 <p class="review green" data-rating="100">100 Reviews</p>,
 <p class="review green" data-rating="20">20 Reviews</p>,
 <p class="review green" data-rating="60">60 Reviews</p>,
 <p class="review green" data-rating="25">25 Reviews</p>,
 <p class="review green" data-rating="21">21 Reviews</p>,
 <p class="review green" data-rating="40">40 Reviews</p>,
 <p class="review green" data-rating="90">90 Reviews</p>,
 <p class="review green" data-rating="70">70 Reviews</p>]

In [15]:
reviews = []
for i in sp_reviews:
    reviews.append(i.get('data-rating'))
print(reviews)

['20', '100', '20', '60', '25', '21', '40', '90', '70']


#### 6) Extract Star Ratings

In [19]:
book = soup.find('div', class_ = 'book_container')
print(book.prettify())

<div class="book_container col-sm-4">
 <img alt="" src="images/OS concepts.jpg" title="The Linux Programming Interface (TLPI) is the definitive guide 
 to the Linux and UNIX programming interface—the interface
 employed by nearly every application that runs on a 
Linux or UNIX system."/>
 <p class="book_name">
  <a href="https://www.amazon.com/Operating-System-Concepts-Abridged-Companion/dp/1119456339" target="_blank">
   Operating System Concepts By Avi Silberschatz
  </a>
 </p>
 <div class="align-left">
  <p class="price green">
   Rs.2000
  </p>
  <p class="stock in_stock" data-stock="in stock">
   <i aria-hidden="true" class="fa fa-check">
   </i>
   In stock
  </p>
  <div>
   <span class="fa fa-star">
   </span>
   <span class="fa fa-star">
   </span>
   <span class="fa fa-star">
   </span>
   <span class="fa fa-star not_filled">
   </span>
   <span class="fa fa-star not_filled">
   </span>
  </div>
  <p class="review green" data-rating="20">
   20 Reviews
  </p>
  <button>
   Add

In [20]:
stars = list()
books = soup.find_all('div',{'class','book_container'})
for book in books:
    stars.append(5 - len(book.find_all('span',{'class','not_filled'})))
print(stars) 

[3, 5, 4, 2, 2, 1, 1, 3, 4]


#### Display Output on Screen

In [21]:
for i in range(9):
    print("      Link: ",links[i])
    print("      Price: ",prices[i])
    print("      Stock: ",availability[i])
    print("      Reviews: ",reviews[i])
    print("      Stars: ",stars[i])

      Link:  https://www.amazon.com/Operating-System-Concepts-Abridged-Companion/dp/1119456339
      Price:  Rs.2000
      Stock:   In stock
      Reviews:  20
      Stars:  3
      Link:  https://www.google.com/search?q=Unix+the+textbook+by+mansoor&rlz=1C1CHBD_enPK987PK987&oq=unix+the+textbook+by+mansoor&aqs=chrome.0.69i59j69i57j69i59j69i60l5.4419j0j7&sourceid=chrome&ie=UTF-8
      Price:  Rs.5000
      Stock:   In stock
      Reviews:  100
      Stars:  5
      Link:  https://www.amazon.in/Taxonomy-Ids-Arif-Butt/dp/3639294092
      Price:  Rs.6900
      Stock:   Not in stock
      Reviews:  20
      Stars:  4
      Link:  https://www.amazon.com/Understanding-Operating-Systems-Ann-McHoes/dp/1305674251
      Price:  Rs.2700
      Stock:   Not in stock
      Reviews:  60
      Stars:  2
      Link:  https://www.goodreads.com/book/show/829182.Computer_Systems
      Price:  Rs.1700
      Stock:   In stock
      Reviews:  25
      Stars:  2
      Link:  https://www.amazon.com/Linux-Bible-C

#### Saving data in the CSV File

In [22]:
import pandas as pd
data = {'Title/Author':titles, "Price":prices, 'Availability':availability, 'Reviews':reviews, 'Links': links, 'Stars':stars}
df = pd.DataFrame(data, columns = ['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
df.to_csv('books.csv', index = False)
df = pd.read_csv('books.csv')
df

Unnamed: 0,Title/Author,Price,Availability,Reviews,Links,Stars
0,Operating System Concepts By Avi Silberschatz,Rs.2000,In stock,20,https://www.amazon.com/Operating-System-Concep...,3
1,UNIX The Textbook By Syed Mansoor Sarwar,Rs.5000,In stock,100,https://www.google.com/search?q=Unix+the+textb...,5
2,Taxonomy of IDS By Arif Butt,Rs.6900,Not in stock,20,https://www.amazon.in/Taxonomy-Ids-Arif-Butt/d...,4
3,Understanding operating systems By Ida Flynn,Rs.2700,Not in stock,60,https://www.amazon.com/Understanding-Operating...,2
4,Computer Systems By Randal E. Bryant,Rs.1700,In stock,25,https://www.goodreads.com/book/show/829182.Com...,2
5,Linux bible Book By Christopher Negus,Rs.1800,Not in stock,21,https://www.amazon.com/Linux-Bible-Christopher...,1
6,Advanced Programming in the UNIX Environment ...,Rs.6000,In stock,40,https://www.amazon.com/dp/0321637739?tag=uuid1...,1
7,Operating Systems: A Design-oriented Approach ...,Rs.1000,In stock,90,https://www.amazon.com/s?k=Operating+Systems%3...,3
8,Hands-On Network Programming with C By Lewis ...,Rs.1800,In stock,70,https://www.amazon.com/Hands-Network-Programmi...,4


### Consolidating in a Single Script

In [23]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

titles = []
prices = []
availability=[]
reviews=[]
links=[]
stars=[]

def books(soup):
    sp_titles = soup.find_all('p', class_="book_name")
    sp_prices = soup.find_all('p', class_="price green")
    sp_availability = data = soup.find_all('p', class_='stock')
    sp_reviews = soup.find_all('p',{'class','review'})
    data = soup.find_all('p', class_="book_name")
    sp_links=[]
    for val in data:
        sp_links.append(val.find('a').get('href'))
    books = soup.find_all('div',{'class','book_container'})
    for book in books:
        stars.append(5 - len(book.find_all('span',{'class','not_filled'})))
    
    for i in range(len(sp_titles)):
        titles.append(sp_titles[i].text)
        prices.append(sp_prices[i].text)
        availability.append(sp_availability[i].text)
        reviews.append(sp_reviews[i].text)
        links.append(sp_links[i])

        
resp = requests.get("https://arifpucit.github.io/bss2")
soup = BeautifulSoup(resp.text, 'lxml')
books(soup)


data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'Stars':stars}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
df.to_csv('books3.csv', index=False)
df = pd.read_csv('books3.csv')
df

Unnamed: 0,Title/Author,Price,Availability,Reviews,Links,Stars
0,Operating System Concepts By Avi Silberschatz,Rs.2000,In stock,20 Reviews,https://www.amazon.com/Operating-System-Concep...,3
1,UNIX The Textbook By Syed Mansoor Sarwar,Rs.5000,In stock,100 Reviews,https://www.google.com/search?q=Unix+the+textb...,5
2,Taxonomy of IDS By Arif Butt,Rs.6900,Not in stock,20 Reviews,https://www.amazon.in/Taxonomy-Ids-Arif-Butt/d...,4
3,Understanding operating systems By Ida Flynn,Rs.2700,Not in stock,60 Reviews,https://www.amazon.com/Understanding-Operating...,2
4,Computer Systems By Randal E. Bryant,Rs.1700,In stock,25 Reviews,https://www.goodreads.com/book/show/829182.Com...,2
5,Linux bible Book By Christopher Negus,Rs.1800,Not in stock,21 Reviews,https://www.amazon.com/Linux-Bible-Christopher...,1
6,Advanced Programming in the UNIX Environment ...,Rs.6000,In stock,40 Reviews,https://www.amazon.com/dp/0321637739?tag=uuid1...,1
7,Operating Systems: A Design-oriented Approach ...,Rs.1000,In stock,90 Reviews,https://www.amazon.com/s?k=Operating+Systems%3...,3
8,Hands-On Network Programming with C By Lewis ...,Rs.1800,In stock,70 Reviews,https://www.amazon.com/Hands-Network-Programmi...,4


### Example # 02: Scraping Information from a Multiple Web Pages

In [24]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

titles = []
prices = []
availability=[]
reviews=[]
links=[]
stars=[]

def books(soup):
    sp_titles = soup.find_all('p', class_="book_name")
    sp_prices = soup.find_all('p', class_="price green")
    sp_availability = data = soup.find_all('p', class_='stock')
    sp_reviews = soup.find_all('p',{'class','review'})
    # for links
    data = soup.find_all('p', class_="book_name")
    sp_links=[]
    for val in data:
        sp_links.append(val.find('a').get('href'))
    books = soup.find_all('div',{'class','book_container'})
    for book in books:
        stars.append(5 - len(book.find_all('span',{'class','not_filled'})))
    
    for i in range(len(sp_titles)):
        titles.append(sp_titles[i].text)
        prices.append(sp_prices[i].text)
        availability.append(sp_availability[i].text)
        reviews.append(sp_reviews[i].text)
        links.append(sp_links[i])


urls = ['https://arifpucit.github.io/bss2/index.html', 
        'https://arifpucit.github.io/bss2/SP.html', 
        'https://arifpucit.github.io/bss2/CA.html']                  
for url in urls:
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'lxml')
    books(soup)

# Creating a dataframe and saving data in a csv file
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'Stars':stars}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
df.to_csv('books3.csv', index=False)
df = pd.read_csv('books3.csv')
df

Unnamed: 0,Title/Author,Price,Availability,Reviews,Links,Stars
0,Operating System Concepts By Avi Silberschatz,Rs.2000,In stock,20 Reviews,https://www.amazon.com/Operating-System-Concep...,3
1,UNIX The Textbook By Syed Mansoor Sarwar,Rs.5000,In stock,100 Reviews,https://www.google.com/search?q=Unix+the+textb...,5
2,Taxonomy of IDS By Arif Butt,Rs.6900,Not in stock,20 Reviews,https://www.amazon.in/Taxonomy-Ids-Arif-Butt/d...,4
3,Understanding operating systems By Ida Flynn,Rs.2700,Not in stock,60 Reviews,https://www.amazon.com/Understanding-Operating...,2
4,Computer Systems By Randal E. Bryant,Rs.1700,In stock,25 Reviews,https://www.goodreads.com/book/show/829182.Com...,2
5,Linux bible Book By Christopher Negus,Rs.1800,Not in stock,21 Reviews,https://www.amazon.com/Linux-Bible-Christopher...,1
6,Advanced Programming in the UNIX Environment ...,Rs.6000,In stock,40 Reviews,https://www.amazon.com/dp/0321637739?tag=uuid1...,1
7,Operating Systems: A Design-oriented Approach ...,Rs.1000,In stock,90 Reviews,https://www.amazon.com/s?k=Operating+Systems%3...,3
8,Hands-On Network Programming with C By Lewis ...,Rs.1800,In stock,70 Reviews,https://www.amazon.com/Hands-Network-Programmi...,4
9,LINUX & UNIX Programming Tools By Syed Mansoo...,Rs.5000,In stock,200 Reviews,https://www.amazon.com/LINUX-UNIX-Programming-...,2


### Example # 03: Scapting Information from a Multiple Web Pages (Pagination)

In [25]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

titles = []
descriptions = []
links=[]


def videos(soup):
    articles = soup.find_all('div', class_='media-body')
    for article in articles:
        title = article.find('h4', class_="media-heading1").text
        titles.append(title)
        
        descr = article.find('p', align='justify').text
        descriptions.append(descr)

        video_id = article.find('iframe')['src'].split('/')[4].split('?')[0]
        youtube_link = f'https://youtube.com/watch?v={video_id}'
        links.append(youtube_link)

        

first_page = requests.get("http://www.arifbutt.me/category/sp-with-linux/")
soup = BeautifulSoup(first_page.text,'lxml')
videos(soup)


# Creating a dataframe and saving data in a csv file
data = {'Title':titles, 'YouTube Link':links, 'Description':descriptions}
df = pd.DataFrame(data, columns=['Title', 'YouTube Link', 'Description'])
df.to_csv('spvideos.csv', index=False)
df = pd.read_csv('spvideos.csv')
df

Unnamed: 0,Title,YouTube Link,Description
0,Lec01 Introduction to System Programming (Arif...,https://youtube.com/watch?v=qThI-U34KYs,This is the first session on the subject of Sy...
1,Lec02 C Compilation: A System Programmer Persp...,https://youtube.com/watch?v=a7GhFL0Gh6Y,This session starts with the C-Compilation pro...
2,Lec03 Working of Linkers: Creating your own Li...,https://youtube.com/watch?v=A67t7X2LUsA,Linking and loading a process (Behind the curt...
3,Lec04 UNIX make utility (Arif Butt @ PUCIT),https://youtube.com/watch?v=8hG0MTyyxMI,This session deals with the famous UNIX make u...
4,Lec05 GNU autotools and cmake (Arif Butt @ PUCIT),https://youtube.com/watch?v=Ncb_xzjGAwM,This session starts with a brief comparison be...
5,Lec06 Versioning Systems git-I (Arif Butt @ PU...,https://youtube.com/watch?v=TBqLJg6PmWQ,This session gives an overview of different mo...
6,Lec07 Versioning Systems git-II (Arif Butt @ P...,https://youtube.com/watch?v=3akXFcBDYc0,This is a continuity of previous session and s...
7,Lec08 Exit Handlers and Resource Limits (Arif ...,https://youtube.com/watch?v=ujzom1OyPMY,This session describes as to how a C program s...
8,Lec09 Stack Behind the Curtain (Arif Butt @ PU...,https://youtube.com/watch?v=1XbTmmWxHzo,This session describes how a process is laid o...


### Limitations of Requests and BeautifulSoup Library
- When you load up a website you want to scrape using your browser, the browser will make a request to the page's server to retrieve the page content. That's usually some HTML code, some CSS, and some JavaScript.
- A key difference between loading the page using your browser and getting the page contents using requests is that your browser executes any JavaScript code that the page comes with. Sometimes you will see the initial page content (before the JavaScript runs) for a few moments, and then the JavaScript kicks in.
- It's a very frequent problem in my courses to see this happen. Unfortunately, the only way to get the page after JavaScript has ran is, well, running the JavaScript. You need a JavaScript engine in order to do that. That means you need a browser or browser-like program in order to get the final page.
- Solution:
    - Selenium is a browser automation tool, which means you can use Selenium to control a browser. You can make Selenium load the page you're interested in, evaluate the JavaScript, and then get the page content.
    - requests-html is another library that will let you evaluate the JavaScript after you've retrieved the page. It uses requests to get the page content, and then runs the page through the Chrome browser engine (Chromium) in order to "calculate" the final page. However, it's still very much under active development and I've had a few problems with it.
    