---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Data-Acquisition</h1>

---  
<h1 align="center">Lecture 6 (Web Scraping using BeautifulSoup)</h1>

----


<img align="center" width="900" height="650"  src="images/scrap.PNG"  >

## Learning agenda of this notebook

<img align="right" width="400" src="images/webscraping.png"  >

1. **Overview of BeautifulSoup**
    - What is BeautifulSoup and how it works?
    - Download and Install BeautifulSoup


2. **Playing with BeautifulSoup**<br>
    - Reviewing the Books Scraping Website
    - Fetching HTML Contents Using `requests` Library
    - Creating the Soup Object using `BeautifulSoup` Library
    - Accessing Attributes of `Soup` Object
    - Using the `soup.find()` Method
    - Using the `soup.find_all()` Method
    - Iterating Through the List returned by `soup.find_all()` Method


3. **Example 1: Scraping Information from a Single Web Page** https://arifpucit.github.io/bss2/ <br>
    - Extracting Book Titles/Authors
    - Extracting Book Prices
    - Extracting Book Availability (In-Stock)
    - Extracting Book Review Count
    - Extracting Book Star Ratings
    - Extracting Book Links
    - Saving data into CSV file on disk


4. **Example 1 (cont): Scraping Information from a Multiple Web Pages** https://arifpucit.github.io/bss2/ <br>
    - Extracting Book Titles/Authors, Prices, Availability, Review Count, Star Ratings and Links from multiple pages
    - Saving data into CSV file on disk


5. **Example 2: Scraping Information from a Multiple Web Pages (Pagination)** http://www.arifbutt.me/category/sp-with-linux/ <br>
    - Extracting required information
    - The Concept of **Pagination**
    - How to extract information from Multiple Web Pages using **Pagination**?
    - Saving data into CSV file on disk
    

6. **Limitations of BeautifulSoup** <br>


7. **Some Coding Exercises** <br>

In [1]:
pip install beautifulsoup4

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## 1. Overview of BeautifulSoup
- URLLIB3 Library: https://pypi.org/project/urllib3/
- Requests Library: https://requests.readthedocs.io/en/latest/
- Requests-html Library: https://requests.readthedocs.io/projects/requests-html/en/latest/
- Beautifulsoup4 Download: https://pypi.org/project/beautifulsoup4/
- Beautifulsoup4: https://www.crummy.com/software/BeautifulSoup/
- Beautifulsoup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- LXML Parser: https://lxml.de/

### a. What is BeautifulSoup and How it Works?
- Beautiful Soup is a Python library for pulling data out of `HTML` and `XML` files. 
- BeutifulSoup cannot fetch HTML contents from a web site. To pull HTML we will use `requests` library and then pass the HTML to BeautifulSoup constructor.
- The three main features of BeautifulSoup are:
    - It generates a parse tree of the HTML and offers simple methods for navigating, searching and modifying that parse tree.
    - It automatically converts incoming documents to Unicode and outgoing documents to UTF-8. So you don't have to worry about encodings.
    - It has support of different parsers, using which BeautifulSoup parse the HTML documents. Some example parsers are: lxml, html5lib, html.parser.
  
- Different parsers may create different parse trees and could return different results depending on the HTML that you are trying to parse. If your are trying to parse perfectly formed HTML, then the different parsers will give almost the same output, but if there are mistakes in the html then different parsers will try to fill in missing information differently.

### b. Download and Install BeautifulSoup

In [2]:
import sys
!"{sys.executable}" -m pip install --upgrade pip -q
!"{sys.executable}" -m pip install requests -q
!"{sys.executable}" -m pip install beautifulsoup4 -q
!"{sys.executable}" -m pip install --upgrade lxml -q
!"{sys.executable}" -m pip install html5lib -q


In [3]:
import requests
import bs4 # bs4 is a dummy package managed by the developer of Beautiful Soup to prevent name squatting
from bs4 import BeautifulSoup
import lxml
import html5lib

requests.__version__, bs4.__version__ , lxml.__version__

('2.31.0', '4.11.1', '5.1.0')

## 2. Playing with BeautifulSoup

### a. Reviewing the Books Scraping Website
https://arifpucit.github.io/bss2/

### b. Fetching HTML Contents Using `requests` Library
- A good practical tutorial on using Requests Library: https://www.jcchouinard.com/python-requests/

In [106]:
import requests
print(dir(requests))



In [107]:
resp = requests.get("https://arifpucit.github.io/bss2")
resp.status_code

200

In [108]:
print(dir(resp))

['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']


In [109]:
resp.url

'https://arifpucit.github.io/bss2/'

In [110]:
resp.json

<bound method Response.json of <Response [200]>>

In [111]:
resp.headers

{'Connection': 'keep-alive', 'Content-Length': '2757', 'Server': 'GitHub.com', 'Content-Type': 'text/html; charset=utf-8', 'permissions-policy': 'interest-cohort=()', 'Last-Modified': 'Mon, 27 Jun 2022 12:32:49 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': 'W/"62b9a371-33ad"', 'expires': 'Sun, 10 Mar 2024 14:34:43 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': 'C256:3EFB99:6D0ED1A:6EF0414:65EDC2AB', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 10 Mar 2024 16:16:24 GMT', 'Via': '1.1 varnish', 'Age': '0', 'X-Served-By': 'cache-fra-eddf8230076-FRA', 'X-Cache': 'HIT', 'X-Cache-Hits': '1', 'X-Timer': 'S1710087385.781544,VS0,VE93', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': 'd93fd1fda3e5323f941975c30ce50892958b807d'}

In [112]:
resp.content

b'<!doctype html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>BSS2</title>\n    <!-- external style sheet -->\n    <link rel="stylesheet" href="./index.css">\n\n    <!--Bootstrap style sheet-->\n    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0-beta1/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-0evHe/X+R7YkIZDRvuzKMRqM+OrBnVFBL6DOitfPri4tjfHxaWutUpFmBp4vmVor" crossorigin="anonymous">\n\n    <!--for icons of tick and cross ans star for in stock-->\n    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> \n\n  </head>\n  <body>\n    <header class="header d-flex align-items-center justify-content-between">\n        <img class="image-container" src="./images//arif.jpg" alt="arif"/>\n         <p> <span class="large_text">Books Scraping Site</span></p>\n        <img class="image-container" src="./images/pu

In [113]:
print(resp.text)

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>BSS2</title>
    <!-- external style sheet -->
    <link rel="stylesheet" href="./index.css">

    <!--Bootstrap style sheet-->
    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0-beta1/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-0evHe/X+R7YkIZDRvuzKMRqM+OrBnVFBL6DOitfPri4tjfHxaWutUpFmBp4vmVor" crossorigin="anonymous">

    <!--for icons of tick and cross ans star for in stock-->
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> 

  </head>
  <body>
    <header class="header d-flex align-items-center justify-content-between">
        <img class="image-container" src="./images//arif.jpg" alt="arif"/>
         <p> <span class="large_text">Books Scraping Site</span></p>
        <img class="image-container" src="./images/pucit.jpg" alt="pucit"/>

### c. Creating the Soup Object using `BeautifulSoup` Library
- The `BeautifulSoup()` method is used to create a BeautifulSoup object.

##### <center> `BeautifulSoup(markup, "lxml")` </center>

- The first argument to the BeautifulSoup constructor is a string or an open filehandle containing the markup you want to be parsed. 
- The second argument is how you’d like the markup parsed. If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

- The method returns a BeautifulSoup object which represents the parsed document and knows how to navigate through the DOM

In [114]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.text, 'lxml')
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [115]:
print(dir(soup))



In [116]:
print(soup)

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>BSS2</title>
<!-- external style sheet -->
<link href="./index.css" rel="stylesheet"/>
<!--Bootstrap style sheet-->
<link crossorigin="anonymous" href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0-beta1/dist/css/bootstrap.min.css" integrity="sha384-0evHe/X+R7YkIZDRvuzKMRqM+OrBnVFBL6DOitfPri4tjfHxaWutUpFmBp4vmVor" rel="stylesheet"/>
<!--for icons of tick and cross ans star for in stock-->
<link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
</head>
<body>
<header class="header d-flex align-items-center justify-content-between">
<img alt="arif" class="image-container" src="./images//arif.jpg"/>
<p> <span class="large_text">Books Scraping Site</span></p>
<img alt="pucit" class="image-container" src="./images/pucit.jpg"/>
</header>
<section>
<div class="main-container d-flex align-items-sta

In [117]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   BSS2
  </title>
  <!-- external style sheet -->
  <link href="./index.css" rel="stylesheet"/>
  <!--Bootstrap style sheet-->
  <link crossorigin="anonymous" href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0-beta1/dist/css/bootstrap.min.css" integrity="sha384-0evHe/X+R7YkIZDRvuzKMRqM+OrBnVFBL6DOitfPri4tjfHxaWutUpFmBp4vmVor" rel="stylesheet"/>
  <!--for icons of tick and cross ans star for in stock-->
  <link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
 </head>
 <body>
  <header class="header d-flex align-items-center justify-content-between">
   <img alt="arif" class="image-container" src="./images//arif.jpg"/>
   <p>
    <span class="large_text">
     Books Scraping Site
    </span>
   </p>
   <img alt="pucit" class="image-container" src="./images/pucit.jpg"/>
  </header>

> **Tag Objects**

In [118]:
soup.p

<p> <span class="large_text">Books Scraping Site</span></p>

In [119]:
soup.header

<header class="header d-flex align-items-center justify-content-between">
<img alt="arif" class="image-container" src="./images//arif.jpg"/>
<p> <span class="large_text">Books Scraping Site</span></p>
<img alt="pucit" class="image-container" src="./images/pucit.jpg"/>
</header>

> **Name Objects**

In [120]:
soup.header.name

'header'

In [121]:
soup.img

<img alt="arif" class="image-container" src="./images//arif.jpg"/>

In [122]:
soup.img.name

'img'

In [123]:
print(type(soup.header.name))
print(type(soup.img.name))

<class 'str'>
<class 'str'>


> **Attribute Objects**

In [124]:
soup.p.attrs

{}

In [125]:
soup.p.span.attrs

{'class': ['large_text']}

In [126]:
soup.img.attrs

{'class': ['image-container'], 'src': './images//arif.jpg', 'alt': 'arif'}

> **Navigatable String Object**

In [127]:
soup.title

<title>BSS2</title>

In [128]:
soup.title.string

'BSS2'

> **You can Navigate the Entire Tree of Soup Object**

In [129]:
soup.body.a

<a href="index.html">Operating System</a>

In [130]:
soup.body.a.parent

<li class="link book_type"><a href="index.html">Operating System</a></li>

In [131]:
soup.body.a.parent.parent

<ul class="nav-links">
<div class="link text-center" id="book_title">Books Titles</div>
<li class="link book_type"><a href="index.html">Operating System</a></li>
<li class="link book_type"><a href="SP.html">System Programming</a></li>
<li class="link book_type"><a href="CA.html">Computer Architecture</a></li>
</ul>

In [132]:
soup.body.ul

<ul class="nav-links">
<div class="link text-center" id="book_title">Books Titles</div>
<li class="link book_type"><a href="index.html">Operating System</a></li>
<li class="link book_type"><a href="SP.html">System Programming</a></li>
<li class="link book_type"><a href="CA.html">Computer Architecture</a></li>
</ul>

In [133]:
soup.body.ul.children

<list_iterator at 0x216e6983730>

In [134]:
for tag in soup.body.ul.children:
    print(tag)



<div class="link text-center" id="book_title">Books Titles</div>


<li class="link book_type"><a href="index.html">Operating System</a></li>


<li class="link book_type"><a href="SP.html">System Programming</a></li>


<li class="link book_type"><a href="CA.html">Computer Architecture</a></li>




### d. Using the `soup.find()` Method
- The `soup.find()` method returns the first tag that matches the search criteria:

`soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)`

- Where
    - `name` is the tag name to search.
    - `attrs={}`, A dictionary of filters on attribute values.
    - `recursive=True`, If this is `True`, find() will perform a recursive search of this PageElement's children. Otherwise, only the direct children will be considered.
    - `text=None`, St
    - `limit`, Stop looking after finding this many results.
    
**The `find()` method can be called on the entire soup object or you can call `find()` method from a specific tag from within a soup object**

In [135]:
soup.find('div', {'class':'navbar'})

<div class="navbar">
<ul class="nav-links">
<div class="link text-center" id="book_title">Books Titles</div>
<li class="link book_type"><a href="index.html">Operating System</a></li>
<li class="link book_type"><a href="SP.html">System Programming</a></li>
<li class="link book_type"><a href="CA.html">Computer Architecture</a></li>
</ul>
</div>

In [136]:
soup.find('div', class_='navbar')

<div class="navbar">
<ul class="nav-links">
<div class="link text-center" id="book_title">Books Titles</div>
<li class="link book_type"><a href="index.html">Operating System</a></li>
<li class="link book_type"><a href="SP.html">System Programming</a></li>
<li class="link book_type"><a href="CA.html">Computer Architecture</a></li>
</ul>
</div>

### e. Using the `soup.find_all()` Method
- The `soup.find_all()` method returns a list of all the tags or strings that match a particular criteria.

`soup.find(name=None, attrs={}, limit, string=None, recursive=True, text=None, **kwargs)`

- Where
    - `name` is  the name of the tag to return.
    - `attrs={}`, A dictionary of filters on attribute values.
    - `string=None`, is used if you want to search for a text string rather than tagname
    - `recursive=True`, If this is `True`, will perform a recursive search of all the descendents. Otherwise, only the direct children will be considered.
    - `string=None`, is used if you want to search for a text string rather than tagname
    - `limit`, is the number of elements to return. Defaults to all matching (`find()` method is similar to find_all() by passing the limit=1


**Note:** The class attribute having space separated string means multiple classes, while an id attribute having space separated string means a single id whose name is having spaces in between

In [137]:
prices = soup.find_all('p', class_='price green')
prices

[<p class="price green">Rs.2000</p>,
 <p class="price green">Rs.5000</p>,
 <p class="price green">Rs.6900</p>,
 <p class="price green">Rs.2700</p>,
 <p class="price green">Rs.1700</p>,
 <p class="price green">Rs.1800</p>,
 <p class="price green">Rs.6000</p>,
 <p class="price green">Rs.1000</p>,
 <p class="price green">Rs.1800</p>]

In [138]:
for price in prices:
    print(price.text)

Rs.2000
Rs.5000
Rs.6900
Rs.2700
Rs.1700
Rs.1800
Rs.6000
Rs.1000
Rs.1800


In [139]:
for price in prices:
    print(price.get('class'))

['price', 'green']
['price', 'green']
['price', 'green']
['price', 'green']
['price', 'green']
['price', 'green']
['price', 'green']
['price', 'green']
['price', 'green']


## 3. Example 1: Scraping Information from a Single Web Page:
<h3 align="center" style="color:green">https://arifpucit.github.io/bss2/</h3>
<br>

- Visit above web page and scrap following six items of the nine books from the index page:
    - Titles/Authors of the Book
    - Links of the Book
    - Price of the Book
    - Availability of the Book (In-Stock or Not in Stock)
    - Count of Reviews
    - Star ratings

In [140]:
import requests
from bs4 import BeautifulSoup
import lxml

In [141]:
resp = requests.get("https://arifpucit.github.io/bss2")

In [142]:
resp.text

'<!doctype html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>BSS2</title>\n    <!-- external style sheet -->\n    <link rel="stylesheet" href="./index.css">\n\n    <!--Bootstrap style sheet-->\n    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0-beta1/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-0evHe/X+R7YkIZDRvuzKMRqM+OrBnVFBL6DOitfPri4tjfHxaWutUpFmBp4vmVor" crossorigin="anonymous">\n\n    <!--for icons of tick and cross ans star for in stock-->\n    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> \n\n  </head>\n  <body>\n    <header class="header d-flex align-items-center justify-content-between">\n        <img class="image-container" src="./images//arif.jpg" alt="arif"/>\n         <p> <span class="large_text">Books Scraping Site</span></p>\n        <img class="image-container" src="./images/puc

In [143]:
soup = BeautifulSoup(resp.text, 'lxml')

In [144]:
soup.text

'\n\n\n\nBSS2\n\n\n\n\n\n\n\n\n\n\n Books Scraping Site\n\n\n\n\n\n\nBooks Titles\nOperating System\nSystem Programming\nComputer Architecture\n\n\n\nOperating Systems\n\n\n\nOperating System Concepts By Avi Silberschatz\n\nRs.2000\n In stock\n\n20 Reviews\nAdd to cart\n\n\n\n\nUNIX The Textbook By Syed Mansoor Sarwar\n\nRs.5000\n In stock\n\n100 Reviews\nAdd to cart\n\n\n\n\nTaxonomy of IDS By Arif Butt\n\nRs.6900\n Not in stock\n\n20 Reviews\nAdd to cart\n\n\n\n\nUnderstanding operating systems By Ida Flynn\n\nRs.2700\n Not in stock\n\n60 Reviews\nAdd to cart\n\n\n\n\nComputer Systems  By Randal E. Bryant \n\nRs.1700\n In stock\n\n25 Reviews\nAdd to cart\n\n\n\n\nLinux bible  Book By Christopher Negus\n\nRs.1800\n Not in stock\n\n21 Reviews\nAdd to cart\n\n\n\n\nAdvanced Programming in the UNIX Environment  By W. Stevans\n\nRs.6000\n In stock\n\n40 Reviews\nAdd to cart\n\n\n\n\nOperating Systems: A Design-oriented Approach By Charles Patrick Crowley\n\nRs.1000\n In stock\n\n90 Review

### a. Extract Book Title and Author Name
- Suppose you want to get the book titles and author names of all the books. 
- Start by getting the information about the first book, once you are satisfied, then try finding information of all the books

In [145]:
sp_titles = soup.find_all('p', class_="book_name")
sp_titles

[<p class="book_name"><a href="https://www.amazon.com/Operating-System-Concepts-Abridged-Companion/dp/1119456339" target="_blank">Operating System Concepts By Avi Silberschatz</a></p>,
 <p class="book_name"><a href="https://www.google.com/search?q=Unix+the+textbook+by+mansoor&amp;rlz=1C1CHBD_enPK987PK987&amp;oq=unix+the+textbook+by+mansoor&amp;aqs=chrome.0.69i59j69i57j69i59j69i60l5.4419j0j7&amp;sourceid=chrome&amp;ie=UTF-8" target="_blank">UNIX The Textbook By Syed Mansoor Sarwar</a></p>,
 <p class="book_name"><a href="https://www.amazon.in/Taxonomy-Ids-Arif-Butt/dp/3639294092" target="_blank">Taxonomy of IDS By Arif Butt</a></p>,
 <p class="book_name"><a href="https://www.amazon.com/Understanding-Operating-Systems-Ann-McHoes/dp/1305674251" target="_blank">Understanding operating systems By Ida Flynn</a></p>,
 <p class="book_name"><a href="https://www.goodreads.com/book/show/829182.Computer_Systems" target="_blank">Computer Systems  By Randal E. Bryant </a></p>,
 <p class="book_name"><

In [146]:
titles = []
for title in sp_titles:
    titles.append(title.text)
print(titles)

['Operating System Concepts By Avi Silberschatz', 'UNIX The Textbook By Syed Mansoor Sarwar', 'Taxonomy of IDS By Arif Butt', 'Understanding operating systems By Ida Flynn', 'Computer Systems  By Randal E. Bryant ', 'Linux bible  Book By Christopher Negus', 'Advanced Programming in the UNIX Environment  By W. Stevans', 'Operating Systems: A Design-oriented Approach By Charles Patrick Crowley', 'Hands-On Network Programming with C  By Lewis Van Winkle']


### b. Extract Links of Books

In [147]:
sp_titles = soup.find_all('p', class_="book_name")
sp_titles

[<p class="book_name"><a href="https://www.amazon.com/Operating-System-Concepts-Abridged-Companion/dp/1119456339" target="_blank">Operating System Concepts By Avi Silberschatz</a></p>,
 <p class="book_name"><a href="https://www.google.com/search?q=Unix+the+textbook+by+mansoor&amp;rlz=1C1CHBD_enPK987PK987&amp;oq=unix+the+textbook+by+mansoor&amp;aqs=chrome.0.69i59j69i57j69i59j69i60l5.4419j0j7&amp;sourceid=chrome&amp;ie=UTF-8" target="_blank">UNIX The Textbook By Syed Mansoor Sarwar</a></p>,
 <p class="book_name"><a href="https://www.amazon.in/Taxonomy-Ids-Arif-Butt/dp/3639294092" target="_blank">Taxonomy of IDS By Arif Butt</a></p>,
 <p class="book_name"><a href="https://www.amazon.com/Understanding-Operating-Systems-Ann-McHoes/dp/1305674251" target="_blank">Understanding operating systems By Ida Flynn</a></p>,
 <p class="book_name"><a href="https://www.goodreads.com/book/show/829182.Computer_Systems" target="_blank">Computer Systems  By Randal E. Bryant </a></p>,
 <p class="book_name"><

In [148]:
for item in sp_titles:
    print(item.find('a'))

<a href="https://www.amazon.com/Operating-System-Concepts-Abridged-Companion/dp/1119456339" target="_blank">Operating System Concepts By Avi Silberschatz</a>
<a href="https://www.google.com/search?q=Unix+the+textbook+by+mansoor&amp;rlz=1C1CHBD_enPK987PK987&amp;oq=unix+the+textbook+by+mansoor&amp;aqs=chrome.0.69i59j69i57j69i59j69i60l5.4419j0j7&amp;sourceid=chrome&amp;ie=UTF-8" target="_blank">UNIX The Textbook By Syed Mansoor Sarwar</a>
<a href="https://www.amazon.in/Taxonomy-Ids-Arif-Butt/dp/3639294092" target="_blank">Taxonomy of IDS By Arif Butt</a>
<a href="https://www.amazon.com/Understanding-Operating-Systems-Ann-McHoes/dp/1305674251" target="_blank">Understanding operating systems By Ida Flynn</a>
<a href="https://www.goodreads.com/book/show/829182.Computer_Systems" target="_blank">Computer Systems  By Randal E. Bryant </a>
<a href="https://www.amazon.com/Linux-Bible-Christopher-Negus/dp/111821854X" target="_blank">Linux bible  Book By Christopher Negus</a>
<a href="https://www.a

In [149]:
for item in sp_titles:
    print(item.find('a').get('href')) # print(item.find('a')['href'])

https://www.amazon.com/Operating-System-Concepts-Abridged-Companion/dp/1119456339
https://www.google.com/search?q=Unix+the+textbook+by+mansoor&rlz=1C1CHBD_enPK987PK987&oq=unix+the+textbook+by+mansoor&aqs=chrome.0.69i59j69i57j69i59j69i60l5.4419j0j7&sourceid=chrome&ie=UTF-8
https://www.amazon.in/Taxonomy-Ids-Arif-Butt/dp/3639294092
https://www.amazon.com/Understanding-Operating-Systems-Ann-McHoes/dp/1305674251
https://www.goodreads.com/book/show/829182.Computer_Systems
https://www.amazon.com/Linux-Bible-Christopher-Negus/dp/111821854X
https://www.amazon.com/dp/0321637739?tag=uuid10-20
https://www.amazon.com/s?k=Operating+Systems%3A+A+Design-oriented+Approach&i=stripbooks-intl-ship&ref=nb_sb_noss
https://www.amazon.com/Hands-Network-Programming-programming-optimized/dp/1789349869/ref=sr_1_1?crid=11FC0M0GAFA21&amp&keywords=unix+network+programming+2019&amp&qid=1653381349&amp&s=books&amp&sprefix=unix+network+programming+2019%2Cstripbooks-intl-ship%2C356&amp&sr=1-1


In [150]:
links=[]
for item in sp_titles:
    links.append(item.find('a').get('href'))
links

['https://www.amazon.com/Operating-System-Concepts-Abridged-Companion/dp/1119456339',
 'https://www.google.com/search?q=Unix+the+textbook+by+mansoor&rlz=1C1CHBD_enPK987PK987&oq=unix+the+textbook+by+mansoor&aqs=chrome.0.69i59j69i57j69i59j69i60l5.4419j0j7&sourceid=chrome&ie=UTF-8',
 'https://www.amazon.in/Taxonomy-Ids-Arif-Butt/dp/3639294092',
 'https://www.amazon.com/Understanding-Operating-Systems-Ann-McHoes/dp/1305674251',
 'https://www.goodreads.com/book/show/829182.Computer_Systems',
 'https://www.amazon.com/Linux-Bible-Christopher-Negus/dp/111821854X',
 'https://www.amazon.com/dp/0321637739?tag=uuid10-20',
 'https://www.amazon.com/s?k=Operating+Systems%3A+A+Design-oriented+Approach&i=stripbooks-intl-ship&ref=nb_sb_noss',
 'https://www.amazon.com/Hands-Network-Programming-programming-optimized/dp/1789349869/ref=sr_1_1?crid=11FC0M0GAFA21&amp&keywords=unix+network+programming+2019&amp&qid=1653381349&amp&s=books&amp&sprefix=unix+network+programming+2019%2Cstripbooks-intl-ship%2C356&amp

### c. Extract Price

In [151]:
sp_prices = soup.find_all('p', class_="price green")
sp_prices

[<p class="price green">Rs.2000</p>,
 <p class="price green">Rs.5000</p>,
 <p class="price green">Rs.6900</p>,
 <p class="price green">Rs.2700</p>,
 <p class="price green">Rs.1700</p>,
 <p class="price green">Rs.1800</p>,
 <p class="price green">Rs.6000</p>,
 <p class="price green">Rs.1000</p>,
 <p class="price green">Rs.1800</p>]

In [152]:
prices = []
for price in sp_prices:
    prices.append(price.text)
print(prices)

['Rs.2000', 'Rs.5000', 'Rs.6900', 'Rs.2700', 'Rs.1700', 'Rs.1800', 'Rs.6000', 'Rs.1000', 'Rs.1800']


### d. Extract Availability of Books (In-Stock or Not in Stock)

In [153]:
sp_availability =  soup.find_all('p', class_='stock')
sp_availability

[<p class="stock in_stock" data-stock="in stock"><i aria-hidden="true" class="fa fa-check"></i> In stock</p>,
 <p class="stock in_stock" data-stock="in stock"><i aria-hidden="true" class="fa fa-check"></i> In stock</p>,
 <p class="stock not_stock" data-stock="not in stock"><i aria-hidden="true" class="fa fa-times"></i> Not in stock</p>,
 <p class="stock not_stock" data-stock="not in stock"><i aria-hidden="true" class="fa fa-times"></i> Not in stock</p>,
 <p class="stock in_stock" data-stock="in stock"><i aria-hidden="true" class="fa fa-check"></i> In stock</p>,
 <p class="stock not_stock" data-stock="not in stock"><i aria-hidden="true" class="fa fa-times"></i> Not in stock</p>,
 <p class="stock in_stock" data-stock="in stock"><i aria-hidden="true" class="fa fa-check"></i> In stock</p>,
 <p class="stock in_stock" data-stock="in stock"><i aria-hidden="true" class="fa fa-check"></i> In stock</p>,
 <p class="stock in_stock" data-stock="in stock"><i aria-hidden="true" class="fa fa-check"></

In [154]:
availability=[]
for aval in sp_availability:
    availability.append(aval.text)
print(availability)

[' In stock', ' In stock', ' Not in stock', ' Not in stock', ' In stock', ' Not in stock', ' In stock', ' In stock', ' In stock']


### e. Extract Count of Reviews

In [155]:
sp_reviews = soup.find_all('p', class_='review')
sp_reviews

[<p class="review green" data-rating="20">20 Reviews</p>,
 <p class="review green" data-rating="100">100 Reviews</p>,
 <p class="review green" data-rating="20">20 Reviews</p>,
 <p class="review green" data-rating="60">60 Reviews</p>,
 <p class="review green" data-rating="25">25 Reviews</p>,
 <p class="review green" data-rating="21">21 Reviews</p>,
 <p class="review green" data-rating="40">40 Reviews</p>,
 <p class="review green" data-rating="90">90 Reviews</p>,
 <p class="review green" data-rating="70">70 Reviews</p>]

In [156]:
reviews = []
for review in sp_reviews:
    reviews.append(int(review.text.split()[0])) ## get() method is passwed an attribute and it returns its value
print(reviews)

[20, 100, 20, 60, 25, 21, 40, 90, 70]


In [157]:
reviews = []
for review in sp_reviews:
    reviews.append(review.get('data-rating')) ## get() method is passwed an attribute and it returns its value
print(reviews)

['20', '100', '20', '60', '25', '21', '40', '90', '70']


### f. Extract Star Ratings

In [158]:
book = soup.find('div', class_ = 'book_container')
print(book.prettify())

<div class="book_container col-sm-4">
 <img alt="" src="images/OS concepts.jpg" title="The Linux Programming Interface (TLPI) is the definitive guide 
 to the Linux and UNIX programming interface—the interface
 employed by nearly every application that runs on a 
Linux or UNIX system."/>
 <p class="book_name">
  <a href="https://www.amazon.com/Operating-System-Concepts-Abridged-Companion/dp/1119456339" target="_blank">
   Operating System Concepts By Avi Silberschatz
  </a>
 </p>
 <div class="align-left">
  <p class="price green">
   Rs.2000
  </p>
  <p class="stock in_stock" data-stock="in stock">
   <i aria-hidden="true" class="fa fa-check">
   </i>
   In stock
  </p>
  <div>
   <span class="fa fa-star">
   </span>
   <span class="fa fa-star">
   </span>
   <span class="fa fa-star">
   </span>
   <span class="fa fa-star not_filled">
   </span>
   <span class="fa fa-star not_filled">
   </span>
  </div>
  <p class="review green" data-rating="20">
   20 Reviews
  </p>
  <button>
   Add

In [159]:
book.find_all('span', class_ = 'not_filled')

[<span class="fa fa-star not_filled"></span>,
 <span class="fa fa-star not_filled"></span>]

In [160]:
len(book.find_all('span', class_ = 'not_filled'))

2

In [161]:
5-len(book.find_all('span',{'class','not_filled'}))

3

In [162]:
stars = list()
books = soup.find_all('div',{'class','book_container'})
for book in books:
    stars.append(5 - len(book.find_all('span',{'class','not_filled'})))
print(stars) 

[3, 5, 4, 2, 2, 1, 1, 3, 4]


### g. Display output on screen

In [163]:
for i in range(9):
    print("",titles[i])
    print("      Link: ",links[i])
    print("      Price: ",prices[i])
    print("      Stock: ",availability[i])
    print("      Reviews: ",reviews[i])
    print("      Stars: ",stars[i])

 Operating System Concepts By Avi Silberschatz
      Link:  https://www.amazon.com/Operating-System-Concepts-Abridged-Companion/dp/1119456339
      Price:  Rs.2000
      Stock:   In stock
      Reviews:  20
      Stars:  3
 UNIX The Textbook By Syed Mansoor Sarwar
      Link:  https://www.google.com/search?q=Unix+the+textbook+by+mansoor&rlz=1C1CHBD_enPK987PK987&oq=unix+the+textbook+by+mansoor&aqs=chrome.0.69i59j69i57j69i59j69i60l5.4419j0j7&sourceid=chrome&ie=UTF-8
      Price:  Rs.5000
      Stock:   In stock
      Reviews:  100
      Stars:  5
 Taxonomy of IDS By Arif Butt
      Link:  https://www.amazon.in/Taxonomy-Ids-Arif-Butt/dp/3639294092
      Price:  Rs.6900
      Stock:   Not in stock
      Reviews:  20
      Stars:  4
 Understanding operating systems By Ida Flynn
      Link:  https://www.amazon.com/Understanding-Operating-Systems-Ann-McHoes/dp/1305674251
      Price:  Rs.2700
      Stock:   Not in stock
      Reviews:  60
      Stars:  2
 Computer Systems  By Randal E. Bryant

### h. Saving Data into a CSV File

#### Option 1:

In [169]:
import pandas as pd
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'Stars':stars}

df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
df.to_csv('./Dataset/books1.csv', index=False)
df = pd.read_csv('./Dataset/books1.csv')
df

Unnamed: 0,Title/Author,Price,Availability,Reviews,Links,Stars
0,Operating System Concepts By Avi Silberschatz,Rs.2000,In stock,20,https://www.amazon.com/Operating-System-Concep...,3
1,UNIX The Textbook By Syed Mansoor Sarwar,Rs.5000,In stock,100,https://www.google.com/search?q=Unix+the+textb...,5
2,Taxonomy of IDS By Arif Butt,Rs.6900,Not in stock,20,https://www.amazon.in/Taxonomy-Ids-Arif-Butt/d...,4
3,Understanding operating systems By Ida Flynn,Rs.2700,Not in stock,60,https://www.amazon.com/Understanding-Operating...,2
4,Computer Systems By Randal E. Bryant,Rs.1700,In stock,25,https://www.goodreads.com/book/show/829182.Com...,2
5,Linux bible Book By Christopher Negus,Rs.1800,Not in stock,21,https://www.amazon.com/Linux-Bible-Christopher...,1
6,Advanced Programming in the UNIX Environment ...,Rs.6000,In stock,40,https://www.amazon.com/dp/0321637739?tag=uuid1...,1
7,Operating Systems: A Design-oriented Approach ...,Rs.1000,In stock,90,https://www.amazon.com/s?k=Operating+Systems%3...,3
8,Hands-On Network Programming with C By Lewis ...,Rs.1800,In stock,70,https://www.amazon.com/Hands-Network-Programmi...,4


In [71]:
import csv
help(csv)

Help on module csv:

NAME
    csv - CSV parsing and writing.

MODULE REFERENCE
    https://docs.python.org/3.10/library/csv.html
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides classes that assist in the reading and writing
    of Comma Separated Value (CSV) files, and implements the interface
    described by PEP 305.  Although many CSV files are simple to parse,
    the format is not formally defined by a stable specification and
    is subtle enough that parsing lines of a CSV file with something
    like line.split(",") is bound to fail.  The module supports three
    basic APIs: reading, writing, and registration of dialects.
    
    
    DIALECT REGISTRATION:
    
    R

#### Option 2:

In [170]:
import csv
import pandas as pd

fd = open('./Dataset/books2.csv', 'wt')
csv_writer = csv.writer(fd)

csv_writer.writerow(['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])

for i in range(len(titles)):
    csv_writer.writerow([titles[i], prices[i], availability[i], reviews[i], links[i], stars[i]])

fd.close()
df = pd.read_csv('./Dataset/books2.csv')
df

Unnamed: 0,Title/Author,Price,Availability,Reviews,Links,Stars
0,Operating System Concepts By Avi Silberschatz,Rs.2000,In stock,20,https://www.amazon.com/Operating-System-Concep...,3
1,UNIX The Textbook By Syed Mansoor Sarwar,Rs.5000,In stock,100,https://www.google.com/search?q=Unix+the+textb...,5
2,Taxonomy of IDS By Arif Butt,Rs.6900,Not in stock,20,https://www.amazon.in/Taxonomy-Ids-Arif-Butt/d...,4
3,Understanding operating systems By Ida Flynn,Rs.2700,Not in stock,60,https://www.amazon.com/Understanding-Operating...,2
4,Computer Systems By Randal E. Bryant,Rs.1700,In stock,25,https://www.goodreads.com/book/show/829182.Com...,2
5,Linux bible Book By Christopher Negus,Rs.1800,Not in stock,21,https://www.amazon.com/Linux-Bible-Christopher...,1
6,Advanced Programming in the UNIX Environment ...,Rs.6000,In stock,40,https://www.amazon.com/dp/0321637739?tag=uuid1...,1
7,Operating Systems: A Design-oriented Approach ...,Rs.1000,In stock,90,https://www.amazon.com/s?k=Operating+Systems%3...,3
8,Hands-On Network Programming with C By Lewis ...,Rs.1800,In stock,70,https://www.amazon.com/Hands-Network-Programmi...,4


### i. Consolidating in a Single Script

In [171]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

titles = []
prices = []
availability=[]
reviews=[]
links=[]
stars=[]

def books(soup):
    sp_titles = soup.find_all('p', class_="book_name")
    sp_prices = soup.find_all('p', class_="price green")
    sp_availability = data = soup.find_all('p', class_='stock')
    sp_reviews = soup.find_all('p',{'class','review'})
    data = soup.find_all('p', class_="book_name")
    sp_links=[]
    for val in data:
        sp_links.append(val.find('a').get('href'))
    books = soup.find_all('div',{'class','book_container'})
    for book in books:
        stars.append(5 - len(book.find_all('span',{'class','not_filled'})))
    
    for i in range(len(sp_titles)):
        titles.append(sp_titles[i].text)
        prices.append(sp_prices[i].text)
        availability.append(sp_availability[i].text)
        reviews.append(sp_reviews[i].text)
        links.append(sp_links[i])

        
resp = requests.get("https://arifpucit.github.io/bss2")
soup = BeautifulSoup(resp.text, 'lxml')
books(soup)


data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'Stars':stars}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
df.to_csv('./Dataset/books3.csv', index=False)
df = pd.read_csv('./Dataset/books3.csv')
df

Unnamed: 0,Title/Author,Price,Availability,Reviews,Links,Stars
0,Operating System Concepts By Avi Silberschatz,Rs.2000,In stock,20 Reviews,https://www.amazon.com/Operating-System-Concep...,3
1,UNIX The Textbook By Syed Mansoor Sarwar,Rs.5000,In stock,100 Reviews,https://www.google.com/search?q=Unix+the+textb...,5
2,Taxonomy of IDS By Arif Butt,Rs.6900,Not in stock,20 Reviews,https://www.amazon.in/Taxonomy-Ids-Arif-Butt/d...,4
3,Understanding operating systems By Ida Flynn,Rs.2700,Not in stock,60 Reviews,https://www.amazon.com/Understanding-Operating...,2
4,Computer Systems By Randal E. Bryant,Rs.1700,In stock,25 Reviews,https://www.goodreads.com/book/show/829182.Com...,2
5,Linux bible Book By Christopher Negus,Rs.1800,Not in stock,21 Reviews,https://www.amazon.com/Linux-Bible-Christopher...,1
6,Advanced Programming in the UNIX Environment ...,Rs.6000,In stock,40 Reviews,https://www.amazon.com/dp/0321637739?tag=uuid1...,1
7,Operating Systems: A Design-oriented Approach ...,Rs.1000,In stock,90 Reviews,https://www.amazon.com/s?k=Operating+Systems%3...,3
8,Hands-On Network Programming with C By Lewis ...,Rs.1800,In stock,70 Reviews,https://www.amazon.com/Hands-Network-Programmi...,4


## 4. Example 1 (cont): Scraping Information from a Multiple Web Pages:
<h3 align="center" style="color:green">https://arifpucit.github.io/bss2/</h3>
<br>

- Visit above web page and scrap following six items of the 27 books on all the three web pages:
    - Titles/Authors of the Book
    - Links of the Book
    - Price of the Book
    - Availability of the Book (In-Stock or Not in Stock)
    - Count of Reviews
    - Star ratings

- Note that the HTML structure of all the three pages of our Book Scraping Site is same
- We have already written the code to scrap the information from the first page
- Now we need to find a way to go to multiple pages and use the same code in a loop for all those pages to grab data of our interest.
- Generally when a website runs into multiple pages it usually add some extra elements into its URL and keep rest of the URL same. 
- After closely observing the structure of the URL, and the changes that occurs when we go from page to page. One can devise the way to generate the URLs from the base URL by some sort of appending strings to the base URL.

In [172]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

titles = []
prices = []
availability=[]
reviews=[]
links=[]
stars=[]

def books(soup):
    sp_titles = soup.find_all('p', class_="book_name")
    sp_prices = soup.find_all('p', class_="price green")
    sp_availability = data = soup.find_all('p', class_='stock')
    sp_reviews = soup.find_all('p',{'class','review'})
    # for links
    data = soup.find_all('p', class_="book_name")
    sp_links=[]
    for val in data:
        sp_links.append(val.find('a').get('href'))
    books = soup.find_all('div',{'class','book_container'})
    for book in books:
        stars.append(5 - len(book.find_all('span',{'class','not_filled'})))
    
    for i in range(len(sp_titles)):
        titles.append(sp_titles[i].text)
        prices.append(sp_prices[i].text)
        availability.append(sp_availability[i].text)
        reviews.append(sp_reviews[i].text)
        links.append(sp_links[i])


urls = ['https://arifpucit.github.io/bss2/index.html', 
        'https://arifpucit.github.io/bss2/SP.html', 
        'https://arifpucit.github.io/bss2/CA.html']                  
for url in urls:
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'lxml')
    books(soup)

# Creating a dataframe and saving data in a csv file
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'Stars':stars}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
df.to_csv('./Dataset/books3.csv', index=False)
df = pd.read_csv('./Dataset/books3.csv')
df

Unnamed: 0,Title/Author,Price,Availability,Reviews,Links,Stars
0,Operating System Concepts By Avi Silberschatz,Rs.2000,In stock,20 Reviews,https://www.amazon.com/Operating-System-Concep...,3
1,UNIX The Textbook By Syed Mansoor Sarwar,Rs.5000,In stock,100 Reviews,https://www.google.com/search?q=Unix+the+textb...,5
2,Taxonomy of IDS By Arif Butt,Rs.6900,Not in stock,20 Reviews,https://www.amazon.in/Taxonomy-Ids-Arif-Butt/d...,4
3,Understanding operating systems By Ida Flynn,Rs.2700,Not in stock,60 Reviews,https://www.amazon.com/Understanding-Operating...,2
4,Computer Systems By Randal E. Bryant,Rs.1700,In stock,25 Reviews,https://www.goodreads.com/book/show/829182.Com...,2
5,Linux bible Book By Christopher Negus,Rs.1800,Not in stock,21 Reviews,https://www.amazon.com/Linux-Bible-Christopher...,1
6,Advanced Programming in the UNIX Environment ...,Rs.6000,In stock,40 Reviews,https://www.amazon.com/dp/0321637739?tag=uuid1...,1
7,Operating Systems: A Design-oriented Approach ...,Rs.1000,In stock,90 Reviews,https://www.amazon.com/s?k=Operating+Systems%3...,3
8,Hands-On Network Programming with C By Lewis ...,Rs.1800,In stock,70 Reviews,https://www.amazon.com/Hands-Network-Programmi...,4
9,LINUX & UNIX Programming Tools By Syed Mansoo...,Rs.5000,In stock,200 Reviews,https://www.amazon.com/LINUX-UNIX-Programming-...,2


## 5. Example 2: Scraping Information from a Multiple Web Pages (Pagination):
<h3 align="center" style="color:green">https://arifbutt.me/category/sp-with-linux</h3>
<br>

- Visit above web page and scrap following three items of System Programming videos on the first page:
    - Video Lecture Title
    - Description
    - YouTube Video Link

### a. Scraping Data from the First Page: 
- http://www.arifbutt.me/category/sp-with-linux/page/1/

In [76]:
url = 'http://www.arifbutt.me/category/sp-with-linux/'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')

In [94]:
resp.text

'<!DOCTYPE html>\n<html lang="en-US">\n<head>\n    <meta charset="UTF-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <link rel="profile" href="http://gmpg.org/xfn/11">\n    <link rel="pingback" href="https://www.arifbutt.me/xmlrpc.php">\n    <meta name=\'robots\' content=\'index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1\' />\n\n\t<!-- This site is optimized with the Yoast SEO plugin v21.5 - https://yoast.com/wordpress/plugins/seo/ -->\n\t<title>SP with Linux Archives - Page 5 of 5 - Arif Butt</title>\n\t<link rel="canonical" href="https://www.arifbutt.me/category/sp-with-linux/page/5/" />\n\t<link rel="prev" href="https://www.arifbutt.me/category/sp-with-linux/page/4/" />\n\t<meta property="og:locale" content="en_US" />\n\t<meta property="og:type" content="article" />\n\t<meta property="og:title" content="SP with Linux Archives - Page 5 of 5 - Arif Butt" />\n\t<meta property="og:url" content="https://www.arifbutt.me/categ

In [77]:
articles = soup.find_all('div', class_='media-body')
articles    

[<div class="media-body">
 <a class="pull-left" href="https://www.arifbutt.me/lec01-introduction-system-programming-arif-butt-pucit-2/"> <h4 class="media-heading1">Lec01 Introduction to System Programming (Arif Butt @ PUCIT)</h4> </a>
 <div class="excroipt">
 <div class="description">
 <p><iframe allow="autoplay; encrypted-media" allowfullscreen="" frameborder="0" height="480" src="https://www.youtube.com/embed/qThI-U34KYs?list=PL7B2bn3G_wfC-mRpG7cxJMnGWdPAQTViW" width="100%"></iframe></p>
 <p align="justify">This is the first session on the subject of System Programming with Linux. It starts with a discussion on application vs system programmer perspective. Describes briefly about a system call and how it works. A detailed discussion on Course Matrix.</p>
 <p>Email: arif@pucit.edu.pk<br/>
 Example Codes: <a href="https://bitbucket.org/arifpucit/spvl-repo/src" target="_blank">https://bitbucket.org/arifpucit/spvl-repo/src</a> </p>
 </div> </div>
 </div>,
 <div class="media-body">
 <a cl

In [78]:
article = soup.find('div', class_='media-body')
article

<div class="media-body">
<a class="pull-left" href="https://www.arifbutt.me/lec01-introduction-system-programming-arif-butt-pucit-2/"> <h4 class="media-heading1">Lec01 Introduction to System Programming (Arif Butt @ PUCIT)</h4> </a>
<div class="excroipt">
<div class="description">
<p><iframe allow="autoplay; encrypted-media" allowfullscreen="" frameborder="0" height="480" src="https://www.youtube.com/embed/qThI-U34KYs?list=PL7B2bn3G_wfC-mRpG7cxJMnGWdPAQTViW" width="100%"></iframe></p>
<p align="justify">This is the first session on the subject of System Programming with Linux. It starts with a discussion on application vs system programmer perspective. Describes briefly about a system call and how it works. A detailed discussion on Course Matrix.</p>
<p>Email: arif@pucit.edu.pk<br/>
Example Codes: <a href="https://bitbucket.org/arifpucit/spvl-repo/src" target="_blank">https://bitbucket.org/arifpucit/spvl-repo/src</a> </p>
</div> </div>
</div>

In [79]:
article.find('h4', class_='media-heading1').text

'Lec01 Introduction to System Programming (Arif Butt @ PUCIT)'

In [80]:
article.find('p', align="justify").text

'This is the first session on the subject of System Programming with Linux. It starts with a discussion on application vs system programmer perspective. Describes briefly about a system call and how it works. A detailed discussion on Course Matrix.'

In [81]:
article.find('iframe').get('src')

'https://www.youtube.com/embed/qThI-U34KYs?list=PL7B2bn3G_wfC-mRpG7cxJMnGWdPAQTViW'

In [82]:
article.find('iframe').get('src').split('/')

['https:',
 '',
 'www.youtube.com',
 'embed',
 'qThI-U34KYs?list=PL7B2bn3G_wfC-mRpG7cxJMnGWdPAQTViW']

In [83]:
article.find('iframe').get('src').split('/')[4]

'qThI-U34KYs?list=PL7B2bn3G_wfC-mRpG7cxJMnGWdPAQTViW'

In [84]:
article.find('iframe').get('src').split('/')[4].split('?')

['qThI-U34KYs', 'list=PL7B2bn3G_wfC-mRpG7cxJMnGWdPAQTViW']

In [85]:
video_id = article.find('iframe').get('src').split('/')[4].split('?')[0]
video_id

'qThI-U34KYs'

In [86]:
f'https://youtube.com/watch?v={video_id}'

'https://youtube.com/watch?v=qThI-U34KYs'

### b. Scraping Data from the All the Pages of System Programming: 
- http://www.arifbutt.me/category/sp-with-linux/page/1/
- http://www.arifbutt.me/category/sp-with-linux/page/2/ 
- http://www.arifbutt.me/category/sp-with-linux/page/3/
- http://www.arifbutt.me/category/sp-with-linux/page/4/
- http://www.arifbutt.me/category/sp-with-linux/page/5/

In [87]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [88]:
def videos(soup):
    articles = soup.find_all('div', class_='media-body')
    for article in articles:
        title = article.find('h4', class_="media-heading1").text
        titles.append(title)
        
        descr = article.find('p', align='justify').text
        descriptions.append(descr)

        video_id = article.find('iframe')['src'].split('/')[4].split('?')[0]
        youtube_link = f'https://youtube.com/watch?v={video_id}'
        links.append(youtube_link)

In [89]:
titles = []
descriptions = []
links=[]

first_page = requests.get("http://www.arifbutt.me/category/sp-with-linux/")
soup = BeautifulSoup(first_page.text,'lxml')
videos(soup)

In [90]:
titles

['Lec01 Introduction to System Programming (Arif Butt @ PUCIT)',
 'Lec02 C Compilation: A System Programmer Perspective (Arif Butt @ PUCIT)',
 'Lec03 Working of Linkers: Creating your own Libraries (Arif Butt @ PUCIT)',
 'Lec04 UNIX make utility (Arif Butt @ PUCIT)',
 'Lec05 GNU autotools and cmake (Arif Butt @ PUCIT)',
 'Lec06 Versioning Systems git-I (Arif Butt @ PUCIT)',
 'Lec07 Versioning Systems git-II (Arif Butt @ PUCIT)',
 'Lec08 Exit Handlers and Resource Limits (Arif Butt @ PUCIT)',
 'Lec09 Stack Behind the Curtain (Arif Butt @ PUCIT)']

In [91]:
pegination_code = soup.find('div',class_="navigation_pegination")
pegination_code

<div class="navigation_pegination"><ul>
<li class="active"><a href="https://www.arifbutt.me/category/sp-with-linux/">1</a></li>
<li><a href="https://www.arifbutt.me/category/sp-with-linux/page/2/">2</a></li>
<li><a href="https://www.arifbutt.me/category/sp-with-linux/page/3/">3</a></li>
<li>...</li>
<li><a href="https://www.arifbutt.me/category/sp-with-linux/page/5/">5</a></li>
<li><a href="https://www.arifbutt.me/category/sp-with-linux/page/2/">Next Page »</a></li>
</ul></div>

In [92]:
pegination_code = soup.find('div',class_="navigation_pegination") 
all_links= pegination_code.find_all('li')

last_link = None 
for last_link in all_links:
    pass 

next_url = last_link.find('a').get('href')

resp = requests.get(next_url)
soup = BeautifulSoup(resp.text,'lxml')
videos(soup)
print(next_url)

https://www.arifbutt.me/category/sp-with-linux/page/2/


In [173]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

titles = []
descriptions = []
links=[]

first_page = requests.get("http://www.arifbutt.me/category/sp-with-linux/")
soup = BeautifulSoup(first_page.text,'lxml')
videos(soup)


while True:
    pegination_code = soup.find('div',class_="navigation_pegination") 
    all_links= pegination_code.find_all('li')

    last_link = None 
    for last_link in all_links:
        pass 
    if(last_link.find('a').text == "Next Page »"):
        next_url = last_link.find('a').get('href')
        resp = requests.get(next_url)
        soup = BeautifulSoup(resp.text,'lxml')
        videos(soup)
    else:
        break;    


# Creating a dataframe and saving data in a csv file
data = {'Title':titles, 'YouTube Link':links, 'Description':descriptions}
df = pd.DataFrame(data, columns=['Title', 'YouTube Link', 'Description'])
df.to_csv('./Dataset/spvideos.csv', index=False)
df = pd.read_csv('./Dataset/spvideos.csv')
df

Unnamed: 0,Title,YouTube Link,Description
0,Lec01 Introduction to System Programming (Arif...,https://youtube.com/watch?v=qThI-U34KYs,This is the first session on the subject of Sy...
1,Lec02 C Compilation: A System Programmer Persp...,https://youtube.com/watch?v=a7GhFL0Gh6Y,This session starts with the C-Compilation pro...
2,Lec03 Working of Linkers: Creating your own Li...,https://youtube.com/watch?v=A67t7X2LUsA,Linking and loading a process (Behind the curt...
3,Lec04 UNIX make utility (Arif Butt @ PUCIT),https://youtube.com/watch?v=8hG0MTyyxMI,This session deals with the famous UNIX make u...
4,Lec05 GNU autotools and cmake (Arif Butt @ PUCIT),https://youtube.com/watch?v=Ncb_xzjGAwM,This session starts with a brief comparison be...
5,Lec06 Versioning Systems git-I (Arif Butt @ PU...,https://youtube.com/watch?v=TBqLJg6PmWQ,This session gives an overview of different mo...
6,Lec07 Versioning Systems git-II (Arif Butt @ P...,https://youtube.com/watch?v=3akXFcBDYc0,This is a continuity of previous session and s...
7,Lec08 Exit Handlers and Resource Limits (Arif ...,https://youtube.com/watch?v=ujzom1OyPMY,This session describes as to how a C program s...
8,Lec09 Stack Behind the Curtain (Arif Butt @ PU...,https://youtube.com/watch?v=1XbTmmWxHzo,This session describes how a process is laid o...
9,Lec10 Heap Behind the Curtain (Arif Butt @ PUCIT),https://youtube.com/watch?v=zpcPS27ZQr0,This session start with a discussion on types ...


## 6. Limitations of Requests and BeautifulSoup Library
- When you load up a website you want to scrape using your browser, the browser will make a request to the page's server to retrieve the page content. That's usually some HTML code, some CSS, and some JavaScript.
- A key difference between loading the page using your browser and getting the page contents using requests is that your browser executes any JavaScript code that the page comes with. Sometimes you will see the initial page content (before the JavaScript runs) for a few moments, and then the JavaScript kicks in.
- It's a very frequent problem in my courses to see this happen. Unfortunately, the only way to get the page after JavaScript has ran is, well, running the JavaScript. You need a JavaScript engine in order to do that. That means you need a browser or browser-like program in order to get the final page.
- Solution:
    - Selenium is a browser automation tool, which means you can use Selenium to control a browser. You can make Selenium load the page you're interested in, evaluate the JavaScript, and then get the page content.
    - requests-html is another library that will let you evaluate the JavaScript after you've retrieved the page. It uses requests to get the page content, and then runs the page through the Chrome browser engine (Chromium) in order to "calculate" the final page. However, it's still very much under active development and I've had a few problems with it.

### a. You cannot Scrap JavaScript Driven Websites using BeautifulSoup (https://arifpucit.github.io/bss2/js)

In [174]:
import requests
from bs4 import BeautifulSoup

resp = requests.get("https://arifpucit.github.io/bss2/")
soup = BeautifulSoup(resp.text,'lxml')
price = soup.find_all('p', class_='price green')
price

[<p class="price green">Rs.2000</p>,
 <p class="price green">Rs.5000</p>,
 <p class="price green">Rs.6900</p>,
 <p class="price green">Rs.2700</p>,
 <p class="price green">Rs.1700</p>,
 <p class="price green">Rs.1800</p>,
 <p class="price green">Rs.6000</p>,
 <p class="price green">Rs.1000</p>,
 <p class="price green">Rs.1800</p>]

In [175]:
import requests
from bs4 import BeautifulSoup

resp = requests.get("https://arifpucit.github.io/bss2/js")
soup = BeautifulSoup(resp.text,'lxml')
prices = soup.find_all('p', class_='price green')
prices

[]

### b. You cannot enter Text and click buttons using BeautifulSoup (https://arifpucit.github.io/bss2/login)

In [176]:
import requests
from bs4 import BeautifulSoup

resp = requests.get("https://arifpucit.github.io/bss2/login/")
soup = BeautifulSoup(resp.text,'lxml')
prices = soup.find_all('p', class_='green')
prices

[]