## BeautifulSoup
- If tag is not present then we will get nothing not even an error
- Syntax:
```
html=""" HTML DATA """
from bs4 import BeautifulSoup as bp
data=bp(html,"html.parser")
```



### BeautifulSoup attributes
- data.TAG
    - data.TAG.name <b>DONE</b>
    - data.TAG.attrs <b>DONE</b>
    - data.TAG.string
    - data.TAG.selects
    - data.TAG.stripped_selects
    - data.TAG.children
    - data.TAG.contents
    - data.TAG.decendants
    - data.TAG.parent
    - data.TAG.parents
    - data.TAG.next_sibling
    - data.TAG.previous_sibling
    - data.TAG.next_siblings
    - data.TAG.previous_siblings
    - data.TAG.next_element
    - data.TAG.previous_element
    - data.TAG.next_elements
    - data.TAG.previous_elements

### BeautifulSoup functions
- data.prettify() <b>DONE</b>
- data.get_text() <b>DONE</b>
- data.find("TAG") <b>DONE</b>
- data.find_all("TAG") or data.find_all(["TAG1","TAG2"]) <b>DONE</b>


In [1]:
html="""
<html>
<head>
<title>Sample Title</title>
</head>
<body>
<h1 id="id_header" class="class_header">Sample Header</h1>
<p id="id_paragraph" class="class_paragraph">Sample Paragraph</p>
<p id="id_paragraph" class="class_paragraph">Second Sample Paragraph</p>
</body>
</html>
"""

In [2]:
from bs4 import BeautifulSoup as bp

In [3]:
data=bp(html,"html.parser")

In [4]:
data.prettify()

'<html>\n <head>\n  <title>\n   Sample Title\n  </title>\n </head>\n <body>\n  <h1 class="class_header" id="id_header">\n   Sample Header\n  </h1>\n  <p class="class_paragraph" id="id_paragraph">\n   Sample Paragraph\n  </p>\n  <p class="class_paragraph" id="id_paragraph">\n   Second Sample Paragraph\n  </p>\n </body>\n</html>\n'

In [5]:
data.title

<title>Sample Title</title>

In [6]:
data.title.name

'title'

In [59]:
data.h1.attrs

{'id': 'id_header', 'class': ['class_header']}

In [60]:
data.h1["id"]

'id_header'

In [61]:
data.get_text()

'\n\n\nSample Title\n\n\nSample Header\nSample Paragraph\nSecond Sample Paragraph\n\n\n'

In [62]:
data.find("p")

<p class="class_paragraph" id="id_paragraph">Sample Paragraph</p>

In [63]:
data.find_all("p")

[<p class="class_paragraph" id="id_paragraph">Sample Paragraph</p>,
 <p class="class_paragraph" id="id_paragraph">Second Sample Paragraph</p>]

In [64]:
arr = data.find_all("p")
for i in arr:
    print(i.get_text())

Sample Paragraph
Second Sample Paragraph


In [65]:
data.find_all(["p","h1"])

[<h1 class="class_header" id="id_header">Sample Header</h1>,
 <p class="class_paragraph" id="id_paragraph">Sample Paragraph</p>,
 <p class="class_paragraph" id="id_paragraph">Second Sample Paragraph</p>]

In [66]:
data.find_all(True)

[<html>
 <head>
 <title>Sample Title</title>
 </head>
 <body>
 <h1 class="class_header" id="id_header">Sample Header</h1>
 <p class="class_paragraph" id="id_paragraph">Sample Paragraph</p>
 <p class="class_paragraph" id="id_paragraph">Second Sample Paragraph</p>
 </body>
 </html>,
 <head>
 <title>Sample Title</title>
 </head>,
 <title>Sample Title</title>,
 <body>
 <h1 class="class_header" id="id_header">Sample Header</h1>
 <p class="class_paragraph" id="id_paragraph">Sample Paragraph</p>
 <p class="class_paragraph" id="id_paragraph">Second Sample Paragraph</p>
 </body>,
 <h1 class="class_header" id="id_header">Sample Header</h1>,
 <p class="class_paragraph" id="id_paragraph">Sample Paragraph</p>,
 <p class="class_paragraph" id="id_paragraph">Second Sample Paragraph</p>]

In [67]:
data.find_all(id="id_paragraph")

[<p class="class_paragraph" id="id_paragraph">Sample Paragraph</p>,
 <p class="class_paragraph" id="id_paragraph">Second Sample Paragraph</p>]

In [68]:
data.find_all(class_="class_paragraph")

[<p class="class_paragraph" id="id_paragraph">Sample Paragraph</p>,
 <p class="class_paragraph" id="id_paragraph">Second Sample Paragraph</p>]

### First Web page with BeautifulSoup with requests
In this we try to scrape world's first webpage https://info.cern.ch/hypertext/WWW/TheProject.html  using requets and BeautifulSoup library

In [13]:
from bs4 import BeautifulSoup as bp
import requests

response = requests.get("https://info.cern.ch/hypertext/WWW/TheProject.html")
response.headers

{'Date': 'Sun, 24 Sep 2023 05:01:43 GMT', 'Server': 'Apache', 'Last-Modified': 'Thu, 03 Dec 1992 08:37:20 GMT', 'ETag': '"8a9-291e721905000"', 'Accept-Ranges': 'bytes', 'Content-Length': '2217', 'Connection': 'close', 'Content-Type': 'text/html'}

In [14]:
data = bp(response.text, "html.parser")
data.prettify()

'<header>\n <title>\n  The World Wide Web project\n </title>\n <nextid n="55"/>\n</header>\n<body>\n <h1>\n  World Wide Web\n </h1>\n The WorldWideWeb (W3) is a wide-area\n <a href="WhatIs.html" name="0">\n  hypermedia\n </a>\n information retrieval\ninitiative aiming to give universal\naccess to a large universe of documents.\n <p>\n  Everything there is online about\nW3 is linked directly or indirectly\nto this document, including an\n  <a href="Summary.html" name="24">\n   executive\nsummary\n  </a>\n  of the project,\n  <a href="Administration/Mailing/Overview.html" name="29">\n   Mailing lists\n  </a>\n  ,\n  <a href="Policy.html" name="30">\n   Policy\n  </a>\n  , November\'s\n  <a href="News/9211.html" name="34">\n   W3  news\n  </a>\n  ,\n  <a href="FAQ/List.html" name="41">\n   Frequently Asked Questions\n  </a>\n  .\n  <dl>\n   <dt>\n    <a href="../DataSources/Top.html" name="44">\n     What\'s out there?\n    </a>\n    <dd>\n     Pointers to the\nworld\'s online information

In [16]:
data.h1.string

'World Wide Web'

In [17]:
data.title.string

'The World Wide Web project'

In [20]:
data.get_text()

"\nThe World Wide Web project\n\n\n\nWorld Wide WebThe WorldWideWeb (W3) is a wide-area\nhypermedia information retrieval\ninitiative aiming to give universal\naccess to a large universe of documents.\nEverything there is online about\nW3 is linked directly or indirectly\nto this document, including an executive\nsummary of the project, Mailing lists\n, Policy , November's  W3  news ,\nFrequently Asked Questions .\n\nWhat's out there?\n Pointers to the\nworld's online information, subjects\n, W3 servers, etc.\nHelp\n on the browser you are using\nSoftware Products\n A list of W3 project\ncomponents and their current state.\n(e.g. Line Mode ,X11 Viola ,  NeXTStep\n, Servers , Tools , Mail robot ,\nLibrary )\nTechnical\n Details of protocols, formats,\nprogram internals etc\nBibliography\n Paper documentation\non  W3 and references.\nPeople\n A list of some people involved\nin the project.\nHistory\n A summary of the history\nof the project.\nHow can I help ?\n If you would like\nto supp

In [23]:
data.find_all("a")

[<a href="WhatIs.html" name="0">
 hypermedia</a>,
 <a href="Summary.html" name="24">executive
 summary</a>,
 <a href="Administration/Mailing/Overview.html" name="29">Mailing lists</a>,
 <a href="Policy.html" name="30">Policy</a>,
 <a href="News/9211.html" name="34">W3  news</a>,
 <a href="FAQ/List.html" name="41">Frequently Asked Questions</a>,
 <a href="../DataSources/Top.html" name="44">What's out there?</a>,
 <a href="../DataSources/bySubject/Overview.html" name="45"> subjects</a>,
 <a href="../DataSources/WWW/Servers.html" name="z54">W3 servers</a>,
 <a href="Help.html" name="46">Help</a>,
 <a href="Status.html" name="13">Software Products</a>,
 <a href="LineMode/Browser.html" name="27">Line Mode</a>,
 <a href="Status.html#35" name="35">Viola</a>,
 <a href="NeXT/WorldWideWeb.html" name="26">NeXTStep</a>,
 <a href="Daemon/Overview.html" name="25">Servers</a>,
 <a href="Tools/Overview.html" name="51">Tools</a>,
 <a href="MailRobot/Overview.html" name="53"> Mail robot</a>,
 <a href="S

In [29]:
arr = data.find_all(href=True)

In [33]:
for i in arr:
    print(i.attrs['href'])

WhatIs.html
Summary.html
Administration/Mailing/Overview.html
Policy.html
News/9211.html
FAQ/List.html
../DataSources/Top.html
../DataSources/bySubject/Overview.html
../DataSources/WWW/Servers.html
Help.html
Status.html
LineMode/Browser.html
Status.html#35
NeXT/WorldWideWeb.html
Daemon/Overview.html
Tools/Overview.html
MailRobot/Overview.html
Status.html#57
Technical.html
Bibliography.html
People.html
History.html
Helping.html
../README.html
LineMode/Defaults/Distribution.html


### books.toscrape.com
In this we try to scrape website http://books.toscrape.com/ using requests and BeautifulSoup.

In [36]:
import requests
from bs4 import BeautifulSoup as bp

In [37]:
response = requests.get("http://books.toscrape.com/")
html = response.text

In [38]:
data = bp(html,"html.parser")
data.prettify()



In [42]:
data.title

<title>
    All products | Books to Scrape - Sandbox
</title>

In [43]:
data.a

<a href="index.html">Books to Scrape</a>

In [47]:
arr = data.find_all(class_="product_pod")
arr

[<article class="product_pod">
 <div class="image_container">
 <a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>,
 <article class="product_pod">
 <div class="image_container">
 <a href="catalogue/tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="th

In [60]:
for i in arr:
    print("Title:", i.h3.a.string, "URL:", i.h3.a.attrs['href'])


Title: A Light in the ... URL: catalogue/a-light-in-the-attic_1000/index.html
Title: Tipping the Velvet URL: catalogue/tipping-the-velvet_999/index.html
Title: Soumission URL: catalogue/soumission_998/index.html
Title: Sharp Objects URL: catalogue/sharp-objects_997/index.html
Title: Sapiens: A Brief History ... URL: catalogue/sapiens-a-brief-history-of-humankind_996/index.html
Title: The Requiem Red URL: catalogue/the-requiem-red_995/index.html
Title: The Dirty Little Secrets ... URL: catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html
Title: The Coming Woman: A ... URL: catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html
Title: The Boys in the ... URL: catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html
Title: The Black Maria URL: catalogue/the-black-maria_991/index.html
Title: Starving Hearts (Triangular Trade ... URL: catalogue/starvin

### Scrape all categories from books.toscrape.com

In [81]:
arr = data.find_all(class_="side_categories")
arr

[<div class="side_categories">
 <ul class="nav nav-list">
 <li>
 <a href="catalogue/category/books_1/index.html">
                             
                                 Books
                             
                         </a>
 <ul>
 <li>
 <a href="catalogue/category/books/travel_2/index.html">
                             
                                 Travel
                             
                         </a>
 </li>
 <li>
 <a href="catalogue/category/books/mystery_3/index.html">
                             
                                 Mystery
                             
                         </a>
 </li>
 <li>
 <a href="catalogue/category/books/historical-fiction_4/index.html">
                             
                                 Historical Fiction
                             
                         </a>
 </li>
 <li>
 <a href="catalogue/category/books/sequential-art_5/index.html">
                             
                        

In [105]:
for i in arr:
    print(i.ul.li.ul.get_text().split())


['Travel', 'Mystery', 'Historical', 'Fiction', 'Sequential', 'Art', 'Classics', 'Philosophy', 'Romance', 'Womens', 'Fiction', 'Fiction', 'Childrens', 'Religion', 'Nonfiction', 'Music', 'Default', 'Science', 'Fiction', 'Sports', 'and', 'Games', 'Add', 'a', 'comment', 'Fantasy', 'New', 'Adult', 'Young', 'Adult', 'Science', 'Poetry', 'Paranormal', 'Art', 'Psychology', 'Autobiography', 'Parenting', 'Adult', 'Fiction', 'Humor', 'Horror', 'History', 'Food', 'and', 'Drink', 'Christian', 'Fiction', 'Business', 'Biography', 'Thriller', 'Contemporary', 'Spirituality', 'Academic', 'Self', 'Help', 'Historical', 'Christian', 'Suspense', 'Short', 'Stories', 'Novels', 'Health', 'Politics', 'Cultural', 'Erotica', 'Crime']


### Scrape all the pages

In [135]:
from bs4 import BeautifulSoup as bp
import requests

In [136]:
response = requests.get("http://books.toscrape.com/")
html = response.text
html



In [137]:
data = bp(html,"html.parser")
data

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

In [141]:
base_url  = "http://books.toscrape.com/"
next_page = data.find_all(class_="next")[0].a.attrs["href"]

In [142]:
print(base_url+next_page)
for i in range(1,49):
    next_page_url = requests.get(base_url+next_page)
    next_data = bp(next_page_url.text,"html.parser")
    base_url  = "http://books.toscrape.com/catalogue/"
    next_page = next_data.find_all(class_="next")[0].a.attrs["href"]
    print(base_url+next_page) 

http://books.toscrape.com/catalogue/page-2.html
http://books.toscrape.com/catalogue/page-3.html
http://books.toscrape.com/catalogue/page-4.html
http://books.toscrape.com/catalogue/page-5.html
http://books.toscrape.com/catalogue/page-6.html
http://books.toscrape.com/catalogue/page-7.html
http://books.toscrape.com/catalogue/page-8.html
http://books.toscrape.com/catalogue/page-9.html
http://books.toscrape.com/catalogue/page-10.html
http://books.toscrape.com/catalogue/page-11.html
http://books.toscrape.com/catalogue/page-12.html
http://books.toscrape.com/catalogue/page-13.html
http://books.toscrape.com/catalogue/page-14.html
http://books.toscrape.com/catalogue/page-15.html
http://books.toscrape.com/catalogue/page-16.html
http://books.toscrape.com/catalogue/page-17.html
http://books.toscrape.com/catalogue/page-18.html
http://books.toscrape.com/catalogue/page-19.html
http://books.toscrape.com/catalogue/page-20.html
http://books.toscrape.com/catalogue/page-21.html
http://books.toscrape.com/ca