# Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

If you want to scrape a Website:

- Use the APIS
- HTML web Scraping using some tools like bs4

We will require Beautiful soup, requests and html4lib for the following porpose.
- html4lib for parsing HTML 
- requests for making API requests
- bs4 for html tree traversal

### Step 1

Import the required Libraries


In [1]:
import requests
from bs4 import BeautifulSoup

### Step 2

Get the HTML

In [2]:
url = "https://codewithharry.com/videos/"
r = requests.get(url)              # making the get request to the URL
htmlContent = r.content
print(htmlContent)

b'<!DOCTYPE html><html lang="en"><head><meta name="viewport" content="width=device-width"/><meta charSet="utf-8"/><script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-9655830461045889" crossorigin="anonymous"></script><script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':\n        new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],\n        j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\n        \'https://www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);\n        })(window,document,\'script\',\'dataLayer\',\'GTM-MCDDKRF\');</script><title>Courses | CodeWithHarry</title><meta name="description" content="Confused on which course to take? I have got you covered. Browse courses and find out the best course for you. Its free!"/><link rel="icon" href="/favicon.ico"/><meta name="next-head-count" content="7"/><link rel="preload" href="/_next/static/css/95fea

### Step 2 : Parse the HTML

This step is generally called as creating the soup

In [3]:
soup = BeautifulSoup(htmlContent, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [4]:
title = soup.title          # get the title of the html page
title, type(title)

(<title>Courses | CodeWithHarry</title>, bs4.element.Tag)

In [5]:
title.string

'Courses | CodeWithHarry'

#### Commonly used types of objects

1. tag
2. NavigableString
3. Beautiful Soup Object
4. Comment

In [6]:
# Get all the paragraphs from the page.
paras = soup.find_all('p')
paras

[<p class="text-sm text-gray-500 sm:ml-4 sm:pl-4 sm:border-l-2 sm:border-gray-200 sm:py-2 sm:mt-0 mt-4 text-center">Copyright © 2022 CodeWithHarry.com</p>]

In [7]:
# Get all the anchor tags from the page.
anchors = soup.find_all('a')
anchors

[<a href="/">CodeWithHarry</a>,
 <a href="/">Home</a>,
 <a href="/videos/">Courses</a>,
 <a href="/blog/">Blog</a>,
 <a href="/contact/">Contact</a>,
 <a class="text-gray-500" href="https://www.facebook.com/CodeWithHarry/" rel="noreferrer" target="_blank"><svg class="w-5 h-5" fill="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="2" viewbox="0 0 24 24"><path d="M18 2h-3a5 5 0 00-5 5v3H7v4h3v8h4v-8h3l1-4h-4V7a1 1 0 011-1h3z"></path></svg></a>,
 <a class="ml-3 text-gray-500" href="https://www.instagram.com/CodeWithHarry/" rel="noreferrer" target="_blank"><svg class="w-5 h-5" fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="2" viewbox="0 0 24 24"><rect height="20" rx="5" ry="5" width="20" x="2" y="2"></rect><path d="M16 11.37A4 4 0 1112.63 8 4 4 0 0116 11.37zm1.5-4.87h.01"></path></svg></a>]

In [8]:
# Get first paragraph from the page.
para_1 = soup.find('p')
para_1

<p class="text-sm text-gray-500 sm:ml-4 sm:pl-4 sm:border-l-2 sm:border-gray-200 sm:py-2 sm:mt-0 mt-4 text-center">Copyright © 2022 CodeWithHarry.com</p>

In [9]:
para_1['class']   # get the attributes of the tag whether it be classes or id

['text-sm',
 'text-gray-500',
 'sm:ml-4',
 'sm:pl-4',
 'sm:border-l-2',
 'sm:border-gray-200',
 'sm:py-2',
 'sm:mt-0',
 'mt-4',
 'text-center']

In [10]:
# Get all the text from the elements/tags/soup
print(soup.find('p').get_text())

Copyright © 2022 CodeWithHarry.com


In [11]:
print(soup.get_text())

Courses | CodeWithHarryCodeWithHarryMenuHomeCoursesBlogContactLoginSignupCodeWithHarryCopyright © 2022 CodeWithHarry.com


In [12]:
# get all the links 
anchors = soup.find_all('a')
anchors
for link in anchors:
    print(link.get('herf'))

None
None
None
None
None
None
None


In [13]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
comment

'Hey, buddy. Want to buy a used parser?'

In [14]:
print(type(comment))

<class 'bs4.element.Comment'>


In [20]:
# get a element with particular id
part_element = soup.find(id="search-content")
type(part_element)

NoneType