# **Web scraping using beautiful soup**

This notebook includes data scraping, which takes a website URL as an input and extracts the information listed below as an output from that webpage.


1.   Specific HTML tags along with titles and meta description
2.   Extract specific tags, heading tags from h1-h6 along with titles and meta description
3. Extracting ALT tags
4. For counting words inside a web page
5. Inspection of broken links inside a webpage
6. Extracting the source code of the webpage in google colab
7. Extracting all URLs from a website without duplication
8. Measuring the forntend and backend performance of website






In [None]:
!pip install beautifulsoup4



**1. For scraping specific HTML tags along with titles and meta description**

In [None]:
#Importing libraries
from bs4 import BeautifulSoup
import urllib
from urllib import request
import urllib.request as ur

In [None]:
# Getting input for webiste from user
urlinput = input("Enter url :")
print(" This is the website link that you entered", urlinput)

# For extracting specific tags from webpage
def getTags(tag):
  s = ur.urlopen(urlinput)
  soup = BeautifulSoup(s.read())
  return soup.findAll(tag)

# For extracting specific title & meta description from webpage
def titleandmetaTags():
    s = ur.urlopen(urlinput)
    soup = BeautifulSoup(s.read())
    #----- Extracting Title from website ------#
    title = soup.title.string
    print ('Website Title is :', title)
    #-----  Extracting Meta description from website ------#
    meta_description = soup.find_all('meta')
    for tag in meta_description:
        if 'name' in tag.attrs.keys() and tag.attrs['name'].strip().lower() in ['description', 'keywords']:
            #print ('NAME    :',tag.attrs['name'].lower())
            print ('CONTENT :',tag.attrs['content'])

#------------- Main ---------------#
if __name__ == '__main__':
  titleandmetaTags()
  tags = getTags('h1')
  for tag in tags:
     print(tag) # display tags 
     print(tag.contents) # display contents of the tags
        

Enter url :https://techoid.co/contact-us
 This is the website link that you entered https://techoid.co/contact-us
Website Title is : Contact Us | Techoid.co
CONTENT : Album Cover Design, Banner Ads, Book Design, Brochure Design, Building Information Modeling, Brand Style Design, Business & Stationary, Cartoon & Comics, Car wraps, Catalog Design, Game Design, Info Graphics Design, Interior Design, Invitation Design, Landscape Design, Logo Design, Menu Design, Packaging Design, Pattern Design, Photoshop Editing, Podcast Cover, Art Portraits, Postcard Design, Poster Design, Presentation Design, Social Media Design, Story boards, Tattoo Design, Trade Booth Design, Tshirts & Merchandise, Twitch Store, Vector Tracing, Flyer Design
<h1 class="elementor-heading-title elementor-size-default">Contact us</h1>
['Contact us']
<h1 class="elementor-heading-title elementor-size-default">Techoid Now</h1>
['Techoid Now']


**2. For extracting specific tags, all heading tags from h1-h6 along with titles and meta description**

In [None]:
# Importing libraries
from bs4 import BeautifulSoup
import urllib
from urllib import request
import urllib.request as ur

In [None]:
# Getting input for webiste from user
url_input = input("Enter url :")
print(" This is the website link that you entered", url_input)


# For extracting specific tags from webpage
def getTags(tag):
  s = ur.urlopen(url_input)
  soup = BeautifulSoup(s.read())
  return soup.findAll(tag)

# For extracting all h1-h6 heading tags from webpage
def headingTags(headingtags):
  h = ur.urlopen(url_input)
  soup = BeautifulSoup(h.read())
  print("List of headings from headingtags function h1, h2, h3, h4, h5, h6 :")
  for heading in soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]):
    print(heading.name + ' ' + heading.text.strip())

# For extracting specific title & meta description from webpage
def titleandmetaTags():
    s = ur.urlopen(urlinput)
    soup = BeautifulSoup(s.read())
    #----- Extracting Title from website ------#
    title = soup.title.string
    print ('Website Title is :', title)
    #-----  Extracting Meta description from website ------#
    meta_description = soup.find_all('meta')
    for tag in meta_description:
        if 'name' in tag.attrs.keys() and tag.attrs['name'].strip().lower() in ['description', 'keywords']:
            #print ('NAME    :',tag.attrs['name'].lower())
            print ('CONTENT :',tag.attrs['content'])



#------------- Main ---------------#
if __name__ == '__main__':
  titleandmetaTags()
  tags = getTags('p')
  headtags = headingTags('h1')
  for tag in tags:
     print(" Here are the tags from getTags function:", tag.contents)
        



Enter url :https://techoid.co/contact-us
 This is the website link that you entered https://techoid.co/contact-us
Website Title is : Contact Us | Techoid.co
CONTENT : Album Cover Design, Banner Ads, Book Design, Brochure Design, Building Information Modeling, Brand Style Design, Business & Stationary, Cartoon & Comics, Car wraps, Catalog Design, Game Design, Info Graphics Design, Interior Design, Invitation Design, Landscape Design, Logo Design, Menu Design, Packaging Design, Pattern Design, Photoshop Editing, Podcast Cover, Art Portraits, Postcard Design, Poster Design, Presentation Design, Social Media Design, Story boards, Tattoo Design, Trade Booth Design, Tshirts & Merchandise, Twitch Store, Vector Tracing, Flyer Design
List of headings from headingtags function h1, h2, h3, h4, h5, h6 :
h1 Contact us
h3 Email
h3 info@techoid.co
h3 Call / Whatsapp
h3 +44 7718 307359
+92 311 0206987
h3 Request a Meeting
h3 Make an appointment
h3 Visit
h3 Office No. 303,
Batool Arcade, Gulshan-e-Iqba

**3. For extracting ALT tags (Image Alter tags)**

In [None]:
import urllib.request as ur

url_input = input("Enter url :")
print("The website link that you entered is:", url_input)

def alt_tag():
  url =  ur.urlopen(url_input)
  htmlSource = url.read()
  url.close()
  soup = BeautifulSoup(htmlSource)
  print('\n The alt tag along with the text in the web page')
  print(soup.find_all('img',alt= True))
  


#------------- Main ---------------#
if __name__ == '__main__':
  alt_tag()


Enter url :https://techoid.co/
The website link that you entered is: https://techoid.co/

 The alt tag along with the text in the web page
[<img alt="techoid.co" src="https://techoid.co/wp-content/uploads/elementor/thumbs/techoid.co_-p1qnbeammunnhcxquhzsb110je32a8etpsdnmlsxjs.png" title="techoid.co"/>, <img alt="techoid.co" src="https://techoid.co/wp-content/uploads/elementor/thumbs/techoid.co_-p1qnbeammunnhcxquhzsb110je32a8etpsdnmlsxjs.png" title="techoid.co"/>, <img alt="" class="attachment-thumbnail size-thumbnail" height="150" loading="lazy" src="https://techoid.co/wp-content/uploads/2021/03/web.png" width="150"/>, <img alt="" class="attachment-thumbnail size-thumbnail" height="150" loading="lazy" src="https://techoid.co/wp-content/uploads/2021/03/mob.png" width="150"/>, <img alt="" class="attachment-thumbnail size-thumbnail" height="150" loading="lazy" src="https://techoid.co/wp-content/uploads/2021/03/marketing.png" width="150"/>, <img alt="" class="attachment-thumbnail size-thum

In [None]:
# For reviewing alt tags in seperate lines
soup.find_all('img',alt= True)

[<img alt="techoid.co" src="https://techoid.co/wp-content/uploads/elementor/thumbs/techoid.co_-p1qnbeammunnhcxquhzsb110je32a8etpsdnmlsxjs.png" title="techoid.co"/>,
 <img alt="techoid.co" src="https://techoid.co/wp-content/uploads/elementor/thumbs/techoid.co_-p1qnbeammunnhcxquhzsb110je32a8etpsdnmlsxjs.png" title="techoid.co"/>,
 <img alt="" class="attachment-thumbnail size-thumbnail" height="150" loading="lazy" src="https://techoid.co/wp-content/uploads/2021/03/web.png" width="150"/>,
 <img alt="" class="attachment-thumbnail size-thumbnail" height="150" loading="lazy" src="https://techoid.co/wp-content/uploads/2021/03/mob.png" width="150"/>,
 <img alt="" class="attachment-thumbnail size-thumbnail" height="150" loading="lazy" src="https://techoid.co/wp-content/uploads/2021/03/marketing.png" width="150"/>,
 <img alt="" class="attachment-thumbnail size-thumbnail" height="150" loading="lazy" src="https://techoid.co/wp-content/uploads/2021/03/iot.png" width="150"/>,
 <img alt="" class="atta

**4. For counting words inside a web page**

In [None]:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

# Getting content from web page
r = requests.get("https://techoid.co/contact-us")
soup = BeautifulSoup(r.content)

# For getting words within paragrphs
text_paragraph = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
count_paragraph = Counter((x.rstrip(punctuation).lower() for y in text_paragraph for x in y.split()))

# For getting words inside div tags
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
count_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))

# Adding two counters for getting a list with words count (from most to less common)
total = count_div + count_paragraph
list_most_common_words = total.most_common() 

In [None]:
# Total words inside a webpage
len(total)

245

In [None]:
# List of common words
list_most_common_words

[('development', 221),
 ('', 160),
 ('management', 136),
 ('system', 136),
 ('us', 126),
 ('services', 103),
 ('to', 93),
 ('the', 84),
 ('message', 72),
 ('email', 71),
 ('project', 70),
 ('about', 68),
 ('name', 67),
 ('software', 65),
 ('our', 62),
 ('softwares', 52),
 ('your', 52),
 ('home', 51),
 ('web', 51),
 ('mobile', 51),
 ('app', 51),
 ('cms', 51),
 ('digital', 51),
 ('marketing', 51),
 ('graphics', 51),
 ('designing', 51),
 ('design', 51),
 ('content', 51),
 ('writing', 51),
 ('artificial', 51),
 ('intelligence', 51),
 ('iot', 51),
 ('based', 51),
 ('database', 51),
 ('it', 51),
 ('outsourcing', 51),
 ('careers', 51),
 ('contact', 51),
 ('get', 50),
 ('company', 48),
 ('request', 45),
 ('a', 45),
 ('in', 45),
 ('form', 43),
 ('you', 42),
 ('send', 37),
 ('subscribe', 36),
 ('ui/ux', 34),
 ('solutions', 34),
 ('asset', 34),
 ('tracking', 34),
 ('inventory', 34),
 ('enterprise', 34),
 ('resource', 34),
 ('planning', 34),
 ('(erp', 34),
 ('employee', 34),
 ('(ems', 34),
 ('hosp

**5. For inspecting Broken links inside a webpage**

We want to retrieve the response code 200 if the site is fully functional. We'll get the 404 response code if it's not available.

In [None]:
# Importing libraries
from bs4 import BeautifulSoup, SoupStrainer
import requests

# Getting URL from user
url = input("Enter your url: ")

def broken_page():
  # For making request to get the URL
  user_req_page = requests.get(url)

  # For getting the response code of given URL
  response_code = str(user_req_page.status_code)

  # For displaying the text of the URL in str
  data =user_req_page.text

  # For using BeautifulSoup to access the built-in methods
  soup = BeautifulSoup(data)

  # Iterate over all links on the given URL with the response code next to it i.e 404 for PAGE NOT FOUND, 200 if website is functional/available
  for link in soup.find_all('a'):
    print(f"Url: {link.get('href')} " + f"| Status Code: {response_code}")


#----- NOTE ------#
# --------- TO VERIFY PAGE NOT FOUND 404 ERROR, enter below web link as a input URL --------#
#https://roine.github.com/p1

#------------- Main ---------------#
if __name__ == '__main__':
  broken_page()

Enter your url: https://techoid.co/
Url: #content | Status Code: 200
Url: https://techoid.co | Status Code: 200
Url: https://techoid.co | Status Code: 200
Url: https://techoid.co/ | Status Code: 200
Url: https://techoid.co/about-us | Status Code: 200
Url: https://techoid.co/services | Status Code: 200
Url: https://techoid.co/web-development | Status Code: 200
Url: https://techoid.co/mobile-app-development | Status Code: 200
Url: https://techoid.co/cms-development | Status Code: 200
Url: https://techoid.co/digital-marketing | Status Code: 200
Url: https://techoid.co/graphics-designing | Status Code: 200
Url: https://techoid.co/ui-ux | Status Code: 200
Url: https://techoid.co/content-writing | Status Code: 200
Url: https://techoid.co/ai | Status Code: 200
Url: https://techoid.co/iot-based-solutions | Status Code: 200
Url: https://techoid.co/database-development | Status Code: 200
Url: https://techoid.co/it-outsourcing | Status Code: 200
Url: https://techoid.co/software | Status Code: 200

**6. For getting the source code of the webpage**

Here, we will be using 'page_source' method is used retrieve the page source of the webpage the user is currently accessing.

*NOTE: (Page source : The source code/page source is the programming behind any webpage)*

In [None]:
# install chromium, its driver, and selenium
!apt update
!apt install chromium-chromedriver
!pip install selenium

# set options to be headless, ..
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:3 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Ign:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [697 B]
Hit:9 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:10 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]
Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:12 http://security.ubuntu.com/ubuntu bionic-securi

In [None]:
#------------- FOR DISPLAYING SOURCE CODE OF THE WEBPAGE -------------#

# open it, go to a website, and get results
wd = webdriver.Chrome(options=options)

# Prompt user to enter the URL
url = input("Enter your url: ")

# For making request to get the URL
wd.get(url)

# To display code results
print(wd.page_source)  

Enter your url: https://techoid.co/contact-us
<html lang="en-US"><head>
	<meta charset="UTF-8">
	<style type="text/css" data-tippy-stylesheet="">.tippy-iOS{cursor:pointer!important;-webkit-tap-highlight-color:transparent}.tippy-popper{transition-timing-function:cubic-bezier(.165,.84,.44,1);max-width:calc(100% - 8px);pointer-events:none;outline:0}.tippy-popper[x-placement^=top] .tippy-backdrop{border-radius:40% 40% 0 0}.tippy-popper[x-placement^=top] .tippy-roundarrow{bottom:-7px;bottom:-6.5px;-webkit-transform-origin:50% 0;transform-origin:50% 0;margin:0 3px}.tippy-popper[x-placement^=top] .tippy-roundarrow svg{position:absolute;left:0;-webkit-transform:rotate(180deg);transform:rotate(180deg)}.tippy-popper[x-placement^=top] .tippy-arrow{border-top:8px solid #333;border-right:8px solid transparent;border-left:8px solid transparent;bottom:-7px;margin:0 3px;-webkit-transform-origin:50% 0;transform-origin:50% 0}.tippy-popper[x-placement^=top] .tippy-backdrop{-webkit-transform-origin:0 25%;

**7. Extraction of all URLs from a website without duplication**

In [1]:
#---- Importing libraries ----#
import re
import requests
from bs4 import BeautifulSoup

all_links = set() #------ Creating a unique set of links ------#

for i in range(7):
   r = requests.get(("https://techoid.co/?page={}").format(i))
   soup = BeautifulSoup(r.content , "html.parser")
   for link in soup.find_all("a",href=re.compile('/')):
            link = (link.get('href'))
            #----- For the removal of duplicate URLs, We will simply add a link to that set; this assures that it's distinct ------#
            if link not in all_links:
              print(link)
            all_links.add(link)

https://techoid.co
https://techoid.co/
https://techoid.co/about-us
https://techoid.co/services
https://techoid.co/web-development
https://techoid.co/mobile-app-development
https://techoid.co/cms-development
https://techoid.co/digital-marketing
https://techoid.co/graphics-designing
https://techoid.co/ui-ux
https://techoid.co/content-writing
https://techoid.co/ai
https://techoid.co/iot-based-solutions
https://techoid.co/database-development
https://techoid.co/it-outsourcing
https://techoid.co/software
https://techoid.co/asset-tracking
https://techoid.co/inventory-management-system
https://techoid.co/project-management-system
https://techoid.co/erp
https://techoid.co/employee-management-system
https://techoid.co/hospital-management-system
https://techoid.co/careers
https://techoid.co/contact-us
https://barcodes.pk/
https://leathersoutlet.com/
https://techoid.uk
https://techoid.co/services/software-development
https://techoid.co/team/hafiz-ibad-jabbar
https://techoid.co/team/muhammad-kamra

**8. Measuring the forntend and backend performance of website**

In [2]:
#----- Installation of selenium and chromedriver in google colab -----#
!pip install selenium
!apt-get update 
!apt install chromium-chromedriver

Collecting selenium
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |████████████████████████████████| 911kB 3.7MB/s 
Installing collected packages: selenium
Successfully installed selenium-3.141.0
Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:6 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [697 B]
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/re

In [3]:
#---- Importing libraries ----#
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import csv
import os.path

In [4]:
#---- Accessing chromedriver in google colab ----#
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',options=options)
driver =webdriver.Chrome('chromedriver',options=options)

In [5]:
#----- Creating csv file to write the calculated performance of the website
csv_path = "performance.csv"
file = open(csv_path, 'w', newline='')
writer = csv.writer(file)
writer.writerow(["backendPerformance_calc","frontendPerformance_calc"])


#----- Getting input for webiste from user
url = input("Enter url :")
print("This is the website link that you entered:", url)

#----- Setting iterations for testing the perfromance
iterations = 10
for i in range(iterations):
    driver =webdriver.Chrome('chromedriver',options=options)
    driver.get(url) #-- Passing url as parameter in Selenium method (driver.get)

    #-- Using Navigation Timing API to calculate the timings, Here driver.execute_script is called and the return value is stored in navigationStart
    #driver.execute_script then synchronously executes JavaScript in the current window or frame. In this case the ‘return window.performance.timing.navigationStart’ code will run.
    navigationStart = driver.execute_script("return window.performance.timing.navigationStart")
    responseStart = driver.execute_script("return window.performance.timing.responseStart")
    domComplete = driver.execute_script("return window.performance.timing.domComplete")

    backendPerformance_calc = responseStart - navigationStart
    frontendPerformance_calc = domComplete - responseStart

    #--This will print iteration wise backend and front end performance for website
    print("Iteration no:", i)
    print("Back End performance in MS: %s" % backendPerformance_calc)
    print("Front End performance in MS: %s" % frontendPerformance_calc)
    print("------------------------")

    #-- Writing row wise data in the file
    writer.writerow([backendPerformance_calc,frontendPerformance_calc])
    driver.close()
    



Enter url :https://techoid.co/
This is the website link that you entered: https://techoid.co/
Iteration no: 0
Back End performance in MS: 700
Front End performance in MS: 1969
------------------------
Iteration no: 1
Back End performance in MS: 529
Front End performance in MS: 2404
------------------------
Iteration no: 2
Back End performance in MS: 666
Front End performance in MS: 2288
------------------------
Iteration no: 3
Back End performance in MS: 452
Front End performance in MS: 2549
------------------------
Iteration no: 4
Back End performance in MS: 580
Front End performance in MS: 2589
------------------------
Iteration no: 5
Back End performance in MS: 439
Front End performance in MS: 3080
------------------------
Iteration no: 6
Back End performance in MS: 654
Front End performance in MS: 2496
------------------------
Iteration no: 7
Back End performance in MS: 554
Front End performance in MS: 2419
------------------------
Iteration no: 8
Back End performance in MS: 440
Fr

In [6]:
#---- For closing the CSV file and the WebDriver ----#
driver.quit()
file.close()


In [8]:
#---- To view performance in a dataframe ----#
import pandas as pd
df=pd.read_csv("performance.csv")

In [9]:
#----- Displaying DataFrames output ------#
df

Unnamed: 0,backendPerformance_calc,frontendPerformance_calc
0,700,1969
1,529,2404
2,666,2288
3,452,2549
4,580,2589
5,439,3080
6,654,2496
7,554,2419
8,440,2048
9,483,2550
