# 📚 Lab 9: Web Scraping Using Python
This notebook covers basic web scraping using `requests` and `BeautifulSoup` libraries.

## 📦 Install Required Libraries

In [1]:
!pip install requests
!pip install beautifulsoup4



## 📥 Import Libraries

In [4]:
import requests
from bs4 import BeautifulSoup

## 🌐 Define the URL and Send a Request

In [6]:
url = 'https://www.geeksforgeeks.org/'
response = requests.get(url)
print(response)

<Response [200]>


## 📑 Parse the HTML Content

In [10]:
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify()[:1000])  # Print first 1000 characters

<!DOCTYPE html>
<html lang="en">
 <head>
  <script type="application/ld+json">
   {"@context":"http://schema.org","@type":"Organization","name":"GeeksforGeeks","url":"https://www.geeksforgeeks.org/","logo":"https://media.geeksforgeeks.org/wp-content/cdn-uploads/20200817185016/gfg_complete_logo_2x-min.png","description":"Your All-in-One Learning Portal. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.","founder":[{"@type":"Person","name":"Sandeep Jain","url":"https://in.linkedin.com/in/sandeep-jain-b3940815"}],"sameAs":["https://www.facebook.com/geeksforgeeks.org/","https://twitter.com/geeksforgeeks","https://www.linkedin.com/company/1299009","https://www.youtube.com/geeksforgeeksvideos/"]}
  </script>
  <link href="https://fonts.googleapis.com" rel="preconnect"/>
  <link crossorigin="true" href="https://fonts.gstatic.com" rel="preconnect"/>
  <meta content="widt

## 📝 Extract and Print the Title of the Webpage

In [5]:
title = soup.title
print(title)

<title>GeeksforGeeks | Your All-in-One Learning Portal</title>


## 🏷️ Extract and Print All h2 Headings (Optional Task)

In [6]:
headings = soup.find_all('h2')
for h in headings:
    print(h.text)

Explore
Courses
DSA
AI ML & Data Science
Web Development
Languages
CS Subjects 
Databases
DevOps
Tutorials
Free Courses
GfG School 
Must Explore


## ✅ Conclusion
We successfully fetched the webpage content and extracted specific information using BeautifulSoup.

## 🔍 Extra Example 1: Extract All Links from the Page

In [12]:
links = soup.find_all('a')
for link in links:
    print(link.get('href'))

https://www.geeksforgeeks.org/
https://www.geeksforgeeks.org/learn-data-structures-and-algorithms-dsa-tutorial/
https://www.geeksforgeeks.org/explore
https://www.geeksforgeeks.org/c-programming-language/
https://www.geeksforgeeks.org/c-plus-plus/
https://www.geeksforgeeks.org/java/
https://www.geeksforgeeks.org/python-programming-language/
https://www.geeksforgeeks.org/data-science-for-beginners/
https://www.geeksforgeeks.org/machine-learning/
https://www.geeksforgeeks.org/courses
https://www.geeksforgeeks.org/linux-tutorial/
https://www.geeksforgeeks.org/devops-tutorial/
https://www.geeksforgeeks.org/sql-tutorial/
https://www.geeksforgeeks.org/web-development/
https://www.geeksforgeeks.org/system-design-tutorial/
https://www.geeksforgeeks.org/aptitude-questions-and-answers/
https://www.geeksforgeeks.org/computer-science-projects/
https://www.geeksforgeeks.org/geeksforgeeks-premium-subscription
https://www.geeksforgeeks.org/courses/full-stack-node
https://practice.geeksforgeeks.org/cou

# TASK LAB #09

In [14]:
# Task 2: Extract All Hyperlinks
# Scrape all the hyperlinks (href values) from 'https://www.geeksforgeeks.org/' and print them.
# We define a for loop to extract and print all the hyperlinks that are in the webpage;
weblinks = soup.find_all('a')
for link in weblinks:
    print(link.get('href'))



https://www.geeksforgeeks.org/
https://www.geeksforgeeks.org/learn-data-structures-and-algorithms-dsa-tutorial/
https://www.geeksforgeeks.org/explore
https://www.geeksforgeeks.org/c-programming-language/
https://www.geeksforgeeks.org/c-plus-plus/
https://www.geeksforgeeks.org/java/
https://www.geeksforgeeks.org/python-programming-language/
https://www.geeksforgeeks.org/data-science-for-beginners/
https://www.geeksforgeeks.org/machine-learning/
https://www.geeksforgeeks.org/courses
https://www.geeksforgeeks.org/linux-tutorial/
https://www.geeksforgeeks.org/devops-tutorial/
https://www.geeksforgeeks.org/sql-tutorial/
https://www.geeksforgeeks.org/web-development/
https://www.geeksforgeeks.org/system-design-tutorial/
https://www.geeksforgeeks.org/aptitude-questions-and-answers/
https://www.geeksforgeeks.org/computer-science-projects/
https://www.geeksforgeeks.org/geeksforgeeks-premium-subscription
https://www.geeksforgeeks.org/courses/full-stack-node
https://practice.geeksforgeeks.org/cou

In [18]:
# Task 3: Extract All Paragraph Texts
# Scrape and print the text from all <p> tags on 'https://www.geeksforgeeks.org/'.
# We use the .find_all() function to extract and print all para texts from the webpage;
paragraph = soup.find_all("p")
paragraph_texts = [p.text for p in paragraph]
for text in paragraph_texts:
    print(text)



Interested in advertising with us?


In [32]:
# Task 4: Extract Data From a Specific Class
# Scrape and print the text inside all <div> tags having a specific class (example: 'head') on 'https://www.geeksforgeeks.org/'.
# In order to extract data from a specific class attribute in an HTML document;
div = soup.find_all("div")
for divs in div:
    print(divs.text)


CoursesTutorialsPracticeContestsData StructureJavaPythonHTMLInterview PreparationDSAPractice ProblemsC C++JavaPythonData ScienceMachine LearningCoursesLinuxDevOpsSQLWeb DevelopmentSystem DesignAptitudeProjectsGfG PremiumHello, What Do You Want To Learn?Full Stack Live ClassesDSA: Basic To Advanced CourseMaster DS & MLExploreData Structure and AlgorithmsView morePractice DSAView moreAI ML & Data ScienceView moreWeb DevelopmentView morePythonView moreMachine LearningView moreSystem DesignView moreDevOpsView moreInterested in advertising with us?Get in touchCoursesView All4.4DSA to Development: A Complete GuideBeginner to Advance549k+ interested GeeksExplore now4.7JAVA Backend Development - LiveIntermediate and Advance301k+ interested GeeksExplore now4.9Tech Interview 101 - From DSA to System Design for Working ProfessionalsBeginner to Advance331k+ interested GeeksExplore now4.7Full Stack Development with React & Node JS - LiveBeginner to Advance350k+ interested GeeksExplore now4.6Java Pr

In [34]:
# Task 5: Compare Parsing Speed
# Parse the same web page using both 'html.parser' and 'lxml'.
# Measure and compare the time taken by each parser to process the page.

import time

url = 'https://www.geeksforgeeks.org/'
response = requests.get(url)

html_content = response.content

def time_parsing(parser, html):
    start_time = time.perf_counter()
    BeautifulSoup(html, parser)
    end_time = time.perf_counter()
    return end_time - start_time

num_iterations = 5 # Number of times to repeat the parsing

lxml_times = [time_parsing("lxml", html_content) for _ in range(num_iterations)]
html_parser_times = [time_parsing("html.parser", html_content) for _ in range(num_iterations)]

avg_lxml_time = sum(lxml_times) / num_iterations
avg_html_parser_time = sum(html_parser_times) / num_iterations

print(f"Average parsing time with lxml: {avg_lxml_time:.4f} seconds")
print(f"Average parsing time with html.parser: {avg_html_parser_time:.4f} seconds")



Average parsing time with lxml: 0.0946 seconds
Average parsing time with html.parser: 0.0440 seconds


In [None]:
# Task 6: Save Hyperlinks to CSV
# Extract all the hyperlinks from 'https://www.geeksforgeeks.org/' and save them in a CSV file.
# First, we extract all the hyperlinks;



In [54]:
# Task 7: Scrape Article Headings from a News Website
# Scrape all the article headings (for example: <h3> tags) from 'https://www.bbc.com/news' or any accessible open news site.

url = "https://www.bbc.com/news"
response = requests.get(url) 
soupObj = BeautifulSoup(response.text, 'lxml')
headings = soupObj.find_all("h3") # Used to extract subheadings or titles in a webpage;
for heading in headings:
    print(heading.text)



In [60]:
# Task 8: Scrape Image URLs
# Scrape all image URLs (src attributes from <img> tags) from 'https://www.geeksforgeeks.org/' and print them.
# Firstly, we define a variable that extracts the <img> tags from the webpage;
# The <img> tag is used to embed an image in a web page. 
# It requires the src attribute to specify the image source file.
img = soup.find_all("img")
for image in img:
    print(image.get("src"))



https://media.geeksforgeeks.org/gfg-gg-logo.svg
https://media.geeksforgeeks.org/img-practice/prod/courses/504/Mobile/Other/Course_DSA_to_Dev_1720846081.webp
https://media.geeksforgeeks.org/img-practice/prod/courses/227/Mobile/Other/Course_Backend_1720846992.webp
https://media.geeksforgeeks.org/img-practice/prod/courses/458/Mobile/Other/Course_Tech_Int_1720846791.webp
https://media.geeksforgeeks.org/img-practice/prod/courses/241/Web/Content/FSRNL_1705410152.webp
https://media.geeksforgeeks.org/img-practice/prod/courses/270/Web/Content/CourseJavaProgrammin_1716371938.webp
https://media.geeksforgeeks.org/img-practice/prod/courses/221/Web/Content/cpp_1723009538.webp
https://media.geeksforgeeks.org/auth-dashboard-uploads/gfgFooterLogo.png
https://media.geeksforgeeks.org/img-practice/Location-1685004904.svg
https://media.geeksforgeeks.org/img-practice/Location-1685004904.svg
https://media.geeksforgeeks.org/auth-dashboard-uploads/googleplay-%281%29.png
https://media.geeksforgeeks.org/auth-das

In [92]:
# Task 9: Scrape Titles from First 5 Links
# Scrape the first 5 hyperlink URLs from 'https://www.geeksforgeeks.org/'
# Then visit each of these pages and scrape their titles.
weblinks = soup.find_all('a')
links = []
for link in weblinks:
    href = link.get('href')
    links.append(href)

for item in links[:5]:
    print(item)



https://www.geeksforgeeks.org/
https://www.geeksforgeeks.org/learn-data-structures-and-algorithms-dsa-tutorial/
https://www.geeksforgeeks.org/explore
https://www.geeksforgeeks.org/c-programming-language/
https://www.geeksforgeeks.org/c-plus-plus/


In [None]:
# Bonus Task: Save Links and Text to Excel
# Scrape all hyperlinks along with their link text from 'https://www.geeksforgeeks.org/'
# Save the data into an Excel file using pandas.
