Name:Laxmi Saud
CRN:19

Lab 5: Web scrabing using python

Aim
To understand the basic of web scraping using pthon libraries such as requests and beautiful soup.

 Question 1: Basic HTML Request & Parsing

Theory
Web scraping starts with sending an HTTP request to a webpage and retrieving its HTML content. The **requests** library is used to fetch web pages, while **BeautifulSoup** helps parse and navigate HTML documents. Handling exceptions is important to ensure the program does not crash due to network or server issues.Web scraping is the process of automatically extracting data from websites.
Instead of manually copying and pasting, a program fetches the HTML content of a webpage and extracts the required information.

In [8]:
import requests
from bs4 import BeautifulSoup

try:
    response = requests.get("https://www.geeksforgeeks.org", timeout=10)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')
    print("Page Title:", soup.title.text)
except requests.exceptions.RequestException as e:
    print("Error:", e)

Page Title: GeeksforGeeks | Your All-in-One Learning Portal


Question 2:Extract Links


 Theory
Hyperlinks are represented using `<a>` tags in HTML. BeautifulSoup provides `.find()` and `.find_all()` methods to extract elements from a webpage. This helps in collecting URLs and anchor text for navigation or data analysis.

In [9]:
links = soup.find_all('a', limit=5)
for link in links:
    print(link.get_text(strip=True), "->", link.get('href'))

 -> https://www.geeksforgeeks.org/
DSA -> https://www.geeksforgeeks.org/dsa/dsa-tutorial-learn-data-structures-and-algorithms/
Practice Problems -> https://www.geeksforgeeks.org/explore
C -> https://www.geeksforgeeks.org/c/c-programming-language/
C++ -> https://www.geeksforgeeks.org/cpp/c-plus-plus/


 Question 3: Extract Headings

 Theory
HTML headings such as `<h2>` represent section titles. Scraping headings is useful for content summarization and indexing. Extracted data can be stored into files like CSV for further analysis.

In [10]:
import csv

headings = [h.text.strip() for h in soup.find_all('h2')]

with open('headings.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    for h in headings:
        writer.writerow([h])

print("Headings saved to headings.csv")

Headings saved to headings.csv


Question 4: Scrape Wikipedia Table

Theory
Tables in HTML consist of `<table>`, `<tr>`, and `<td>` tags. Scraping tables allows structured data extraction. Proper encoding and exception handling ensures reliability.

In [11]:
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population"

try:
    res = requests.get(url, timeout=10)
    res.raise_for_status()
    soup = BeautifulSoup(res.text, 'html.parser')
    table = soup.find('table', {'class': 'wikitable'})
    
    for row in table.find_all('tr'):
        cells = [cell.text.strip() for cell in row.find_all(['td','th'])]
        print(cells)
except Exception as e:
    print("Error:", e)

Error: 403 Client Error: Forbidden for url: https://en.wikipedia.org/wiki/List_of_countries_by_population


## Question 5: Selectors and Navigation

### Theory
BeautifulSoup supports DOM navigation such as parents, siblings, and attribute-based selection. This enables fine-grained traversal of HTML elements.

In [17]:
html = '''<html><body>
<p class="intro">Welcome</p>
<p class="intro">Learn Python</p>
<a href="https://python.org">Python</a>
</body></html>'''

soup = BeautifulSoup(html, 'html.parser')

print(soup.find_all('p', class_='intro'))
print("Parent of <a>:", soup.find('a').parent)
print("Next sibling:", soup.find('p').find_next_sibling())

[<p class="intro">Welcome</p>, <p class="intro">Learn Python</p>]
Parent of <a>: <body>
<p class="intro">Welcome</p>
<p class="intro">Learn Python</p>
<a href="https://python.org">Python</a>
</body>
Next sibling: <p class="intro">Learn Python</p>


Question 6: Tag Manipulation

 Theory
BeautifulSoup allows modification of HTML tags such as changing tag names, adding attributes, and updating text dynamically.

In [18]:
tag = BeautifulSoup('<b class="boldest">Hello</b>', 'html.parser').b
tag.name = 'strong'
tag['id'] = 'greeting'
tag.string = 'Hi there'
print(tag)

<strong class="boldest" id="greeting">Hi there</strong>


## Question 7: Advanced Navigation

### Theory
Advanced navigation techniques help locate text nodes and their related elements such as parents and siblings within an HTML structure.

In [14]:
html = '''<table><tr><td>Apple</td></tr><tr><td>Banana</td></tr></table>'''
soup = BeautifulSoup(html, 'html.parser')

apple = soup.find(string='Apple')
print("Parent td:", apple.parent)
print("Siblings:", apple.parent.find_next_siblings())

Parent td: <td>Apple</td>
Siblings: []


## Question 8: Using SoupStrainer

### Theory
SoupStrainer improves efficiency by parsing only specific tags instead of the entire document.

In [15]:
from bs4 import SoupStrainer

html = '''<html><a href="page1.html">Page 1</a><p>Paragraph</p><a href="page2.html">Page 2</a></html>'''
only_a = SoupStrainer('a')
soup = BeautifulSoup(html, 'html.parser', parse_only=only_a)
print(soup)

<a href="page1.html">Page 1</a><a href="page2.html">Page 2</a>


 Question 9: Exception Handling

 Theory
Robust web scraping requires handling exceptions such as timeouts, HTTP errors, missing elements, and request failures.

In [16]:
try:
    response = requests.get(url, timeout=5)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')
    table = soup.find('table')
    if table is None:
        raise AttributeError('Table not found')
except requests.exceptions.Timeout:
    print('Timeout Error')
except requests.exceptions.HTTPError:
    print('HTTP Error')
except requests.exceptions.RequestException:
    print('Request Error')
except AttributeError as e:
    print(e)

HTTP Error


Discussion
This lab demonstrated how Python can be effectively used for web scraping tasks. Students learned how to fetch web pages, parse HTML, extract structured and unstructured data, navigate DOM trees, manipulate tags, and handle exceptions gracefully.

 Conclusion
Web scraping is a powerful technique for data collection from the internet. By using requests and BeautifulSoup, complex HTML documents can be parsed efficiently. Proper exception handling and ethical scraping practices ensure reliability and maintainability of scraping programs.