🌟 Exercise 1 : Parsing HTML with BeautifulSoup
Instructions
Objective: Use urlopen() to fetch the HTML content of a webpage and then parse it using BeautifulSoup.

In [8]:
html = '''<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sports World</title>
    <style>
        body { font-family: Arial, sans-serif; }
        header, nav, section, article, footer { margin: 20px; padding: 15px; }
        nav { background-color: #333; }
        nav a { color: white; padding: 14px 20px; text-decoration: none; display: inline-block; }
        nav a:hover { background-color: #ddd; color: black; }
        .video { text-align: center; margin: 20px 0; }
    </style>
</head>
<body>

    <header>
        <h1>Welcome to Sports World</h1>
        <p>Your one-stop destination for the latest sports news and videos.</p>
    </header>

    <nav>
        <a href="#football">Football</a>
        <a href="#basketball">Basketball</a>
        <a href="#tennis">Tennis</a>
    </nav>

    <section id="football">
        <h2>Football</h2>
        <article>
            <h3>Latest Football News</h3>
            <p>Read about the latest football matches and player news.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/football-video-id" frameborder="0" allowfullscreen>
                </iframe>
            </div>
        </article>
    </section>

    <section id="basketball">
        <h2>Basketball</h2>
        <article>
            <h3>NBA Highlights</h3>
            <p>Watch highlights from the latest NBA games.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/basketball-video-id" frameborder="0" allowfullscreen>
                </iframe>
            </div>
        </article>
    </section>

    <section id="tennis">
        <h2>Tennis</h2>
        <article>
            <h3>Grand Slam Updates</h3>
            <p>Get the latest updates from the world of Grand Slam tennis.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/tennis-video-id" frameborder="0" allowfullscreen></iframe>
            </div>
        </article>
    </section>

    <footer>
        <form action="mailto:contact@sportsworld.com" method="post" enctype="text/plain">
            <label for="name">Name:</label><br>
            <input type="text" id="name" name="name"><br>
            <label for="email">Email:</label><br>
            <input type="email" id="email" name="email"><br>
            <label for="message">Message:</label><br>
            <textarea id="message" name="message" rows="4" cols="50"></textarea><br><br>
            <input type="submit" value="Send">
        </form>
    </footer>

</body>
</html>'''

Read the HTML content of the page.

Create a BeautifulSoup object to parse this HTML.

In [9]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser') 

type(soup) #returns the type of the variable soup
soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Sports World</title>
<style>
        body { font-family: Arial, sans-serif; }
        header, nav, section, article, footer { margin: 20px; padding: 15px; }
        nav { background-color: #333; }
        nav a { color: white; padding: 14px 20px; text-decoration: none; display: inline-block; }
        nav a:hover { background-color: #ddd; color: black; }
        .video { text-align: center; margin: 20px 0; }
    </style>
</head>
<body>
<header>
<h1>Welcome to Sports World</h1>
<p>Your one-stop destination for the latest sports news and videos.</p>
</header>
<nav>
<a href="#football">Football</a>
<a href="#basketball">Basketball</a>
<a href="#tennis">Tennis</a>
</nav>
<section id="football">
<h2>Football</h2>
<article>
<h3>Latest Football News</h3>
<p>Read about the latest football matches and player news.</p>
<div class="video">
<iframe allowfu

Find the title of the webpage (the content inside the <title> tag).

In [12]:
soup.title.string

'Sports World'

Extract all paragraphs (<p> tags) from the page.

In [13]:
soup.find_all('p')

[<p>Your one-stop destination for the latest sports news and videos.</p>,
 <p>Read about the latest football matches and player news.</p>,
 <p>Watch highlights from the latest NBA games.</p>,
 <p>Get the latest updates from the world of Grand Slam tennis.</p>]

Retrieve all links (URLs in <a href=""> tags) on the page.

In [19]:
soup.find_all('a')

[<a href="#football">Football</a>,
 <a href="#basketball">Basketball</a>,
 <a href="#tennis">Tennis</a>]

🌟 Exercise 2 : Scraping robots.txt from Wikipedia
Instructions
Write a Python program to download and display the content of robot.txt for wikipedia

In [6]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Main_Page"
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
html = urlopen(req).read()

soup = BeautifulSoup(html, "html.parser")
print(soup.title.get_text())
soup.prettify()

Wikipedia, the free encyclopedia


'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-disabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-not-available" dir="ltr" lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   Wikipedia, the free encyclopedia\n  </title>\n  <script>\n   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-li

In [8]:
import requests

url = "https://en.wikipedia.org/robots.txt"
headers = {"User-Agent": "Mozilla/5.0"} 

response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()  
text = response.text
print(text)

# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Z

🌟 Exercise 3 : Extracting Headers from Wikipedia’s Main Page
Instructions
Write a Python program to extract and display all the header tags from wikipedia.

In [15]:
url = "https://en.wikipedia.org/wiki/Main_Page"
r = requests.get(url, headers=headers, timeout=30)
soup = BeautifulSoup(r.text, "html.parser")

header_tags = soup.find_all(['h1', 'h2'])

for i, tag in enumerate(header_tags, start=1):
    print(f"{i:02d}. <{tag.name}> {tag.get_text(strip=True)}")

01. <h1> Main Page
02. <h1> Welcome toWikipedia
03. <h2> From today's featured article
04. <h2> Did you know ...
05. <h2> In the news
06. <h2> On this day
07. <h2> Today's featured picture
08. <h2> Other areas of Wikipedia
09. <h2> Wikipedia's sister projects
10. <h2> Wikipedia languages


🌟 Exercise 4 : Checking for Page Title
Instructions
Write a Python program to check whether a page contains a title or not.

In [19]:
title_tag = soup.title 
title_text = title_tag.get_text(strip=True)

if title_text:
    print("Title found:", title_text)
else:
    print("No title on the page.")

Title found: Wikipedia, the free encyclopedia


🌟 Exercise 5 : Analyzing US-CERT Security Alerts
Instructions
Write a Python program to get the number of security alerts issued by US-CERT in the current year.
Source

In [28]:
BASE_URL = ("https://www.cisa.gov/news-events/cybersecurity-advisories"
            "?f%5B0%5D=advisory_type%3A93")
HEADERS = {"User-Agent": "Mozilla/5.0"}


r = requests.get(BASE_URL, headers=HEADERS)
soup = BeautifulSoup(r.text, "html.parser")

times = soup.find_all("time")
times
year = '2025'
total = 0

on_this_page = sum(
        1 for t in times
        if year in ((t.get("datetime") or "") + t.get_text(strip=True))
    )
total += on_this_page
print(f"Total advisories in {year}: {total}")

Total advisories in 2025: 10


🌟 Exercise 6 : Scraping Movie Details
Instructions
Write a Python program to get movie name, year and a brief summary of the top 10 random movies from this IMBD website.

In [31]:
url = 'https://www.imdb.com/list/ls091294718/'
headers = {"User-Agent": "Mozilla/5.0"}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
soup

<!DOCTYPE html>
<html lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/"><head><meta charset="utf-8"/><meta content="width=device-width" name="viewport"/><script>if(typeof uet === 'function'){ uet('bb', 'LoadTitle', {wb: 1}); }</script><script>window.addEventListener('load', (event) => {
        if (typeof window.csa !== 'undefined' && typeof window.csa === 'function') {
            var csaLatencyPlugin = window.csa('Content', {
                element: {
                    slotId: 'LoadTitle',
                    type: 'service-call'
                }
            });
            csaLatencyPlugin('mark', 'clickToBodyBegin', 1761409083172);
        }
    })</script><title>Random Movie Roulette</title><meta content="Rules: Generate a number (from 1 to x) via: www.random.org

See how many number of films there are in the sidebar.

You can skip movies 10 times but never go back.

If you end up with a movie thats part of a series you c

In [39]:
import re, random
import pandas as pd

cards = soup.select("li.ipc-metadata-list-summary-item")

rows = []
for c in cards:

    t = c.find(class_="ipc-title__text")
    title = t.get_text(strip=True) if t else None

    meta = c.select(".dli-title-metadata-item")
    year = meta[0].get_text(strip=True) if meta else ""

    m = re.search(r"\b(19|20)\d{2}\b", year)
    year = m.group(0) if m else year

    s = (c.find(class_="ipc-html-content-inner-div")
         or c.find(class_="ipc-overflowText")
         or c.find("p"))
    summary = s.get_text(strip=True) if s else ""

    if title: 
        rows.append({"Movie": title, "Year": year, "Presentation": summary})

if len(rows) > 10:
    rows = random.sample(rows, 10)

data = pd.DataFrame(rows)
data.head(10)


Unnamed: 0,Movie,Year,Presentation
0,5. Top Gun,1986,The Top Gun Naval Fighter Weapons School is wh...
1,15. Cars,2006,"On the way to the biggest race of his life, a ..."
2,18. 50/50,2011,"Inspired by a true story, a comedy centered on..."
3,1. The Thing,1982,A research team in Antarctica is hunted by a s...
4,17. Ed Wood,1994,Ambitious but troubled movie director Edward D...
5,4. The Evil Dead,1981,"Five friends travel to a cabin in the woods, w..."
6,8. Dog Soldiers,2002,A routine military exercise turns into a night...
7,11. Alien,1979,After investigating a mysterious transmission ...
8,3. Jaws,1975,When a massive killer shark unleashes chaos on...
9,23. הגוניס,1985,A group of young misfits called The Goonies di...
