**üåü Exercise 1 : Parsing HTML with BeautifulSoup**
**Instructions**
Objective: Use urlopen() to fetch the HTML content of a webpage and then parse it using BeautifulSoup.



<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sports World</title>
    <style>
        body { font-family: Arial, sans-serif; }
        header, nav, section, article, footer { margin: 20px; padding: 15px; }
        nav { background-color: #333; }
        nav a { color: white; padding: 14px 20px; text-decoration: none; display: inline-block; }
        nav a:hover { background-color: #ddd; color: black; }
        .video { text-align: center; margin: 20px 0; }
    </style>
</head>
<body>

    <header>
        <h1>Welcome to Sports World</h1>
        <p>Your one-stop destination for the latest sports news and videos.</p>
    </header>

    <nav>
        <a href="#football">Football</a>
        <a href="#basketball">Basketball</a>
        <a href="#tennis">Tennis</a>
    </nav>

    <section id="football">
        <h2>Football</h2>
        <article>
            <h3>Latest Football News</h3>
            <p>Read about the latest football matches and player news.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/football-video-id" frameborder="0" allowfullscreen>
                </iframe>
            </div>
        </article>
    </section>

    <section id="basketball">
        <h2>Basketball</h2>
        <article>
            <h3>NBA Highlights</h3>
            <p>Watch highlights from the latest NBA games.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/basketball-video-id" frameborder="0" allowfullscreen>
                </iframe>
            </div>
        </article>
    </section>

    <section id="tennis">
        <h2>Tennis</h2>
        <article>
            <h3>Grand Slam Updates</h3>
            <p>Get the latest updates from the world of Grand Slam tennis.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/tennis-video-id" frameborder="0" allowfullscreen></iframe>
            </div>
        </article>
    </section>

    <footer>
        <form action="mailto:contact@sportsworld.com" method="post" enctype="text/plain">
            <label for="name">Name:</label><br>
            <input type="text" id="name" name="name"><br>
            <label for="email">Email:</label><br>
            <input type="email" id="email" name="email"><br>
            <label for="message">Message:</label><br>
            <textarea id="message" name="message" rows="4" cols="50"></textarea><br><br>
            <input type="submit" value="Send">
        </form>
    </footer>

</body>
</html>


- Read the HTML content of the page.
- Create a BeautifulSoup object to parse this HTML.
- Find the title of the webpage (the content inside the <title> tag).
- Extract all paragraphs (<p> tags) from the page.
- Retrieve all links (URLs in <a href=""> tags) on the page.

In [8]:
from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html><html lang="en"><head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sports World</title>
    <style>
        body { font-family: Arial, sans-serif; }
        header, nav, section, article, footer { margin: 20px; padding: 15px; }
        nav { background-color: #333; }
        nav a { color: white; padding: 14px 20px; text-decoration: none; display: inline-block; }
        nav a:hover { background-color: #ddd; color: black; }
        .video { text-align: center; margin: 20px 0; }
    </style></head><body>

    <header>
        <h1>Welcome to Sports World</h1>
        <p>Your one-stop destination for the latest sports news and videos.</p>
    </header>

    <nav>
        <a href="#football">Football</a>
        <a href="#basketball">Basketball</a>
        <a href="#tennis">Tennis</a>
    </nav>

    <section id="football">
        <h2>Football</h2>
        <article>
            <h3>Latest Football News</h3>
            <p>Read about the latest football matches and player news.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/football-video-id" frameborder="0" allowfullscreen>
                </iframe>
            </div>
        </article>
    </section>

    <section id="basketball">
        <h2>Basketball</h2>
        <article>
            <h3>NBA Highlights</h3>
            <p>Watch highlights from the latest NBA games.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/basketball-video-id" frameborder="0" allowfullscreen>
                </iframe>
            </div>
        </article>
    </section>

    <section id="tennis">
        <h2>Tennis</h2>
        <article>
            <h3>Grand Slam Updates</h3>
            <p>Get the latest updates from the world of Grand Slam tennis.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/tennis-video-id" frameborder="0" allowfullscreen></iframe>
            </div>
        </article>
    </section>

    <footer>
        <form action="mailto:contact@sportsworld.com" method="post" enctype="text/plain">
            <label for="name">Name:</label><br>
            <input type="text" id="name" name="name"><br>
            <label for="email">Email:</label><br>
            <input type="email" id="email" name="email"><br>
            <label for="message">Message:</label><br>
            <textarea id="message" name="message" rows="4" cols="50"></textarea><br><br>
            <input type="submit" value="Send">
        </form>
    </footer></body></html>
"""

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(html_doc, 'html.parser')

In [9]:
# Find the title of the webpage
page_title = soup.title.string
print("Page Title:", page_title)

Page Title: Sports World


In [10]:
# Extract all paragraphs (<p> tags) from the page
paragraphs = soup.find_all('p')
print("\nParagraphs:")
for i, p in enumerate(paragraphs, 1):
    print(f"{i}.", p.text)


Paragraphs:
1. Your one-stop destination for the latest sports news and videos.
2. Read about the latest football matches and player news.
3. Watch highlights from the latest NBA games.
4. Get the latest updates from the world of Grand Slam tennis.


In [13]:
# Retrieve all links (<a href="">)
links = soup.find_all('a')
print("\nLinks (URLs):")
for i, link in enumerate(links, 1):
    href = link.get('href')
    print(f"{i}.", href)


Links (URLs):
1. #football
2. #basketball
3. #tennis


In [12]:
# Extracting iframe sources
print("\n--- All Video Sources (src in <iframe> tags) ---")
iframes = soup.find_all('iframe')
for i, iframe in enumerate(iframes):
    src = iframe.get('src')
    if src:
        print(f"Video Source {i+1}: {src}")


--- All Video Sources (src in <iframe> tags) ---
Video Source 1: https://www.youtube.com/embed/football-video-id
Video Source 2: https://www.youtube.com/embed/basketball-video-id
Video Source 3: https://www.youtube.com/embed/tennis-video-id


**üåü Exercise 2 : Scraping robots.txt from Wikipedia**

**Instructions**

Write a Python program to download and display the content of robot.txt for wikipedia

In [14]:
import requests

# Wikipedia's robots.txt URL
url = "https://en.wikipedia.org/robots.txt"

# Send GET request to fetch the robots.txt content
response = requests.get(url)

# Display the response
if response.status_code == 200:
    print("robots.txt content:\n")
    print(response.text)
else:
    print(f"Failed to fetch robots.txt. Status code: {response.status_code}")

robots.txt content:

Ôªø# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Di

**üåü Exercise 3 : Extracting Headers from Wikipedia‚Äôs Main Page**

**Instructions**

Write a Python program to extract and display all the header tags from wikipedia

In [15]:
# Wikipedia main page URL
url = "https://en.wikipedia.org/wiki/Main_Page"

# Send GET request to fetch the HTML content
response = requests.get(url)

# Check for successful response
if response.status_code == 200:
    # Parse HTML with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract all header tags: h1 to h6
    print("Header tags found on Wikipedia Main Page:\n")
    for level in range(1, 7):
        headers = soup.find_all(f'h{level}')
        for i, header in enumerate(headers, 1):
            print(f"h{level}-{i}: {header.text.strip()}")
else:
    print(f"Failed to fetch the page. Status code: {response.status_code}")

Header tags found on Wikipedia Main Page:

h1-1: Main Page
h1-2: Welcome to Wikipedia
h2-1: From today's featured article
h2-2: Did you know¬†...
h2-3: In the news
h2-4: On this day
h2-5: Today's featured picture
h2-6: Other areas of Wikipedia
h2-7: Wikipedia's sister projects
h2-8: Wikipedia languages


**üåü Exercise 4 : Checking for Page Title**

**Instructions**

Write a Python program to check whether a page contains a title or not.

In [16]:
# URL to check
url = "https://en.wikipedia.org/wiki/Main_Page"

# Fetch the page
response = requests.get(url)

# Parse the page if response is successful
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # Check if a title tag exists and is not empty
    title_tag = soup.title
    if title_tag and title_tag.string:
        print("Page contains a title:")
        print(title_tag.string.strip())
    else:
        print("Page does NOT contain a title.")
else:
    print(f"Failed to fetch the page. Status code: {response.status_code}")


Page contains a title:
Wikipedia, the free encyclopedia


**üåü Exercise 5 : Analyzing US-CERT Security Alerts**

**Instructions**

Write a Python program to get the number of security alerts issued by US-CERT in the current year.
Source

In [18]:
from datetime import datetime

# Get current year
current_year = datetime.now().year

# URL for US-CERT (CISA) Alerts
url = "https://www.cisa.gov/news-events/alerts"

# Send GET request
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all dates on the page (typically inside <time> tags or near titles)
    dates = soup.find_all('span', class_="date-display-single")

    # Count how many dates match the current year
    count = 0
    for date_tag in dates:
        date_text = date_tag.get_text(strip=True)
        try:
            alert_date = datetime.strptime(date_text, "%B %d, %Y")
            if alert_date.year == current_year:
                count += 1
        except ValueError:
            continue  # Skip dates with unexpected formats

    print(f"Number of US-CERT Security Alerts in {current_year}: {count}")
else:
    print(f"Failed to fetch the page. Status code: {response.status_code}")

Failed to fetch the page. Status code: 404


**üåü Exercise 6 : Scraping Movie Details**

**Instructions**

Write a Python program to get movie name, year and a brief summary of the top 10 random movies from this IMBD website.

In [19]:
import random

# IMDb Top 250 Movies URL
url = "https://www.imdb.com/chart/top/"

# IMDb base URL
base_url = "https://www.imdb.com"

# Send GET request
headers = {"Accept-Language": "en-US,en;q=0.5"}
response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all movie rows
    movies = soup.select('td.titleColumn a')

    # Pick 10 random movies
    sampled_movies = random.sample(movies, 10)

    print("Top 10 Random IMDb Movies:\n")

    for i, movie_tag in enumerate(sampled_movies, 1):
        movie_name = movie_tag.text.strip()
        movie_url = base_url + movie_tag['href']

        # Visit the movie detail page
        movie_response = requests.get(movie_url, headers=headers)
        movie_soup = BeautifulSoup(movie_response.text, 'html.parser')

        # Extract release year
        year_tag = movie_soup.find('span', id='titleYear')
        year = year_tag.text.strip("()") if year_tag else "Unknown"

        # Extract summary
        summary_tag = movie_soup.find('span', {'data-testid': 'plot-xl'})
        summary = summary_tag.text.strip() if summary_tag else "Summary not available."

        # Output
        print(f"{i}. {movie_name} ({year})")
        print(f"   Summary: {summary}\n")

else:
    print(f"Failed to fetch IMDb Top 250 page. Status code: {response.status_code}")

Failed to fetch IMDb Top 250 page. Status code: 403
