
🌟 Exercise 1 : Parsing HTML with BeautifulSoup
Instructions
- Objective: Use urlopen() to fetch the HTML content of a webpage and then parse it using BeautifulSoup.

- Read the HTML content of the page.
- Create a BeautifulSoup object to parse this HTML.
- Find the title of the webpage (the content inside the <title> tag).
- Extract all paragraphs (<p> tags) from the page.
- Retrieve all links (URLs in <a href=""> tags) on the page.


In [22]:
html = '''
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sports World</title>
    <style>
        body { font-family: Arial, sans-serif; }
        header, nav, section, article, footer { margin: 20px; padding: 15px; }
        nav { background-color: #333; }
        nav a { color: white; padding: 14px 20px; text-decoration: none; display: inline-block; }
        nav a:hover { background-color: #ddd; color: black; }
        .video { text-align: center; margin: 20px 0; }
    </style>
</head>
<body>

    <header>
        <h1>Welcome to Sports World</h1>
        <p>Your one-stop destination for the latest sports news and videos.</p>
    </header>

    <nav>
        <a href="#football">Football</a>
        <a href="#basketball">Basketball</a>
        <a href="#tennis">Tennis</a>
    </nav>

    <section id="football">
        <h2>Football</h2>
        <article>
            <h3>Latest Football News</h3>
            <p>Read about the latest football matches and player news.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/football-video-id" frameborder="0" allowfullscreen>
                </iframe>
            </div>
        </article>
    </section>

    <section id="basketball">
        <h2>Basketball</h2>
        <article>
            <h3>NBA Highlights</h3>
            <p>Watch highlights from the latest NBA games.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/basketball-video-id" frameborder="0" allowfullscreen>
                </iframe>
            </div>
        </article>
    </section>

    <section id="tennis">
        <h2>Tennis</h2>
        <article>
            <h3>Grand Slam Updates</h3>
            <p>Get the latest updates from the world of Grand Slam tennis.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/tennis-video-id" frameborder="0" allowfullscreen></iframe>
            </div>
        </article>
    </section>

    <footer>
        <form action="mailto:contact@sportsworld.com" method="post" enctype="text/plain">
            <label for="name">Name:</label><br>
            <input type="text" id="name" name="name"><br>
            <label for="email">Email:</label><br>
            <input type="email" id="email" name="email"><br>
            <label for="message">Message:</label><br>
            <textarea id="message" name="message" rows="4" cols="50"></textarea><br><br>
            <input type="submit" value="Send">
        </form>
    </footer>

</body>
</html>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser') 
soup.html.head.title
soup.title.get_text()


'Sports World'

In [31]:
# Extract all paragraphs (<p> tags) from the page.
soup = BeautifulSoup(html, 'html.parser')
paragraphs = soup.find_all('p')

# Print each paragraph
for index, paragraph in enumerate(paragraphs, start=1):
    print(f"Paragraph {index}: {paragraph.get_text(strip=True)}")



Paragraph 1: Your one-stop destination for the latest sports news and videos.
Paragraph 2: Read about the latest football matches and player news.
Paragraph 3: Watch highlights from the latest NBA games.
Paragraph 4: Get the latest updates from the world of Grand Slam tennis.


In [32]:
# Retrieve all links (URLs in <a href=""> tags) on the page.
soup = BeautifulSoup(html, 'html.parser')

links = soup.find_all('a')
print("Links (href attributes):")
for index, link in enumerate(links, start=1):
    url = link.get('href')  # Get the href attribute
    if url:
        print(f"Link {index}: {url}")

# Retrieve all tags with src attributes
tags_with_src = soup.find_all(src=True)
print("\nTags with src attributes:")
for index, tag in enumerate(tags_with_src, start=1):
    src_url = tag['src']  # Get the src attribute
    print(f"Tag {index}: {src_url}")



Links (href attributes):
Link 1: #football
Link 2: #basketball
Link 3: #tennis

Tags with src attributes:
Tag 1: https://www.youtube.com/embed/football-video-id
Tag 2: https://www.youtube.com/embed/basketball-video-id
Tag 3: https://www.youtube.com/embed/tennis-video-id


🌟 Exercise 2 : Scraping robots.txt from Wikipedia:

*Instructions*
- Write a Python program to download and display the content of robot.txt for wikipedia

In [25]:
import requests
response = requests.get("https://en.wikipedia.org/robots.txt")
print(response.text[:100])
websoup = BeautifulSoup(response.text, 'html.parser')


# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on


🌟 Exercise 3 : Extracting Headers from Wikipedia’s Main Page

*Instructions*
- Write a Python program to extract and display all the header tags from wikipedia.

In [36]:
from bs4 import BeautifulSoup
import requests
url = "https://en.wikipedia.org/wiki/Main_Page"
response = requests.get(url)
soup1 = BeautifulSoup(response.text, 'html.parser')

header_tags = soup1.find_all(['h1','h2','h3','h4','h5,','h6'])

for header_tag in header_tags:
    print(header_tag.text)

Main Page
Welcome to Wikipedia
From today's featured article
Did you know ...
In the news
On this day
Today's featured picture
Other areas of Wikipedia
Wikipedia's sister projects
Wikipedia languages


🌟 Exercise 4 : Checking for Page Title

*Instructions*

- Write a Python program to check whether a page contains a title or not.


In [34]:
print(soup1.title.get_text())

Wikipedia, the free encyclopedia


🌟 Exercise 5 : Analyzing US-CERT Security Alerts

*Instructions*
- Write a Python program to get the number of security alerts issued by US-CERT in the current year.
Source

In [42]:
from bs4 import BeautifulSoup
import requests
from datetime import datetime

url = "https://www.cisa.gov/news-events/cybersecurity-advisories?f%5B0%5D=advisory_type%3A93"
response = requests.get(url)
Security = BeautifulSoup(response.text, 'html.parser')

In [44]:
from bs4 import BeautifulSoup
import requests
from datetime import datetime

def get_security_alerts_count(url):
    current_year = datetime.now().year
    response = requests.get(url)
    if response.status_code != 200:
        return 0
    soup = BeautifulSoup(response.text, 'html.parser')
    alerts = soup.find_all('div', class_='c-teaser')
    count = 0
    for alert in alerts:
        date_element = alert.find('year')
        if date_element:
            alert_date = date_element.get_text(strip=True)
            alert_year = datetime.strptime(alert_date, '%B %d, %Y').year
            if alert_year == current_year:
                count += 1
    return count

url = "https://www.cisa.gov/news-events/cybersecurity-advisories?f%5B0%5D=advisory_type%3A93"
alerts_count = get_security_alerts_count(url)
print(f"Number of security alerts issued by US-CERT in the current year: {alerts_count}")

Number of security alerts issued by US-CERT in the current year: 0


🌟 Exercise 6 : Scraping Movie Details

*Instructions*
- Write a Python program to get movie name, year and a brief summary of the top 10 random movies from this IMBD website.



In [39]:
from bs4 import BeautifulSoup
import requests
url = "https://www.scrapethissite.com/pages/forms/"
response = requests.get(url)
hockey = BeautifulSoup(response.text, 'html.parser')


In [40]:

from bs4 import BeautifulSoup
import requests

def scrape_top_teams(url):

    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to retrieve page. Status code: {response.status_code}")
        return []

    soup = BeautifulSoup(response.text, 'html.parser')
    table = soup.find('table', class_='table')

    if not table:
        print("Could not find the team table.")
        return []

    rows = table.find_all('tr')[1:] 

    top_teams = []
    for row in rows[:5]:  
        cols = row.find_all('td')
        if len(cols) < 5:
            continue  
        
        team_name = cols[0].text.strip()
        year = cols[1].text.strip()
        wins = cols[2].text.strip()
        losses = cols[3].text.strip()
        ot_losses = cols[4].text.strip()

        top_teams.append({
            'Team Name': team_name,
            'Year': year,
            'Wins': wins,
            'Losses': losses,
            'OT Losses': ot_losses
        })

    return top_teams

url = "https://www.scrapethissite.com/pages/forms/"
top_teams = scrape_top_teams(url)

if top_teams:
    print("Top 5 Teams:")
    for team in top_teams:
        print(team)


Top 5 Teams:
{'Team Name': 'Boston Bruins', 'Year': '1990', 'Wins': '44', 'Losses': '24', 'OT Losses': ''}
{'Team Name': 'Buffalo Sabres', 'Year': '1990', 'Wins': '31', 'Losses': '30', 'OT Losses': ''}
{'Team Name': 'Calgary Flames', 'Year': '1990', 'Wins': '46', 'Losses': '26', 'OT Losses': ''}
{'Team Name': 'Chicago Blackhawks', 'Year': '1990', 'Wins': '49', 'Losses': '23', 'OT Losses': ''}
{'Team Name': 'Detroit Red Wings', 'Year': '1990', 'Wins': '34', 'Losses': '38', 'OT Losses': ''}
