# 🌟 Exercise 1 : Parsing HTML with BeautifulSoup
Instructions

Objective: Use urlopen() to fetch the HTML content of a webpage and then parse it using BeautifulSoup.


- Read the HTML content of the page.
- Create a BeautifulSoup object to parse this HTML.
- Find the title of the webpage (the content inside the `<title>` tag).
- Extract all paragraphs (`<p>` tags) from the page.
- Retrieve all links (URLs in `<a href="">` tags) on the page.


In [1]:
!pip install requests beautifulsoup4 pandas

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime



In [2]:
# send a GET request to the webpage
url = "https://www.infobae.com/america/"
response = requests.get(url)

# create a beautifulsoup object to parse the html content
soup = BeautifulSoup(response.content, 'html.parser')

# find the title of the webpage
title = soup.title.string if soup.title else "Title not found"

print(f"The title of the webpage is: {title}")
print('=' * 50)


# find all paragraph elements on the page
print("Paragraphs on the page:")
paragraphs = soup.find_all('p')

# extract and print the text from each paragraph
for paragraph in paragraphs:
    print(paragraph.get_text())

print('=' * 50)

# retrieve all links (URLs in `<a href="">` tags) on the page.
print("Links on the page:")
links = soup.find_all('a')

for link in links:
    href = link.get('href')
    if href:
        print(href)

The title of the webpage is: Infobae América - Infobae
Paragraphs on the page:
15 Dic, 2024
The Economist
Julian E. Barnes
Jamey Keaten
Daniela Mérida
Francois Becker
Martina Cortés
Federico Sáenz Martínez
Maria Eugenia Capelo
Maria Eugenia Capelo
Dylan Escobar Ruiz
Dylan Escobar Ruiz
Jenifer Nava
Micaela Ragoy
Lisa Caamaño
Gustavo Robles
Gustavo Robles
Mirko Racovsky
Links on the page:
https://www.infobae.com/
https://www.infobae.com/?noredirect
https://www.infobae.com/colombia/
https://www.infobae.com/espana/
https://www.infobae.com/mexico/
https://www.infobae.com/peru/
https://www.infobae.com/estados-unidos/
https://www.infobae.com/america/
https://www.infobae.com/venezuela/
https://www.infobae.com/economist/
https://www.infobae.com/wapo/
https://www.infobae.com/america/realeza/
https://www.infobae.com/america/opinion/
https://www.infobae.com/ultimas-noticias-america/
https://www.infobae.com/entretenimiento/
https://www.infobae.com/deportes/
https://www.infobae.com/tendencias/
https

# 🌟 Exercise 2 : Scraping robots.txt from Wikipedia
Instructions

Write a Python program to download and display the content of robot.txt for wikipedia

In [3]:
# send a GET request to the webpage
url = "https://en.wikipedia.org/robots.txt"
response = requests.get(url)

# check if the request was successful
if response.status_code == 200:
    print("robots.txt content for Wikipedia:")
    print('=' * 40)
    print(response.text)
else:
    print(f"Failed to retrieve robots.txt. Status code: {response.status_code}")

robots.txt content for Wikipedia:
﻿# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internet

# 🌟 Exercise 3 : Extracting Headers from Wikipedia’s Main Page
Instructions

Write a Python program to extract and display all the header tags from wikipedia.

In [4]:
# send a GET request to the webpage
url = "https://en.wikipedia.org/wiki/Main_Page"
response = requests.get(url)

# create a beautifulsoup object to parse the html content
soup = BeautifulSoup(response.content, 'html.parser')

# define header objects
headers = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])

# spread butter
print("Header tags from Wikipedia:")
for header in headers:
    print(f"{header.name}: {header.text.strip()}")

Header tags from Wikipedia:
h1: Main Page
h1: Welcome to Wikipedia
h2: From today's featured article
h2: Did you know ...
h2: In the news
h2: On this day
h2: Today's featured picture
h2: Other areas of Wikipedia
h2: Wikipedia's sister projects
h2: Wikipedia languages


# 🌟 Exercise 4 : Checking for Page Title
Instructions

Write a Python program to check whether a page contains a title or not.

In [5]:
# url to check
url = "https://9gag.com/"

try:
  # send a get rqeuest to the webpage
  response = requests.get(url)
  response.raise_for_status() # raise an excemption for bad status

  # creat a beatuifulsoup object to parse  the html content
  soup = BeautifulSoup(response.content, 'html.parser')

  # find the title tag
  title_tag = soup.title

  # check if a title tag was found
  if title_tag:
    title = title_tag.string.strip()
    print(f"The page '{url}' contains a title: {title}")
  else:
    print(f"The page '{url}' does not contain a title.")

except requests.exceptions.RequestException as e:
  print(f"An error occurred while accessing the page: {e}")

The page 'https://9gag.com/' contains a title: 9GAG - Best Funny Memes and Breaking News


# 🌟 Exercise 5 : Analyzing US-CERT Security Alerts
Instructions

Write a Python program to get the number of security alerts issued by US-CERT in the current year.


[Source](https://www.cisa.gov/news-events/cybersecurity-advisories?f%5B0%5D=advisory_type%3A93)

In [6]:
url = "https://www.cisa.gov/news-events/cybersecurity-advisories?f%5B0%5D=advisory_type%3A93"


response = requests.get(url)
html_content = response.text


soup = BeautifulSoup(html_content, 'html.parser')

#extract the current year
current_year = datetime.now().year  # Get the current year

#find all elements that contain security alert dates
alert_dates = []
for date_tag in soup.find_all('time'):
    date_text = date_tag.get('datetime') # extract the date as text
    if date_text and date_text.startswith(str(current_year)):  # check if the current year is in the date
        alert_dates.append(date_text)  # add the date to our list if it matches the current year

#count the number of security alerts from the current year
alert_count = len(alert_dates)

#display the result
print(f"number of security alerts issued by us-cert in {current_year}: {alert_count}")

number of security alerts issued by us-cert in 2024: 10


# 🌟 Exercise 6 : Scraping Movie Details
Instructions

Write a Python program to get movie name, year and a brief summary of the top 10 random movies from [this IMBD website](https://www.imdb.com/list/ls091294718/).

In [7]:
# url to check
url = 'https://www.scrapethissite.com/pages/forms/'

try:
  # fetch webpge content
  response = requests.get(url)
  response.raise_for_status() # raise an excemption for bad status

  # creat a beatuifulsoup object to parse  the html content
  soup = BeautifulSoup(response.text, 'html.parser')

  teams = []
  for team in soup.find_all('tr', class_='team'):
    name = team.find('td', class_='name').text.strip()
    year = team.find('td', class_='year').text.strip()
    wins = team.find('td', class_='wins').text.strip()
    ot_losses = team.find('td', class_='ot-losses').text.strip()
    win_pct = team.find('td', class_='pct').text.strip()
    goals_for = team.find('td', class_='gf').text.strip()
    goals_against = team.find('td', class_='ga').text.strip()
    goal_diff = team.find('td', class_='diff').text.strip()

    teams.append({
        'name': name,
        'year': year,
        'wins': wins,
        'ot_losses': ot_losses,
        'win_pct': win_pct,
        'goals_for': goals_for,
        'goals_against': goals_against,
        'goal_diff': goal_diff
    })

  df = pd.DataFrame(teams)
  print(df.head(10))

except requests.RequestException as e:
    print(f"An error occurred while fetching the page: {e}")

                    name  year wins ot_losses win_pct goals_for goals_against  \
0          Boston Bruins  1990   44              0.55       299           264   
1         Buffalo Sabres  1990   31             0.388       292           278   
2         Calgary Flames  1990   46             0.575       344           263   
3     Chicago Blackhawks  1990   49             0.613       284           211   
4      Detroit Red Wings  1990   34             0.425       273           298   
5        Edmonton Oilers  1990   37             0.463       272           272   
6       Hartford Whalers  1990   31             0.388       238           276   
7      Los Angeles Kings  1990   46             0.575       340           254   
8  Minnesota North Stars  1990   27             0.338       256           266   
9     Montreal Canadiens  1990   39             0.487       273           249   

  goal_diff  
0        35  
1        14  
2        81  
3        73  
4       -25  
5         0  
6       -3