# 🕸️ Web Scraping with BeautifulSoup

## Objective
- Scrape a web page using Python and BeautifulSoup.
- Extract the page title, H2 headings, and all hyperlinks.
- Filter links starting with "https" and save them to a text file.

## Step 1: Import Libraries

In [None]:
!pip install beautifulsoup4 requests

In [7]:
from bs4 import BeautifulSoup
import requests

## Step 2: Fetch HTML Content
Using `requests` to get the HTML content from the Python homepage.

In [8]:
url = "https://www.python.org"
headers = {"User-Agent": "Mozilla/5.0"}

r = requests.get(url, headers=headers)

soup = BeautifulSoup(r.text, "html.parser")

## Step 3: Extract Title, H2 Headings, and Hyperlinks


In [10]:
title = soup.title.string
h2 = soup.find_all("h2")
a = soup.find_all("a")

hyp = []
for link in soup.find_all("a", href=True):
    hyp.append(link["href"])

# print(hyp)

text = []
for link in hyp:
    if "https" in link:
        text.append(link)
    else:
        continue
text_str = ""

for line in text:
    text_str += line + '\n'

print(text_str)

https://www.python.org/psf/
https://docs.python.org
https://pypi.org/
https://psfmember.org/civicrm/contribute/transact?reset=1&id=2
https://www.linkedin.com/company/python-software-foundation/
https://fosstodon.org/@ThePSF
https://twitter.com/ThePSF
https://docs.python.org/3/license.html
https://wiki.python.org/moin/BeginnersGuide
https://devguide.python.org/
https://docs.python.org/faq/
https://peps.python.org
https://wiki.python.org/moin/PythonBooks
https://wiki.python.org/moin/
https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
https://docs.python.org
https://blog.python.org
https://pythoninsider.blogspot.com/2025/09/python-3140rc3-is-go.html
https://pyfound.blogspot.com/2025/09/announcing-2025-psf-board-election.html
https://pyfound.blogspot.com/2025/09/sprints-are-best-part-of-conference.html
https://pyfound.blogspot.com/2025/09/the-2025-psf-board-election-is-open.html
https://pyfound.blogspot.com/2025/08/pypistats-org-is-now-operated-by-the-psf.html
https://wi

## Step 4: Save Filtered Links to File


In [None]:
directory = r"C:\Repos\IBM-Data-Engineering-Journal\exercises\Python for Data Science, AI, & Devvelopment\Mod_5_Web Scraping with BeautifulSoup [BeautifulSoup, open]"

with open(fr"{directory}\links.txt", "w", encoding="utf-8") as file:
    file.write(text_str)


## ✅ Conclusion
- Successfully extracted the title, H2 headings, and hyperlinks.
- Filtered links starting with "https".
- Saved filtered links to `links.txt` for reference.
