---

# Introduction to Web Scraping with BeautifulSoup

Welcome to this notebook!

Here, we'll explore the fascinating world of **Web Scraping** using the **BeautifulSoup** library in Python. Web scraping is a powerful technique for extracting data from websites, transforming unstructured web content into structured, usable information.

## What is BeautifulSoup?

**BeautifulSoup** is a Python library that makes parsing HTML and XML documents easy. It builds a "parse tree" from the web page, allowing you to navigate, search, and modify the content in a simple and intuitive way. Think of it as having a detailed map of a website's structure!

---

In [2]:
from bs4 import BeautifulSoup
import requests 

In [None]:
# Make an HTTP GET request to the Python.org homepage.
# The 'requests.get()' function downloads the content of the specified URL.
pages = requests.get('https://www.python.org/')

# Parse the raw HTML content obtained from the request using BeautifulSoup.
# 'pages.content' provides the raw bytes of the HTML page.
# 'html.parser' specifies that Python's built-in HTML parser should be used.
# The 'soup' object now contains a parsed representation of the HTML,
# which allows you to easily navigate and search its elements.
soup = BeautifulSoup(pages.content, 'html.parser')

In [None]:
# Assuming 'soup' is already a BeautifulSoup object containing parsed HTML:
# (from the previous example: soup = BeautifulSoup(pages.content, 'html.parser'))

# Use the find_all() method of the BeautifulSoup object to find all occurrences of a specific HTML tag.
# 'a' is the HTML tag for hyperlinks (anchor tags).
# This line finds every <a> tag present in the 'soup' (the entire parsed HTML document).
# The result, 'links', will be a list of BeautifulSoup Tag objects,
# each representing an <a> tag found in the HTML.
links = soup.find_all('a')

In [6]:
# Assuming 'links' is a list of BeautifulSoup Tag objects, where each Tag represents an <a> (anchor) element.
# (from the previous example: links = soup.find_all('a'))

# Iterate through each 'link' (which is an <a> tag object) in the 'links' list.
for link in links:
    # Check two conditions for each link:
    # 1. 'link.get('href')': This retrieves the value of the 'href' attribute from the <a> tag.
    #    The 'href' attribute contains the URL that the hyperlink points to.
    #    '.get()' is used because it safely returns None if the 'href' attribute doesn't exist,
    #    preventing a KeyError.
    # 2. 'and 'http' in link.get('href')': If 'href' exists (is not None), this checks if the string 'http'
    #    is present within the 'href' value. This is a simple way to filter for absolute URLs
    #    (i.e., those starting with 'http://' or 'https://').
    if link.get('href') and 'http' in link.get('href'):
        # If both conditions are true (the 'href' attribute exists and contains 'http'),
        # print the value of the 'href' attribute, which is the full URL.
        print(link.get('href'))
    else:
        # If either condition is false (no 'href' or no 'http' in 'href'),
        # it means the link is not an external, absolute URL in the format we're looking for.
        # In this case, print a message indicating that no valid URL was found for this specific link.
        print("No valid URL found")

No valid URL found
No valid URL found
No valid URL found
https://www.python.org/psf/
https://docs.python.org
https://pypi.org/
No valid URL found
No valid URL found
No valid URL found
No valid URL found
https://psfmember.org/civicrm/contribute/transact?reset=1&id=2
No valid URL found
No valid URL found
No valid URL found
No valid URL found
No valid URL found
No valid URL found
https://www.linkedin.com/company/python-software-foundation/
https://fosstodon.org/@ThePSF
No valid URL found
https://twitter.com/ThePSF
No valid URL found
No valid URL found
No valid URL found
No valid URL found
No valid URL found
http://brochure.getpython.info/
No valid URL found
No valid URL found
No valid URL found
No valid URL found
No valid URL found
No valid URL found
https://docs.python.org/3/license.html
No valid URL found
No valid URL found
No valid URL found
No valid URL found
https://wiki.python.org/moin/BeginnersGuide
https://devguide.python.org/
https://docs.python.org/faq/
http://wiki.python.org/mo

In [9]:
url = 'https://www.python.org/'
# Make an HTTP GET request to the specified URL.
data  = requests.get(url).text

In [10]:
for link in soup.find_all('img'):
    print(link)
    print(link.get('src'))

<img alt="python™" class="python-logo" src="/static/img/python-logo.png"/>
/static/img/python-logo.png
