# 11. The `BeautifulSoup` Library: Parsing Digital Artifacts (HTML)

Often, the data we need to explore is located on the web, structured within HTML or XML documents. `BeautifulSoup` is a powerful Python library that allows us to parse these static documents, navigate their structure, and extract the valuable information within.

- `BeautifulSoup` is ideal for parsing static HTML and XML files (the kind whose content doesn't change without a page reload).
- For interacting with dynamic websites that rely heavily on JavaScript to load content, a browser automation tool like `Selenium` is often a better choice.
- **Installation** (in your active virtual environment): `pip install beautifulsoup4`
- **Documentation:** https://beautiful-soup-4.readthedocs.io/en/latest/

In [None]:
import requests
from bs4 import BeautifulSoup
# We need to import the BeautifulSoup and requests library

# The url of the page we want to scrape
url = "http://example.com/"
response = requests.get(url)

# First, confirm our request was successful (HTTP status code 200)
if response.status_code == 200:
    # Create a 'Soup' object from the page's HTML text
    soup = BeautifulSoup(response.text, 'html.parser')
    # 'html.parser' helps Python make sense of the HTML structure.
    
    # Example: Extract the text of all product titles (which are inside <h3><a> tags on this site)
    book_title_tags = soup.find_all('h3') # Find all <h3> tags

    for title_tag in book_title_tags:
        # The ".text" attribute extracts text content from within "a" tag
        print(title_tag.a.text) # Access the <a> tag inside the <h3> and get its text
else:
    print(f"Failed to download the webpage. Status code: {response.status_code}")


# --- Finding individual tags ---
# Accessing tags directly gets the *first* occurrence on the page.
first_title_tag = soup.title # The complete <title>...</title> tag
title_name = soup.title.name # The name of the tag -> 'title'
title_text = soup.title.string # The text content inside the tag

first_h3_tag = soup.h3 # The first <h3> tag
first_link = soup.a # The first <a> tag


# --- Getting tag attributes and text ---
# returns the value of an attribute from a tag.
link_url = first_link.get('href') # -> 'index.html'
# returns the inner text of a tag, similar to .string or .text
link_text = first_link.get_text() # -> 'Home'


# --- Using find_all() ---
# returns a list-like ResultSet of all <a> tag elements.
all_links = soup.find_all('a')

for link in all_links: # We can iterate through the results
    print(f"Link Text: {link.get_text().strip()}, URL: {link.get('href')}")


# --- Using find() with filters ---
# .find() is like .find_all() but returns only the first match.
# We can also filter by attributes like 'class'. Note: 'class' is a Python keyword,
# so we use 'class_' with a trailing underscore.
article_pod = soup.find('div', class_='product_pod') # Example find
print(article_pod)


# --- Using CSS Selectors with select() ---
# .select() and .select_one() use CSS selector syntax to find elements, which is very powerful.
soup.select_one(selector="p a") # Returns the first <a> tag that is inside a <p> tag
soup.select(selector="p a") # Returns a list of all <a> tags inside <p> tags
soup.select(selector=".product_pod") # Returns a list of all tags with class="product_pod"

## 11.1. The `robots.txt` Protocol: Rules of Engagement
- The `robots.txt` file is a standard text file located in the root directory of a website (e.g., `https://www.google.com/robots.txt`).
- It provides rules and directives for automated programs (`bots` or `spiders`) that visit the site. It outlines which parts of the site the owner does *not* want bots to access.
- It's like the "rules of engagement" or a set of access permissions left by the site's creators for automated reconnaissance bots.

    - `User-agent: *` (* means the rule applies to all bots)
    - `User-agent: Googlebot` (The rule applies to a specific bot, Google's main crawler)
    - `"Disallow:"` (with no value) = Bots are allowed everywhere.
    - `"Disallow: /"` = Bots are not allowed anywhere on the site.
    - `"Disallow: /search"` = Bots should not access anything in the `/search/` directory.
    - `"Allow: /search/about"` = An exception, allowing access to a specific sub-page even if its parent is disallowed.
    - `"Disallow: /index.html?"` = Disallows access to any URL starting with `/index.html` that includes a query string (e.g., `/index.html?id=123`). The `?` denotes the start of a query string.

- **Note:** Respecting `robots.txt` is a matter of etiquette and good practice. It is **not technically enforceable**; a malicious bot can simply ignore it. Responsible programmers and systems, however, adhere to these rules.

## practice

**Scenario:** You are a data operative tasked with extracting specific intel from publicly available web sources (static web pages).

**1. Basic Reconnaissance:**
- Go to a simple, static web page. 
- Inspect its structure in your browser using DevTools (`F12`) to identify the HTML tags that contain the data you want.
- Using `requests` and `BeautifulSoup` in a Python script:
    - **a)** Extract all visible text from the main body of the page and print it to the console.
    - **b)** Extract all links, headlines or anything else from the page and print them as a list.

---

**2. Challenge I: Modular Scraping Tool**
- Refactor your code from the previous exercise into one or more functions.
- Create a main function that accepts a `url` as a parameter to make your scraper reusable for different targets.

---

**3. Challenge II: Data Archiving**
- Modify your function(s) to save the extracted results to a local file.
- If the file doesn't exist, it should be created. If it does exist, the new data should be appended.

---

**4. Challenge III: Timestamped Logging**
- Enhance your function(s) from the previous challenge.
- Each time you run a scrape on a URL, the new results should be appended to the existing log file.
- Before writing the new results for a scrape session, your script should first write a header line with the **current date and time** to timestamp when that specific data was gathered. This allows your log file to store a history of multiple reconnaissance runs over time.

---
#### © Jiří Svoboda (George Freedom)
- Web: https://GeorgeFreedom.com
- LinkedIn: https://www.linkedin.com/in/georgefreedom/
- Book me: https://cal.com/georgefreedom