# Chapter 12: Interacting With the Web 🌐

Web scraping is the process of collecting and parsing raw data from the Web. Many disciplines like data science, business intelligence, and investigative reporting benefit from collecting and analyzing data from websites.

## What You'll Learn 📚
- Scrape and parse text from websites
- Use regular expressions for pattern matching
- Work with HTML parsers like Beautiful Soup
- Interact with web forms using MechanicalSoup
- Handle real-time website interactions

## Important Note ⚠️
Always check a website's Terms of Service before scraping. Some websites explicitly forbid automated data collection. Be respectful with request frequency to avoid overwhelming servers.

## 12.1 Scrape and Parse Text From Websites

### Basic Web Scraping with urllib

In [None]:
# uncomment the codes and run it
#from urllib.request import urlopen

# Open a web page
#url = "http://olympus.realpython.org/profiles/aphrodite"
#page = urlopen(url)

# Extract HTML content
#html_bytes = page.read()
#html = html_bytes.decode("utf-8")

#print(html)


### Extracting Text with String Methods

In [None]:
# Extract title using string methods
title_index = html.find("<title>")
start_index = title_index + len("<title>")
end_index = html.find("</title>")
title = html[start_index:end_index]

print("Extracted title:", title)

## 12.2 Regular Expressions Primer 🔍

Regular expressions are patterns used to search for text within strings.

In [None]:
import re

# Basic pattern matching
print("Pattern 'ab*c':")
print("ac:", re.findall("ab*c", "ac"))
print("abc:", re.findall("ab*c", "abc"))
print("abbbc:", re.findall("ab*c", "abbbc"))
print("abdc:", re.findall("ab*c", "abdc"))

# Case-insensitive matching
print("\nCase-insensitive 'ab*c':")
print("ABC:", re.findall("ab*c", "ABC", re.IGNORECASE))

# Using wildcards
print("\nPattern 'a.c':")
print("abc:", re.findall("a.c", "abc"))
print("abbc:", re.findall("a.c", "abbc"))
print("ac:", re.findall("a.c", "ac"))

# Greedy vs non-greedy matching
string = "Everything is <replaced> if it's in <tags>."
greedy = re.sub("<.*>", "ELEPHANTS", string)
nongreedy = re.sub("<.*?>", "ELEPHANTS", string)

print("\nGreedy substitution:", greedy)
print("Non-greedy substitution:", nongreedy)

Pattern 'ab*c':
ac: ['ac']
abc: ['abc']
abbbc: ['abbbc']
abdc: []

Case-insensitive 'ab*c':
ABC: ['ABC']

Pattern 'a.c':
abc: ['abc']
abbc: []
ac: []

Greedy substitution: Everything is ELEPHANTS.
Non-greedy substitution: Everything is ELEPHANTS if it's in ELEPHANTS.


### Extracting Text with Regular Expressions

In [None]:
from urllib.request import urlopen
import re

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

# Extract title using regex
pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title)  # Remove HTML tags

print("Extracted title:", title)

## 12.3 Using HTML Parsers 🧰

### Beautiful Soup Basics

In [None]:
#!pip install beautifulsoup4
#uncomment all
#from bs4 import BeautifulSoup
#from urllib.request import urlopen

#url = "http://olympus.realpython.org/profiles/dionysus"
#page = urlopen(url)
#html = page.read().decode("utf-8")
#soup = BeautifulSoup(html, "html.parser")

# Extract all text
#print("All text:\n", soup.get_text())

# Extract specific elements
#print("\nTitle:", soup.title)
#print("Title string:", soup.title.string)

# Find all images
#images = soup.find_all("img")
#print("\nImages:", images)

# Extract image sources
#for img in images:
#print("Image source:", img["src"])

## 12.4 Interacting With Forms 📝

### MechanicalSoup Installation and Setup

In [None]:
!pip install MechanicalSoup
import mechanicalsoup

# Create a browser object
browser = mechanicalsoup.Browser()

# Request a page
url = "http://olympus.realpython.org/login"
login_page = browser.get(url)
print("Status code:", login_page.status_code)
print("Login page HTML:\n", login_page.soup)

### Form Submission Example

In [None]:
# Select the form
login_html = login_page.soup
form = login_html.select("form")[0]

# Fill in credentials
form.select("input")[0]["value"] = "zeus"
form.select("input")[1]["value"] = "ThunderDude"

# Submit form
profiles_page = browser.submit(form, login_page.url)

# Check if login was successful
print("Current URL:", profiles_page.url)
if profiles_page.url == "http://olympus.realpython.org/profiles":
    print("Login successful!")
    
# Extract profile links
links = profiles_page.soup.select("a")
base_url = "http://olympus.realpython.org"
print("\nAvailable profiles:")
for link in links:
    address = base_url + link["href"]
    text = link.text
    print(f"{text}: {address}")
else:
    print("Login failed.")
    # Check for error message
    error = profiles_page.soup.find(text="Wrong username or password!")
    if error:
        print("Error:", error)

## 12.5 Real-Time Interaction ⏱️

### Monitoring a Changing Website

In [15]:
# uncomment and run
#import time
#import mechanicalsoup

#browser = mechanicalsoup.Browser()

#for i in range(4):
    # Get the page
    #page = browser.get("http://olympus.realpython.org/dice")
    
    # Extract the result
    #tag = page.soup.select("#result")[0]
    #result = tag.text
    
    # Extract the time
    #time_tag = page.soup.select("p")[1]  # Assuming time is in the second paragraph
    #time_text = time_tag.text
    
    #print(f"Roll {i+1}: {result} at {time_text}")
    
    # Wait before next request (if not the last one)
    #if i < 3:
        #time.sleep(5)  # Wait 5 seconds between requests

## Practice Exercises 💪

### Exercise 1: Basic Scraping
Scrape the profile page of Poseidon from http://olympus.realpython.org/profiles/poseidon and extract:
1. The page title
2. All text content
3. The favorite animal mentioned

In [None]:
# Your solution for Exercise 1

### Exercise 2: Form Handling
Create a script that:
1. Attempts to log in to http://olympus.realpython.org/login with incorrect credentials
2. Verifies that the login failed by checking for the error message
3. Then logs in with the correct credentials (zeus/ThunderDude)
4. Prints the title of the resulting page

In [None]:
# Your solution for Exercise 2

### Challenge Exercise: Data Aggregation
Write a program that:
1. Logs in to the Olympus site
2. Visits each profile page (Aphrodite, Poseidon, Dionysus)
3. Collects the following data from each profile:
   - Name
   - Hometown
   - Favorite animal
   - Favorite color
4. Stores the data in a dictionary
5. Prints the collected data in a formatted way

In [None]:
# Your solution for the Challenge Exercise

## Additional Resources 📚
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [MechanicalSoup Documentation](https://mechanicalsoup.readthedocs.io/)
- [Regular Expressions Guide](https://docs.python.org/3/howto/regex.html)
- [Web Scraping Best Practices](https://www.scrapehero.com/web-scraping-best-practices/)