# Week 1 (Optional): Web Scraping with Selenium

**Web and Social Network Analytics**

---

**Note**: This notebook is **optional** supplementary material. The main notebook (`Week1-notes.ipynb`) covers Playwright, which is recommended for JupyterHub environments. Use this notebook if:

- You're working on your **local machine**
- You need to learn Selenium for work/research (industry standard)
- You want to compare Selenium vs Playwright approaches

---

**Disclaimer**: This educational content is provided for instructional purposes only. Always respect website terms of service and legal requirements when scraping.

## When to Use Selenium vs Playwright

| Feature | Selenium | Playwright |
|---------|----------|------------|
| **Browser Support** | Chrome, Firefox, Safari, Edge | Chromium, Firefox, WebKit |
| **Setup Complexity** | Requires matching driver version | Auto-manages browsers |
| **Industry Adoption** | Very widely used | Growing rapidly |
| **Documentation** | Extensive, many tutorials | Modern, well-organized |
| **Virtual Environments** | Can be tricky | Works well |
| **Async Support** | Limited | Built-in |

**Recommendation**: 
- Use **Playwright** for new projects and virtual environments
- Learn **Selenium** if you'll work with existing codebases

---

# Part 1: Setting Up Selenium

## Step 1: Install Selenium

In [None]:
# Uncomment to install
# !pip install selenium

## Step 2: Download ChromeDriver

Selenium needs a **WebDriver** to control the browser. For Chrome:

### 2.1 Find Your Chrome Version

1. Open Chrome
2. Go to `chrome://settings/help`
3. Note the version number (e.g., `131.0.6778.265`)

### 2.2 Download Matching ChromeDriver

**For Chrome version 115 or newer:**
- Go to [Chrome for Testing](https://googlechromelabs.github.io/chrome-for-testing/)
- Find the "chromedriver" row matching your Chrome version
- Download for your OS (win64, mac-x64, mac-arm64, linux64)

**For Chrome version 114 or older:**
- Go to [ChromeDriver Downloads](https://chromedriver.storage.googleapis.com/index.html)
- Find your version folder
- Download the appropriate zip file

### 2.3 Install ChromeDriver

1. Unzip the downloaded file
2. Place `chromedriver` (Mac/Linux) or `chromedriver.exe` (Windows) in:
   - **Option A**: The same folder as this notebook
   - **Option B**: A folder in your system PATH

## Step 3: Verify Setup

In [None]:
# Test if Selenium is installed
import selenium
print(f'Selenium version: {selenium.__version__}')

In [None]:
# Import required modules
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from bs4 import BeautifulSoup
import time
import sys

## Browser Initialization Function

This function creates a browser instance that works on both Windows and Mac.

In [None]:
def get_browser(headless=False):
    """
    Create and return a Chrome browser instance.
    
    Args:
        headless: If True, runs browser without GUI (useful for servers)
    
    Returns:
        webdriver.Chrome instance
    """
    from selenium.webdriver.chrome.options import Options
    
    options = Options()
    if headless:
        options.add_argument('--headless')
        options.add_argument('--no-sandbox')  # Required for some environments
        options.add_argument('--disable-dev-shm-usage')  # Overcome limited resource problems
    
    try:
        # Modern Selenium (4.x) - auto-manages drivers
        browser = webdriver.Chrome(options=options)
    except Exception as e:
        print(f'Error starting browser: {e}')
        print('\nTroubleshooting:')
        print('1. Make sure Chrome is installed')
        print('2. For older Selenium, ensure chromedriver is in PATH or current directory')
        raise
    
    return browser

In [None]:
# Test browser creation
print('Starting browser...')
browser = get_browser(headless=True)  # Use headless=False to see the browser
browser.get('https://quotes.toscrape.com/')
print(f'Page title: {browser.title}')
browser.quit()
print('Browser closed successfully!')

## Common Setup Issues and Solutions

### Issue 1: "chromedriver" cannot be opened (Mac)

**Solution**: 
```bash
xattr -d com.apple.quarantine chromedriver
```
Or: System Preferences > Security & Privacy > Allow anyway

### Issue 2: Version mismatch error

**Solution**: Download ChromeDriver matching your exact Chrome version

### Issue 3: "str has no capabilities" error

**Solution**: Update to Selenium 4.x which doesn't require executable path:
```bash
pip install --upgrade selenium
```

### Issue 4: Browser crashes immediately

**Solution**: Try headless mode or add these options:
```python
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
```

---

# Part 2: Selenium Fundamentals

## Core Concepts

### WebDriver
The main interface to control the browser. Think of it as a remote control.

### Locating Elements
Selenium provides multiple ways to find elements on a page.

## Finding Elements

| Method | Usage | Example |
|--------|-------|---------|  
| By.ID | Unique element | `By.ID, 'submit-btn'` |
| By.CLASS_NAME | Elements with class | `By.CLASS_NAME, 'quote'` |
| By.TAG_NAME | HTML tag | `By.TAG_NAME, 'h1'` |
| By.CSS_SELECTOR | CSS selector | `By.CSS_SELECTOR, 'div.quote span.text'` |
| By.XPATH | XPath expression | `By.XPATH, '//div[@class="quote"]'` |
| By.LINK_TEXT | Exact link text | `By.LINK_TEXT, 'Next'` |
| By.PARTIAL_LINK_TEXT | Partial link text | `By.PARTIAL_LINK_TEXT, 'Nex'` |

In [None]:
# Example: Different ways to find elements
browser = get_browser(headless=True)
browser.get('https://quotes.toscrape.com/')

# By tag name
title = browser.find_element(By.TAG_NAME, 'h1')
print(f'Title (by tag): {title.text}')

# By class name
first_quote = browser.find_element(By.CLASS_NAME, 'quote')
print(f'First quote exists: {first_quote is not None}')

# By CSS selector
quote_text = browser.find_element(By.CSS_SELECTOR, 'span.text')
print(f'Quote text: {quote_text.text[:50]}...')

# By XPath
author = browser.find_element(By.XPATH, '//small[@class="author"]')
print(f'Author: {author.text}')

browser.quit()

## Single vs Multiple Elements

```python
# Single element (first match)
element = browser.find_element(By.CLASS_NAME, 'quote')

# Multiple elements (list of all matches)
elements = browser.find_elements(By.CLASS_NAME, 'quote')
```

In [None]:
# Example: Getting multiple elements
browser = get_browser(headless=True)
browser.get('https://quotes.toscrape.com/')

# Get all quotes
quotes = browser.find_elements(By.CLASS_NAME, 'quote')
print(f'Found {len(quotes)} quotes on this page')

# Extract data from each
for i, quote in enumerate(quotes[:3]):  # First 3
    text = quote.find_element(By.CLASS_NAME, 'text').text
    author = quote.find_element(By.CLASS_NAME, 'author').text
    print(f'\n{i+1}. {text[:60]}...')
    print(f'   - {author}')

browser.quit()

## Waiting for Elements

Web pages take time to load. Selenium provides two waiting strategies:

### Implicit Wait (Global)
Sets a default wait time for all element searches.

```python
browser.implicitly_wait(10)  # Wait up to 10 seconds
```

### Explicit Wait (Specific)
Wait for specific conditions before proceeding.

```python
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(browser, 10)
element = wait.until(EC.presence_of_element_located((By.ID, 'myid')))
```

In [None]:
# Example: Using explicit waits
browser = get_browser(headless=True)

# Visit a page with JavaScript-loaded content
browser.get('https://quotes.toscrape.com/js/')

# Wait for quotes to appear (they're loaded by JavaScript)
wait = WebDriverWait(browser, 10)
try:
    quotes = wait.until(
        EC.presence_of_all_elements_located((By.CLASS_NAME, 'quote'))
    )
    print(f'Found {len(quotes)} quotes after waiting')
except TimeoutException:
    print('Timeout: Quotes did not load in time')

browser.quit()

## Interacting with Elements

In [None]:
# Example: Clicking and navigating
browser = get_browser(headless=True)
browser.get('https://quotes.toscrape.com/')

print(f'Starting page: {browser.current_url}')

# Click the 'Next' link to go to page 2
next_link = browser.find_element(By.PARTIAL_LINK_TEXT, 'Next')
next_link.click()

time.sleep(1)  # Wait for navigation
print(f'After clicking Next: {browser.current_url}')

browser.quit()

In [None]:
# Example: Filling forms
browser = get_browser(headless=True)
browser.get('https://quotes.toscrape.com/login')

# Find form fields
username_field = browser.find_element(By.ID, 'username')
password_field = browser.find_element(By.ID, 'password')

# Fill in the form
username_field.send_keys('test_user')
password_field.send_keys('test_password')

print('Form filled (not submitting in this example)')

# To submit: browser.find_element(By.CSS_SELECTOR, 'input[type="submit"]').click()

browser.quit()

## Scrolling

In [None]:
# Example: Scrolling the page
browser = get_browser(headless=True)
browser.get('https://quotes.toscrape.com/')

# Method 1: Using Keys
body = browser.find_element(By.TAG_NAME, 'body')
body.send_keys(Keys.PAGE_DOWN)
print('Scrolled down using PAGE_DOWN key')

time.sleep(0.5)

# Method 2: Using JavaScript
browser.execute_script('window.scrollTo(0, document.body.scrollHeight);')
print('Scrolled to bottom using JavaScript')

# Method 3: Scroll by specific amount
browser.execute_script('window.scrollBy(0, 500);')
print('Scrolled down 500 pixels')

browser.quit()

---

# Part 3: Practical Example - BBC Weather

Let's scrape weather information from BBC Weather, demonstrating real-world Selenium usage.

In [None]:
def scrape_bbc_weather():
    """
    Scrape sunrise time for Edinburgh from BBC Weather.
    
    This demonstrates:
    - Navigating to a page
    - Clicking elements
    - Waiting for content
    - Handling cookies consent
    - Extracting specific data
    """
    browser = get_browser(headless=True)
    
    try:
        # Step 1: Go to BBC Weather
        print('Step 1: Navigating to BBC Weather...')
        browser.get('https://www.bbc.co.uk/weather')
        time.sleep(2)
        
        # Step 2: Handle cookies consent (if present)
        print('Step 2: Checking for cookies dialog...')
        try:
            wait = WebDriverWait(browser, 5)
            accept_btn = wait.until(
                EC.element_to_be_clickable((By.ID, 'bbccookies-continue-button'))
            )
            accept_btn.click()
            print('   Accepted cookies')
            time.sleep(1)
        except TimeoutException:
            print('   No cookies dialog found')
        
        # Step 3: Click on Edinburgh
        print('Step 3: Clicking Edinburgh...')
        try:
            edinburgh_link = browser.find_element(
                By.XPATH, "//span[text()='Edinburgh']"
            )
            edinburgh_link.click()
            time.sleep(2)
            print('   Navigated to Edinburgh weather')
        except NoSuchElementException:
            print('   Edinburgh link not found on page')
            return None
        
        # Step 4: Get current page data
        print('Step 4: Extracting weather data...')
        soup = BeautifulSoup(browser.page_source, 'html.parser')
        
        # Try to find sunrise/sunset data
        sunrise_elements = soup.find_all('span', {'class': 'wr-c-astro-data__time'})
        
        if sunrise_elements:
            sunrise = sunrise_elements[0].text if len(sunrise_elements) > 0 else 'N/A'
            sunset = sunrise_elements[1].text if len(sunrise_elements) > 1 else 'N/A'
            print(f'\nResults for Edinburgh:')
            print(f'   Sunrise: {sunrise}')
            print(f'   Sunset: {sunset}')
            return {'sunrise': sunrise, 'sunset': sunset}
        else:
            print('   Could not find sunrise/sunset data')
            return None
            
    finally:
        browser.quit()
        print('\nBrowser closed.')

# Run the scraper
# Note: This may not work in all environments due to page structure changes
# result = scrape_bbc_weather()
print('Run scrape_bbc_weather() to test (may require adjustments for current page structure)')

---

# Part 4: Selenium vs Playwright Comparison

Let's do the same task with both tools to see the differences.

In [None]:
# SELENIUM VERSION
def scrape_quotes_selenium():
    """Scrape quotes using Selenium."""
    browser = get_browser(headless=True)
    quotes_data = []
    
    try:
        browser.get('https://quotes.toscrape.com/js/')
        
        # Wait for quotes to load
        wait = WebDriverWait(browser, 10)
        wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'quote')))
        
        # Extract quotes
        quotes = browser.find_elements(By.CLASS_NAME, 'quote')
        
        for quote in quotes:
            text = quote.find_element(By.CLASS_NAME, 'text').text
            author = quote.find_element(By.CLASS_NAME, 'author').text
            quotes_data.append({'text': text, 'author': author})
            
    finally:
        browser.quit()
    
    return quotes_data

# Test Selenium version
print('Selenium version:')
start = time.time()
selenium_quotes = scrape_quotes_selenium()
selenium_time = time.time() - start
print(f'Found {len(selenium_quotes)} quotes in {selenium_time:.2f} seconds')

In [None]:
# PLAYWRIGHT VERSION (for comparison)
from playwright.sync_api import sync_playwright

def scrape_quotes_playwright():
    """Scrape quotes using Playwright."""
    quotes_data = []
    
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        
        page.goto('https://quotes.toscrape.com/js/')
        page.wait_for_selector('.quote')
        
        # Get HTML and parse with BeautifulSoup
        soup = BeautifulSoup(page.content(), 'html.parser')
        quotes = soup.find_all('div', {'class': 'quote'})
        
        for quote in quotes:
            text = quote.find('span', {'class': 'text'}).text
            author = quote.find('small', {'class': 'author'}).text
            quotes_data.append({'text': text, 'author': author})
        
        browser.close()
    
    return quotes_data

# Test Playwright version
print('\nPlaywright version:')
start = time.time()
playwright_quotes = scrape_quotes_playwright()
playwright_time = time.time() - start
print(f'Found {len(playwright_quotes)} quotes in {playwright_time:.2f} seconds')

In [None]:
# Compare results
print('\n--- Comparison ---')
print(f'Selenium: {selenium_time:.2f}s')
print(f'Playwright: {playwright_time:.2f}s')
print(f'Both found same number of quotes: {len(selenium_quotes) == len(playwright_quotes)}')

## Key Syntax Differences

| Task | Selenium | Playwright |
|------|----------|------------|
| Launch | `webdriver.Chrome()` | `p.chromium.launch()` |
| Navigate | `browser.get(url)` | `page.goto(url)` |
| Find one | `browser.find_element(By.X, val)` | `page.query_selector(sel)` |
| Find all | `browser.find_elements(By.X, val)` | `page.query_selector_all(sel)` |
| Wait | `WebDriverWait(...).until(...)` | `page.wait_for_selector(sel)` |
| Click | `element.click()` | `page.click(sel)` |
| Get HTML | `browser.page_source` | `page.content()` |
| Close | `browser.quit()` | `browser.close()` |

---

# Part 5: XPath Basics

XPath is a powerful way to locate elements. Here are common patterns:

## Common XPath Patterns

```xpath
# By tag
//div                  # All div elements
//div/p                # p elements that are direct children of div
//div//p               # p elements anywhere inside div

# By attribute
//div[@id='main']      # div with id="main"
//div[@class='card']   # div with class="card"
//a[@href]             # All links with href attribute

# By text content
//span[text()='Hello']           # span containing exact text
//span[contains(text(),'Hello')] # span containing text

# Position
//div[1]               # First div
//div[last()]          # Last div
//div[position()<=3]   # First 3 divs

# Combining
//div[@class='quote']//span[@class='text']
```

In [None]:
# XPath examples
browser = get_browser(headless=True)
browser.get('https://quotes.toscrape.com/')

# Find first quote text
first_quote = browser.find_element(By.XPATH, '//span[@class="text"]')
print(f'First quote: {first_quote.text[:50]}...')

# Find all authors
authors = browser.find_elements(By.XPATH, '//small[@class="author"]')
print(f'\nAuthors on page: {[a.text for a in authors[:5]]}')

# Find quote containing specific text
einstein_quote = browser.find_element(
    By.XPATH, '//div[@class="quote"][.//small[text()="Albert Einstein"]]//span[@class="text"]'
)
print(f'\nEinstein quote: {einstein_quote.text[:50]}...')

browser.quit()

## Finding XPath in Browser

1. Right-click on element > Inspect
2. In DevTools, right-click the HTML > Copy > Copy XPath

**Note**: Auto-generated XPaths can be brittle. It's often better to write your own based on stable attributes.

---

# Summary

## When to Use Selenium

- Working with existing Selenium codebases
- Need to support browsers beyond Chrome/Firefox/Safari
- Team already knows Selenium
- Following tutorials/courses that use Selenium

## Key Selenium Concepts

1. **WebDriver**: The browser controller
2. **Locators**: By.ID, By.CLASS_NAME, By.XPATH, By.CSS_SELECTOR
3. **Waits**: Implicit (global) and Explicit (specific)
4. **Actions**: click(), send_keys(), execute_script()

## Best Practices

- Use explicit waits instead of `time.sleep()` where possible
- Always close the browser in a `finally` block
- Use headless mode for automation
- Handle exceptions gracefully
- Prefer stable locators (ID > class > XPath)

---

*End of Optional Selenium Notebook*