# Week 1: Introduction to Web Scraping

**Web and Social Network Analytics**

---

## Learning Objectives

By the end of this lab, you will be able to:

1. Understand when and why to scrape web data
2. Use `requests` + `BeautifulSoup` for static HTML pages
3. Use `Playwright` for JavaScript-rendered content
4. Use APIs as the preferred data source

---

**Disclaimer**: This educational content, including any code examples, is provided for instructional purposes only. The author does not endorse or encourage the unauthorised or illegal scraping of websites.

While Python with relevant libraries can be used for web scraping, it's crucial to conduct scraping activities in compliance with applicable laws, the website's terms of service, and ethical considerations. Always review and respect the rules set by the website you are scraping to ensure legal and responsible data collection practices.

---

# Part 1: Introduction to Web Scraping

## What is Web Scraping?

**Web scraping** is the automated process of extracting data from websites. Instead of manually copying information, we write programs that:

1. Download web pages
2. Parse the HTML content
3. Extract the data we need
4. Store it in a structured format (CSV, database, etc.)

### Common Use Cases

- **Price monitoring**: Track product prices across e-commerce sites
- **Research**: Collect data for academic studies
- **News aggregation**: Gather articles from multiple sources
- **Social media analysis**: Analyze public posts and trends
- **Job listings**: Aggregate job postings from various sites

## The Web Scraping Decision Tree

Before scraping, always follow this decision process:

```
Do you need data from a website?
            |
            v
    1. Does an official API exist?
            |
       Yes  |  No
        |   |   |
        v   |   v
    Use the |  2. Is the content static HTML?
      API!  |       |
            |  Yes  |  No (JavaScript-rendered)
            |   |   |   |
            |   v   |   v
            | Use   | Use Playwright
            | BeautifulSoup  or Selenium
```

**Always prefer APIs** when available - they provide:
- Structured data (JSON/XML)
- Legal access with terms of service
- Reliable and stable endpoints
- Rate limiting to prevent overload

## HTTP Basics

The web works on the **HTTP protocol**. When you visit a website:

1. Your browser sends an **HTTP GET request** to the server
2. The server processes the request
3. The server sends back an **HTTP response** with:
   - **Status code** (200 = OK, 404 = Not Found, 500 = Server Error)
   - **Headers** (metadata about the response)
   - **Body** (the actual HTML content)

### Common Status Codes (This is common, NOT MUST BE)

| Code | Meaning |
|------|---------|  
| 200 | OK - Request successful |
| 301 | Moved Permanently - Redirect |
| 403 | Forbidden - Access denied |
| 404 | Not Found - Page doesn't exist |
| 500 | Internal Server Error |

---

# Part 2: HTML Fundamentals

## HTML Structure

HTML (HyperText Markup Language) structures web content using **tags**:

```html
<tagname attribute="value">Content</tagname>
```

### Key HTML Elements

| Tag | Purpose | Example |
|-----|---------|---------|  
| `<h1>` to `<h6>` | Headings | `<h1>Main Title</h1>` |
| `<p>` | Paragraph | `<p>Some text...</p>` |
| `<a>` | Hyperlink | `<a href="url">Link text</a>` |
| `<div>` | Division/container | `<div class="section">...</div>` |
| `<span>` | Inline container | `<span class="highlight">text</span>` |
| `<table>` | Table | Contains `<tr>`, `<td>` |
| `<ul>`, `<ol>` | Lists | Contains `<li>` items |

### Finding Elements

Elements can be identified by:

1. **Tag name**: `<div>`, `<p>`, `<a>`
2. **ID** (unique): `<div id="header">` - only one element per page
3. **Class** (reusable): `<div class="card">` - multiple elements can share

## Hands-on: Exploring HTML

Let's explore the `example_html.html` file in this folder. Open it in your browser and use **Developer Tools** (F12 or right-click > Inspect) to examine the structure.

In [None]:
# First, let's import the libraries we'll need
import os
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [None]:
# Load the local HTML file
file_url = "file:///" + os.getcwd() + "/example_html.html"
website_source_code = urlopen(file_url)

# Parse with BeautifulSoup
soup = BeautifulSoup(website_source_code, 'html.parser')

# View the formatted HTML
print(soup.prettify())

---

# Part 3: Static Scraping with BeautifulSoup

**BeautifulSoup** is a Python library for parsing HTML and XML documents. It creates a parse tree that makes it easy to extract data.

## Core Pattern

```python
from urllib.request import urlopen
from bs4 import BeautifulSoup

# 1. Fetch the page
html = urlopen(url)

# 2. Parse with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

# 3. Find elements
elements = soup.find_all('tag', {'class': 'classname'})

# 4. Extract data
for element in elements:
    print(element.text)
```

## Key BeautifulSoup Methods

| Method | Returns | Example |
|--------|---------|---------|  
| `find(tag)` | First matching element | `soup.find('h1')` |
| `find_all(tag)` | List of all matches | `soup.find_all('p')` |
| `find(id='x')` | Element with ID | `soup.find(id='header')` |
| `find(class_='x')` | Element with class | `soup.find(class_='card')` |
| `find('tag', {'attr': 'val'})` | By attribute | `soup.find('div', {'class': 'main'})` |
| `.text` | Text content | `element.text` |
| `['attribute']` | Attribute value | `link['href']` |
| `.findChildren()` | Child elements | `row.findChildren('td')` |

## Example 1: Working with Local HTML

Let's practice finding elements in our local HTML file.

In [None]:
# Find all h1 tags
h1_tags = soup.find_all('h1')

for h1 in h1_tags:
    print('Tag:', h1)
    print('Text:', h1.text)
    print('---')

In [None]:
# Find element by ID
middle_row = soup.find(id='middle_row')

print('Complete tag:', middle_row)
print('Text content:', middle_row.text)

In [None]:
# Find children of an element
cells = middle_row.findChildren('td')

for cell in cells:
    print('Cell value:', cell.text)

In [None]:
# Find by class name
hipster_divs = soup.find_all('div', {'class': 'hipster'})

for div in hipster_divs:
    # Get the h2 inside each div
    header = div.find('h2').text
    paragraph = div.find('p').text.strip()
    print(f'Header: {header}')
    print(f'Content: {paragraph}')
    print('---')

In [None]:
# Extract all data from a table
table = soup.find('table')

for row_num, row in enumerate(table.find_all('tr')):
    print(f'Row {row_num}:')
    for cell in row.find_all('td'):
        print(f'  Value: {cell.text}')

## Example 2: Scraping quotes.toscrape.com

This is a website specifically designed for practicing web scraping. It's stable and won't change unexpectedly.

In [None]:
# Fetch the quotes page
url = 'https://quotes.toscrape.com/'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

# Find all quote containers
quotes = soup.find_all('div', {'class': 'quote'})

print(f'Found {len(quotes)} quotes on this page\n')

# Extract data from each quote
for quote in quotes[:5]:  # First 5 quotes
    text = quote.find('span', {'class': 'text'}).text
    author = quote.find('small', {'class': 'author'}).text
    
    # Get tags
    tags = [tag.text for tag in quote.find_all('a', {'class': 'tag'})]
    
    print(f'Quote: {text}')
    print(f'Author: {author}')
    print(f'Tags: {tags}')
    print('---')

### Handling Pagination

Most websites split content across multiple pages. Let's scrape multiple pages.

In [None]:
import time

all_quotes = []

# Scrape first 3 pages
for page_num in range(1, 4):
    url = f'https://quotes.toscrape.com/page/{page_num}/'
    print(f'Scraping page {page_num}...')
    
    html = urlopen(url)
    soup = BeautifulSoup(html, 'html.parser')
    
    quotes = soup.find_all('div', {'class': 'quote'})
    
    for quote in quotes:
        all_quotes.append({
            'text': quote.find('span', {'class': 'text'}).text,
            'author': quote.find('small', {'class': 'author'}).text,
            'tags': [tag.text for tag in quote.find_all('a', {'class': 'tag'})]
        })
    
    # Be respectful - wait between requests
    time.sleep(1)

print(f'\nTotal quotes collected: {len(all_quotes)}')

In [None]:
# Convert to a pandas DataFrame
import pandas as pd

df = pd.DataFrame(all_quotes)
df.head(10)

## Example 3: Edinburgh University DRPS

Let's scrape real course information from the University of Edinburgh's DRPS (Degree Regulations and Programmes of Study).

In [None]:
# Fetch a course page
url = 'https://www.drps.ed.ac.uk/current/dpt/cxcmse11427.htm'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

# The page title contains the course name
course_name = soup.find('h1').text if soup.find('h1') else 'Not found'
print(f'Course: {course_name}')

In [None]:
# Find information in the course table
table = soup.find('table', {'class': 'sitstablegrid'})

if table:
    for cell in table.find_all('td'):
        text = cell.text.strip()
        # Look for specific information
        if 'SCQF' in text:
            print(text)

### Try It Yourself!

Modify the code above to extract:
1. The course credits
2. The course organiser
3. The course description

In [None]:
# Your code here


---

# Part 4: Dynamic Scraping with Playwright

## Why Do We Need Playwright?

Many modern websites use **JavaScript** to load content dynamically. When you visit such a site:

1. The server sends minimal HTML
2. JavaScript code runs in your browser
3. The JavaScript fetches data and renders the content

**Problem**: BeautifulSoup only sees the initial HTML - not the JavaScript-rendered content!

**Solution**: Use a browser automation tool like **Playwright** that:
- Launches a real browser
- Executes JavaScript
- Waits for content to load
- Then extracts the rendered HTML

## Setting Up Playwright

Playwright works well in JupyterHub environments because:
- It manages browser binaries automatically
- It has excellent headless mode support
- It's designed for modern web applications

### Async vs Sync API: Which to Use?

| Environment | API to Use | Reason |
|-------------|------------|--------|
| **JupyterLab/Notebook** | `async_playwright` | JupyterLab already runs an event loop; async avoids conflicts |
| **.py scripts** | `sync_playwright` | Simpler syntax when no event loop is running |

**In this notebook**, we use the **async API** since you're running JupyterLab. The sync API equivalent is shown in comments for reference when writing standalone scripts.

In [None]:
# Install Playwright (run once)
# !pip install playwright

# Install browser binaries (run once)
# !playwright install chromium

In [None]:
# Import Playwright
# For JupyterLab/Notebook: Use async API (recommended)
from playwright.async_api import async_playwright

# For .py scripts: Use sync API (uncomment below)
# from playwright.sync_api import sync_playwright

## Core Pattern

### For JupyterLab/Notebook: Async API (Recommended)

JupyterLab runs an event loop, so using the async API avoids conflicts and is the recommended approach.

```python
from playwright.async_api import async_playwright

async def scrape_page():
    async with async_playwright() as p:
        # Launch browser in headless mode (no GUI)
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        # Navigate to URL
        await page.goto('https://example.com')
        
        # Wait for content to load
        await page.wait_for_selector('.content-class')
        
        # Get the rendered HTML
        html = await page.content()
        
        # Now use BeautifulSoup to parse
        soup = BeautifulSoup(html, 'html.parser')
        
        # Close browser
        await browser.close()
        
        return soup

# Run in JupyterLab/Notebook
soup = await scrape_page()
```

### For .py Scripts: Sync API

When running standalone Python scripts (.py files), use the sync API:

```python
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Launch browser in headless mode (no GUI)
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    
    # Navigate to URL
    page.goto('https://example.com')
    
    # Wait for content to load
    page.wait_for_selector('.content-class')
    
    # Get the rendered HTML
    html = page.content()
    
    # Now use BeautifulSoup to parse
    soup = BeautifulSoup(html, 'html.parser')
    
    # Close browser
    browser.close()
```

## Key Playwright Methods

| Method | Purpose | Example |
|--------|---------|---------|  
| `page.goto(url)` | Navigate to URL | `page.goto('https://...')` |
| `page.wait_for_selector(sel)` | Wait for element | `page.wait_for_selector('.quote')` |
| `page.click(sel)` | Click element | `page.click('button.next')` |
| `page.fill(sel, text)` | Fill input field | `page.fill('#search', 'query')` |
| `page.content()` | Get HTML content | `html = page.content()` |
| `page.screenshot()` | Take screenshot | `page.screenshot(path='shot.png')` |
| `page.evaluate(js)` | Run JavaScript | `page.evaluate('window.scrollBy(0, 500)')` |

## Example: JavaScript-Rendered Quotes

The website `quotes.toscrape.com/js/` renders quotes using JavaScript. BeautifulSoup alone cannot see them!

In [None]:
# First, let's see what BeautifulSoup gets (without JavaScript)
url = 'https://quotes.toscrape.com/js/'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

quotes = soup.find_all('div', {'class': 'quote'})
print(f'BeautifulSoup found: {len(quotes)} quotes')
print('(The quotes are loaded by JavaScript, so BeautifulSoup sees nothing!)')

In [None]:
# Now let's use Playwright to render the JavaScript
# ============================================
# ASYNC API (for JupyterLab/Notebook)
# ============================================
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

async def scrape_js_quotes():
    async with async_playwright() as p:
        # Launch headless browser
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        # Navigate to the page
        await page.goto('https://quotes.toscrape.com/js/')
        
        # Wait for quotes to load (JavaScript needs time to execute)
        await page.wait_for_selector('.quote')
        
        # Get the rendered HTML
        html = await page.content()
        
        # Close browser
        await browser.close()
        
        return html

# Run the async function in JupyterLab/Notebook
html = await scrape_js_quotes()

# Now parse with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
quotes = soup.find_all('div', {'class': 'quote'})

print(f'Playwright + BeautifulSoup found: {len(quotes)} quotes\n')

# Display first 3 quotes
for quote in quotes[:3]:
    text = quote.find('span', {'class': 'text'}).text
    author = quote.find('small', {'class': 'author'}).text
    print(f'Quote: {text}')
    print(f'Author: {author}')
    print('---')

# ============================================
# SYNC API (for .py scripts) - Uncomment to use
# ============================================
# from playwright.sync_api import sync_playwright
# from bs4 import BeautifulSoup
#
# with sync_playwright() as p:
#     browser = p.chromium.launch(headless=True)
#     page = browser.new_page()
#     page.goto('https://quotes.toscrape.com/js/')
#     page.wait_for_selector('.quote')
#     html = page.content()
#     browser.close()
#
# soup = BeautifulSoup(html, 'html.parser')
# quotes = soup.find_all('div', {'class': 'quote'})
# print(f'Found: {len(quotes)} quotes')

## Interacting with Pages

Playwright can click buttons, fill forms, scroll, and more.

In [None]:
# Example: Navigate through multiple pages
# ============================================
# ASYNC API (for JupyterLab/Notebook)
# ============================================
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import asyncio

async def scrape_multiple_pages():
    all_quotes = []
    
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        await page.goto('https://quotes.toscrape.com/js/')
        await page.wait_for_selector('.quote')
        
        # Scrape multiple pages
        for page_num in range(1, 4):  # 3 pages
            print(f'Scraping page {page_num}...')
            
            # Get current page content
            html = await page.content()
            soup = BeautifulSoup(html, 'html.parser')
            
            quotes = soup.find_all('div', {'class': 'quote'})
            for quote in quotes:
                all_quotes.append({
                    'text': quote.find('span', {'class': 'text'}).text,
                    'author': quote.find('small', {'class': 'author'}).text
                })
            
            # Try to click 'Next' button
            next_button = await page.query_selector('li.next a')
            if next_button:
                await next_button.click()
                await page.wait_for_selector('.quote')
                await asyncio.sleep(1)  # Be respectful
            else:
                break
        
        await browser.close()
    
    return all_quotes

# Run the async function in JupyterLab/Notebook
all_quotes = await scrape_multiple_pages()
print(f'\nTotal quotes collected: {len(all_quotes)}')

# ============================================
# SYNC API (for .py scripts) - Uncomment to use
# ============================================
# from playwright.sync_api import sync_playwright
# import time
#
# all_quotes = []
#
# with sync_playwright() as p:
#     browser = p.chromium.launch(headless=True)
#     page = browser.new_page()
#     page.goto('https://quotes.toscrape.com/js/')
#     page.wait_for_selector('.quote')
#     
#     for page_num in range(1, 4):
#         print(f'Scraping page {page_num}...')
#         html = page.content()
#         soup = BeautifulSoup(html, 'html.parser')
#         
#         quotes = soup.find_all('div', {'class': 'quote'})
#         for quote in quotes:
#             all_quotes.append({
#                 'text': quote.find('span', {'class': 'text'}).text,
#                 'author': quote.find('small', {'class': 'author'}).text
#             })
#         
#         next_button = page.query_selector('li.next a')
#         if next_button:
#             next_button.click()
#             page.wait_for_selector('.quote')
#             time.sleep(1)
#         else:
#             break
#     
#     browser.close()
#
# print(f'\nTotal quotes collected: {len(all_quotes)}')

## Scrolling for Infinite-Load Pages

Some pages load more content as you scroll ("infinite scroll"). Here's how to handle that:

In [None]:
# Example of scrolling (using quotes.toscrape.com/scroll as demo)
# ============================================
# ASYNC API (for JupyterLab/Notebook)
# ============================================
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import asyncio

async def scrape_with_scroll():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        await page.goto('https://quotes.toscrape.com/scroll')
        await page.wait_for_selector('.quote')
        
        # Scroll down multiple times to load more content
        for i in range(3):
            # Scroll to bottom
            await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
            await asyncio.sleep(2)  # Wait for content to load
            print(f'Scrolled {i+1} times')
        
        # Get all loaded content
        html = await page.content()
        await browser.close()
        
        return html

# Run the async function in JupyterLab/Notebook
html = await scrape_with_scroll()

soup = BeautifulSoup(html, 'html.parser')
quotes = soup.find_all('div', {'class': 'quote'})
print(f'\nTotal quotes after scrolling: {len(quotes)}')

# ============================================
# SYNC API (for .py scripts) - Uncomment to use
# ============================================
# from playwright.sync_api import sync_playwright
# import time
#
# with sync_playwright() as p:
#     browser = p.chromium.launch(headless=True)
#     page = browser.new_page()
#     
#     page.goto('https://quotes.toscrape.com/scroll')
#     page.wait_for_selector('.quote')
#     
#     for i in range(3):
#         page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
#         time.sleep(2)
#         print(f'Scrolled {i+1} times')
#     
#     html = page.content()
#     soup = BeautifulSoup(html, 'html.parser')
#     quotes = soup.find_all('div', {'class': 'quote'})
#     
#     print(f'\nTotal quotes after scrolling: {len(quotes)}')
#     
#     browser.close()

### Try It Yourself!

Use Playwright to:
1. Visit a JavaScript-rendered page of your choice
2. Wait for specific content to load
3. Extract and display the data

In [None]:
# Your code here


---

# Part 5: APIs - The Preferred Approach

## Why APIs Are Better Than Scraping

| Aspect | API | Web Scraping |
|--------|-----|-------------|
| **Data Format** | Structured (JSON/XML) | Unstructured HTML |
| **Reliability** | Stable endpoints | Pages can change anytime |
| **Legality** | Clear terms of service | Often gray area |
| **Rate Limiting** | Documented limits | Risk of being blocked |
| **Data Quality** | Clean, complete | May need extensive cleaning |

## REST API Basics

**REST APIs** (Representational State Transfer) are the most common type:

- Use HTTP methods: GET (read), POST (create), PUT (update), DELETE
- Return data in JSON format
- Have documented endpoints (URLs)
- May require authentication (API keys)

### Making API Requests with `requests`

```python
import requests

# Simple GET request
response = requests.get('https://api.example.com/data')
data = response.json()  # Parse JSON response

# GET with parameters
params = {'city': 'Edinburgh', 'units': 'metric'}
response = requests.get('https://api.example.com/weather', params=params)
```

## Example 1: Open-Meteo Weather API (No API Key Required)

Open-Meteo provides free weather data without requiring registration or API keys (endpoint) - perfect for learning!

In [None]:
import requests

# Edinburgh coordinates
latitude = 55.95
longitude = -3.19

# Build the API URL
url = "https://api.open-meteo.com/v1/forecast"
params = {
    "latitude": latitude,
    "longitude": longitude,
    "current_weather": True,
    "hourly": "temperature_2m,precipitation",
    "timezone": "Europe/London"
}

# Make the request
response = requests.get(url, params=params)

# Check if successful
print(f'Status Code: {response.status_code}')

# Parse JSON response
data = response.json()
print(f'\nResponse keys: {data.keys()}')

In [None]:
# Extract current weather
current = data['current_weather']

print('Current Weather in Edinburgh:')
print(f"  Temperature: {current['temperature']}Â°C")
print(f"  Wind Speed: {current['windspeed']} km/h")
print(f"  Time: {current['time']}")

In [None]:
# Get hourly forecast
import pandas as pd

hourly_df = pd.DataFrame({
    'time': data['hourly']['time'],
    'temperature': data['hourly']['temperature_2m'],
    'precipitation': data['hourly']['precipitation']
})

# Show first 24 hours
hourly_df.head(24)

### Fetch Weather for Multiple Cities

In [None]:
# Scottish cities coordinates
cities = {
    'Edinburgh': (55.95, -3.19),
    'Glasgow': (55.86, -4.25),
    'Aberdeen': (57.15, -2.11),
    'Dundee': (56.46, -2.97)
}

weather_data = []

for city, (lat, lon) in cities.items():
    params = {
        "latitude": lat,
        "longitude": lon,
        "current_weather": True
    }
    
    response = requests.get(url, params=params)
    data = response.json()
    
    weather_data.append({
        'city': city,
        'temperature': data['current_weather']['temperature'],
        'windspeed': data['current_weather']['windspeed']
    })

weather_df = pd.DataFrame(weather_data)
weather_df

## Example 2: JSONPlaceholder (Free Test API)

JSONPlaceholder is a free fake API for testing and prototyping.

In [None]:
# Get fake posts
response = requests.get('https://jsonplaceholder.typicode.com/posts')
posts = response.json()

print(f'Number of posts: {len(posts)}')
print(f'\nFirst post:')
print(f"  Title: {posts[0]['title']}")
print(f"  Body: {posts[0]['body'][:100]}...")

In [None]:
# Get users
response = requests.get('https://jsonplaceholder.typicode.com/users')
users = response.json()

users_df = pd.DataFrame(users)[['id', 'name', 'email', 'company']]
users_df['company'] = users_df['company'].apply(lambda x: x['name'])
users_df

## Example 3: Google Maps Places API (Advanced - Optional)

**Note**: This requires an API key from Google Cloud Platform. The free tier allows limited requests.

### Getting an API Key

1. Go to [Google Cloud Console](https://console.cloud.google.com/)
2. Create a new project
3. Enable "Places API"
4. Create credentials > API Key
5. Restrict your key to only Places API

**Important**: Google's free tier includes 5 reviews per place, and has monthly quota limits.

In [None]:
# Example code for Google Maps API (requires API key)
# Uncomment and add your API key to use

'''
import requests

api_key = 'YOUR_API_KEY_HERE'
place_id = 'ChIJ98CZIJrHh0gRWApM5esemkY'  # Edinburgh Castle

url = f'https://maps.googleapis.com/maps/api/place/details/json'
params = {
    'place_id': place_id,
    'fields': 'name,rating,reviews',
    'key': api_key
}

response = requests.get(url, params=params)
data = response.json()

if data['status'] == 'OK':
    result = data['result']
    print(f"Place: {result['name']}")
    print(f"Rating: {result['rating']}")
    print(f"\nReviews (max 5):")
    for review in result.get('reviews', []):
        print(f"  - {review['author_name']}: {review['rating']}/5")
        print(f"    {review['text'][:100]}...")
else:
    print(f"Error: {data['status']}")
'''

print('Google Maps API example - requires API key')
print('See comments above for implementation details')

### Using the googlemaps Library

For easier Google Maps API access, you can use the official Python library:

In [None]:
# Install googlemaps library (uncomment to install)
# !pip install googlemaps

'''
import googlemaps

gmaps = googlemaps.Client(key='YOUR_API_KEY_HERE')

# Search for a place
result = gmaps.places('Edinburgh Castle')
place_id = result['results'][0]['place_id']

# Get place details including reviews
place = gmaps.place(place_id)
reviews = place['result'].get('reviews', [])

print(f"Found {len(reviews)} reviews (API limit: 5)")
'''

print('Google Maps library example - requires API key')

---

# Summary and Best Practices

## Decision Checklist

1. **Always check for an API first** - it's the cleanest solution
2. **For static HTML** - use `requests` + `BeautifulSoup`
3. **For JavaScript-rendered content** - use `Playwright`

## Best Practices

### Legal & Ethical
- Check `robots.txt` (e.g., `https://example.com/robots.txt`)
- Read the website's Terms of Service
- Don't scrape personal data without consent

### Technical
- Add delays between requests (`time.sleep(1)`)
- Handle errors gracefully (`try/except`)
- Use appropriate User-Agent headers
- Cache results to avoid repeated requests

### Code Quality
- Store data in structured formats (CSV, JSON)
- Document your scraping logic
- Test with small samples first

In [None]:
# Example: Good scraping practices template
import requests
from bs4 import BeautifulSoup
import time

def scrape_with_best_practices(url):
    """Example of responsible web scraping."""
    
    # Use a descriptive User-Agent
    headers = {
        'User-Agent': 'Educational Web Scraper (University Project)'
    }
    
    try:
        # Make request
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # Raise error for bad status codes
        
        # Parse content
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Be respectful - wait before next request
        time.sleep(1)
        
        return soup
        
    except requests.exceptions.RequestException as e:
        print(f'Error fetching {url}: {e}')
        return None

# Test the function
soup = scrape_with_best_practices('https://quotes.toscrape.com/')
if soup:
    title = soup.find('title').text
    print(f'Successfully scraped: {title}')

---

## What's Next?

- Complete the exercises in `Week1-Exercise.ipynb`
- Try the assessment preparation challenge
- (Optional) Explore Selenium in `Week1-optional-selenium.ipynb`

---

*End of Week 1 Notes*