# Chapter 10: Data Acquisition from APIs and Web Sources

In earlier chapters, we focused on analyzing data that already exists in files or databases. In the real world, a lot of your work starts one step earlier: **getting the data**.

Data rarely comes to you in a perfect format. As a data analyst, you'll often need to:
- Pull live data from web services (APIs)
- Extract information from websites (web scraping)
- Handle different data formats (JSON, XML, HTML)
- Deal with authentication, rate limits, and errors

This chapter teaches you beginner-friendly, practical workflows to acquire data from external sources reliably and ethically.

---

## What you'll learn in this chapter

| Section | Topic | Key Skills |
|---------|-------|------------|
| 10.1 | Types of data sources | Identify where data comes from |
| 10.2 | REST API fundamentals | Understand how APIs work |
| 10.3 | Making API requests | Use Python to call APIs |
| 10.4 | Authentication and tokens | Secure API access |
| 10.5 | Handling JSON and XML | Parse common data formats |
| 10.6 | Web scraping principles | Ethical data extraction |
| 10.7 | HTML parsing | Extract data from web pages |
| 10.8 | Dynamic content scraping | Handle JavaScript-rendered pages |
| 10.9 | Rate limits and error handling | Build robust data pipelines |
| 10.10 | Legal and ethical considerations | Stay compliant and ethical |

---

## Learning goals

By the end of this chapter, you will be able to:

1. **Explain** the different types of data sources (files, databases, APIs, web pages)
2. **Understand** REST API basics: endpoints, methods, parameters, and status codes
3. **Make** reliable API requests with proper error handling and timeouts
4. **Store** API tokens safely using environment variables
5. **Parse** JSON, XML, and HTML data into clean DataFrames
6. **Recognize** when Selenium is needed for dynamic content
7. **Apply** polite rate-limiting and retry logic
8. **Describe** key legal and ethical considerations for data acquisition

---

## Why this matters

> "80% of a data analyst's time is spent getting and cleaning data."  
> ‚Äî Common industry observation

Understanding data acquisition is essential because:
- **Real-world data is messy**: It comes from multiple sources in different formats
- **APIs are everywhere**: Weather, finance, social media, government data‚Äîall available via APIs
- **Automation saves time**: Once you can fetch data programmatically, you can automate reports
- **Ethics matter**: Knowing what you can and cannot do protects you and your organization

## Setup: Required Libraries

Before we begin, let's import all the libraries we'll use throughout this chapter.

### Libraries overview

| Library | Purpose | Installation |
|---------|---------|--------------|
| `requests` | Make HTTP requests to APIs | `pip install requests` |
| `beautifulsoup4` | Parse HTML/XML documents | `pip install beautifulsoup4` |
| `pandas` | Data manipulation | Usually pre-installed |
| `json` | Parse JSON data | Built-in (no install needed) |
| `xml.etree.ElementTree` | Parse XML data | Built-in (no install needed) |

### Offline-friendly examples

This notebook is designed to work even without internet access. When we make API calls, we include fallback sample data so you can continue learning.

> **Tip:** If you get import errors, open a terminal and run:
> ```
> pip install requests beautifulsoup4 matplotlib
> ```
>
> Avoid running `pip` directly inside notebooks unless you understand your environment well‚Äîit can sometimes install to the wrong location.

In [None]:
# =============================================================================
# IMPORTS: Libraries used throughout this chapter
# =============================================================================

# Built-in libraries (no installation needed)
import json                          # Parse JSON data
import os                            # Access environment variables
import time                          # Add delays between requests
import xml.etree.ElementTree as ET   # Parse XML data
from typing import Any, Dict, Optional  # Type hints for cleaner code

# Data manipulation
import pandas as pd

# HTTP requests library (needs installation)
try:
    import requests
    print("‚úì requests library loaded successfully")
except ImportError:
    requests = None
    print("‚úó requests not installed. Run: pip install requests")

# HTML parsing library (needs installation)
try:
    from bs4 import BeautifulSoup
    print("‚úì BeautifulSoup library loaded successfully")
except ImportError:
    BeautifulSoup = None
    print("‚úó beautifulsoup4 not installed. Run: pip install beautifulsoup4")

# Visualization
try:
    import matplotlib.pyplot as plt
    print("‚úì matplotlib library loaded successfully")
except ImportError:
    plt = None
    print("‚úó matplotlib not installed. Run: pip install matplotlib")

# Configure pandas display options for better output
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 120)

print("\n--- Setup complete ---")

---

## 10.1 Types of Data Sources

As a data analyst, you'll acquire data from many different places. Understanding where data comes from helps you choose the right tools and techniques.

### Common data sources

| Source Type | Examples | Typical Format | How to Access |
|-------------|----------|----------------|---------------|
| **Files** | CSV, Excel, JSON files | Structured / Semi-structured | `pandas.read_csv()`, `pandas.read_excel()` |
| **Databases** | SQLite, PostgreSQL, MySQL | Structured (tables) | SQL queries via Python |
| **APIs** | Weather APIs, Finance APIs, Social media | Usually JSON (sometimes XML) | HTTP requests |
| **Web pages** | News sites, e-commerce, dashboards | Unstructured HTML | Web scraping |
| **Logs / Events** | App logs, clickstream data | Semi-structured (text/JSON) | File parsing, streaming |

### Understanding data structure levels

Think of data structure as a spectrum:

```
STRUCTURED ‚óÑ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ UNSTRUCTURED
    ‚îÇ                      ‚îÇ                                ‚îÇ
  Tables               JSON/XML                        Raw HTML
  (rows & columns)     (nested but organized)          (must extract)
```

- **Structured data**: Neat rows and columns, like a spreadsheet. Easy to analyze directly.
- **Semi-structured data**: Has organization (like JSON objects), but not a flat table. Needs some transformation.
- **Unstructured data**: Raw text or HTML where you must identify and extract the pieces you need.

### Choosing the right approach

```
Do you need external data?
         ‚îÇ
         ‚ñº
    Is there an API?
      /         \
    YES          NO
     ‚îÇ            ‚îÇ
     ‚ñº            ‚ñº
  Use API    Is the data in HTML?
                /         \
              YES          NO
               ‚îÇ            ‚îÇ
               ‚ñº            ‚ñº
           Scrape       Look for
           (carefully)   other sources
```

> **Best Practice:** Always prefer an **official API** when available. APIs are:
> - More reliable (designed for programmatic access)
> - More stable (less likely to change without notice)
> - Usually faster and cleaner
>
> Scraping should be your **last resort** when no API exists.

In [None]:
# =============================================================================
# EXAMPLE: Visualizing data source types
# =============================================================================

# Let's create a simple visualization showing different data source types
# and their relative usage in data analytics

data_sources = {
    'Source': ['Files (CSV/Excel)', 'Databases (SQL)', 'APIs', 'Web Scraping', 'Logs/Events'],
    'Ease of Use': [5, 4, 3, 2, 3],  # 1-5 scale (5 = easiest)
    'Data Quality': [4, 5, 4, 2, 3],  # 1-5 scale (5 = highest quality)
    'Common in Industry': [5, 5, 4, 2, 4]  # 1-5 scale (5 = most common)
}

df_sources = pd.DataFrame(data_sources)
df_sources.set_index('Source', inplace=True)

print("Data Source Comparison (1-5 scale, 5 = best):")
print("=" * 60)
display(df_sources)

# Create a simple bar chart
if plt:
    fig, ax = plt.subplots(figsize=(10, 5))
    df_sources.plot(kind='bar', ax=ax, rot=15)
    ax.set_title('Comparison of Data Source Types', fontsize=14)
    ax.set_ylabel('Score (1-5)')
    ax.set_ylim(0, 6)
    ax.legend(loc='upper right')
    plt.tight_layout()
    plt.show()

---

## 10.2 REST API Fundamentals

**API** stands for **Application Programming Interface**. It's a way for programs to talk to each other.

When you use an app on your phone to check the weather, that app is making an **API call** to a weather service to get the data. We can do the same thing with Python!

### What is REST?

**REST** (Representational State Transfer) is the most common style for web APIs. Almost every API you'll encounter as a data analyst follows REST principles.

Think of a REST API like a restaurant:
- The **menu** = API documentation (tells you what's available)
- Your **order** = API request (what you want)
- The **waiter** = HTTP protocol (delivers your request)
- Your **food** = API response (the data you receive)

### Key REST Concepts

#### 1. Endpoints (URLs)

An **endpoint** is a specific URL that gives you access to a resource.

```
https://api.example.com/weather         ‚Üê Get weather data
https://api.example.com/users           ‚Üê Get user data
https://api.example.com/products/123    ‚Üê Get product #123
```

#### 2. HTTP Methods

The **method** tells the API what you want to do:

| Method | Purpose | Analogy |
|--------|---------|---------|
| `GET` | Read/retrieve data | "Show me the menu" |
| `POST` | Create new data | "Place a new order" |
| `PUT` | Update existing data | "Change my order" |
| `DELETE` | Remove data | "Cancel my order" |

> **For data analytics, you'll use `GET` 99% of the time** because you're reading data, not creating or modifying it.

#### 3. Query Parameters

**Parameters** filter or customize your request. They appear after a `?` in the URL:

```
https://api.example.com/weather?city=London&units=metric
                               ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                   Query parameters
```

#### 4. Status Codes

The server responds with a **status code** telling you what happened:

| Code | Meaning | What to Do |
|------|---------|------------|
| **200** | ‚úÖ Success | Process the data |
| **400** | ‚ùå Bad request | Check your parameters |
| **401** | üîê Unauthorized | Check your API key |
| **403** | üö´ Forbidden | You don't have permission |
| **404** | ‚ùì Not found | Check the endpoint URL |
| **429** | ‚è±Ô∏è Rate limited | Slow down, wait before retrying |
| **500** | üí• Server error | Try again later |

> **Warning:** Always set a **timeout** on your requests. Without it, your code can hang forever if the server doesn't respond!

In [None]:
# =============================================================================
# EXAMPLE: Understanding URL structure
# =============================================================================

# Let's break down a typical API URL

example_url = "https://api.open-meteo.com/v1/forecast"
example_params = {
    'latitude': 51.5072,
    'longitude': -0.1276,
    'hourly': 'temperature_2m',
    'timezone': 'UTC'
}

print("Breaking down an API request:")
print("=" * 60)
print(f"\n1. BASE URL: {example_url}")
print(f"   - Protocol: https (secure)")
print(f"   - Domain: api.open-meteo.com")
print(f"   - Path: /v1/forecast")
print(f"\n2. PARAMETERS:")
for key, value in example_params.items():
    print(f"   - {key}: {value}")

# Show what the full URL would look like
if requests:
    from urllib.parse import urlencode
    full_url = f"{example_url}?{urlencode(example_params)}"
    print(f"\n3. FULL URL (what gets sent to the server):")
    print(f"   {full_url}")
else:
    print("\n   (Install requests library to see the full URL)")

---

## 10.3 Making API Requests

Now let's actually make some API requests! We'll use the `requests` library, which is the standard tool for HTTP requests in Python.

### The basic pattern

```python
import requests

response = requests.get(url, params=params, timeout=10)
response.raise_for_status()  # Raise an error if request failed
data = response.json()       # Convert JSON response to Python dict
```

### Why we need best practices

Without proper handling, your code can:
- ‚ùå Hang forever (no timeout)
- ‚ùå Crash on network errors
- ‚ùå Produce confusing error messages

Let's build a **safe helper function** that handles these issues.

> **Common Beginner Mistake:** Building URLs by hand with string concatenation.
> 
> ‚ùå Bad: `url = base + "?city=" + city + "&units=" + units`
> 
> ‚úÖ Good: `requests.get(url, params={'city': city, 'units': units})`
>
> Using `params={}` lets Python handle URL encoding safely (special characters, spaces, etc.).

In [None]:
# =============================================================================
# HELPER FUNCTION: Safe API request with timeout and error handling
# =============================================================================

def fetch_json(
    url: str,
    params: Optional[Dict[str, Any]] = None,
    headers: Optional[Dict[str, str]] = None,
    timeout_s: int = 20
) -> Dict[str, Any]:
    """
    Fetch JSON data from an API safely.
    
    Parameters:
    -----------
    url : str
        The API endpoint URL
    params : dict, optional
        Query parameters to include in the request
    headers : dict, optional
        HTTP headers (for authentication, etc.)
    timeout_s : int
        Maximum seconds to wait for response (default: 20)
    
    Returns:
    --------
    dict
        The JSON response as a Python dictionary
    
    Raises:
    -------
    ImportError
        If requests library is not installed
    requests.HTTPError
        If the server returns an error status code
    requests.Timeout
        If the request takes longer than timeout_s
    """
    # Check if requests library is available
    if requests is None:
        raise ImportError('requests is not installed. Run: pip install requests')
    
    # Make the request with timeout
    response = requests.get(url, params=params, headers=headers, timeout=timeout_s)
    
    # Raise an exception if status code indicates an error (4xx or 5xx)
    response.raise_for_status()
    
    # Parse JSON and return
    return response.json()


# =============================================================================
# EXAMPLE: Call a real API (Open-Meteo weather API - no key required!)
# =============================================================================

# Open-Meteo is a free weather API that doesn't require registration
open_meteo_url = 'https://api.open-meteo.com/v1/forecast'

# Parameters for London, UK
open_meteo_params = {
    'latitude': 51.5072,      # London's latitude
    'longitude': -0.1276,     # London's longitude
    'hourly': 'temperature_2m',  # We want hourly temperature
    'timezone': 'UTC',        # Use UTC timezone
}

# Fallback sample data (so the notebook works without internet)
fallback_api_json = {
    'hourly': {
        'time': [
            '2026-01-01T00:00', '2026-01-01T01:00', '2026-01-01T02:00',
            '2026-01-01T03:00', '2026-01-01T04:00', '2026-01-01T05:00',
            '2026-01-01T06:00', '2026-01-01T07:00', '2026-01-01T08:00',
        ],
        'temperature_2m': [7.1, 6.9, 6.6, 6.4, 6.2, 6.0, 5.9, 6.1, 6.5],
    }
}

# Try to make the API call
try:
    api_json = fetch_json(open_meteo_url, params=open_meteo_params)
    print("‚úÖ API call succeeded!")
    print(f"   Response contains {len(api_json.get('hourly', {}).get('time', []))} hourly records")
    print(f"   Top-level keys: {list(api_json.keys())}")
except Exception as e:
    print(f"‚ö†Ô∏è API call failed (this is OK if you're offline)")
    print(f"   Error: {type(e).__name__} - {e}")
    print("   Using fallback sample data instead...")
    api_json = fallback_api_json

### Converting API responses to DataFrames

APIs often return **nested JSON** (dictionaries inside dictionaries). For analysis, we need to extract the relevant data and put it into a flat table (DataFrame).

**Strategy:** Don't try to force ALL the nested JSON into a DataFrame. Instead:
1. Explore the response structure
2. Identify the data you need
3. Extract just that part
4. Convert to DataFrame

Let's look at what our API response contains:

In [None]:
# =============================================================================
# STEP 1: Explore the API response structure
# =============================================================================

print("Exploring the API response structure:")
print("=" * 60)

# Show the top-level keys
print(f"\nTop-level keys: {list(api_json.keys())}")

# Look at what's inside 'hourly'
if 'hourly' in api_json:
    hourly_data = api_json['hourly']
    print(f"\nKeys inside 'hourly': {list(hourly_data.keys())}")
    print(f"Number of time records: {len(hourly_data.get('time', []))}")
    print(f"\nFirst 3 timestamps: {hourly_data.get('time', [])[:3]}")
    print(f"First 3 temperatures: {hourly_data.get('temperature_2m', [])[:3]}")

In [None]:
# =============================================================================
# STEP 2: Create a function to convert this specific API response to DataFrame
# =============================================================================

def open_meteo_to_dataframe(data: Dict[str, Any]) -> pd.DataFrame:
    """
    Convert Open-Meteo API response to a clean pandas DataFrame.
    
    Parameters:
    -----------
    data : dict
        The JSON response from Open-Meteo API
    
    Returns:
    --------
    pd.DataFrame
        A DataFrame with 'time' and 'temperature_2m' columns
    """
    # Extract the hourly data section
    hourly = data.get('hourly', {})
    
    # Create DataFrame from the parallel lists
    df = pd.DataFrame({
        'time': hourly.get('time', []),
        'temperature_2m': hourly.get('temperature_2m', []),
    })
    
    # Convert time strings to proper datetime objects
    # errors='coerce' turns invalid dates into NaT (Not a Time) instead of crashing
    df['time'] = pd.to_datetime(df['time'], errors='coerce', utc=True)
    
    # Remove any rows where time conversion failed
    df = df.dropna(subset=['time'])
    
    return df

# Convert our API data to a DataFrame
df_weather = open_meteo_to_dataframe(api_json)

print("Converted to DataFrame:")
print("=" * 60)
print(f"Shape: {df_weather.shape[0]} rows √ó {df_weather.shape[1]} columns")
print(f"Columns: {list(df_weather.columns)}")
print(f"Data types:\n{df_weather.dtypes}")
print(f"\nFirst 5 rows:")
df_weather.head()

### Visual validation: Quick sanity check

Even during data acquisition, it's useful to quickly visualize the data. This helps you catch obvious issues:
- ‚ùå Missing time periods (gaps in the line)
- ‚ùå Wrong units (values way too high or low)
- ‚ùå Wrong timezone (times shifted unexpectedly)
- ‚ùå Data quality issues (sudden spikes or drops)

> **Common Mistake:** Plotting before converting timestamps properly. Always convert time strings to datetime objects first!

In [None]:
# =============================================================================
# VISUAL VALIDATION: Plot the temperature data
# =============================================================================

if plt:
    # Sort by time to ensure proper line plot
    df_plot = df_weather.sort_values('time').copy()
    
    # Create the plot
    fig, ax = plt.subplots(figsize=(12, 5))
    
    ax.plot(df_plot['time'], df_plot['temperature_2m'], 
            linewidth=2, color='steelblue', marker='o', markersize=3)
    
    # Add labels and title
    ax.set_title('Temperature Over Time (API Data)', fontsize=14, fontweight='bold')
    ax.set_xlabel('Time (UTC)', fontsize=11)
    ax.set_ylabel('Temperature (¬∞C)', fontsize=11)
    
    # Add grid for easier reading
    ax.grid(True, alpha=0.3)
    
    # Add a horizontal line at freezing point for reference
    ax.axhline(y=0, color='lightblue', linestyle='--', linewidth=1, label='Freezing point')
    ax.legend()
    
    # Rotate x-axis labels for readability
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    # Quick statistics
    print(f"\nQuick statistics:")
    print(f"  Min temperature: {df_plot['temperature_2m'].min():.1f}¬∞C")
    print(f"  Max temperature: {df_plot['temperature_2m'].max():.1f}¬∞C")
    print(f"  Mean temperature: {df_plot['temperature_2m'].mean():.1f}¬∞C")
else:
    print("matplotlib not available for plotting")

---

## 10.4 Authentication and Tokens

Many APIs require you to prove who you are before they give you data. This is called **authentication**.

### Why APIs require authentication

- **Rate limiting**: Track how many requests you make
- **Billing**: Charge for API usage
- **Access control**: Limit what data you can see
- **Security**: Prevent abuse

### Common authentication methods

| Method | How it works | Example |
|--------|--------------|---------|
| **API Key in URL** | Pass key as query parameter | `?api_key=abc123` |
| **API Key in Header** | Pass key in HTTP header | `X-API-Key: abc123` |
| **Bearer Token** | OAuth-style token in header | `Authorization: Bearer abc123` |
| **Basic Auth** | Username:password encoded | Less common for APIs |

### üîê CRITICAL: Keep secrets safe!

> **Warning:** Never hard-code API keys or tokens directly in your code!
>
> If you commit secrets to Git, they become public. Assume any leaked token is **compromised** and rotate it immediately.

### Best practice: Use environment variables

Environment variables store sensitive values outside your code.

**Setting environment variables:**

```powershell
# Windows PowerShell (temporary - current session only)
$env:MY_API_TOKEN = "your_secret_token_here"

# Windows Command Prompt
set MY_API_TOKEN=your_secret_token_here
```

```bash
# Linux/Mac
export MY_API_TOKEN="your_secret_token_here"
```

**Reading environment variables in Python:**

```python
import os
token = os.environ.get('MY_API_TOKEN')  # Returns None if not set
```

In [None]:
# =============================================================================
# EXAMPLE: Using environment variables for API tokens
# =============================================================================

# Try to read a token from environment variable
token = os.environ.get('MY_API_TOKEN')

print("Reading API token from environment:")
print("=" * 60)

if token:
    # Mask the token for display (show only first/last few characters)
    masked = token[:4] + '...' + token[-4:] if len(token) > 10 else '****'
    print(f"‚úÖ Token found: {masked}")
else:
    print("‚ÑπÔ∏è No token found (MY_API_TOKEN environment variable not set)")
    print("   This is expected - we're just demonstrating the pattern")

# Build headers dictionary for authenticated requests
headers: Dict[str, str] = {}

if token:
    # Bearer token format (most common for modern APIs)
    headers['Authorization'] = f'Bearer {token}'
    print(f"\n   Headers prepared for authenticated request")
else:
    print(f"\n   Headers will be empty (no authentication)")

print(f"\nHeaders dict: {headers if headers else '(empty - no auth needed for Open-Meteo)'}")

In [None]:
# =============================================================================
# EXAMPLE: Different authentication patterns
# =============================================================================

def create_auth_headers(auth_type: str, token: str) -> Dict[str, str]:
    """
    Create headers for different authentication methods.
    
    Parameters:
    -----------
    auth_type : str
        One of: 'bearer', 'api_key_header', 'basic'
    token : str
        The authentication token or API key
    
    Returns:
    --------
    dict
        Headers dictionary ready to use with requests
    """
    if auth_type == 'bearer':
        # OAuth 2.0 Bearer Token (most common for modern APIs)
        return {'Authorization': f'Bearer {token}'}
    
    elif auth_type == 'api_key_header':
        # API Key in header (common for simpler APIs)
        return {'X-API-Key': token}
    
    elif auth_type == 'basic':
        # Basic authentication (username:password base64 encoded)
        import base64
        encoded = base64.b64encode(token.encode()).decode()
        return {'Authorization': f'Basic {encoded}'}
    
    else:
        raise ValueError(f"Unknown auth_type: {auth_type}")

# Demonstrate the different patterns (with a dummy token)
demo_token = "demo_token_12345"

print("Different authentication header formats:")
print("=" * 60)
print(f"\n1. Bearer Token:")
print(f"   {create_auth_headers('bearer', demo_token)}")
print(f"\n2. API Key Header:")
print(f"   {create_auth_headers('api_key_header', demo_token)}")
print(f"\n3. Basic Auth:")
print(f"   {create_auth_headers('basic', 'user:password')}")

---

## 10.5 Handling JSON and XML Data

APIs return data in standard formats. The two most common are:
- **JSON** (JavaScript Object Notation) - Modern, widely used
- **XML** (eXtensible Markup Language) - Older, still used in some systems

### JSON: The modern standard

JSON is the most common format for APIs today. It looks like this:

```json
{
    "city": "London",
    "temperature": 7.5,
    "conditions": ["cloudy", "mild"],
    "metadata": {
        "source": "weather-api",
        "updated": "2026-01-03T10:00:00Z"
    }
}
```

**Key characteristics:**
- Uses `{}` for objects (like Python dictionaries)
- Uses `[]` for arrays (like Python lists)
- Supports strings, numbers, booleans, null
- Easy to convert to/from Python objects

### XML: The legacy format

XML uses tags and is more verbose:

```xml
<weather>
    <city>London</city>
    <temperature>7.5</temperature>
    <conditions>
        <condition>cloudy</condition>
        <condition>mild</condition>
    </conditions>
</weather>
```

> **Tip:** When working with XML APIs, always check if there's a JSON alternative. JSON is usually easier to work with in Python.

In [None]:
# =============================================================================
# WORKING WITH JSON
# =============================================================================

# JSON text (as you might receive from an API)
json_text = '''
{
    "city": "London",
    "country": "UK",
    "measurements": [
        {"type": "temperature", "value": 7.1, "unit": "celsius"},
        {"type": "humidity", "value": 80, "unit": "percent"},
        {"type": "wind_speed", "value": 15, "unit": "km/h"}
    ],
    "metadata": {
        "source": "weather-station-42",
        "last_updated": "2026-01-03T08:00:00Z"
    }
}
'''

# Parse JSON string into Python objects
data = json.loads(json_text)

print("Working with JSON data:")
print("=" * 60)
print(f"\nType of parsed data: {type(data)}")
print(f"Top-level keys: {list(data.keys())}")
print(f"\nAccessing simple values:")
print(f"  City: {data['city']}")
print(f"  Country: {data['country']}")
print(f"\nAccessing nested values:")
print(f"  Source: {data['metadata']['source']}")
print(f"\nAccessing list items:")
print(f"  First measurement: {data['measurements'][0]}")

In [None]:
# =============================================================================
# CONVERTING JSON TO DATAFRAME
# =============================================================================

# The 'measurements' list is perfect for a DataFrame
df_measurements = pd.DataFrame(data['measurements'])

print("Converting JSON list to DataFrame:")
print("=" * 60)
display(df_measurements)

# You can also add context from the parent object
df_measurements['city'] = data['city']
df_measurements['source'] = data['metadata']['source']

print("\nWith added context:")
display(df_measurements)

In [None]:
# =============================================================================
# WORKING WITH XML
# =============================================================================

# XML text (as you might receive from a legacy API)
xml_text = '''<?xml version="1.0" encoding="UTF-8"?>
<products>
    <product id="1">
        <name>Apple</name>
        <price currency="USD">1.20</price>
        <category>Fruit</category>
        <in_stock>true</in_stock>
    </product>
    <product id="2">
        <name>Banana</name>
        <price currency="USD">0.80</price>
        <category>Fruit</category>
        <in_stock>true</in_stock>
    </product>
    <product id="3">
        <name>Orange Juice</name>
        <price currency="USD">3.50</price>
        <category>Beverage</category>
        <in_stock>false</in_stock>
    </product>
</products>
'''

# Parse XML string into an ElementTree object
root = ET.fromstring(xml_text)

print("Working with XML data:")
print("=" * 60)
print(f"\nRoot element: <{root.tag}>")
print(f"Number of products: {len(root.findall('product'))}")

# Extract data into a list of dictionaries
rows = []
for product in root.findall('product'):
    rows.append({
        'id': product.get('id'),  # Attributes use .get()
        'name': product.findtext('name'),  # Text content uses .findtext()
        'price': float(product.findtext('price')),
        'currency': product.find('price').get('currency'),
        'category': product.findtext('category'),
        'in_stock': product.findtext('in_stock') == 'true'
    })

# Convert to DataFrame
df_products = pd.DataFrame(rows)

print("\nXML data converted to DataFrame:")
display(df_products)

### JSON vs XML: Quick comparison

| Aspect | JSON | XML |
|--------|------|-----|
| **Readability** | Compact, easy to read | Verbose, more tags |
| **Python parsing** | `json.loads()` ‚Üí dict/list | `ET.fromstring()` ‚Üí ElementTree |
| **Data types** | Native (string, number, bool, null) | Everything is text |
| **Attributes** | Not supported | Supported (e.g., `id="1"`) |
| **Modern APIs** | Most common | Less common |

> **Tip:** Always validate your assumptions about data types when parsing XML. Numbers and booleans come as strings and need manual conversion.

---

## 10.6 Web Scraping Principles

**Web scraping** means downloading web pages and extracting data from the HTML. It's a powerful technique, but comes with responsibilities.

### When to scrape vs when NOT to scrape

| ‚úÖ Good reasons to scrape | ‚ùå Bad reasons to scrape |
|--------------------------|-------------------------|
| No API available | There's an API you're ignoring |
| Public data for research | Private/personal data |
| One-time data collection | Continuous high-volume access |
| Terms of Service allow it | Terms explicitly forbid it |
| You respect rate limits | You want to "get all the data fast" |

### The polite scraping checklist

Before you scrape any website, go through this checklist:

1. **üìã Check Terms of Service (ToS)**
   - Many sites explicitly prohibit scraping
   - Violating ToS can have legal consequences

2. **ü§ñ Check robots.txt**
   - Visit `https://example.com/robots.txt`
   - It tells crawlers what's allowed/disallowed
   - Not legally binding, but a strong ethical guideline

3. **‚è±Ô∏è Respect rate limits**
   - Add delays between requests (at least 1 second)
   - Don't overload servers

4. **üÜî Identify yourself**
   - Use a reasonable User-Agent header
   - Include contact info if doing research

5. **üîí Avoid personal data**
   - Don't scrape emails, names, or private information
   - Consider GDPR and other privacy regulations

> **Warning:** Just because you *can* scrape something doesn't mean you *should*. When in doubt, ask for permission or look for an API.

In [None]:
# =============================================================================
# EXAMPLE: Checking robots.txt
# =============================================================================

# Let's see what a robots.txt file looks like
sample_robots_txt = """
# Example robots.txt file
User-agent: *
Allow: /public/
Disallow: /private/
Disallow: /admin/
Disallow: /api/internal/
Crawl-delay: 10

User-agent: Googlebot
Allow: /

Sitemap: https://example.com/sitemap.xml
"""

print("Understanding robots.txt:")
print("=" * 60)
print(sample_robots_txt)
print("=" * 60)
print("\nKey points:")
print("‚Ä¢ 'User-agent: *' applies to all crawlers")
print("‚Ä¢ 'Disallow: /private/' means don't access /private/ paths")
print("‚Ä¢ 'Crawl-delay: 10' means wait 10 seconds between requests")
print("‚Ä¢ Always check the site's actual robots.txt before scraping")

---

## 10.7 HTML Parsing with BeautifulSoup

**BeautifulSoup** is Python's most popular library for parsing HTML. It turns messy HTML into a navigable tree structure.

### How HTML is structured

HTML documents are made of nested **elements** (tags):

```html
<html>
  <head>
    <title>Page Title</title>
  </head>
  <body>
    <h1>Main Heading</h1>
    <p class="intro">This is a paragraph.</p>
    <table id="data">
      <tr><th>Name</th><th>Value</th></tr>
      <tr><td>Item 1</td><td>100</td></tr>
    </table>
  </body>
</html>
```

### Key BeautifulSoup methods

| Method | Purpose | Example |
|--------|---------|---------|
| `soup.find(tag)` | Find first matching element | `soup.find('h1')` |
| `soup.find_all(tag)` | Find all matching elements | `soup.find_all('tr')` |
| `soup.find(tag, {'attr': 'value'})` | Find by attribute | `soup.find('table', {'id': 'data'})` |
| `element.get_text()` | Get text content | `h1.get_text()` |
| `element.get('attr')` | Get attribute value | `link.get('href')` |
| `element.find_all('child')` | Find within element | `table.find_all('tr')` |

Let's practice with a sample HTML page:

In [None]:
# =============================================================================
# HTML PARSING: Sample web page
# =============================================================================

# Check if BeautifulSoup is available
if BeautifulSoup is None:
    raise ImportError('beautifulsoup4 not installed. Run: pip install beautifulsoup4')

# A sample HTML page (simulating what you'd get from requests.get())
sample_html = '''
<!DOCTYPE html>
<html>
  <head>
    <title>Online Store - Products</title>
  </head>
  <body>
    <h1>Welcome to Our Store</h1>
    
    <p class="description">Find the best products at great prices!</p>
    
    <table id="products" class="data-table">
      <thead>
        <tr>
          <th>Product</th>
          <th>Category</th>
          <th>Price</th>
          <th>Rating</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>Laptop Pro 15</td>
          <td>Electronics</td>
          <td>$1299.99</td>
          <td>4.5</td>
        </tr>
        <tr>
          <td>Wireless Mouse</td>
          <td>Accessories</td>
          <td>$29.99</td>
          <td>4.2</td>
        </tr>
        <tr>
          <td>USB-C Hub</td>
          <td>Accessories</td>
          <td>$49.99</td>
          <td>4.7</td>
        </tr>
      </tbody>
    </table>
    
    <div class="links">
      <p>Useful links:</p>
      <ul>
        <li><a href="/about">About Us</a></li>
        <li><a href="/contact">Contact</a></li>
        <li><a href="https://external.com/reviews">Customer Reviews</a></li>
      </ul>
    </div>
    
    <footer>
      <p>Last updated: January 2026</p>
    </footer>
  </body>
</html>
'''

# Parse the HTML
soup = BeautifulSoup(sample_html, 'html.parser')

print("HTML document parsed successfully!")
print("=" * 60)

# Extract basic information
print(f"\nPage title: {soup.find('title').get_text()}")
print(f"Main heading: {soup.find('h1').get_text()}")
print(f"Description: {soup.find('p', {'class': 'description'}).get_text()}")

In [None]:
# =============================================================================
# EXTRACTING TABLE DATA
# =============================================================================

# Find the products table by its id attribute
table = soup.find('table', {'id': 'products'})

# Extract column headers from <th> elements
headers = [th.get_text(strip=True) for th in table.find_all('th')]
print(f"Table headers: {headers}")

# Extract data rows from <tbody>
tbody = table.find('tbody')
rows = []

for tr in tbody.find_all('tr'):
    # Get all <td> elements in this row
    cells = tr.find_all('td')
    
    row_data = {
        'product': cells[0].get_text(strip=True),
        'category': cells[1].get_text(strip=True),
        'price': cells[2].get_text(strip=True),
        'rating': cells[3].get_text(strip=True)
    }
    rows.append(row_data)

# Convert to DataFrame
df_scraped = pd.DataFrame(rows)

print("\nExtracted table data:")
display(df_scraped)

In [None]:
# =============================================================================
# CLEANING SCRAPED DATA
# =============================================================================

# The price column has $ symbols and the rating is a string
# Let's clean these up for analysis

df_clean = df_scraped.copy()

# Remove $ and convert to float
df_clean['price'] = df_clean['price'].str.replace('$', '', regex=False).astype(float)

# Convert rating to float
df_clean['rating'] = df_clean['rating'].astype(float)

print("Cleaned DataFrame:")
print("=" * 60)
display(df_clean)
print(f"\nData types after cleaning:")
print(df_clean.dtypes)

In [None]:
# =============================================================================
# EXTRACTING LINKS
# =============================================================================

# Find all <a> tags (hyperlinks)
all_links = soup.find_all('a')

# Extract link information
links_data = []
for link in all_links:
    links_data.append({
        'text': link.get_text(strip=True),
        'href': link.get('href'),
        'is_external': link.get('href', '').startswith('http')
    })

df_links = pd.DataFrame(links_data)

print("Extracted links:")
display(df_links)

# Filter to just external links
external_links = df_links[df_links['is_external']]
print(f"\nExternal links found: {len(external_links)}")

---

## 10.8 Dynamic Content and JavaScript-Rendered Pages

Some websites load their content using JavaScript **after** the initial HTML loads. This means:
- When you download the HTML, the data isn't there yet
- BeautifulSoup only sees the "skeleton" page
- The actual data gets filled in by JavaScript running in a browser

### How to detect dynamic content

1. **View page source** vs **Inspect element**
   - Right-click ‚Üí "View Page Source" shows raw HTML (what requests sees)
   - Right-click ‚Üí "Inspect" shows the DOM after JavaScript runs
   - If they're different, the page uses dynamic loading

2. **Look for JavaScript frameworks**
   - React, Vue, Angular often load data dynamically
   - Single-page applications (SPAs) are almost always dynamic

### Your options for dynamic content

| Approach | Pros | Cons |
|----------|------|------|
| **Find the API** | Fast, clean data, most reliable | Requires investigation |
| **Selenium** | Can handle any dynamic content | Slow, resource-heavy, complex |
| **Playwright** | Modern alternative to Selenium | Still complex |
| **Wait for static version** | Some sites offer one | Not always available |

> **Best Practice:** Before using Selenium, open your browser's Developer Tools (F12) ‚Üí Network tab. Look for XHR/Fetch requests that load data. You can often call those APIs directly!

In [None]:
# =============================================================================
# SELENIUM EXAMPLE (Demonstration - not executed)
# =============================================================================

# Selenium automates a real browser, so it can handle JavaScript
# Installation: pip install selenium
# Also requires a browser driver (ChromeDriver, GeckoDriver, etc.)

selenium_example_code = '''
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Start a Chrome browser
driver = webdriver.Chrome()

try:
    # Navigate to the page
    driver.get('https://example.com/dynamic-page')
    
    # Wait for a specific element to appear (max 10 seconds)
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, 'data-table'))
    )
    
    # Now the page is fully loaded - extract data
    table_html = element.get_attribute('outerHTML')
    
    # You can then parse with BeautifulSoup
    soup = BeautifulSoup(table_html, 'html.parser')
    # ... extract data as before ...
    
finally:
    # Always close the browser
    driver.quit()
'''

print("Selenium example (not executed - for reference):")
print("=" * 60)
print(selenium_example_code)
print("=" * 60)
print("\nKey points about Selenium:")
print("‚Ä¢ Requires browser driver installation (ChromeDriver, etc.)")
print("‚Ä¢ Much slower than direct requests (starts real browser)")
print("‚Ä¢ Uses more memory and resources")
print("‚Ä¢ Can handle login forms, clicking, scrolling")
print("‚Ä¢ Consider it a last resort after trying to find APIs")

---

## 10.9 Rate Limits and Error Handling

In real-world data acquisition, **things go wrong**. Networks fail, servers get busy, and APIs limit how fast you can make requests.

### Common problems and solutions

| Problem | Status Code | Solution |
|---------|-------------|----------|
| Network timeout | N/A | Set reasonable timeout, retry |
| Rate limited | 429 | Wait and retry (exponential backoff) |
| Server overloaded | 503 | Wait and retry |
| Bad request | 400 | Fix your parameters |
| Unauthorized | 401 | Check your API key |
| Not found | 404 | Check the URL |
| Server error | 500 | Retry later |

### Exponential backoff

Instead of retrying immediately (which can make things worse), **exponential backoff** waits longer after each failure:

```
Attempt 1 fails ‚Üí wait 1 second
Attempt 2 fails ‚Üí wait 2 seconds
Attempt 3 fails ‚Üí wait 4 seconds
Attempt 4 fails ‚Üí wait 8 seconds
...give up after max retries
```

This is polite to servers and more likely to succeed.

> **Common Beginner Mistake:** Retrying forever in a tight loop. This can get your IP blocked and wastes resources.

In [None]:
# =============================================================================
# ROBUST API FUNCTION: With retries and exponential backoff
# =============================================================================

def fetch_json_with_retries(
    url: str,
    params: Optional[Dict[str, Any]] = None,
    headers: Optional[Dict[str, str]] = None,
    timeout_s: int = 20,
    max_retries: int = 3,
    base_sleep: float = 1.0
) -> Dict[str, Any]:
    """
    Fetch JSON from an API with automatic retry logic.
    
    Uses exponential backoff: waits longer after each failure.
    Handles rate limits (429) specially.
    
    Parameters:
    -----------
    url : str
        The API endpoint URL
    params : dict, optional
        Query parameters
    headers : dict, optional
        HTTP headers
    timeout_s : int
        Request timeout in seconds
    max_retries : int
        Maximum number of retry attempts
    base_sleep : float
        Base sleep time (doubles with each retry)
    
    Returns:
    --------
    dict
        The JSON response as a Python dictionary
    
    Raises:
    -------
    The last exception if all retries fail
    """
    if requests is None:
        raise ImportError('requests not installed. Run: pip install requests')
    
    last_error: Optional[Exception] = None
    
    for attempt in range(1, max_retries + 1):
        try:
            # Make the request
            response = requests.get(url, params=params, headers=headers, timeout=timeout_s)
            
            # Handle rate limiting specially
            if response.status_code == 429:
                sleep_time = base_sleep * (2 ** (attempt - 1))
                print(f"  ‚è±Ô∏è Rate limited (429). Waiting {sleep_time:.1f}s (attempt {attempt}/{max_retries})...")
                time.sleep(sleep_time)
                continue
            
            # Raise exception for other HTTP errors
            response.raise_for_status()
            
            # Success!
            return response.json()
            
        except requests.exceptions.Timeout as e:
            last_error = e
            print(f"  ‚è±Ô∏è Timeout on attempt {attempt}/{max_retries}")
            
        except requests.exceptions.ConnectionError as e:
            last_error = e
            print(f"  üîå Connection error on attempt {attempt}/{max_retries}")
            
        except requests.exceptions.HTTPError as e:
            last_error = e
            print(f"  ‚ùå HTTP error on attempt {attempt}/{max_retries}: {e}")
            # Don't retry client errors (4xx except 429)
            if 400 <= response.status_code < 500 and response.status_code != 429:
                raise
            
        except Exception as e:
            last_error = e
            print(f"  ‚ùå Unexpected error on attempt {attempt}/{max_retries}: {e}")
        
        # If we're going to retry, wait with exponential backoff
        if attempt < max_retries:
            sleep_time = base_sleep * (2 ** (attempt - 1))
            print(f"  üí§ Waiting {sleep_time:.1f}s before retry...")
            time.sleep(sleep_time)
    
    # All retries exhausted
    raise last_error if last_error else RuntimeError('Request failed')


# Test the function
print("Testing fetch_json_with_retries:")
print("=" * 60)

try:
    test_data = fetch_json_with_retries(
        open_meteo_url,
        params=open_meteo_params,
        max_retries=2
    )
    print("‚úÖ Request succeeded!")
    print(f"   Got {len(test_data.get('hourly', {}).get('time', []))} hourly records")
except Exception as e:
    print(f"‚ö†Ô∏è Request failed (OK if offline): {type(e).__name__}")

In [None]:
# =============================================================================
# POLITE SCRAPING: Adding delays between requests
# =============================================================================

def polite_delay(min_seconds: float = 1.0, max_seconds: float = 3.0) -> None:
    """
    Add a random delay between requests to be polite to servers.
    
    Using a random range prevents predictable patterns that might
    look like bot behavior.
    """
    import random
    delay = random.uniform(min_seconds, max_seconds)
    print(f"  üí§ Polite delay: {delay:.2f}s")
    time.sleep(delay)

# Example of polite scraping pattern (not actually executed against a site)
print("Polite scraping pattern:")
print("=" * 60)
print("""
# Example: scraping multiple pages politely

urls_to_scrape = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
]

all_data = []

for url in urls_to_scrape:
    try:
        # Fetch the page
        response = requests.get(url, timeout=20)
        response.raise_for_status()
        
        # Parse and extract data
        soup = BeautifulSoup(response.text, 'html.parser')
        # ... extract data ...
        
        all_data.append(extracted_data)
        
        # BE POLITE: wait before next request
        polite_delay(1.0, 3.0)
        
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        continue  # Move to next URL
""")

---

## 10.10 Legal and Ethical Considerations

Data acquisition comes with responsibilities. Just because you *can* access data doesn't mean you *should*.

### Key considerations

#### 1. üîí Privacy and Consent
- Avoid collecting personal data without explicit consent
- Consider: "Would this person expect their data to be collected?"
- Be especially careful with sensitive data (health, finances, etc.)

#### 2. üìã Terms of Service
- Read the ToS of sites/APIs you use
- Many prohibit automated access or commercial use
- Violations can lead to legal action

#### 3. ü§ñ robots.txt
- Not legally binding, but ethically important
- Shows what the site owner wants crawlers to access
- Respecting it shows good faith

#### 4. üìú Copyright and Licensing
- Data may be protected by copyright
- Check licensing before republishing
- Attribution requirements may apply

#### 5. üåç Regulatory Compliance
- **GDPR** (EU): Strict rules on personal data
- **CCPA** (California): Consumer privacy rights
- **Other regional laws**: Many countries have data protection laws

#### 6. üè¢ Organizational Policies
- Your company may have approved data sources
- Ask your legal/compliance team when in doubt
- Document your data sources and methods

### Ethical decision framework

Ask yourself these questions before collecting data:

```
1. Is this data truly necessary for my analysis?
2. Am I authorized to access and use this data?
3. Could this data collection harm anyone?
4. Am I being transparent about my methods?
5. Would I be comfortable if this collection was made public?
```

> **Golden Rule:** Treat data collection the way you'd want your own data treated.

In [None]:
# =============================================================================
# CHECKLIST: Before starting any data acquisition project
# =============================================================================

data_acquisition_checklist = """
‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë                    DATA ACQUISITION ETHICAL CHECKLIST                        ‚ïë
‚ï†‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï£
‚ïë                                                                              ‚ïë
‚ïë  ‚ñ° PURPOSE: Is this data necessary for my legitimate business purpose?      ‚ïë
‚ïë                                                                              ‚ïë
‚ïë  ‚ñ° AUTHORIZATION: Do I have permission to access this data source?          ‚ïë
‚ïë                                                                              ‚ïë
‚ïë  ‚ñ° TERMS OF SERVICE: Have I read and do I comply with the ToS?              ‚ïë
‚ïë                                                                              ‚ïë
‚ïë  ‚ñ° ROBOTS.TXT: Does robots.txt allow my access pattern?                     ‚ïë
‚ïë                                                                              ‚ïë
‚ïë  ‚ñ° RATE LIMITS: Am I respecting rate limits and being polite?               ‚ïë
‚ïë                                                                              ‚ïë
‚ïë  ‚ñ° PERSONAL DATA: Am I avoiding unnecessary personal data collection?       ‚ïë
‚ïë                                                                              ‚ïë
‚ïë  ‚ñ° PRIVACY LAWS: Do I comply with GDPR, CCPA, and other regulations?        ‚ïë
‚ïë                                                                              ‚ïë
‚ïë  ‚ñ° DOCUMENTATION: Have I documented my data sources and methods?            ‚ïë
‚ïë                                                                              ‚ïë
‚ïë  ‚ñ° SECURITY: Am I storing any credentials/tokens securely?                  ‚ïë
‚ïë                                                                              ‚ïë
‚ïë  ‚ñ° ORGANIZATIONAL: Does this comply with my company's policies?             ‚ïë
‚ïë                                                                              ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù
"""

print(data_acquisition_checklist)

---

# Exercises

Practice what you've learned with these exercises. Try to solve them **before** looking at the solutions.

---

## Exercise 1: Build a Safe API Request Function

**Goal:** Create a function that fetches data from an API safely.

**Requirements:**
1. Accept a URL and optional parameters
2. Set a reasonable timeout (e.g., 15 seconds)
3. Raise an error if the request fails
4. Return the JSON response as a Python dictionary

**Test your function** with the Open-Meteo API for New York City:
- Latitude: 40.7128
- Longitude: -74.0060
- Hourly data: temperature_2m

---

## Exercise 2: Convert Nested JSON to DataFrame

**Goal:** Practice extracting specific data from nested JSON structures.

**Given this JSON structure:**
```python
api_response = {
    "status": "success",
    "location": {"city": "Tokyo", "country": "Japan"},
    "forecast": [
        {"date": "2026-01-03", "high": 10, "low": 3, "conditions": "sunny"},
        {"date": "2026-01-04", "high": 8, "low": 2, "conditions": "cloudy"},
        {"date": "2026-01-05", "high": 12, "low": 5, "conditions": "sunny"}
    ]
}
```

**Tasks:**
1. Extract the forecast list into a DataFrame
2. Add a column for the city name
3. Calculate the temperature range (high - low) as a new column

---

## Exercise 3: Parse HTML and Extract Table Data

**Goal:** Use BeautifulSoup to extract a table from HTML.

**Given HTML:** (provided in the code cell below)

**Tasks:**
1. Parse the HTML with BeautifulSoup
2. Find the table by its class name
3. Extract all rows into a list of dictionaries
4. Convert to a DataFrame
5. Clean the data types (convert strings to numbers where appropriate)

---

## Exercise 4: Thinking Exercise - API vs Scraping

**Goal:** Practice making decisions about data acquisition approaches.

**Scenario:** You want to get historical stock prices for analysis.

**Questions to answer:**
1. What are some potential sources for this data?
2. For each source, would you use an API or scraping?
3. What ethical/legal considerations apply?
4. What could go wrong and how would you handle it?

Write your answers in a markdown cell or comments.

In [None]:
# =============================================================================
# EXERCISE 1: Build a Safe API Request Function
# =============================================================================

# YOUR SOLUTION HERE:
# -------------------
# def my_fetch_json(url, params=None, timeout=15):
#     """Fetch JSON from an API safely."""
#     pass  # Implement this!

# Test parameters for New York City
exercise1_params = {
    'latitude': 40.7128,
    'longitude': -74.0060,
    'hourly': 'temperature_2m',
    'timezone': 'UTC',
}

# Test your function:
# result = my_fetch_json(open_meteo_url, params=exercise1_params)
# print(result.keys())

### Exercise 1 - Sample Solution

<details>
<summary>Click to reveal solution</summary>

```python
def my_fetch_json(url, params=None, timeout=15):
    """Fetch JSON from an API safely."""
    if requests is None:
        raise ImportError('requests not installed')
    
    response = requests.get(url, params=params, timeout=timeout)
    response.raise_for_status()
    return response.json()
```

</details>

In [None]:
# =============================================================================
# EXERCISE 2: Convert Nested JSON to DataFrame
# =============================================================================

# Given data
api_response = {
    "status": "success",
    "location": {"city": "Tokyo", "country": "Japan"},
    "forecast": [
        {"date": "2026-01-03", "high": 10, "low": 3, "conditions": "sunny"},
        {"date": "2026-01-04", "high": 8, "low": 2, "conditions": "cloudy"},
        {"date": "2026-01-05", "high": 12, "low": 5, "conditions": "sunny"}
    ]
}

# YOUR SOLUTION HERE:
# -------------------
# 1. Extract the forecast list into a DataFrame
# df_forecast = ...

# 2. Add a column for the city name
# df_forecast['city'] = ...

# 3. Calculate temperature range
# df_forecast['temp_range'] = ...

# Display your result
# display(df_forecast)

In [None]:
# =============================================================================
# EXERCISE 2 - Sample Solution
# =============================================================================

print("Exercise 2 Solution:")
print("=" * 60)

# 1. Extract the forecast list into a DataFrame
df_forecast = pd.DataFrame(api_response['forecast'])

# 2. Add a column for the city name
df_forecast['city'] = api_response['location']['city']

# 3. Calculate temperature range
df_forecast['temp_range'] = df_forecast['high'] - df_forecast['low']

# Display result
display(df_forecast)

In [None]:
# =============================================================================
# EXERCISE 3: Parse HTML and Extract Table Data
# =============================================================================

# Given HTML
exercise_html = '''
<html>
<body>
    <h1>Sales Report</h1>
    <table class="sales-data">
        <tr>
            <th>Product</th>
            <th>Q1 Sales</th>
            <th>Q2 Sales</th>
            <th>Total</th>
        </tr>
        <tr>
            <td>Widget A</td>
            <td>$12,500</td>
            <td>$15,200</td>
            <td>$27,700</td>
        </tr>
        <tr>
            <td>Widget B</td>
            <td>$8,300</td>
            <td>$9,100</td>
            <td>$17,400</td>
        </tr>
        <tr>
            <td>Widget C</td>
            <td>$22,000</td>
            <td>$24,500</td>
            <td>$46,500</td>
        </tr>
    </table>
</body>
</html>
'''

# YOUR SOLUTION HERE:
# -------------------
# 1. Parse the HTML
# soup_ex = BeautifulSoup(exercise_html, 'html.parser')

# 2. Find the table by class name
# table_ex = ...

# 3. Extract rows into a list of dictionaries
# rows_ex = []
# for tr in ...:
#     ...

# 4. Convert to DataFrame
# df_sales = pd.DataFrame(rows_ex)

# 5. Clean the data (remove $ and , from numbers)
# ...

In [None]:
# =============================================================================
# EXERCISE 3 - Sample Solution
# =============================================================================

print("Exercise 3 Solution:")
print("=" * 60)

# 1. Parse the HTML
soup_ex = BeautifulSoup(exercise_html, 'html.parser')

# 2. Find the table by class name
table_ex = soup_ex.find('table', {'class': 'sales-data'})

# 3. Extract rows into a list of dictionaries
rows_ex = []
data_rows = table_ex.find_all('tr')[1:]  # Skip header row

for tr in data_rows:
    cells = tr.find_all('td')
    rows_ex.append({
        'product': cells[0].get_text(strip=True),
        'q1_sales': cells[1].get_text(strip=True),
        'q2_sales': cells[2].get_text(strip=True),
        'total': cells[3].get_text(strip=True)
    })

# 4. Convert to DataFrame
df_sales = pd.DataFrame(rows_ex)
print("\nRaw extracted data:")
display(df_sales)

# 5. Clean the data (remove $ and , from numbers)
def clean_currency(value):
    """Remove $ and commas, convert to float."""
    return float(value.replace('$', '').replace(',', ''))

for col in ['q1_sales', 'q2_sales', 'total']:
    df_sales[col] = df_sales[col].apply(clean_currency)

print("\nCleaned data:")
display(df_sales)
print(f"\nData types:\n{df_sales.dtypes}")

### Exercise 4 - Thinking Exercise: Sample Answer

**Scenario:** Getting historical stock prices

**Potential sources:**

| Source | API or Scrape? | Considerations |
|--------|---------------|----------------|
| Yahoo Finance | API (yfinance library) | Free, reliable, well-documented |
| Alpha Vantage | API | Free tier available, requires API key |
| Financial news sites | Scrape (last resort) | ToS often prohibit, data may be copyrighted |
| Bloomberg Terminal | API (if you have access) | Expensive, professional-grade |

**Ethical/Legal considerations:**
- Most financial data has copyright restrictions
- Redistribution may be prohibited
- Rate limits apply to free APIs
- Some data requires paid subscriptions

**What could go wrong:**
- API limits exceeded ‚Üí Use rate limiting and caching
- API discontinued ‚Üí Have backup data source
- Data format changes ‚Üí Validate data structure
- Historical data gaps ‚Üí Handle missing data gracefully

---

# Mini-Project: Complete Data Acquisition Pipeline

In this mini-project, you'll build a **complete, repeatable data pipeline** that:
1. Fetches data from an API (with error handling)
2. Converts the response to a clean DataFrame
3. Validates the data with a quick visualization
4. Saves the data to a CSV file for later analysis

This is a realistic workflow you'll use in real data analytics projects!

### Pipeline steps:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   1. FETCH      ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ   2. TRANSFORM  ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ   3. VALIDATE   ‚îÇ
‚îÇ   (API call)    ‚îÇ    ‚îÇ   (to DataFrame)‚îÇ    ‚îÇ   (visualize)   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                                      ‚îÇ
                                                      ‚ñº
                                              ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                                              ‚îÇ   4. SAVE       ‚îÇ
                                              ‚îÇ   (to CSV)      ‚îÇ
                                              ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

In [None]:
# =============================================================================
# MINI-PROJECT: Complete Data Acquisition Pipeline
# =============================================================================

# Configuration
API_URL = 'https://api.open-meteo.com/v1/forecast'
PARAMS = {
    'latitude': 51.5072,      # London
    'longitude': -0.1276,
    'hourly': 'temperature_2m,relative_humidity_2m',  # Multiple variables
    'timezone': 'UTC',
}
OUTPUT_FILE = 'chapter10_weather_data.csv'

print("=" * 70)
print("MINI-PROJECT: Weather Data Acquisition Pipeline")
print("=" * 70)

# -----------------------------------------------------------------------------
# STEP 1: Fetch data from API (with fallback for offline use)
# -----------------------------------------------------------------------------
print("\nüì• STEP 1: Fetching data from API...")

fallback_data = {
    'hourly': {
        'time': [f'2026-01-01T{h:02d}:00' for h in range(24)],
        'temperature_2m': [5.0 + h * 0.3 for h in range(24)],
        'relative_humidity_2m': [80 - h * 1.5 for h in range(24)],
    }
}

try:
    raw_data = fetch_json_with_retries(API_URL, params=PARAMS, max_retries=2)
    print("   ‚úÖ API call successful!")
except Exception as e:
    print(f"   ‚ö†Ô∏è API call failed: {e}")
    print("   Using fallback sample data...")
    raw_data = fallback_data

print(f"   Data contains {len(raw_data.get('hourly', {}).get('time', []))} hourly records")

In [None]:
# -----------------------------------------------------------------------------
# STEP 2: Transform to clean DataFrame
# -----------------------------------------------------------------------------
print("\nüîÑ STEP 2: Transforming to DataFrame...")

def transform_weather_data(raw: Dict[str, Any]) -> pd.DataFrame:
    """Transform raw API response to a clean DataFrame."""
    hourly = raw.get('hourly', {})
    
    df = pd.DataFrame({
        'timestamp': hourly.get('time', []),
        'temperature_c': hourly.get('temperature_2m', []),
        'humidity_pct': hourly.get('relative_humidity_2m', []),
    })
    
    # Convert timestamp to proper datetime
    df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce', utc=True)
    
    # Add derived columns
    df['date'] = df['timestamp'].dt.date
    df['hour'] = df['timestamp'].dt.hour
    
    # Drop any rows with missing timestamps
    df = df.dropna(subset=['timestamp'])
    
    return df

pipeline_df = transform_weather_data(raw_data)

print(f"   ‚úÖ Created DataFrame with {len(pipeline_df)} rows and {len(pipeline_df.columns)} columns")
print(f"   Columns: {list(pipeline_df.columns)}")
print(f"\n   First 5 rows:")
display(pipeline_df.head())

In [None]:
# -----------------------------------------------------------------------------
# STEP 3: Validate with visualization
# -----------------------------------------------------------------------------
print("\nüìä STEP 3: Validating data with visualization...")

if plt:
    fig, axes = plt.subplots(2, 1, figsize=(12, 8), sharex=True)
    
    df_plot = pipeline_df.sort_values('timestamp')
    
    # Plot 1: Temperature
    axes[0].plot(df_plot['timestamp'], df_plot['temperature_c'], 
                 color='orangered', linewidth=2)
    axes[0].set_ylabel('Temperature (¬∞C)', fontsize=11)
    axes[0].set_title('Weather Data Validation', fontsize=14, fontweight='bold')
    axes[0].grid(True, alpha=0.3)
    axes[0].axhline(y=0, color='lightblue', linestyle='--', linewidth=1)
    
    # Plot 2: Humidity
    axes[1].plot(df_plot['timestamp'], df_plot['humidity_pct'], 
                 color='steelblue', linewidth=2)
    axes[1].set_ylabel('Humidity (%)', fontsize=11)
    axes[1].set_xlabel('Time (UTC)', fontsize=11)
    axes[1].grid(True, alpha=0.3)
    axes[1].set_ylim(0, 100)
    
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    # Quick statistics
    print("\n   üìà Quick statistics:")
    print(f"   Temperature: min={df_plot['temperature_c'].min():.1f}¬∞C, "
          f"max={df_plot['temperature_c'].max():.1f}¬∞C, "
          f"mean={df_plot['temperature_c'].mean():.1f}¬∞C")
    print(f"   Humidity: min={df_plot['humidity_pct'].min():.0f}%, "
          f"max={df_plot['humidity_pct'].max():.0f}%, "
          f"mean={df_plot['humidity_pct'].mean():.0f}%")
else:
    print("   ‚ö†Ô∏è matplotlib not available for plotting")

In [None]:
# -----------------------------------------------------------------------------
# STEP 4: Save to CSV
# -----------------------------------------------------------------------------
print("\nüíæ STEP 4: Saving to CSV...")

# Save the DataFrame
pipeline_df.to_csv(OUTPUT_FILE, index=False)

print(f"   ‚úÖ Data saved to: {OUTPUT_FILE}")
print(f"   File size: {os.path.getsize(OUTPUT_FILE):,} bytes")

# Verify by reading back
df_verify = pd.read_csv(OUTPUT_FILE)
print(f"   Verification: Read back {len(df_verify)} rows")

print("\n" + "=" * 70)
print("‚úÖ PIPELINE COMPLETE!")
print("=" * 70)

---

# Summary and Key Takeaways

## What we covered in this chapter

### 10.1 Types of Data Sources
- Data comes from files, databases, APIs, web pages, and logs
- Structured ‚Üí Semi-structured ‚Üí Unstructured spectrum
- **Always prefer APIs over scraping when available**

### 10.2 REST API Fundamentals
- REST APIs use endpoints (URLs), HTTP methods, parameters, and status codes
- GET method is used for reading data (most common in analytics)
- Status codes tell you if requests succeeded or failed

### 10.3 Making API Requests
- Use the `requests` library for HTTP calls
- **Always set a timeout** to prevent hanging
- Use `params={}` instead of building URLs manually

### 10.4 Authentication and Tokens
- Many APIs require API keys or tokens
- **Never hard-code secrets in your code**
- Use environment variables to store credentials

### 10.5 Handling JSON and XML
- JSON is the modern standard (use `json.loads()`)
- XML is more verbose (use `xml.etree.ElementTree`)
- Convert API responses to DataFrames for analysis

### 10.6 Web Scraping Principles
- Check ToS, robots.txt, and ethical considerations first
- Be polite: add delays, respect rate limits
- **Scraping is a last resort**

### 10.7 HTML Parsing
- BeautifulSoup makes HTML parsing easy
- Use `find()` and `find_all()` to navigate the DOM
- Clean extracted data (remove currency symbols, convert types)

### 10.8 Dynamic Content
- Some pages load content via JavaScript
- Try to find the underlying API first
- Use Selenium/Playwright as a last resort

### 10.9 Rate Limits and Error Handling
- Build robust code with retries and exponential backoff
- Handle specific HTTP errors appropriately
- Add polite delays between requests

### 10.10 Legal and Ethical Considerations
- Respect privacy, ToS, and copyright
- Know the regulations (GDPR, CCPA, etc.)
- When in doubt, ask for permission

---

## Quick Reference: Python Libraries

| Task | Library | Installation |
|------|---------|-------------|
| HTTP requests | `requests` | `pip install requests` |
| HTML parsing | `beautifulsoup4` | `pip install beautifulsoup4` |
| JSON parsing | `json` | Built-in |
| XML parsing | `xml.etree.ElementTree` | Built-in |
| Browser automation | `selenium` | `pip install selenium` |
| Data manipulation | `pandas` | `pip install pandas` |

---

## Further Reading

- **Requests documentation:** https://requests.readthedocs.io/
- **BeautifulSoup documentation:** https://www.crummy.com/software/BeautifulSoup/
- **HTTP status codes (MDN):** https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
- **robots.txt (MDN):** https://developer.mozilla.org/en-US/docs/Glossary/Robots.txt
- **Selenium documentation:** https://selenium-python.readthedocs.io/
- **REST API tutorial:** https://restfulapi.net/

---

## Next Steps

Now that you can acquire data from external sources, you're ready to:
1. Combine multiple data sources for richer analysis
2. Build automated data pipelines that run on schedules
3. Create dashboards that update with live API data
4. Handle larger datasets with the techniques from Chapter 11 (Big Data)

**Practice tip:** Find a free API that interests you (weather, sports, finance, government data) and build a complete data acquisition pipeline!