# Web Scraping Fundamentals: From HTML to Data 🌐

## What is Web Scraping? 🤔

**Web scraping** is the process of automatically extracting data from websites. Think of it as teaching your computer to "read" web pages like a human would, but much faster and more systematically.

**Imagine this scenario:** You want to collect product prices from an e-commerce site, track news headlines, or gather research data from multiple websites. Instead of manually copying and pasting for hours, web scraping automates this process!

## 🎯 Learning Objectives

By the end of this notebook, you'll understand:

1. **🔍 The anatomy of web pages** - HTML structure and how data is organized
2. **🛠️ Essential scraping tools** - requests, BeautifulSoup, and modern alternatives
3. **📊 Data extraction techniques** - Finding and extracting specific information
4. **⚖️ Legal and ethical considerations** - Scraping responsibly and respectfully
5. **🚀 Advanced concepts** - Handling dynamic content, APIs, and challenges
6. **💼 Real-world applications** - Practical examples and use cases

## 🌟 Why Web Scraping Matters

**In Data Science:**
- **Data Collection:** Gather datasets for analysis and machine learning
- **Market Research:** Monitor competitors, prices, and trends
- **Automation:** Replace manual data entry with automated processes

**In Business:**
- **Lead Generation:** Extract contact information and business data
- **Content Aggregation:** Collect news, reviews, and social media posts
- **Monitoring:** Track mentions, reviews, and brand sentiment

**The Big Picture:** Web scraping democratizes data access, turning the entire web into your database! 🚀

In [1]:
# Essential imports for web scraping
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time
import urllib.parse
from urllib.robotparser import RobotFileParser
import warnings
warnings.filterwarnings('ignore')

# For visualization and data analysis
import matplotlib.pyplot as plt
import seaborn as sns

# Set up display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("🔧 Web Scraping Toolkit Ready!")
print("📚 Libraries loaded:")
print("  • requests: HTTP requests and web communication")
print("  • BeautifulSoup: HTML parsing and navigation")
print("  • pandas: Data manipulation and analysis")
print("  • json: Handle JSON data from APIs")
print("  • urllib: URL handling and robots.txt checking")

🔧 Web Scraping Toolkit Ready!
📚 Libraries loaded:
  • requests: HTTP requests and web communication
  • BeautifulSoup: HTML parsing and navigation
  • pandas: Data manipulation and analysis
  • json: Handle JSON data from APIs
  • urllib: URL handling and robots.txt checking


## Chapter 1: Understanding HTML Structure 📝

**Before we scrape, we need to understand what we're scraping!** HTML (HyperText Markup Language) is the backbone of web pages.

### 🏗️ HTML Anatomy

**Think of HTML like a document outline:**
- **Tags:** Define structure (headings, paragraphs, lists)
- **Attributes:** Provide additional information (id, class, href)
- **Content:** The actual text and data we want to extract

**Common HTML Elements:**
```html
<html>                    <!-- Root element -->
  <head>                  <!-- Metadata -->
    <title>Page Title</title>
  </head>
  <body>                  <!-- Visible content -->
    <h1 id="main-title">Heading</h1>
    <p class="description">Paragraph text</p>
    <div class="container">
      <ul>                <!-- Unordered list -->
        <li>Item 1</li>
        <li>Item 2</li>
      </ul>
    </div>
    <a href="https://example.com">Link</a>
  </body>
</html>
```

### 🎯 CSS Selectors: Your Navigation Tool

**CSS selectors help us find specific elements:**
- **Element:** `p` (all paragraphs)
- **Class:** `.description` (elements with class="description")
- **ID:** `#main-title` (element with id="main-title")
- **Descendant:** `div p` (paragraphs inside divs)
- **Attribute:** `a[href]` (links with href attribute)

In [2]:
# Example 1: Making your first HTTP request
print("🌐 MAKING HTTP REQUESTS")
print("=" * 30)

# Example HTML content (simulating a simple webpage)
sample_html = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample Product Page</title>
</head>
<body>
    <div class="container">
        <h1 id="product-title">Amazing Laptop</h1>
        <p class="price">$999.99</p>
        <p class="description">High-performance laptop with 16GB RAM</p>
        <ul class="features">
            <li>16GB RAM</li>
            <li>512GB SSD</li>
            <li>Intel i7 Processor</li>
        </ul>
        <div class="reviews">
            <div class="review">
                <span class="rating">5 stars</span>
                <p class="comment">Excellent laptop!</p>
            </div>
            <div class="review">
                <span class="rating">4 stars</span>
                <p class="comment">Great value for money.</p>
            </div>
        </div>
    </div>
</body>
</html>
"""

# Parse the HTML using BeautifulSoup
soup = BeautifulSoup(sample_html, 'html.parser')

print("📄 HTML PARSING DEMONSTRATION")
print("=" * 35)

# Extract different types of information
product_title = soup.find('h1', id='product-title').text
price = soup.find('p', class_='price').text
description = soup.find('p', class_='description').text

print(f"Product Title: {product_title}")
print(f"Price: {price}")
print(f"Description: {description}")

# Extract multiple elements (features list)
features = soup.find_all('li')
feature_list = [feature.text for feature in features]
print(f"Features: {feature_list}")

# Extract reviews
reviews = soup.find_all('div', class_='review')
review_data = []
for review in reviews:
    rating = review.find('span', class_='rating').text
    comment = review.find('p', class_='comment').text
    review_data.append({'rating': rating, 'comment': comment})

print(f"\nReviews:")
for i, review in enumerate(review_data, 1):
    print(f"  {i}. {review['rating']} - {review['comment']}")

print(f"\n🎯 Key BeautifulSoup Methods:")
print(f"  • find() - Gets the first matching element")
print(f"  • find_all() - Gets all matching elements")
print(f"  • .text - Extracts text content")
print(f"  • .get() - Gets attribute values")

🌐 MAKING HTTP REQUESTS
📄 HTML PARSING DEMONSTRATION
Product Title: Amazing Laptop
Price: $999.99
Description: High-performance laptop with 16GB RAM
Features: ['16GB RAM', '512GB SSD', 'Intel i7 Processor']

Reviews:
  1. 5 stars - Excellent laptop!
  2. 4 stars - Great value for money.

🎯 Key BeautifulSoup Methods:
  • find() - Gets the first matching element
  • find_all() - Gets all matching elements
  • .text - Extracts text content
  • .get() - Gets attribute values


## Chapter 2: Real-World Web Scraping 🌍

**Now let's scrape actual websites!** We'll start with a simple, scraper-friendly site and demonstrate best practices.

### 🛡️ Best Practices for Responsible Scraping

**Before scraping any website:**
1. **Check robots.txt** - See what's allowed: `website.com/robots.txt`
2. **Use proper headers** - Identify yourself as a legitimate user
3. **Respect rate limits** - Don't overwhelm servers with requests
4. **Handle errors gracefully** - Networks fail, pages change
5. **Consider APIs first** - Many sites offer data through APIs

In [3]:
# Example 2: Professional web scraping with proper headers and error handling
print("🔧 PROFESSIONAL WEB SCRAPING SETUP")
print("=" * 40)

def create_session():
    """Create a properly configured requests session"""
    session = requests.Session()
    
    # Set headers to appear like a real browser
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
    })
    
    return session

def safe_request(url, session, timeout=10, retries=3):
    """Make a safe HTTP request with retries and error handling"""
    for attempt in range(retries):
        try:
            print(f"  Attempting request to: {url[:50]}...")
            response = session.get(url, timeout=timeout)
            response.raise_for_status()  # Raises an HTTPError for bad responses
            print(f"  ✅ Success! Status: {response.status_code}")
            return response
        
        except requests.exceptions.Timeout:
            print(f"  ⏰ Timeout on attempt {attempt + 1}")
        except requests.exceptions.ConnectionError:
            print(f"  🔌 Connection error on attempt {attempt + 1}")
        except requests.exceptions.HTTPError as e:
            print(f"  🚫 HTTP error: {e}")
            break
        except Exception as e:
            print(f"  ❌ Unexpected error: {e}")
        
        if attempt < retries - 1:
            wait_time = 2 ** attempt  # Exponential backoff
            print(f"  ⏳ Waiting {wait_time} seconds before retry...")
            time.sleep(wait_time)
    
    print(f"  ❌ Failed to fetch {url} after {retries} attempts")
    return None

def check_robots_txt(base_url):
    """Check robots.txt for scraping permissions"""
    try:
        robots_url = urllib.parse.urljoin(base_url, '/robots.txt')
        rp = RobotFileParser()
        rp.set_url(robots_url)
        rp.read()
        return rp
    except:
        print(f"  ⚠️ Could not fetch robots.txt for {base_url}")
        return None

# Example: Scraping quotes from a test website
print("\n📚 EXAMPLE: SCRAPING QUOTES FROM QUOTES.TOSCRAPE.COM")
print("=" * 55)

# This is a website specifically designed for scraping practice
base_url = "http://quotes.toscrape.com"
url = f"{base_url}/page/1/"

# Check robots.txt
print("🤖 Checking robots.txt...")
robots = check_robots_txt(base_url)
if robots:
    user_agent = '*'
    can_fetch = robots.can_fetch(user_agent, url)
    print(f"  Can fetch {url}: {can_fetch}")

# Create session and make request
session = create_session()
response = safe_request(url, session)

if response:
    # Parse the HTML
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract quotes data
    quotes_data = []
    quote_elements = soup.find_all('div', class_='quote')
    
    print(f"\n📖 Found {len(quote_elements)} quotes:")
    print("-" * 40)
    
    for i, quote_elem in enumerate(quote_elements, 1):
        # Extract quote text
        quote_text = quote_elem.find('span', class_='text').text
        
        # Extract author
        author = quote_elem.find('small', class_='author').text
        
        # Extract tags
        tag_elements = quote_elem.find_all('a', class_='tag')
        tags = [tag.text for tag in tag_elements]
        
        quote_data = {
            'quote': quote_text,
            'author': author,
            'tags': tags
        }
        quotes_data.append(quote_data)
        
        # Display first few quotes
        if i <= 3:
            print(f"{i}. \"{quote_text[:100]}...\" - {author}")
            print(f"   Tags: {', '.join(tags)}")
            print()
    
    # Convert to DataFrame for analysis
    quotes_df = pd.DataFrame(quotes_data)
    print(f"📊 SCRAPED DATA SUMMARY:")
    print(f"  • Total quotes: {len(quotes_df)}")
    print(f"  • Unique authors: {quotes_df['author'].nunique()}")
    print(f"  • Most common tags: {pd.Series([tag for tags in quotes_df['tags'] for tag in tags]).value_counts().head(3).to_dict()}")
    
    # Display the DataFrame
    print(f"\n📋 Sample DataFrame:")
    quotes_df['tags_str'] = quotes_df['tags'].apply(lambda x: ', '.join(x))
    display_df = quotes_df[['quote', 'author', 'tags_str']].copy()
    display_df['quote'] = display_df['quote'].str[:50] + '...'
    print(display_df.head())

else:
    print("❌ Could not fetch the webpage")

🔧 PROFESSIONAL WEB SCRAPING SETUP

📚 EXAMPLE: SCRAPING QUOTES FROM QUOTES.TOSCRAPE.COM
🤖 Checking robots.txt...
  Can fetch http://quotes.toscrape.com/page/1/: True
  Attempting request to: http://quotes.toscrape.com/page/1/...
  ✅ Success! Status: 200

📖 Found 10 quotes:
----------------------------------------
1. "“The world as we have created it is a process of our thinking. It cannot be changed without changing..." - Albert Einstein
   Tags: change, deep-thoughts, thinking, world

2. "“It is our choices, Harry, that show what we truly are, far more than our abilities.”..." - J.K. Rowling
   Tags: abilities, choices

3. "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as t..." - Albert Einstein
   Tags: inspirational, life, live, miracle, miracles

📊 SCRAPED DATA SUMMARY:
  • Total quotes: 10
  • Unique authors: 8
  • Most common tags: {'inspirational': 3, 'life': 2, 'humor': 2}

📋 Sample DataFrame:
                                    

## Chapter 3: Advanced Scraping Techniques 🚀

**Beyond basic scraping:** Real websites often require more sophisticated approaches.

### 🔄 Handling Dynamic Content

**Many modern websites load content with JavaScript.** For these sites, you might need:

1. **Selenium WebDriver** - Controls a real browser
2. **Playwright** - Modern browser automation
3. **requests-html** - JavaScript support for requests

### 📋 Forms and Sessions

**Some data requires:**
- **Login authentication** - Maintaining sessions
- **Form submissions** - POST requests with CSRF tokens
- **Pagination** - Following "next page" links
- **AJAX requests** - API calls that load data dynamically

### 🛠️ Advanced CSS Selectors

**More powerful selection techniques:**
```css
/* Attribute selectors */
input[type="email"]           /* Input with type email */
a[href^="https"]             /* Links starting with https */
div[class*="product"]        /* Divs with "product" in class */

/* Pseudo-selectors */
li:first-child              /* First list item */
tr:nth-child(odd)           /* Odd table rows */
p:contains("price")         /* Paragraphs containing "price" */

/* Combinators */
div > p                     /* Direct child paragraphs */
h2 + p                      /* Paragraph immediately after h2 */
h2 ~ p                      /* All paragraphs after h2 */
```

In [4]:
# Example 3: Advanced techniques - Pagination and session handling
print("🔄 ADVANCED SCRAPING: PAGINATION & SESSIONS")
print("=" * 45)

def scrape_multiple_pages(base_url, max_pages=3, delay=1):
    """
    Scrape multiple pages with proper pagination handling
    """
    session = create_session()
    all_quotes = []
    
    for page_num in range(1, max_pages + 1):
        url = f"{base_url}/page/{page_num}/"
        print(f"\n📄 Scraping page {page_num}...")
        
        response = safe_request(url, session)
        if not response:
            print(f"  ❌ Failed to load page {page_num}")
            break
            
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Check if this page has quotes
        quote_elements = soup.find_all('div', class_='quote')
        if not quote_elements:
            print(f"  📭 No quotes found on page {page_num}, stopping...")
            break
            
        # Extract quotes from this page
        page_quotes = []
        for quote_elem in quote_elements:
            quote_data = {
                'quote': quote_elem.find('span', class_='text').text,
                'author': quote_elem.find('small', class_='author').text,
                'tags': [tag.text for tag in quote_elem.find_all('a', class_='tag')],
                'page': page_num
            }
            page_quotes.append(quote_data)
            all_quotes.append(quote_data)
        
        print(f"  ✅ Extracted {len(page_quotes)} quotes from page {page_num}")
        
        # Check for "Next" button
        next_btn = soup.find('li', class_='next')
        if not next_btn:
            print(f"  🏁 No 'Next' button found, reached end of quotes")
            break
            
        # Be respectful - add delay between requests
        if page_num < max_pages:
            print(f"  ⏳ Waiting {delay} seconds before next request...")
            time.sleep(delay)
    
    return all_quotes

# Scrape multiple pages
quotes_data = scrape_multiple_pages("http://quotes.toscrape.com", max_pages=3)

if quotes_data:
    # Convert to DataFrame for analysis
    df = pd.DataFrame(quotes_data)
    
    print(f"\n📊 COMPREHENSIVE SCRAPING RESULTS:")
    print(f"=" * 40)
    print(f"Total quotes scraped: {len(df)}")
    print(f"Pages scraped: {df['page'].nunique()}")
    print(f"Unique authors: {df['author'].nunique()}")
    
    # Author statistics
    author_counts = df['author'].value_counts()
    print(f"\nTop 5 authors by quote count:")
    for author, count in author_counts.head().items():
        print(f"  • {author}: {count} quotes")
    
    # Tag analysis
    all_tags = [tag for tags in df['tags'] for tag in tags]
    tag_counts = pd.Series(all_tags).value_counts()
    print(f"\nTop 5 most common tags:")
    for tag, count in tag_counts.head().items():
        print(f"  • {tag}: {count} occurrences")
    
    # Page distribution
    page_counts = df['page'].value_counts().sort_index()
    print(f"\nQuotes per page:")
    for page, count in page_counts.items():
        print(f"  • Page {page}: {count} quotes")

# Example: Advanced CSS selectors demonstration
print(f"\n🎯 ADVANCED CSS SELECTORS DEMO")
print("=" * 35)

# Create sample HTML with complex structure
complex_html = """
<div class="products">
    <div class="product" data-category="electronics">
        <h3 class="title">Laptop</h3>
        <span class="price" data-currency="USD">$999</span>
        <div class="rating">
            <span class="stars">★★★★☆</span>
            <span class="count">(124 reviews)</span>
        </div>
    </div>
    <div class="product featured" data-category="books">
        <h3 class="title">Python Programming</h3>
        <span class="price" data-currency="USD">$39</span>
        <div class="rating">
            <span class="stars">★★★★★</span>
            <span class="count">(89 reviews)</span>
        </div>
    </div>
</div>
"""

soup = BeautifulSoup(complex_html, 'html.parser')

print("Advanced selector examples:")

# Attribute selectors
electronics = soup.select('div[data-category="electronics"]')
print(f"Electronics products: {len(electronics)}")

# Class combinations
featured_products = soup.select('.product.featured')
print(f"Featured products: {len(featured_products)}")

# Descendant selectors
product_titles = soup.select('.product .title')
print(f"Product titles: {[title.text for title in product_titles]}")

# Pseudo-selectors (first, last, nth-child)
first_product = soup.select('.product:first-child')
print(f"First product title: {first_product[0].find('h3').text if first_product else 'None'}")

# Advanced attribute selectors
prices_usd = soup.select('span[data-currency="USD"]')
print(f"USD prices: {[price.text for price in prices_usd]}")

print(f"\n💡 Key Takeaways:")
print(f"  • Always add delays between requests")
print(f"  • Handle pagination systematically")
print(f"  • Use advanced selectors for precise targeting")
print(f"  • Monitor for anti-bot measures")

🔄 ADVANCED SCRAPING: PAGINATION & SESSIONS

📄 Scraping page 1...
  Attempting request to: http://quotes.toscrape.com/page/1/...
  ✅ Success! Status: 200
  ✅ Extracted 10 quotes from page 1
  ⏳ Waiting 1 seconds before next request...

📄 Scraping page 2...
  Attempting request to: http://quotes.toscrape.com/page/2/...
  ✅ Success! Status: 200
  ✅ Extracted 10 quotes from page 2
  ⏳ Waiting 1 seconds before next request...

📄 Scraping page 3...
  Attempting request to: http://quotes.toscrape.com/page/3/...
  ✅ Success! Status: 200
  ✅ Extracted 10 quotes from page 3

📊 COMPREHENSIVE SCRAPING RESULTS:
Total quotes scraped: 30
Pages scraped: 3
Unique authors: 20

Top 5 authors by quote count:
  • Albert Einstein: 6 quotes
  • J.K. Rowling: 3 quotes
  • Marilyn Monroe: 2 quotes
  • Dr. Seuss: 2 quotes
  • Bob Marley: 2 quotes

Top 5 most common tags:
  • life: 7 occurrences
  • love: 6 occurrences
  • inspirational: 5 occurrences
  • humor: 4 occurrences
  • friends: 3 occurrences

Quotes p

## Chapter 4: APIs vs Web Scraping 🔄

**Before scraping, always check if an API exists!** APIs are usually faster, more reliable, and more respectful.

### 🆚 API vs Scraping Comparison

| **Aspect** | **API** | **Web Scraping** |
|------------|---------|------------------|
| **Speed** | ⚡ Very fast | 🐌 Slower (HTML parsing) |
| **Reliability** | 🎯 Stable structure | 🔄 Changes with website updates |
| **Legal** | ✅ Usually permitted | ⚖️ Check terms of service |
| **Rate Limits** | 📊 Clearly defined | 🚫 Risk of being blocked |
| **Data Format** | 📋 Structured (JSON/XML) | 🕸️ Unstructured (HTML) |
| **Availability** | 🎪 Not always available | 🌐 Most websites accessible |

### 🔍 Finding APIs

**How to discover if a site has an API:**
1. **Check developer documentation** - Look for "API" or "Developer" pages
2. **Inspect network traffic** - Use browser dev tools
3. **Look for JSON responses** - Many sites use internal APIs
4. **Search for "[site name] API"** - Google is your friend

### 📊 Working with JSON APIs

**JSON (JavaScript Object Notation) is the standard API format:**
```python
# Typical API response structure
{
    "status": "success",
    "data": [
        {"id": 1, "name": "Product A", "price": 29.99},
        {"id": 2, "name": "Product B", "price": 39.99}
    ],
    "pagination": {
        "page": 1,
        "total_pages": 10,
        "per_page": 20
    }
}
```

In [5]:
# Example 4: Working with APIs and JSON data
print("🔗 WORKING WITH APIs AND JSON")
print("=" * 32)

def fetch_api_data(url, params=None):
    """
    Fetch data from a JSON API with proper error handling
    """
    session = create_session()
    try:
        response = session.get(url, params=params, timeout=10)
        response.raise_for_status()
        
        # Check if response is JSON
        content_type = response.headers.get('content-type', '')
        if 'application/json' in content_type:
            return response.json()
        else:
            print(f"⚠️ Response is not JSON: {content_type}")
            return None
            
    except requests.exceptions.RequestException as e:
        print(f"❌ API request failed: {e}")
        return None
    except json.JSONDecodeError as e:
        print(f"❌ Invalid JSON response: {e}")
        return None

# Example: Using a public API (JSONPlaceholder - a fake API for testing)
print("📡 EXAMPLE: JSONPLACEHOLDER API")
print("=" * 35)

# Fetch sample posts from JSONPlaceholder
posts_url = "https://jsonplaceholder.typicode.com/posts"
posts_data = fetch_api_data(posts_url, params={'_limit': 5})

if posts_data:
    print(f"✅ Successfully fetched {len(posts_data)} posts")
    
    # Convert to DataFrame
    posts_df = pd.DataFrame(posts_data)
    print(f"\nPost data structure:")
    print(posts_df.head())
    
    # Fetch users data to join with posts
    users_url = "https://jsonplaceholder.typicode.com/users"
    users_data = fetch_api_data(users_url)
    
    if users_data:
        users_df = pd.DataFrame(users_data)
        
        # Join posts with user information
        merged_df = posts_df.merge(users_df[['id', 'name', 'email']], 
                                  left_on='userId', right_on='id', 
                                  suffixes=('_post', '_user'))
        
        print(f"\n📊 MERGED DATA (Posts + User Info):")
        print(merged_df[['title', 'name', 'email']].head())

# Example: Comparing API vs Scraping approaches
print(f"\n🆚 API VS SCRAPING COMPARISON DEMO")
print("=" * 40)

# Simulate API data extraction
api_start_time = time.time()
api_data = fetch_api_data("https://jsonplaceholder.typicode.com/posts", {'_limit': 10})
api_end_time = time.time()

if api_data:
    api_processing_time = api_end_time - api_start_time
    print(f"API approach:")
    print(f"  ✅ Fetched {len(api_data)} items in {api_processing_time:.3f} seconds")
    print(f"  📊 Clean, structured JSON data")
    print(f"  🎯 No HTML parsing required")

# Show the difference in data structure
print(f"\n📋 DATA STRUCTURE COMPARISON:")
print("=" * 35)

# API data (clean JSON)
if api_data:
    sample_api_item = api_data[0]
    print("API Response (JSON):")
    print(json.dumps(sample_api_item, indent=2))

print(f"\nHTML data (requires parsing):")
sample_html_data = """
<div class="post">
    <h2 class="title">sunt aut facere repellat</h2>
    <div class="body">quia et suscipit...</div>
    <span class="author">Leanne Graham</span>
    <span class="user-id" data-id="1">1</span>
</div>
"""
print(sample_html_data)

# Demonstrate JSON manipulation
print(f"\n🔧 JSON DATA MANIPULATION")
print("=" * 30)

# Working with nested JSON (common in APIs)
nested_json = {
    "user": {
        "id": 1,
        "profile": {
            "name": "John Doe",
            "contact": {
                "email": "john@example.com",
                "phone": "123-456-7890"
            }
        },
        "posts": [
            {"id": 1, "title": "First Post", "likes": 15},
            {"id": 2, "title": "Second Post", "likes": 23}
        ]
    }
}

print("Nested JSON structure:")
print(json.dumps(nested_json, indent=2))

# Extract data from nested JSON
user_name = nested_json['user']['profile']['name']
user_email = nested_json['user']['profile']['contact']['email']
total_likes = sum(post['likes'] for post in nested_json['user']['posts'])

print(f"\nExtracted information:")
print(f"  Name: {user_name}")
print(f"  Email: {user_email}")
print(f"  Total likes: {total_likes}")

# Flatten nested JSON for DataFrame
flattened_data = []
for post in nested_json['user']['posts']:
    flat_record = {
        'user_id': nested_json['user']['id'],
        'user_name': nested_json['user']['profile']['name'],
        'user_email': nested_json['user']['profile']['contact']['email'],
        'post_id': post['id'],
        'post_title': post['title'],
        'post_likes': post['likes']
    }
    flattened_data.append(flat_record)

flat_df = pd.DataFrame(flattened_data)
print(f"\nFlattened DataFrame:")
print(flat_df)

print(f"\n💡 API BEST PRACTICES:")
print(f"  • Always check API documentation first")
print(f"  • Respect rate limits and use API keys")
print(f"  • Handle pagination in API responses")
print(f"  • Cache API responses when appropriate")
print(f"  • Fall back to scraping only when APIs aren't available")

🔗 WORKING WITH APIs AND JSON
📡 EXAMPLE: JSONPLACEHOLDER API
✅ Successfully fetched 5 posts

Post data structure:
   userId  id                                              title  \
0       1   1  sunt aut facere repellat provident occaecati e...   
1       1   2                                       qui est esse   
2       1   3  ea molestias quasi exercitationem repellat qui...   
3       1   4                               eum et est occaecati   
4       1   5                                 nesciunt quas odio   

                                                body  
0  quia et suscipit\nsuscipit recusandae consequu...  
1  est rerum tempore vitae\nsequi sint nihil repr...  
2  et iusto sed quo iure\nvoluptatem occaecati om...  
3  ullam et saepe reiciendis voluptatem adipisci\...  
4  repudiandae veniam quaerat sunt sed\nalias aut...  

📊 MERGED DATA (Posts + User Info):
                                               title           name  \
0  sunt aut facere repellat provident occ

## Chapter 5: Summary & Best Practices 🎯

### 🏆 Key Takeaways

**You've learned the fundamentals of web scraping!** Here's what we covered:

1. **🔍 HTML Structure** - Understanding the anatomy of web pages
2. **🛠️ Basic Scraping** - Using requests and BeautifulSoup
3. **🚀 Advanced Techniques** - Pagination, sessions, and error handling
4. **📊 Data Processing** - Converting scraped data to DataFrames
5. **🔌 APIs vs Scraping** - Knowing when to use each approach

### ⚖️ Legal and Ethical Guidelines

**🚨 Always Remember:**
- **Check robots.txt** - Respect website guidelines
- **Read Terms of Service** - Understand legal restrictions
- **Be respectful** - Don't overload servers with requests
- **Add delays** - Space out your requests (1-2 seconds minimum)
- **Use APIs when available** - They're designed for data access

### 🛡️ Best Practices Checklist

**Before You Scrape:**
- [ ] Check if an API exists first
- [ ] Read the website's robots.txt
- [ ] Review terms of service
- [ ] Test with small samples first

**During Scraping:**
- [ ] Use proper headers (User-Agent, etc.)
- [ ] Implement rate limiting and delays
- [ ] Handle errors gracefully
- [ ] Monitor for blocking/captchas
- [ ] Cache responses when appropriate

**After Scraping:**
- [ ] Clean and validate your data
- [ ] Store data efficiently
- [ ] Document your scraping process
- [ ] Schedule updates responsibly

### 🔮 Next Steps

**To become a scraping expert, explore:**

1. **Advanced Tools:**
   - **Selenium/Playwright** - For JavaScript-heavy sites
   - **Scrapy** - Professional scraping framework
   - **Beautiful Soup alternatives** - lxml, html.parser

2. **Handling Challenges:**
   - **CAPTCHAs** - Detection and solving
   - **IP blocking** - Proxy rotation
   - **Dynamic content** - Browser automation
   - **Anti-bot measures** - Stealth techniques

3. **Data Pipeline:**
   - **Storage** - Databases, files, cloud storage
   - **Processing** - Data cleaning and validation
   - **Monitoring** - Automated scraping jobs
   - **Visualization** - Dashboards and reports

### 📚 Recommended Resources

**Documentation:**
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Requests Documentation](https://docs.python-requests.org/)
- [Scrapy Tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html)

**Practice Sites:**
- [Quotes to Scrape](http://quotes.toscrape.com/) - Beginner friendly
- [Books to Scrape](http://books.toscrape.com/) - More complex structure
- [Scrape This Site](https://scrapethissite.com/) - Various challenges

**Remember:** Web scraping is a powerful tool for data collection, but with great power comes great responsibility! Always scrape ethically and respectfully. 🌟

In [6]:
# 🎯 PRACTICAL EXERCISE: Your Turn to Scrape!
print("🎯 PRACTICAL EXERCISE")
print("=" * 25)

# Exercise: Scrape book information from books.toscrape.com
# This is a more complex scraping challenge!

def scrape_books_challenge():
    """
    Challenge: Scrape book data from books.toscrape.com
    Try to extract: title, price, rating, availability
    """
    
    print("📚 BOOKS SCRAPING CHALLENGE")
    print("Your mission: Scrape book data from http://books.toscrape.com/")
    print("\nWhat to extract:")
    print("  • Book titles")
    print("  • Prices") 
    print("  • Star ratings")
    print("  • Availability status")
    
    print("\n💡 HINTS:")
    print("  • Use inspector to examine the HTML structure")
    print("  • Look for class names like 'product_pod'")
    print("  • Ratings are in class names like 'star-rating Three'")
    print("  • Prices are in <p class='price_color'>")
    
    print("\n🚀 BONUS CHALLENGES:")
    print("  • Scrape multiple pages")
    print("  • Calculate average price by rating")
    print("  • Find the most expensive book")
    print("  • Create a visualization of price distribution")
    
    print("\n📝 CODE TEMPLATE:")
    template_code = '''
# Your solution here:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# 1. Make request to books.toscrape.com
# 2. Parse HTML with BeautifulSoup  
# 3. Find book containers
# 4. Extract title, price, rating, availability
# 5. Convert to DataFrame
# 6. Analyze the data

# url = "http://books.toscrape.com/"
# response = requests.get(url)
# soup = BeautifulSoup(response.content, 'html.parser')
# ... your code here ...
'''
    print(template_code)

# Run the challenge setup
scrape_books_challenge()

# Example solution (commented out - let students try first!)
print("\n" + "="*50)
print("💼 REAL-WORLD SCRAPING WORKFLOW")
print("="*50)

workflow_steps = [
    "1. 🎯 Define your objective - What data do you need?",
    "2. 🔍 Explore the website - Understand the structure", 
    "3. 📋 Check legal requirements - robots.txt, ToS",
    "4. 🛠️ Choose your tools - requests + BeautifulSoup or alternatives",
    "5. 🧪 Start small - Test with single pages first",
    "6. 🔄 Scale up - Handle pagination and multiple pages", 
    "7. 🧹 Clean data - Remove duplicates, handle missing values",
    "8. 💾 Store results - Database, CSV, or other formats",
    "9. 🔄 Automate - Schedule regular scraping if needed",
    "10. 📊 Analyze - Turn data into insights!"
]

for step in workflow_steps:
    print(f"  {step}")

print(f"\n🎉 CONGRATULATIONS!")
print("You now have the foundation to scrape data from the web!")
print("Remember: Practice makes perfect, and always scrape responsibly! 🌟")

🎯 PRACTICAL EXERCISE
📚 BOOKS SCRAPING CHALLENGE
Your mission: Scrape book data from http://books.toscrape.com/

What to extract:
  • Book titles
  • Prices
  • Star ratings
  • Availability status

💡 HINTS:
  • Use inspector to examine the HTML structure
  • Look for class names like 'product_pod'
  • Ratings are in class names like 'star-rating Three'
  • Prices are in <p class='price_color'>

🚀 BONUS CHALLENGES:
  • Scrape multiple pages
  • Calculate average price by rating
  • Find the most expensive book
  • Create a visualization of price distribution

📝 CODE TEMPLATE:

# Your solution here:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# 1. Make request to books.toscrape.com
# 2. Parse HTML with BeautifulSoup  
# 3. Find book containers
# 4. Extract title, price, rating, availability
# 5. Convert to DataFrame
# 6. Analyze the data

# url = "http://books.toscrape.com/"
# response = requests.get(url)
# soup = BeautifulSoup(response.content, 'html.parser')
# ..