# Web Scraping Demo

This notebook demonstrates how to scrape quotes from quotes.toscrape.com using Python, requests, and BeautifulSoup.

## Learning Objectives
- Understand HTTP requests and responses
- Parse HTML content with BeautifulSoup
- Extract structured data from web pages
- Handle pagination and multiple pages
- Save scraped data to CSV format

In [None]:
# Import required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import os
from urllib.parse import urljoin

print("Libraries imported successfully!")

In [None]:
# Import our custom web scraper
import sys
sys.path.append('../src')
from web_scraper import WebScraper

print("Web scraper imported successfully!")

## Step 1: Initialize the Web Scraper

Let's create an instance of our WebScraper class with appropriate settings.

In [None]:
# Initialize the scraper
scraper = WebScraper("http://quotes.toscrape.com", delay=1.0)
print("Scraper initialized with 1-second delay between requests")

## Step 2: Test Single Page Scraping

Let's first test scraping a single page to understand the structure.

In [None]:
# Get the first page
soup = scraper.get_page("http://quotes.toscrape.com/page/1/")

if soup:
    quotes = soup.find_all('div', class_='quote')
    print(f"Found {len(quotes)} quotes on the first page")
    
    # Display first quote as example
    if quotes:
        first_quote = quotes[0]
        text = first_quote.find('span', class_='text').get_text()
        author = first_quote.find('small', class_='author').get_text()
        tags = [tag.get_text() for tag in first_quote.find_all('a', class_='tag')]
        
        print("\nFirst quote:" )
        print(f"Text: {text}")
        print(f"Author: {author}")
        print(f"Tags: {', '.join(tags)}")
else:
    print("Failed to fetch the page")

## Step 3: Scrape Multiple Pages

Now let's scrape all available quotes from multiple pages.

In [None]:
# Scrape all quotes
quotes_data = scraper.scrape_quotes()

print(f"\nTotal quotes scraped: {len(quotes_data)}")

# Display summary statistics
if quotes_data:
    # Convert to DataFrame for analysis
    df = pd.DataFrame(quotes_data)
    
    print(f"\nUnique authors: {df['author'].nunique()}")
    print(f"Most quoted author: {df['author'].mode().iloc[0]}")
    
    # Show top 5 authors
    print("\nTop 5 authors by number of quotes:")
    print(df['author'].value_counts().head())

## Step 4: Analyze the Scraped Data

Let's perform some basic analysis on our scraped quotes.

In [None]:
# Basic data analysis
import matplotlib.pyplot as plt

# Plot top authors
plt.figure(figsize=(12, 6))
df['author'].value_counts().head(10).plot(kind='bar')
plt.title('Top 10 Authors by Number of Quotes')
plt.xlabel('Author')
plt.ylabel('Number of Quotes')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Analyze quote lengths
df['quote_length'] = df['text'].str.len()
print(f"\nQuote length statistics:")
print(df['quote_length'].describe())

## Step 5: Save Data to CSV

Finally, let's save our scraped data to a CSV file.

In [None]:
# Save to CSV
filepath = scraper.save_to_csv(quotes_data, 'demo_scraped_quotes.csv')

# Verify the saved file
if os.path.exists(filepath):
    saved_df = pd.read_csv(filepath)
    print(f"Successfully saved {len(saved_df)} quotes to {filepath}")
    print("\nFirst 5 rows of saved data:")
    print(saved_df.head())

## Conclusion

In this demo, we've successfully:
1. Set up a web scraper with proper rate limiting
2. Scraped quotes from multiple pages
3. Extracted structured data (text, author, tags)
4. Performed basic analysis on the scraped data
5. Saved the results to a CSV file

## Key Takeaways
- Always be respectful when scraping (use delays, check robots.txt)
- Handle errors gracefully
- Structure your data for easy analysis
- Save your work in standard formats like CSV

## Next Steps
- Try scraping different websites
- Add more sophisticated error handling
- Implement data validation
- Explore APIs as alternatives to scraping