# Web Scraping for Beginners: Building Your First Data Pipeline

## What You'll Learn Today
By the end of this lesson, you'll be able to:
- Extract information from websites automatically
- Store that data in a database
- Analyze the data to find interesting patterns

## What is Web Scraping?
Web scraping is like copying information from a digital book, but instead of doing it by hand, 
we write programs to do it automatically. It's useful for:
- Collecting product prices from shopping websites
- Gathering news headlines from multiple sources
- Building datasets for research projects

## Understanding Web Pages: HTML Basics

Before we can scrape websites, we need to understand how web pages are structured.

### HTML is Like a Filing System
Think of HTML as a filing system where information is organized in containers (called "tags").

### Essential HTML Tags for Scraping
- `<h1>, <h2>, <h3>` - Headlines (like chapter titles)
- `<p>` - Paragraphs (like regular text)
- `<div>` - Divisions (like folders that group things)
- `<a>` - Links (like bookmarks to other pages)
- `<table>` - Tables (like spreadsheets)

### See HTML in Action
Right-click on any webpage and select "View Page Source" to see the HTML code.

In [1]:
# This is HTML code (the structure of a webpage)
simple_html = """
<html>
    <body>
        <h1>Welcome to My Blog</h1>
        <p>This is my first blog post.</p>
        <p>This is my second blog post.</p>
    </body>
</html>
"""

In [2]:
print("Here is my HTML code:")
print(simple_html)

Here is my HTML code:

<html>
    <body>
        <h1>Welcome to My Blog</h1>
        <p>This is my first blog post.</p>
        <p>This is my second blog post.</p>
    </body>
</html>



In [3]:
from bs4 import BeautifulSoup

In [4]:
soup = BeautifulSoup(simple_html, 'html.parser')
soup


<html>
<body>
<h1>Welcome to My Blog</h1>
<p>This is my first blog post.</p>
<p>This is my second blog post.</p>
</body>
</html>

In [7]:
title = soup.find('h1')
print("The title is :", title.text)

The title is : Welcome to My Blog


In [None]:
## paragraphs
paragraphs = soup.find_all('p')
print("We found", len(paragraphs), 'paragraphs')

## method find finds the first division of the html; cf with find_all 

We found 2 paragraphs


In [11]:
news_html = """
<html>
    <body>
        <h1>Today's News</h1>
        <div class="article">
            <h2>Python Programming Growing in Popularity</h2>
            <p>Python continues to be one of the most popular programming languages.</p>
        </div>
        <div class="article">
            <h2>Web Scraping Helps Researchers</h2>
            <p>Scientists use web scraping to collect data for their studies.</p>
        </div>
    </body>
</html>
"""

In [14]:
## getting the headline
soup = BeautifulSoup(news_html, 'html.parser')
## find all articles and print them
headline = soup.find("h1")
print("Main headline:", headline.text)
## find all paragraphs and print them
article_headlines = soup.find_all("h2")
for headline in article_headlines:
    print("-", headline.text)

paragraphs = soup.find_all("p")
for paragraph in paragraphs:
    print("-", paragraph.text)


Main headline: Today's News
- Python Programming Growing in Popularity
- Web Scraping Helps Researchers
- Python continues to be one of the most popular programming languages.
- Scientists use web scraping to collect data for their studies.


In [22]:
## scrap data through soap API

import requests
from bs4 import BeautifulSoup

url = "https://httpbin.org/html"

response = requests.get(url)

if response.status_code == 200:
    print("Successfully we got the webpage")
    soup = BeautifulSoup(response.text, 'html.parser')

    title = soup.find("title")
    if title:
        print("Page title:", title.text)
    else:
        print("No <title> tag found in the page")

    paragraphs = soup.find_all("p")
    if paragraphs:
        for p in paragraphs:
             print("Paragraphs", p.text)

    else:
        print("No paragraphs found")

else:
    print("Failed to get the webpage")
    print("status code", response.status_code)




Successfully we got the webpage
No <title> tag found in the page
Paragraphs 
          Availing himself of the mild, summer-cool weather that now reigned in these latitudes, and in preparation for the peculiarly active pursuits shortly to be anticipated, Perth, the begrimed, blistered old blacksmith, had not removed his portable forge to the hold again, after concluding his contributory work for Ahab's leg, but still retained it on deck, fast lashed to ringbolts by the foremast; being now almost incessantly invoked by the headsmen, and harpooneers, and bowsmen to do some little job for them; altering, or repairing, or new shaping their various weapons and boat furniture. Often he would be surrounded by an eager circle, all waiting to be served; holding boat-spades, pike-heads, harpoons, and lances, and jealously watching his every sooty movement, as he toiled. Nevertheless, this old man's was a patient hammer wielded by a patient arm. No murmur, no impatience, no petulance did come fro

## Scrapy: A More Powerful Scraping Framework

### What is Scrapy?
While Beautiful Soup is great for beginners, **Scrapy** is a more powerful framework for larger scraping projects. Think of it this way:
- **Beautiful Soup**: Like a manual can opener - simple, direct, perfect for small tasks
- **Scrapy**: Like an industrial food processor - more complex setup, but handles big jobs efficiently

### When to Use Scrapy vs Beautiful Soup
**Use Beautiful Soup when:**
- Learning web scraping basics
- Scraping a few pages occasionally
- Simple, one-time data extraction
- Working with data you already have

**Use Scrapy when:**
- Scraping hundreds or thousands of pages
- Need to follow links automatically
- Want built-in data export (CSV, JSON, databases)
- Building production scraping systems
- Need advanced features like handling cookies, sessions

In [None]:
# Simple Scrapy example (for comparison)
# Note: Scrapy usually requires more setup, but here's a basic concept

# First, you would install Scrapy:
# pip install scrapy

# Here's how a simple Scrapy spider looks:
"""
import scrapy

class SimpleSpider(scrapy.Spider):
    name = 'simple'
    start_urls = ['https://httpbin.org/html']
    
    def parse(self, response):
        # Extract title
        title = response.css('title::text').get()
        
        # Extract all paragraphs
        paragraphs = response.css('p::text').getall()
        
        # Return structured data
        yield {
            'title': title,
            'paragraphs': paragraphs,
            'url': response.url
        }
"""

# Scrapy differences from Beautiful Soup:
# 1. Built-in HTTP handling (no need for requests library)
# 2. CSS selectors and XPath support
# 3. Automatic data export to files
# 4. Built-in support for following links
# 5. Concurrent processing (faster for many pages)

print("Scrapy Example Code (above)")
print("Scrapy is more complex but more powerful for large projects")
print("For learning, Beautiful Soup is perfect!")
print("For production scraping, consider Scrapy")

## Quick Comparison: Beautiful Soup vs Scrapy

| Feature | Beautiful Soup | Scrapy |
|---------|---------------|---------|
| **Learning Curve** | Easy | Moderate to Hard |
| **Setup** | `pip install beautifulsoup4` | More configuration needed |
| **Best For** | Learning, small projects | Large-scale scraping |
| **Speed** | Good for small tasks | Faster for many pages |
| **Data Export** | Manual (you write the code) | Built-in (CSV, JSON, etc.) |
| **Error Handling** | Manual | Built-in retry logic |
| **Following Links** | Manual | Automatic |
| **Code Style** | Procedural (step by step) | Object-oriented (classes) |

### Our Recommendation for Beginners
**Start with Beautiful Soup** because:
- Easier to understand how web scraping works
- Simpler syntax and concepts
- Better for learning the fundamentals
- You can always upgrade to Scrapy later

**Move to Scrapy when** you need to:
- Scrape many websites regularly
- Handle complex navigation between pages
- Build production systems
- Need better performance and error handling

## Storing Scraped Data: Database Basics

Now that we can extract data from websites, we need to store it somewhere useful.

### Why Use a Database?
- **Organization**: Keep data structured and searchable
- **Persistence**: Data survives even if your program stops
- **Efficiency**: Faster than reading/writing files repeatedly
- **Scalability**: Can handle large amounts of data

### SQLite: Perfect for Beginners
- **No setup required**: Built into Python
- **Single file**: Your entire database is one file
- **SQL language**: Industry standard for working with data
- **Portable**: Easy to share and backup

### Basic Database Concepts
- **Table**: Like a spreadsheet with rows and columns
- **Column**: A category of data (like "name" or "price")
- **Row**: A single record (like one person's information)
- **Primary Key**: A unique identifier for each row

## Comprehensive Example: Book Store Scraping

Now let's tackle a real-world scenario! We'll scrape a book catalog website that has:
- **Multiple books** on each page (not just one item)
- **Multiple pages** to navigate through
- **Structured data** (titles, prices, ratings, availability)
- **Perfect for analysis** (price trends, rating distributions, etc.)

### Our Target: Books.toscrape.com
This website is specifically designed for scraping practice - it's legal, ethical, and perfect for learning!

**What we'll do:**
1. Scrape multiple books from several pages
2. Store all data in a proper SQL database
3. Perform meaningful analysis on our dataset
4. Create visualizations of our findings

In [23]:
import sqlite3
import requests
from bs4 import BeautifulSoup
import time
import re
from datetime import datetime

# First, let's create a proper database for our books
def setup_books_database():
    """
    Create a database with a proper schema for book data
    """
    conn = sqlite3.connect('books_catalog.db')
    cursor = conn.cursor()
    
    # Create books table with all the fields we want to track
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS books (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            title TEXT NOT NULL,
            price_pounds REAL,
            rating INTEGER,
            availability TEXT,
            in_stock INTEGER,
            image_url TEXT,
            page_number INTEGER,
            scraped_at TEXT
        )
    ''')
    
    conn.commit()
    conn.close()
    print("✅ Books database created successfully!")

# Set up our database
setup_books_database()

✅ Books database created successfully!


In [24]:
def convert_rating_to_number(rating_class):
    """
    Convert rating class (like 'Three') to number (like 3)
    """
    rating_map = {
        'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5
    }
    
    # Extract rating from class list
    for class_name in rating_class:
        if class_name in rating_map:
            return rating_map[class_name]
    return 0

def extract_price(price_text):
    """
    Extract numerical price from text like '£51.77'
    """
    # Remove currency symbol and convert to float
    price_match = re.search(r'[\d.]+', price_text)
    return float(price_match.group()) if price_match else 0.0

def extract_stock_info(availability_text):
    """
    Extract stock number from text like 'In stock (22 available)'
    """
    stock_match = re.search(r'\((\d+) available\)', availability_text)
    return int(stock_match.group(1)) if stock_match else 0

def scrape_books_from_page(page_url, page_number):
    """
    Scrape all books from a single page
    """
    print(f"Scraping page {page_number}: {page_url}")
    
    try:
        time.sleep(1)  # Be respectful to the server
        response = requests.get(page_url)
        
        if response.status_code != 200:
            print(f"Failed to get page {page_number}")
            return []
            
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find all book containers
        book_containers = soup.find_all('article', class_='product_pod')
        books_data = []
        
        for book in book_containers:
            try:
                # Extract title
                title_element = book.find('h3').find('a')
                title = title_element.get('title', 'No title')
                
                # Extract price
                price_element = book.find('p', class_='price_color')
                price = extract_price(price_element.text) if price_element else 0.0
                
                # Extract rating
                rating_element = book.find('p', class_='star-rating')
                rating = convert_rating_to_number(rating_element.get('class', [])) if rating_element else 0
                
                # Extract availability
                availability_element = book.find('p', class_='instock availability')
                availability = availability_element.text.strip() if availability_element else 'Unknown'
                in_stock = extract_stock_info(availability)
                
                # Extract image URL
                image_element = book.find('div', class_='image_container').find('img')
                image_url = image_element.get('src', '') if image_element else ''
                
                book_data = {
                    'title': title,
                    'price_pounds': price,
                    'rating': rating,
                    'availability': availability,
                    'in_stock': in_stock,
                    'image_url': image_url,
                    'page_number': page_number,
                    'scraped_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                }
                
                books_data.append(book_data)
                
            except Exception as e:
                print(f"Error scraping individual book: {e}")
                continue
        
        print(f"✅ Successfully scraped {len(books_data)} books from page {page_number}")
        return books_data
        
    except Exception as e:
        print(f"Error scraping page {page_number}: {e}")
        return []

# Test scraping one page
books_page_1 = scrape_books_from_page('https://books.toscrape.com/catalogue/page-1.html', 1)
print(f"Found {len(books_page_1)} books on page 1")

# Show first book as example
if books_page_1:
    first_book = books_page_1[0]
    print(f"\nExample book:")
    print(f"Title: {first_book['title']}")
    print(f"Price: £{first_book['price_pounds']}")
    print(f"Rating: {first_book['rating']}/5 stars")
    print(f"In stock: {first_book['in_stock']}")

Scraping page 1: https://books.toscrape.com/catalogue/page-1.html
✅ Successfully scraped 20 books from page 1
Found 20 books on page 1

Example book:
Title: A Light in the Attic
Price: £51.77
Rating: 3/5 stars
In stock: 0


In [25]:
def store_books_in_database(books_list):
    """
    Store a list of books in our database
    """
    if not books_list:
        return 0
        
    conn = sqlite3.connect('books_catalog.db')
    cursor = conn.cursor()
    
    # Insert all books
    for book in books_list:
        cursor.execute('''
            INSERT INTO books (title, price_pounds, rating, availability, 
                             in_stock, image_url, page_number, scraped_at)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?)
        ''', (
            book['title'], book['price_pounds'], book['rating'],
            book['availability'], book['in_stock'], book['image_url'],
            book['page_number'], book['scraped_at']
        ))
    
    conn.commit()
    conn.close()
    return len(books_list)

def scrape_multiple_pages(start_page=1, end_page=3):
    """
    Scrape multiple pages and store all books
    """
    all_books = []
    total_stored = 0
    
    print(f"🚀 Starting to scrape pages {start_page} to {end_page}")
    print("-" * 50)
    
    for page_num in range(start_page, end_page + 1):
        page_url = f'https://books.toscrape.com/catalogue/page-{page_num}.html'
        
        # Scrape books from this page
        books_from_page = scrape_books_from_page(page_url, page_num)
        
        if books_from_page:
            # Store books in database
            stored_count = store_books_in_database(books_from_page)
            total_stored += stored_count
            all_books.extend(books_from_page)
            print(f"📚 Stored {stored_count} books from page {page_num}")
        
        # Be respectful - add delay between pages
        time.sleep(2)
    
    print("-" * 50)
    print(f"🎉 Scraping complete!")
    print(f"📊 Total books scraped and stored: {total_stored}")
    return all_books

# Scrape 3 pages of books (about 60 books total)
all_scraped_books = scrape_multiple_pages(1, 3)

🚀 Starting to scrape pages 1 to 3
--------------------------------------------------
Scraping page 1: https://books.toscrape.com/catalogue/page-1.html
✅ Successfully scraped 20 books from page 1
📚 Stored 20 books from page 1
Scraping page 2: https://books.toscrape.com/catalogue/page-2.html
✅ Successfully scraped 20 books from page 2
📚 Stored 20 books from page 2
Scraping page 3: https://books.toscrape.com/catalogue/page-3.html
✅ Successfully scraped 20 books from page 3
📚 Stored 20 books from page 3
--------------------------------------------------
🎉 Scraping complete!
📊 Total books scraped and stored: 60
