# Web Scraping with Python
========================================

Web scraping involves extracting information from web pages using Python. It can save time and automate data collection.

This notebook covers:
- Setting up required tools (Requests and Beautiful Soup)
- Fetching and parsing HTML content
- Navigating HTML structure
- Custom data extraction
- Using pandas for table extraction

## Required Tools

Web scraping requires Python code and two essential modules: **Requests** and **Beautiful Soup**. Ensure you have both modules installed in your Python environment.

In [None]:
# Install required packages if needed
# !pip install requests beautifulsoup4 pandas lxml html5lib

# Import required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

print("All required libraries imported successfully!")

## Fetching and Parsing HTML

To start web scraping, you need to fetch the HTML content of a webpage and parse it using Beautiful Soup. Here's a step-by-step example:

In [None]:
# Specify the URL of the webpage you want to scrape
url = 'https://en.wikipedia.org/wiki/IBM'

# Send an HTTP GET request to the webpage
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print(f"Successfully fetched the webpage! Status code: {response.status_code}")
else:
    print(f"Failed to fetch webpage. Status code: {response.status_code}")

# Store the HTML content in a variable
html_content = response.text

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Display a snippet of the HTML content
print("\nFirst 500 characters of HTML content:")
print(html_content[:500])

## Navigating the HTML Structure

BeautifulSoup represents HTML content as a tree-like structure, allowing for easy navigation. You can use methods like `find_all` to filter and extract specific HTML elements.

In [None]:
# Find the page title
title = soup.find('title')
print(f"Page title: {title.text}")

# Find the main heading (h1)
main_heading = soup.find('h1')
if main_heading:
    print(f"Main heading: {main_heading.text}")

# Find all headings (h1, h2, h3)
headings = soup.find_all(['h1', 'h2', 'h3'])
print(f"\nFound {len(headings)} headings on the page")

# Display first 5 headings
print("\nFirst 5 headings:")
for i, heading in enumerate(headings[:5]):
    print(f"{i+1}. {heading.name}: {heading.text.strip()}")

In [None]:
# Find all <a> tags (anchor tags) in the HTML
links = soup.find_all('a')

print(f"Found {len(links)} links on the page")

# Iterate through the first 10 links and print their text and href
print("\nFirst 10 links:")
for i, link in enumerate(links[:10]):
    link_text = link.text.strip()
    link_href = link.get('href', 'No href')
    
    # Only show links with meaningful text
    if link_text:
        print(f"{i+1}. Text: '{link_text}' | URL: {link_href}")

## Custom Data Extraction

Web scraping allows you to navigate the HTML structure and extract specific information based on your requirements. This process may involve finding specific tags, attributes, or text content within the HTML document.

In [None]:
# Extract specific information from the Wikipedia page

# Find the first paragraph of the article
first_paragraph = soup.find('div', class_='mw-parser-output').find('p')
if first_paragraph:
    print("First paragraph of the article:")
    print(first_paragraph.text.strip()[:300] + "...")

# Find the infobox (if it exists)
infobox = soup.find('table', class_='infobox')
if infobox:
    print("\n" + "="*50)
    print("INFOBOX DATA")
    print("="*50)
    
    # Extract key-value pairs from infobox
    rows = infobox.find_all('tr')
    for row in rows[:10]:  # First 10 rows
        cells = row.find_all(['th', 'td'])
        if len(cells) == 2:
            key = cells[0].text.strip()
            value = cells[1].text.strip()
            if key and value:
                print(f"{key}: {value[:100]}")

In [None]:
# Extract all images and their descriptions
images = soup.find_all('img')

print(f"Found {len(images)} images on the page")
print("\nFirst 5 images with descriptions:")

for i, img in enumerate(images[:5]):
    src = img.get('src', 'No source')
    alt = img.get('alt', 'No description')
    print(f"{i+1}. Source: {src}")
    print(f"   Description: {alt}")
    print()

## Using BeautifulSoup for Advanced HTML Parsing

Beautiful Soup is a powerful tool for navigating and extracting specific web page parts. It allows you to find elements based on their tags, attributes, or text, making it easier to extract the information you're interested in.

In [None]:
# Advanced BeautifulSoup techniques

# 1. Find elements by CSS selectors
print("Using CSS selectors:")
css_headings = soup.select('h2')
print(f"Found {len(css_headings)} h2 elements using CSS selector")

# 2. Find elements by attribute
print("\nFinding elements by attribute:")
elements_with_id = soup.find_all(attrs={'id': True})
print(f"Found {len(elements_with_id)} elements with 'id' attribute")

# 3. Find elements containing specific text
print("\nFinding elements containing 'IBM':")
ibm_elements = soup.find_all(text=lambda text: text and 'IBM' in text)
print(f"Found {len(ibm_elements)} text elements containing 'IBM'")

# Show first 3 examples
for i, text in enumerate(ibm_elements[:3]):
    clean_text = text.strip()
    if clean_text:
        print(f"{i+1}. {clean_text[:100]}...")

In [None]:
# Extract structured data: all external links
print("External links from the page:")
print("="*40)

external_links = []
for link in soup.find_all('a', href=True):
    href = link['href']
    # Check if it's an external link (starts with http)
    if href.startswith('http'):
        link_text = link.text.strip()
        if link_text:  # Only include links with text
            external_links.append({
                'text': link_text,
                'url': href
            })

# Remove duplicates and show first 10
unique_links = {link['url']: link for link in external_links}
for i, (url, link_data) in enumerate(list(unique_links.items())[:10]):
    print(f"{i+1}. {link_data['text'][:50]}")
    print(f"   URL: {url}")
    print()

## Using Pandas read_html for Table Extraction

Pandas, a Python library, provides a function called `read_html`, which can automatically extract data from websites' tables and present it in a format suitable for analysis. It's similar to taking a table from a webpage and importing it into a spreadsheet for further analysis.

In [None]:
# Use pandas to extract tables from the webpage
try:
    # Extract all tables from the webpage
    tables = pd.read_html(url)
    
    print(f"Found {len(tables)} tables on the webpage")
    
    # Display information about each table
    for i, table in enumerate(tables):
        print(f"\nTable {i+1}:")
        print(f"  Shape: {table.shape} (rows, columns)")
        print(f"  Columns: {list(table.columns)}")
        
        # Show first few rows if the table has reasonable size
        if table.shape[0] > 0 and table.shape[1] <= 10:
            print(f"  First few rows:")
            print(table.head(3).to_string(index=False))
        print("-" * 50)

except Exception as e:
    print(f"Error extracting tables: {e}")
    print("This might happen if no tables are found or if there are parsing issues.")

In [None]:
# Let's try a different URL with more obvious tables
# Wikipedia's list of countries by population
table_url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'

try:
    print("Extracting tables from countries by population page...")
    population_tables = pd.read_html(table_url)
    
    print(f"Found {len(population_tables)} tables")
    
    # The main table is usually the first one
    if len(population_tables) > 0:
        main_table = population_tables[0]
        print(f"\nMain table shape: {main_table.shape}")
        print(f"Columns: {list(main_table.columns)}")
        
        # Display top 10 countries by population
        print("\nTop 10 entries:")
        print(main_table.head(10).to_string(index=False))
        
        # Save the table to CSV for further analysis
        main_table.to_csv('countries_population.csv', index=False)
        print("\nTable saved as 'countries_population.csv'")

except Exception as e:
    print(f"Error extracting population tables: {e}")

## Best Practices and Ethics

When web scraping, it's important to follow best practices and ethical guidelines:

In [None]:
# Best practices for web scraping

def ethical_scraping_example(url, delay=1):
    """
    Example of ethical web scraping with proper practices
    """
    # 1. Check robots.txt (manually check before scraping)
    print(f"Remember to check: {url}/robots.txt")
    
    # 2. Add delay between requests
    time.sleep(delay)
    
    # 3. Use proper headers to identify your bot
    headers = {
        'User-Agent': 'Educational Web Scraping Bot 1.0 (contact@example.com)'
    }
    
    # 4. Handle errors gracefully
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # Raises an exception for bad status codes
        
        print(f"Successfully scraped: {url}")
        print(f"Status code: {response.status_code}")
        print(f"Content length: {len(response.content)} bytes")
        
        return response
        
    except requests.exceptions.RequestException as e:
        print(f"Error scraping {url}: {e}")
        return None

# Example usage
print("Ethical scraping example:")
result = ethical_scraping_example('https://httpbin.org/user-agent')

if result:
    # Parse the response
    data = result.json()
    print(f"\nServer saw our User-Agent as: {data['user-agent']}")

## Summary

In this notebook, we covered:

1. **Setting up tools**: Installing and importing `requests` and `BeautifulSoup`
2. **Fetching HTML**: Using `requests.get()` to retrieve webpage content
3. **Parsing HTML**: Creating BeautifulSoup objects and navigating the HTML structure
4. **Extracting data**: Finding specific elements using various methods
5. **Advanced techniques**: CSS selectors, attribute searches, and text filtering
6. **Table extraction**: Using `pandas.read_html()` for automatic table parsing
7. **Best practices**: Ethical scraping guidelines and error handling

### Key Takeaways:
- Always respect websites' `robots.txt` and terms of service
- Add delays between requests to avoid overwhelming servers
- Handle errors gracefully and use appropriate headers
- Use the right tool for the job: BeautifulSoup for complex parsing, pandas for tables
- Test your scraping code thoroughly before running it at scale