# Assignment 4 - Web Scraping Solutions

This notebook contains solutions for all questions in Assignment 4.

**Topics Covered:**
- Web scraping with BeautifulSoup and Requests
- Dynamic content scraping with Selenium
- Data extraction and CSV export
- Handling pagination

In [None]:
# Install required packages
# !pip install requests beautifulsoup4 pandas selenium webdriver-manager

In [None]:
# Import all required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

---
# Question 1: Scrape Books from Books to Scrape

**Website**: https://books.toscrape.com/

**Task**: Scrape all available books with:
1. Title
2. Price
3. Availability (In stock / Out of stock)
4. Star Rating

**Note**: Handle pagination to scrape books from ALL pages

## Step 1: Understanding the Website Structure

In [None]:
# Base URL for the website
base_url = "https://books.toscrape.com/"

# Test connection
response = requests.get(base_url)
print(f"Status Code: {response.status_code}")
print(f"Content Length: {len(response.content)} bytes")

# Parse initial page
soup = BeautifulSoup(response.content, 'html.parser')

# Find total pages
pager = soup.find('li', class_='current')
if pager:
    total_pages = int(pager.text.strip().split()[-1])
    print(f"Total pages to scrape: {total_pages}")

## Step 2: Define Scraping Function

In [None]:
def scrape_books_page(url):
    """
    Scrape all books from a single page.
    
    Parameters:
    - url: URL of the page to scrape
    
    Returns:
    - List of dictionaries containing book data
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    books_data = []
    
    # Find all book containers
    books = soup.find_all('article', class_='product_pod')
    
    for book in books:
        # Extract title
        title = book.find('h3').find('a')['title']
        
        # Extract price
        price = book.find('p', class_='price_color').text
        
        # Extract availability
        availability = book.find('p', class_='instock availability').text.strip()
        
        # Extract star rating (stored in class name like "star-rating Three")
        rating_class = book.find('p', class_='star-rating')['class']
        rating = rating_class[1]  # Get the second class which is the rating
        
        books_data.append({
            'Title': title,
            'Price': price,
            'Availability': availability,
            'Rating': rating
        })
    
    return books_data

print("Scraping function defined!")

## Step 3: Scrape All Pages with Pagination Handling

In [None]:
print("="*60)
print("SCRAPING BOOKS FROM ALL PAGES")
print("="*60)

all_books = []

# Scrape all pages
for page in range(1, total_pages + 1):
    if page == 1:
        url = base_url + "catalogue/page-1.html"
    else:
        url = base_url + f"catalogue/page-{page}.html"
    
    print(f"Scraping page {page}/{total_pages}...", end=" ")
    
    try:
        page_books = scrape_books_page(url)
        all_books.extend(page_books)
        print(f"Found {len(page_books)} books")
    except Exception as e:
        print(f"Error: {e}")
    
    # Be respectful - add small delay
    time.sleep(0.5)

print(f"\nTotal books scraped: {len(all_books)}")

## Step 4: Create DataFrame and Export to CSV

In [None]:
# Create DataFrame
df_books = pd.DataFrame(all_books)

print("="*60)
print("BOOKS DATA SUMMARY")
print("="*60)
print(f"\nTotal books: {len(df_books)}")
print(f"\nColumns: {df_books.columns.tolist()}")

print("\nRating Distribution:")
print(df_books['Rating'].value_counts())

print("\nAvailability Distribution:")
print(df_books['Availability'].value_counts())

# Display sample
print("\nSample Data (first 10 books):")
df_books.head(10)

In [None]:
# Export to CSV
df_books.to_csv('books.csv', index=False)
print("Data exported to 'books.csv' successfully!")

---
# Question 2: Scrape IMDB Top 250 Movies

**Website**: https://www.imdb.com/chart/top/

**Task**: Scrape all 250 movies with:
1. Rank (1-250)
2. Movie Title
3. Year of Release
4. IMDB Rating

**Note**: Use Selenium for dynamic content loading

## Step 1: Setup Selenium WebDriver

In [None]:
# Import Selenium components
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

print("Selenium components imported!")

## Step 2: Launch Browser and Navigate to IMDB

In [None]:
print("="*60)
print("SCRAPING IMDB TOP 250 MOVIES")
print("="*60)

# Setup Chrome options
chrome_options = Options()
# chrome_options.add_argument("--headless")  # Uncomment for headless mode

# Launch browser
print("\nLaunching browser...")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)

# Navigate to IMDB Top 250
url = "https://www.imdb.com/chart/top/"
driver.get(url)
print("Navigated to IMDB Top 250 page")

# Wait for page to load
time.sleep(3)

## Step 3: Scroll to Load All Movies

In [None]:
print("\nScrolling to load all movies...")

# Scroll down multiple times to load all content
scroll_pause_time = 2
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)
    
    # Calculate new scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

print("Scrolling complete!")

## Step 4: Parse Page Content and Extract Movie Data

In [None]:
# Get page source after all content is loaded
page_source = driver.page_source

# Close browser
driver.quit()
print("Browser closed")

# Parse with BeautifulSoup
soup = BeautifulSoup(page_source, "html.parser")

# Extract movie data
movies_data = []

movie_list_items = soup.select(".ipc-metadata-list-summary-item")
print(f"\nFound {len(movie_list_items)} movies")

for rank, item in enumerate(movie_list_items, 1):
    try:
        # Extract title (format: "1. The Shawshank Redemption")
        title_tag = item.find("h3", class_="ipc-title__text")
        full_title = title_tag.text
        title = full_title.split(".", 1)[1].strip() if "." in full_title else full_title
        
        # Extract year from metadata
        metadata_div = item.find("div", class_="cli-title-metadata")
        metadata_items = metadata_div.find_all("span", class_="cli-title-metadata-item")
        year = metadata_items[0].text.strip() if metadata_items else "N/A"
        
        # Extract rating
        rating_span = item.find("span", class_="ipc-rating-star")
        rating = rating_span.text.split()[0] if rating_span else "N/A"
        
        movies_data.append({
            'Rank': rank,
            'Title': title,
            'Year': year,
            'Rating': rating
        })
    except Exception as e:
        print(f"Error parsing movie {rank}: {e}")

print(f"Successfully extracted {len(movies_data)} movies")

## Step 5: Create DataFrame and Export to CSV

In [None]:
# Create DataFrame
df_movies = pd.DataFrame(movies_data)

print("="*60)
print("IMDB TOP 250 MOVIES SUMMARY")
print("="*60)
print(f"\nTotal movies: {len(df_movies)}")
print(f"Columns: {df_movies.columns.tolist()}")

print("\nTop 10 Movies:")
print(df_movies.head(10).to_string(index=False))

print("\nBottom 10 Movies:")
print(df_movies.tail(10).to_string(index=False))

In [None]:
# Export to CSV
df_movies.to_csv('imdb_top250.csv', index=False)
print("\nData exported to 'imdb_top250.csv' successfully!")

---
# Question 3: Scrape Weather Information

**Website**: https://www.timeanddate.com/weather/

**Task**: Scrape weather for top world cities:
1. City Name
2. Temperature
3. Weather Condition (Clear, Cloudy, Rainy, etc.)

## Step 1: Fetch Weather Page

In [None]:
print("="*60)
print("SCRAPING WORLD WEATHER DATA")
print("="*60)

# URL for weather page
weather_url = "https://www.timeanddate.com/weather/"

# Set headers to mimic a browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

# Fetch the page
response = requests.get(weather_url, headers=headers)
print(f"Status Code: {response.status_code}")

soup = BeautifulSoup(response.content, "html.parser")

## Step 2: Extract Weather Data

In [None]:
weather_data = []

# Find the weather table
weather_table = soup.find("table", class_="zebra")

if weather_table:
    # Find all city rows
    city_rows = weather_table.find_all("tr")
    
    print(f"\nFound {len(city_rows)-1} cities in the table\n")
    
    # Skip header row, process data rows
    for row in city_rows[1:]:
        cells = row.find_all("td")
        
        if len(cells) >= 3:
            try:
                # Extract city name
                city = cells[0].text.strip()
                
                # Extract weather condition
                weather_cell = cells[1]
                img_tag = weather_cell.find('img')
                if img_tag and 'title' in img_tag.attrs:
                    weather = img_tag['title']
                else:
                    weather = weather_cell.text.strip()
                
                # Extract temperature
                temperature = cells[2].text.strip()
                
                weather_data.append({
                    'City': city,
                    'Weather': weather,
                    'Temperature': temperature
                })
            except Exception as e:
                print(f"Error parsing row: {e}")
else:
    print("Weather table not found. Trying alternative approach...")

print(f"Extracted weather data for {len(weather_data)} cities")

## Step 3: Create DataFrame and Export to CSV

In [None]:
# Create DataFrame
df_weather = pd.DataFrame(weather_data)

print("="*60)
print("WORLD WEATHER DATA SUMMARY")
print("="*60)
print(f"\nTotal cities: {len(df_weather)}")
print(f"Columns: {df_weather.columns.tolist()}")

print("\nSample Data:")
df_weather.head(15)

In [None]:
# Weather condition distribution
if len(df_weather) > 0:
    print("\nWeather Condition Distribution:")
    print(df_weather['Weather'].value_counts())

In [None]:
# Export to CSV
df_weather.to_csv('weather.csv', index=False)
print("\nData exported to 'weather.csv' successfully!")

---
# Summary

This notebook demonstrated three web scraping techniques:

**Q1 - Books to Scrape:**
- Used `requests` and `BeautifulSoup` for static content
- Handled pagination to scrape all 50 pages
- Extracted: Title, Price, Availability, Star Rating
- Output: `books.csv`

**Q2 - IMDB Top 250:**
- Used `Selenium` for dynamic JavaScript-rendered content
- Implemented scrolling to load all 250 movies
- Extracted: Rank, Title, Year, Rating
- Output: `imdb_top250.csv`

**Q3 - World Weather:**
- Used `requests` with custom headers
- Parsed HTML table structure
- Extracted: City, Weather Condition, Temperature
- Output: `weather.csv`

**Key Takeaways:**
- Use `requests` + `BeautifulSoup` for static HTML content
- Use `Selenium` when JavaScript renders content dynamically
- Always set appropriate headers and add delays to be respectful
- Handle exceptions gracefully for robust scraping