This assignment will help you practice web scraping techniques by extracting structured data
from a live practice website. You will learn how to navigate HTML structures, extract relevant
information, and save it in a structured format for analysis.

# **QUES 1**

Write a Python program to scrape all available books from the website
(https://books.toscrape.com/) Books to Scrape - a live site built for practicing scraping (safe,legal, no anti-bot). For each book, extract the following details:

1. Title

2. Price

3. Availability (In stock / Out of stock)

4. Star Rating (One, Two, Three, Four, Five)

Store the scraped results into a Pandas DataFrame and export them to a CSV file named books.csv.

(Note: Use the requests library to fetch the HTML page. Use BeautifulSoup to parse and extract book details and handle pagination so that books from all pages are scraped)

In [1]:
# Import required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Base URL of the website
base_url = "https://books.toscrape.com/catalogue/page-{}.html"

# Empty list to store book data
books_data = []

# Loop through all pages
page = 1
while True:
    # Fetch the page
    url = base_url.format(page)
    response = requests.get(url)

    # Check if page exists
    if response.status_code != 200:
        break                     # Exit loop if no more pages

    # Parse HTML content using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all books on the page
    books = soup.find_all('article', class_='product_pod')

    # Extract details for each book
    for book in books:
        title = book.h3.a['title']  # Book title
        price = book.find('p', class_='price_color').text  # Price
        availability = book.find('p', class_='instock availability').text.strip()  # Availability
        star_rating = book.p['class'][1]  # Star rating (class contains 'star-rating One', 'Two', etc.)

        # Add book data to the list
        books_data.append({
            'Title': title,
            'Price': price,
            'Availability': availability,
            'Star Rating': star_rating
        })

    print(f"Page {page} scraped successfully...")
    page += 1  # Move to the next page

# Convert list to Pandas DataFrame
df = pd.DataFrame(books_data)

# Export DataFrame to CSV
df.to_csv('books.csv', index=False)

print("Scraping completed! Data saved to books.csv")


Page 1 scraped successfully...
Page 2 scraped successfully...
Page 3 scraped successfully...
Page 4 scraped successfully...
Page 5 scraped successfully...
Page 6 scraped successfully...
Page 7 scraped successfully...
Page 8 scraped successfully...
Page 9 scraped successfully...
Page 10 scraped successfully...
Page 11 scraped successfully...
Page 12 scraped successfully...
Page 13 scraped successfully...
Page 14 scraped successfully...
Page 15 scraped successfully...
Page 16 scraped successfully...
Page 17 scraped successfully...
Page 18 scraped successfully...
Page 19 scraped successfully...
Page 20 scraped successfully...
Page 21 scraped successfully...
Page 22 scraped successfully...
Page 23 scraped successfully...
Page 24 scraped successfully...
Page 25 scraped successfully...
Page 26 scraped successfully...
Page 27 scraped successfully...
Page 28 scraped successfully...
Page 29 scraped successfully...
Page 30 scraped successfully...
Page 31 scraped successfully...
Page 32 scraped s

# **QUES 2**

Write a Python program to scrape the IMDB Top 250 Movies list
(https://www.imdb.com/chart/top/) .

For each movie, extract the following details:

1. Rank (1-250)

2. Movie Title

3. Year of Release

4. IMDB Rating

Store the results in a Pandas DataFrame and export it to a CSV file named imdb_top250.csv.

(Note: Use Selenium/Playwright to scrape the required details from this website)

In [2]:
!pip install selenium
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import time

# Configure Selenium options
chrome_opts = Options()
chrome_opts.add_argument("--headless")
chrome_opts.add_argument("--no-sandbox")
chrome_opts.add_argument("--disable-dev-shm-usage")

# Spoof user-agent to avoid 403 Forbidden
chrome_opts.add_argument(
    "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/115.0.0.0 Safari/537.36"
)

# Launch browser
browser = webdriver.Chrome(options=chrome_opts)

# Open IMDb Top 250
browser.get("https://www.imdb.com/chart/top/")
time.sleep(5)  # wait for page to load

film_list = []
movie_cards = browser.find_elements(By.CSS_SELECTOR, ".ipc-metadata-list-summary-item")

# Extract movie details
for rank, card in enumerate(movie_cards, start=1):
    try:
        name = card.find_element(By.CSS_SELECTOR, "h3").text
        release_year = card.find_element(By.CSS_SELECTOR, ".cli-title-metadata-item").text
        score = card.find_element(By.CSS_SELECTOR, ".ipc-rating-star--imdb").text.split()[0]
        film_list.append([rank, name, release_year, score])
    except Exception as e:
        print(f"Skipping a card due to error: {e}")

browser.quit()

# Save as DataFrame
imdb_table = pd.DataFrame(film_list, columns=["Position", "Movie", "Release Year", "Rating"])
imdb_table.to_csv("imdb_top250.csv", index=False)
print(imdb_table.head())

   Position                        Movie Release Year Rating
0         1  1. The Shawshank Redemption         1994    9.3
1         2             2. The Godfather         1972    9.2
2         3           3. The Dark Knight         2008    9.1
3         4     4. The Godfather Part II         1974    9.0
4         5              5. 12 Angry Men         1957    9.0


# **QUES 3**

Write a Python program to scrape the weather information for top world cities from the given website (https://www.timeanddate.com/weather/). For each city, extract the following details:

1. City Name

2. Temperature

3. Weather Condition (e.g., Clear, Cloudy, Rainy, etc.)

Store the results in a Pandas DataFrame and export it to a CSV file named weather.csv.

In [27]:
!pip install selenium
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import time

# Configure Selenium options
chrome_opts = Options()
chrome_opts.add_argument("--headless")
chrome_opts.add_argument("--no-sandbox")
chrome_opts.add_argument("--disable-dev-shm-usage")

# Spoof user-agent to avoid 403 Forbidden
chrome_opts.add_argument(
    "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/115.0.0.0 Safari/537.36"
)

# Launch browser
browser = webdriver.Chrome(options=chrome_opts)

# Open IMDb Top 250
browser.get("https://www.timeanddate.com/weather/?sort=1&low=4")
time.sleep(5)  # wait for page to load

weather_list = []
weather_rows = browser.find_elements(By.CSS_SELECTOR, "table.zebra.fw.tb-theme tbody tr")

# Extract movie details
for row in weather_rows:
    try:
        city_name = row.find_element(By.CSS_SELECTOR, "td:nth-child(1)").text

        cond_td = row.find_element(By.CSS_SELECTOR, "td:nth-child(3)")
        condition_img = cond_td.find_element(By.TAG_NAME, "img")
        condition = condition_img.get_attribute("title")

        temperature = row.find_element(By.CSS_SELECTOR, "td:nth-child(4)").text

        weather_list.append([city_name,condition,temperature])
    except Exception as e:
        print(f"Skipping a card due to error: {e}")

browser.quit()

# Save as DataFrame
weather = pd.DataFrame(weather_list, columns=["City_Name", "Condition", "Temperature"])
weather.to_csv("weather.csv", index=False)
print(weather.head())

                           City_Name               Condition Temperature
0                  Albania, Tirana *   Passing clouds. Warm.       82 °F
1                   Algeria, Algiers  Scattered clouds. Hot.       91 °F
2                     Angola, Luanda     Partly sunny. Mild.       73 °F
3  Antigua and Barbuda, Saint John's   Passing clouds. Warm.       86 °F
4            Argentina, Buenos Aires     Partly sunny. Cool.       59 °F
