
# 🧹 Task 1: Data Collection & Web Scraping (Google Colab)

This notebook shows you how to collect data from the web using **requests**, **BeautifulSoup**, and **pandas**, handle **pagination**, and save results to **CSV/JSON**.  
It also includes an **optional Selenium** section for sites that use dynamic (JavaScript-rendered) content.

> Demo site: **Books to Scrape** (static) — perfect for practice.



## 1) Setup (Install & Imports)

Run the cell below to make sure required libraries are installed in Colab.


In [None]:

# If you're on Google Colab, most of these are already available.
# This ensures everything is present & up-to-date.
!pip -q install beautifulsoup4 lxml pandas requests --upgrade

import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
print("Libraries ready ✅")



## 2) Helper Utilities (Polite Scraping)

- Use a **User-Agent** header
- Add small delays if needed
- Always check the website's **robots.txt** and **Terms of Service** before scraping


In [None]:

import time

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36"
}

def get_soup(url, wait=0.0):
    resp = requests.get(url, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    if wait:
        time.sleep(wait)
    return BeautifulSoup(resp.text, "lxml")



## 3) Static Site Demo: Books to Scrape

We'll collect **book title, price, availability, product URL** across all pages.

**Selectors (from Inspect Element):**
- Each book: `article.product_pod`
- Title: `h3 > a['title']`
- Price: `p.price_color`
- Availability: `p.instock.availability`
- Detail link: `h3 > a['href']`


In [None]:

BASE = "https://books.toscrape.com/"
START = urljoin(BASE, "catalogue/page-1.html")

def parse_page(soup):
    rows = []
    for card in soup.select("article.product_pod"):
        title = card.h3.a.get("title", "").strip()
        price = card.select_one("p.price_color").get_text(strip=True)
        availability = card.select_one("p.instock.availability").get_text(strip=True)
        rel_link = card.h3.a.get("href", "")
        product_url = urljoin(BASE + "catalogue/", rel_link)
        rows.append({
            "title": title,
            "price": price,
            "availability": availability,
            "product_url": product_url
        })
    return rows

def find_next_page(soup, current_url):
    next_li = soup.select_one("li.next > a")
    if not next_li:
        return None
    href = next_li.get("href")
    # next links are relative to current page
    return urljoin(current_url, href)

all_rows = []
page_url = START
page_num = 1

while page_url:
    print(f"Scraping page {page_num}: {page_url}")
    soup = get_soup(page_url, wait=0.3)
    all_rows.extend(parse_page(soup))
    page_url = find_next_page(soup, page_url)
    page_num += 1

df = pd.DataFrame(all_rows)
print(f"Collected {len(df)} rows")
df.head(10)



## 4) Save Data (CSV & JSON)

Run to create files you can download.  
**Tip:** If you want to save directly to Google Drive, run the next section to mount Drive first.


In [None]:

csv_path = "books_data.csv"
json_path = "books_data.json"

df.to_csv(csv_path, index=False)
df.to_json(json_path, orient="records", force_ascii=False, indent=2)
print("Saved files:")
print(" -", csv_path)
print(" -", json_path)



## 5) (Optional) Save to Google Drive

Run this cell to mount your Drive and copy the output files there.
You'll be asked to authorize the connection.


In [None]:

from google.colab import drive
drive.mount('/content/drive')

target_folder = "/content/drive/MyDrive/web_scraping_outputs"
import os, shutil

os.makedirs(target_folder, exist_ok=True)
for p in [csv_path, json_path]:
    shutil.copy(p, os.path.join(target_folder, os.path.basename(p)))

print("Copied to:", target_folder)



## 6) (Optional) Dynamic Content with Selenium

If a site loads data with JavaScript, `requests` won't see it. Use **Selenium** to automate a headless browser and grab the rendered HTML.

Below is a minimal setup that works in Colab.  
**Demo target:** `https://quotes.toscrape.com/scroll` (AJAX infinite scroll).


In [None]:

# Install Selenium & driver manager
!pip -q install selenium webdriver-manager --upgrade

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import time

# Configure headless Chrome (works in Colab)
chrome_options = Options()
chrome_options.add_argument("--headless=new")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)

try:
    driver.get("https://quotes.toscrape.com/scroll")
    time.sleep(2)

    # Scroll a few times to load more content
    for _ in range(5):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1.5)

    # Parse the rendered DOM
    # (You can also pass driver.page_source into BeautifulSoup if you prefer)
    quotes = driver.find_elements(By.CSS_SELECTOR, ".quote")
    scraped = []
    for q in quotes:
        text = q.find_element(By.CSS_SELECTOR, ".text").text
        author = q.find_element(By.CSS_SELECTOR, ".author").text
        scraped.append({"text": text, "author": author})

    df_dyn = pd.DataFrame(scraped)
    print("Dynamic rows:", len(df_dyn))
    df_dyn.head(10)
finally:
    driver.quit()



## 7) (Optional) Save Dynamic Results

If you ran the Selenium section, run this to save the results as files too.


In [None]:

if 'df_dyn' in globals():
    df_dyn.to_csv("quotes_dynamic.csv", index=False)
    df_dyn.to_json("quotes_dynamic.json", orient="records", indent=2, force_ascii=False)
    print("Saved quotes_dynamic.csv & quotes_dynamic.json")
else:
    print("Skip: dynamic section wasn't run.")



## 8) Tips & Troubleshooting

- **Robots.txt & ToS:** Always confirm you have permission to scrape.
- **Selectors break?** Right-click → *Inspect* to re-check classes/structure.
- **Pagination strategies:** 
  - Look for a *Next* button and follow its link (used above).
  - Increment a `page=` query parameter until no results return.
- **Rate limiting:** Add `time.sleep()` between requests and set headers.
- **Errors:** Wrap requests in `try/except`, and use `response.raise_for_status()`.
- **Prefer APIs:** If the site has a public API endpoint, use it instead of scraping.
