
# Task 1: Data Collection & Web Scraping (Google Colab)

This notebook shows you how to collect data from the web using **requests**, **BeautifulSoup**, and **pandas**, handle **pagination**, and save results to **CSV/JSON**.  
It also includes an **optional Selenium** section for sites that use dynamic (JavaScript-rendered) content.

> Demo site: **Books to Scrape** (static) — perfect for practice.



## 1) Setup (Install & Imports)

Run the cell below to make sure required libraries are installed in Colab.


In [1]:

!pip -q install beautifulsoup4 lxml pandas requests --upgrade

import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
print("Libraries ready ")


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m61.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.3.2 which is incompatible.
google-colab 1.0.0 requires requests==2.32.4, but you have requests 2.32.5 which is incompatible.
dask-cudf-cu12 25.6.0 requires pandas<2.2.4dev0,>=2.0, but you have pandas 2.3.2 which is incompatible.
cudf-cu12 25.6.0 requires pandas<2.2.4dev0,>=2


## 2) Helper Utilities (Polite Scraping)

- Use a **User-Agent** header
- Add small delays if needed
- Always check the website's **robots.txt** and **Terms of Service** before scraping


In [2]:

import time

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36"
}

def get_soup(url, wait=0.0):
    resp = requests.get(url, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    if wait:
        time.sleep(wait)
    return BeautifulSoup(resp.text, "lxml")



## 3) Static Site Demo: Books to Scrape

We'll collect **book title, price, availability, product URL** across all pages.

**Selectors (from Inspect Element):**
- Each book: `article.product_pod`
- Title: `h3 > a['title']`
- Price: `p.price_color`
- Availability: `p.instock.availability`
- Detail link: `h3 > a['href']`


In [3]:

BASE = "https://books.toscrape.com/"
START = urljoin(BASE, "catalogue/page-1.html")

def parse_page(soup):
    rows = []
    for card in soup.select("article.product_pod"):
        title = card.h3.a.get("title", "").strip()
        price = card.select_one("p.price_color").get_text(strip=True)
        availability = card.select_one("p.instock.availability").get_text(strip=True)
        rel_link = card.h3.a.get("href", "")
        product_url = urljoin(BASE + "catalogue/", rel_link)
        rows.append({
            "title": title,
            "price": price,
            "availability": availability,
            "product_url": product_url
        })
    return rows

def find_next_page(soup, current_url):
    next_li = soup.select_one("li.next > a")
    if not next_li:
        return None
    href = next_li.get("href")
    # next links are relative to current page
    return urljoin(current_url, href)

all_rows = []
page_url = START
page_num = 1

while page_url:
    print(f"Scraping page {page_num}: {page_url}")
    soup = get_soup(page_url, wait=0.3)
    all_rows.extend(parse_page(soup))
    page_url = find_next_page(soup, page_url)
    page_num += 1

df = pd.DataFrame(all_rows)
print(f"Collected {len(df)} rows")
df.head(10)


Scraping page 1: https://books.toscrape.com/catalogue/page-1.html
Scraping page 2: https://books.toscrape.com/catalogue/page-2.html
Scraping page 3: https://books.toscrape.com/catalogue/page-3.html
Scraping page 4: https://books.toscrape.com/catalogue/page-4.html
Scraping page 5: https://books.toscrape.com/catalogue/page-5.html
Scraping page 6: https://books.toscrape.com/catalogue/page-6.html
Scraping page 7: https://books.toscrape.com/catalogue/page-7.html
Scraping page 8: https://books.toscrape.com/catalogue/page-8.html
Scraping page 9: https://books.toscrape.com/catalogue/page-9.html
Scraping page 10: https://books.toscrape.com/catalogue/page-10.html
Scraping page 11: https://books.toscrape.com/catalogue/page-11.html
Scraping page 12: https://books.toscrape.com/catalogue/page-12.html
Scraping page 13: https://books.toscrape.com/catalogue/page-13.html
Scraping page 14: https://books.toscrape.com/catalogue/page-14.html
Scraping page 15: https://books.toscrape.com/catalogue/page-15.htm

Unnamed: 0,title,price,availability,product_url
0,A Light in the Attic,Â£51.77,In stock,https://books.toscrape.com/catalogue/a-light-i...
1,Tipping the Velvet,Â£53.74,In stock,https://books.toscrape.com/catalogue/tipping-t...
2,Soumission,Â£50.10,In stock,https://books.toscrape.com/catalogue/soumissio...
3,Sharp Objects,Â£47.82,In stock,https://books.toscrape.com/catalogue/sharp-obj...
4,Sapiens: A Brief History of Humankind,Â£54.23,In stock,https://books.toscrape.com/catalogue/sapiens-a...
5,The Requiem Red,Â£22.65,In stock,https://books.toscrape.com/catalogue/the-requi...
6,The Dirty Little Secrets of Getting Your Dream...,Â£33.34,In stock,https://books.toscrape.com/catalogue/the-dirty...
7,The Coming Woman: A Novel Based on the Life of...,Â£17.93,In stock,https://books.toscrape.com/catalogue/the-comin...
8,The Boys in the Boat: Nine Americans and Their...,Â£22.60,In stock,https://books.toscrape.com/catalogue/the-boys-...
9,The Black Maria,Â£52.15,In stock,https://books.toscrape.com/catalogue/the-black...



## 4) Save Data (CSV & JSON)

Run to create files you can download.  
**Tip:** If you want to save directly to Google Drive, run the next section to mount Drive first.


In [4]:

csv_path = "books_data.csv"
json_path = "books_data.json"

df.to_csv(csv_path, index=False)
df.to_json(json_path, orient="records", force_ascii=False, indent=2)
print("Saved files:")
print(" -", csv_path)
print(" -", json_path)


Saved files:
 - books_data.csv
 - books_data.json



## 5) (Optional) Save to Google Drive

Run this cell to mount your Drive and copy the output files there.
You'll be asked to authorize the connection.


In [5]:

from google.colab import drive
drive.mount('/content/drive')

target_folder = "/content/drive/MyDrive/web_scraping_outputs"
import os, shutil

os.makedirs(target_folder, exist_ok=True)
for p in [csv_path, json_path]:
    shutil.copy(p, os.path.join(target_folder, os.path.basename(p)))

print("Copied to:", target_folder)


Mounted at /content/drive
Copied to: /content/drive/MyDrive/web_scraping_outputs



## 6) (Optional) Dynamic Content with Selenium

If a site loads data with JavaScript, `requests` won't see it. Use **Selenium** to automate a headless browser and grab the rendered HTML.

Below is a minimal setup that works in Colab.  
**Demo target:** `https://quotes.toscrape.com/scroll` (AJAX infinite scroll).


In [1]:
# Install Chrome & Selenium dependencies
!apt-get update -y
!apt-get install -y wget unzip curl
!wget -q https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
!dpkg -i google-chrome-stable_current_amd64.deb || apt-get -fy install

# Install selenium + driver manager
!pip install -q selenium webdriver-manager

# ---- Python part ----
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import time
import pandas as pd

# Configure Chrome (headless mode for Colab)
chrome_options = Options()
chrome_options.add_argument("--headless=new")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")

# Start Chrome
driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=chrome_options
)

try:
    driver.get("https://quotes.toscrape.com/scroll")
    time.sleep(2)

    # Scroll a few times to load more content
    for _ in range(5):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1.5)

    # Extract data
    quotes = driver.find_elements(By.CSS_SELECTOR, ".quote")
    scraped = []
    for q in quotes:
        text = q.find_element(By.CSS_SELECTOR, ".text").text
        author = q.find_element(By.CSS_SELECTOR, ".author").text
        scraped.append({"text": text, "author": author})

    df_dyn = pd.DataFrame(scraped)
    print("Dynamic rows:", len(df_dyn))
    display(df_dyn.head(10))
finally:
    driver.quit()


0% [Working]            Hit:1 https://cli.github.com/packages stable InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.82)] [Connecting to security.                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:5 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [80.2 kB]
Hit:6 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,961 kB]
Get:8 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease [24.3 kB]
Hit:9 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:10 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages

Unnamed: 0,text,author
0,“The world as we have created it is a process ...,Albert Einstein
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling
2,“There are only two ways to live your life. On...,Albert Einstein
3,"“The person, be it gentleman or lady, who has ...",Jane Austen
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe
5,“Try not to become a man of success. Rather be...,Albert Einstein
6,“It is better to be hated for what you are tha...,André Gide
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt
9,"“A day without sunshine is like, you know, nig...",Steve Martin



## 7) (Optional) Save Dynamic Results

If you ran the Selenium section, run this to save the results as files too.


In [2]:

if 'df_dyn' in globals():
    df_dyn.to_csv("quotes_dynamic.csv", index=False)
    df_dyn.to_json("quotes_dynamic.json", orient="records", indent=2, force_ascii=False)
    print("Saved quotes_dynamic.csv & quotes_dynamic.json")
else:
    print("Skip: dynamic section wasn't run.")


Saved quotes_dynamic.csv & quotes_dynamic.json



## 8) Tips & Troubleshooting

- **Robots.txt & ToS:** Always confirm you have permission to scrape.
- **Selectors break?** Right-click → *Inspect* to re-check classes/structure.
- **Pagination strategies:**
  - Look for a *Next* button and follow its link (used above).
  - Increment a `page=` query parameter until no results return.
- **Rate limiting:** Add `time.sleep()` between requests and set headers.
- **Errors:** Wrap requests in `try/except`, and use `response.raise_for_status()`.
- **Prefer APIs:** If the site has a public API endpoint, use it instead of scraping.
