<a href="https://colab.research.google.com/github/EmmanuelKnows/DS-Codveda/blob/main/Data_Collection_and_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data Collection and Web Scraping

## Step 1: Install and import Necessary Libaries

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Step 2: Sending a Request and Parsing HTML
The requests library downloads the webpage's HTML, and BeautifulSoup turns that raw text into a searchable "tree" of objects.

In [None]:
# 1. Identify the URL
url = "http://books.toscrape.com/catalogue/page-1.html"

# 2. Send an HTTP request
response = requests.get(url)

# 3. Check if the request was successful (Status Code 200)
if response.status_code == 200:
    # 4. Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')
    print("Page parsed successfully!")
else:
    print(f"Failed to retrieve page. Status code: {response.status_code}")

Page parsed successfully!


## Step 3: Extracting Specific Data
Now we find the HTML elements we identified in Step 1. In this example, each book is contained within an <article class="product_pod">.

In [None]:
books_data = []

# Find all book containers
books = soup.find_all('article', class_='product_pod')

for book in books:
    # Extract Title (found inside an <a> tag inside an <h3>)
    title = book.h3.a['title']

    # Extract Price
    price = book.find('p', class_='price_color').text

    # Extract Availability
    stock = book.find('p', class_='instock availability').text.strip()

    # Store in a list of dictionaries
    books_data.append({
        'Title': title,
        'Price': price,
        'Stock': stock
    })

print(f"Extracted {len(books_data)} books.")


Extracted 20 books.


## Step 4: Handling Pagination
To scrape multiple pages, you wrap your logic in a loop and dynamically change the URL.

In [None]:
all_books = []

# Loop through the first 3 pages
for i in range(1, 4):
    url = f"http://books.toscrape.com/catalogue/page-{i}.html"
    res = requests.get(url)
    sp = BeautifulSoup(res.text, 'html.parser')

    items = sp.find_all('article', class_='product_pod')
    for item in items:
        all_books.append({
            'Title': item.h3.a['title'],
            'Price': item.find('p', class_='price_color').text
        })

print(f"Total books collected: {len(all_books)}")

Total books collected: 60


## Step 5: Storing Data in CSV/JSON
Pandas makes it incredibly easy to convert a list of dictionaries into a structured file.

In [None]:
# Create a DataFrame
df = pd.DataFrame(all_books)

# Save to CSV
df.to_csv('books_data.csv', index=False)

# Save to JSON
df.to_json('books_data.json', orient='records')

print("Data saved successfully to CSV and JSON!")

Data saved successfully to CSV and JSON!


## Handling Common Challenges
1. Dynamic Content (JavaScript)
If the data you see in your browser doesn't appear in the requests response, it is likely being loaded dynamically via JavaScript.

Solution: Use Selenium or Playwright. These tools control a real web browser to render JavaScript before scraping.

2. User-Agent Headers
Some websites block "bot-like" requests. You can bypass this by mimicking a real browser header.

Python

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)

3. Rate Limiting
If you scrape too fast, the website might block your IP.

Solution: Use time.sleep(2) between requests to slow down your scraper.



# Python Script Handling common Challenge for a simple web scraping
**Using User-Agent and Rate Limiting**

Most websites monitor the User-Agent string to see who is visiting. If it says "python-requests," they know it's a bot. By changing this to a browser string, you look like a human user.

Rate limiting (time.sleep) ensures you don't overwhelm the website's server, which is both ethical and a way to avoid IP bans.

In [None]:
import requests
import time
import random # Used to vary the sleep time
from bs4 import BeautifulSoup

# Define a real browser header
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
}

# can do a for loop on list of urls
# urls = ["http://books.toscrape.com/catalogue/page-1.html", "http://books.toscrape.com/catalogue/page-2.html"]

all_books = []

# Loop through the first 3 pages
for i in range(1, 4):
    url = f"http://books.toscrape.com/catalogue/page-{i}.html"
    # Pass the headers into the get request
    response = requests.get(url, headers=headers)
    sp = BeautifulSoup(response.text, 'html.parser')

    if response.status_code == 200:
      print("Page parsed successfully!")

      # Scraping the data
      items = sp.find_all('article', class_='product_pod')
      for item in items:
        all_books.append({
            'Title': item.h3.a['title'],
            'Price': item.find('p', class_='price_color').text
            })

      print("Successfully scraped.")

    # Rate Limiting: Wait for 2 to 5 seconds before the next request
    # Using random makes your behavior less predictable to bot-detectors
    wait_time = random.uniform(2, 5)
    print(f"Sleeping for {wait_time:.2f} seconds...")
    time.sleep(wait_time)

print(f"Total books collected: {len(all_books)}")

# Create a DataFrame
df = pd.DataFrame(all_books)

# Save to CSV
df.to_csv('books_data.csv', index=False)

# Save to JSON
df.to_json('books_data.json', orient='records')

print("Data saved successfully to CSV and JSON!")


Page parsed successfully!
Successfully scraped.
Sleeping for 2.34 seconds...
Page parsed successfully!
Successfully scraped.
Sleeping for 3.95 seconds...
Page parsed successfully!
Successfully scraped.
Sleeping for 3.83 seconds...
Total books collected: 60


In [None]:
df

Unnamed: 0,Title,Price
0,A Light in the Attic,√Ç¬£51.77
1,Tipping the Velvet,√Ç¬£53.74
2,Soumission,√Ç¬£50.10
3,Sharp Objects,√Ç¬£47.82
4,Sapiens: A Brief History of Humankind,√Ç¬£54.23
5,The Requiem Red,√Ç¬£22.65
6,The Dirty Little Secrets of Getting Your Dream...,√Ç¬£33.34
7,The Coming Woman: A Novel Based on the Life of...,√Ç¬£17.93
8,The Boys in the Boat: Nine Americans and Their...,√Ç¬£22.60
9,The Black Maria,√Ç¬£52.15


## Scraping Dynamic Content with Selenium
If a website uses JavaScript to load data (like a "Load More" button or infinite scroll), requests won't see that data because it doesn't "run" JavaScript.

Selenium opens a real browser (Chrome, Firefox, etc.) and waits for the page to fully render.

Prerequisites
You need the library and a driver (though modern Selenium manages drivers automatically):

In [None]:
pip install selenium

Collecting selenium
  Downloading selenium-4.39.0-py3-none-any.whl.metadata (7.5 kB)
Collecting trio<1.0,>=0.31.0 (from selenium)
  Downloading trio-0.32.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket<1.0,>=0.12.2 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting sortedcontainers (from trio<1.0,>=0.31.0->selenium)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting outcome (from trio<1.0,>=0.31.0->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket<1.0,>=0.12.2->selenium)
  Downloading wsproto-1.3.2-py3-none-any.whl.metadata (5.2 kB)
Downloading selenium-4.39.0-py3-none-any.whl (9.7 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m9.7/9.7 MB[0m [31m92.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trio-0.32.0-py3-

This script opens a browser, waits for the page to load, and then grabs the content.üëáüèªüëáüèªüëáüèª

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time

# 1. Setup Chrome Options (Add User-Agent here too!)
chrome_options = Options()
chrome_options.add_argument("--headless") # Run without a visible window
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")

# 2. Initialize the Driver
driver = webdriver.Chrome(options=chrome_options)

try:
    url = "https://example-dynamic-site.com"
    driver.get(url)

    # 3. Rate Limiting / Waiting
    # Selenium needs time for JavaScript to execute
    time.sleep(5)

    # 4. Find elements using Selenium's methods
    # Example: Finding book titles by their CSS tag
    titles = driver.find_elements(By.TAG_NAME, "h3")

    for title in titles:
        print(title.text)

finally:
    # Always close the browser when finished
    driver.quit()