# 🎓 Lesson 12: Polite Scraping (Delay, Headers, User-Agent)

🎯 Goal

In this lesson, you'll learn how to:

1. Avoid overwhelming websites with too many requests

2. Use `time.sleep()` and random delays

3. Set headers like User-Agent to mimic real browsers

4. Stay respectful and avoid getting blocked

## 🤖 Why Do Bots Get Blocked?

Web servers may block your requests if you:

1. Scrape too fast (many requests per second)

2. Don’t send browser-like headers (looks like a bot)

3. Hit restricted endpoints repeatedly

4. Violate robots.txt or overload the server

## ✅ Example: Add Delay and Headers to Quote Scraper

In [None]:
import requests
from bs4 import BeautifulSoup
import time
import random

# User-defined tag to filter quotes
tag = "life"
base_url = f"https://quotes.toscrape.com/tag/{tag}/page/{{}}/"

# Start from page 1
page = 1

while True:
    url = base_url.format(page)
    
    # 🧢 Add custom headers to simulate a real browser
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      "AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/113.0.0.0 Safari/537.36"
    }

    print(f"📄 Scraping page {page}...")

    # Send GET request with headers
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "lxml")

    # Extract all quote blocks
    quotes = soup.select("div.quote")
    if not quotes:
        print("✅ No more quotes found. Stopping.")
        break

    for quote in quotes:
        text = quote.select_one("span.text").text.strip()
        author = quote.select_one("small.author").text.strip()
        print(f"📝 {text} — {author}")

    # Add a polite random delay between 3.5 to 5.0 seconds
    sleep_time = round(random.uniform(3.5, 5.0), 2)
    print(f"⏳ Sleeping for {sleep_time} seconds...\n")
    time.sleep(sleep_time)

    # Move to the next page
    page += 1

### Explanation of Key Elements

| Concept                    | Purpose                                |
| -------------------------- | -------------------------------------- |
| `headers['User-Agent']`    | Pretend you're a normal browser        |
| `random.uniform(1.5, 3.0)` | Adds randomness to delay               |
| `time.sleep()`             | Slows down scraping to avoid detection |
| `print(...)`               | Good for debugging and pacing          |


## What Is an HTTP Header?

When your browser or script sends a request to a website (like GET /page), it includes HTTP headers, little pieces of metadata that describe:

- Who’s making the request

- What kind of content is accepted

- What type of connection is used

- Whether it's part of a session (cookies, tokens, etc.)

Headers help the server understand who you are and how to respond.

## What Is a User-Agent?

One of the most important headers is:

`User-Agent`:
- It tells the server what browser or device is making the request.

For example, here’s a User-Agent for Google Chrome on Windows 10:

💡 Without this header, the server may assume you're a bot, API client, or script and block or restrict you.

## Other Useful Headers (Optional)

| Header            | Purpose                                                           |
| ----------------- | ----------------------------------------------------------------- |
| `User-Agent`      | Fools the server into thinking you're a browser                   |
| `Accept`          | What type of content you accept (`text/html`, `application/json`) |
| `Referer`         | The page you "came from" — helps simulate real navigation         |
| `Accept-Language` | Tells the site your preferred language                            |
| `Connection`      | Whether to keep the connection alive                              |
| `DNT`             | “Do Not Track” header (optional)                                  |
| `Cache-Control`   | Can control or skip cached versions                               |


✅ Simulating a Real Browser

To look more human, combine:

In [None]:
headers = {
    "User-Agent": "...",                  # Real browser UA string
    "Accept-Language": "en-US,en;q=0.9",  # Common browser language
    "Accept": "text/html,application/xhtml+xml",
    "Referer": "https://google.com",      # Pretend we came from Google
    "DNT": "1",                           # Respect "Do Not Track"
    "Connection": "keep-alive"
}

## Extra Techniques to Avoid Detection

| Technique                           | Description                                       |
| ----------------------------------- | ------------------------------------------------- |
| `time.sleep()` + `random.uniform()` | Add human-like delay between requests             |
| Use sessions (`requests.Session()`) | Simulate browser sessions and reuse cookies       |
| Rotate headers                      | Use different User-Agents for each request        |
| Rotate IPs                          | Use proxy rotation or Tor (advanced topics)       |
| Avoid patterns                      | Don't always scrape at same time, order, or speed |


## 💡 Pro Tip: Use a Header Generator

You can find real User-Agents here:

- https://www.whatismybrowser.com/guides/the-latest-user-agent/

Use one for:

- Chrome
- Firefox
- Mobile Safari
- Edge

Rotate them every few requests if needed.

## Practice Tasks

1. Replace the `tag = "life"` with user input like:

```python
tag = input("Enter a tag to scrape: ").strip().lower()
```

2. Try increasing the sleep delay to between 2 and 5 seconds.

3. Modify the code to scrape from a static list of URLs instead of pagination.

💡 Remember: Being polite protects you from being blocked and helps keep scraping sustainable and ethical.

## 🔜 Next up: Lesson  13 – Saving Data to CSV, JSON, or SQLite

You’ll learn how to store scraped data for future use: clean, structured, and ready for analysis.