# Practice NB Using ZenRows Web-Scraping Tutorial
---
## Douglas Krouth
## [TUTORIAL URL](https://www.zenrows.com/blog/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja#ip-rate-limit)

In [2]:
# IP Rate Limit
# The idea here is that we don't know whether there exists IP rate limits that we need to be aware of.
# The workaround for this is to use a rotating/changing IP address via proxy.
import requests 
 
response = requests.get('http://httpbin.org/ip') 
print(response.json()['origin'])

73.62.199.225


Free proxy list site : [URL](https://free-proxy-list.net/)
---
This gives us a list of *unreliable* proxies that we can practice with.

In [4]:
proxies = {'http': 'http://105.242.158.92:3129'} 
response = requests.get('http://httpbin.org/ip', proxies=proxies) 
print(response.json()['origin']) 

105.242.158.92


## User-Agent Header
Check request headers to see if we're presenting any suspicious UA info

In [5]:
response = requests.get('http://httpbin.org/headers') 
print(response.json()['headers']['User-Agent']) 
# python-requests/2.25.1

python-requests/2.28.2


In [6]:
# Fake User-Agent header passed with requests
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"} 
response = requests.get('http://httpbin.org/headers', headers=headers) 
print(response.json()['headers']['User-Agent']) # Mozilla/5.0 ...

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36


## User Agent
Mitigate detection by rotating User-Agent headers at random. Same idea as IP Addresses, allows us to simulate varying user traffic.

There are resources online that can be used to generate/source info on browser user-agent info. One challenge with simulating UA data is that there can be changes/updates to browsers that will need to be acommodated as they'll likely change the format of the UA header.

[USER AGENT DATABASE](https://explore.whatismybrowser.com/useragents/explore/)

In [7]:
import random 
 
user_agents = [ 
	'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 
	'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36', 
	'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36', 
	'Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148', 
	'Mozilla/5.0 (Linux; Android 11; SM-G960U) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Mobile Safari/537.36' 
] 
user_agent = random.choice(user_agents) 
headers = {'User-Agent': user_agent} 
response = requests.get('https://httpbin.org/headers', headers=headers) 
print(response.json()['headers']['User-Agent']) 
# Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) ...

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36


## Sample Browser Headers
The use of simple User Agent header swaps is easily detected with good security systems. We need a better way to bypass this-an easy answer is to generate user-agent headers with a browser dev tool like Chrome's web dev suite. This data is usually formatted as cURL so we need to format it as JSON.

These headers have obvious differences (Chrome/Chromium vs FireFox)

### Chrome sample header

```
{ 
	"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
	"Accept-Encoding": "gzip, deflate, br", 
	"Accept-Language": "en-US,en;q=0.9", 
	"Host": "httpbin.org", 
	"Sec-Ch-Ua": "\"Chromium\";v=\"92\", \" Not A;Brand\";v=\"99\", \"Google Chrome\";v=\"92\"", 
	"Sec-Ch-Ua-Mobile": "?0", 
	"Sec-Fetch-Dest": "document", 
	"Sec-Fetch-Mode": "navigate", 
	"Sec-Fetch-Site": "none", 
	"Sec-Fetch-User": "?1", 
	"Upgrade-Insecure-Requests": "1", 
	"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36" 
}
```
### FireFox sample header
```
{ 
	"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
	"Accept-Encoding": "gzip, deflate, br", 
	"Accept-Language": "en-US,en;q=0.5", 
	"Host": "httpbin.org", 
	"Sec-Fetch-Dest": "document", 
	"Sec-Fetch-Mode": "navigate", 
	"Sec-Fetch-Site": "none", 
	"Sec-Fetch-User": "?1", 
	"Upgrade-Insecure-Requests": "1", 
	"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0" 
}
```

## To get browser headers
Visit desired website in browser with Dev Tools -> Network open, grab request header and copy it as cURL. Then convert the cURL to JSON.

[cURL to JSON Converter](https://www.scrapingbee.com/curl-converter/json/)
### Sample header from httpbin.org
```
{
    "url": "http://httpbin.org",
    "raw_url": "http://httpbin.org/",
    "method": "get",
    "headers": {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache-Control": "max-age=0",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    },
    "insecure": false
}
```

## Cookies
Cookies are small blocks of data generated from web servers/hosts that are placed on a user's machine via web browser. The main benefit of cookies is that they enable stateful behaviors that can persist across browsing sessions.

**Examples** : Maintain login info, user selections, payment info

## Cookies w/ scraping
Because cookies can track ind. user sessions, we need to factor in implementation/design that can use cookies provided by a site to avoid getting blocked/suspicious behavior resulting from improper cookies.

Rotating proxies can cause cookie problems as sites can recognize repeated traffic from various locations that don't provide the cookie originally provided by the site for that "browsing session".

## Whether or not to use Cookies when scraping
Simple cases / websites? Don't worry about cookies/maintaining a session.

Completx sites / lot of auth? Session cookies provide potential access/ease of implementation to mitigate security if session details (IP, User Agent) can be maintained.

## Playwright
For sites that require more rigorous auth (cookies, etc.) we can use a headless browser. Tutorial uses Playwright, Selenium will also work fine.

NOTE : To get Playwright to work in Jupyter, we need to use the async package instead of sync. Also, we need to install drivers/executable for desired mock browser to be used.

1. ``` pip install pytest-playwright```

2. ```npx install chromium```

[Playwright for Python intro](https://playwright.dev/python/docs/intro)

In [2]:
# Playwright test
import re
from playwright.sync_api import Page, expect


def test_homepage_has_Playwright_in_title_and_get_started_link_linking_to_the_intro_page(page: Page):
    page.goto("https://playwright.dev/")

    # Expect a title "to contain" a substring.
    expect(page).to_have_title(re.compile("Playwright"))

    # create a locator
    get_started = page.get_by_role("link", name="Get started")

    # Expect an attribute "to be strictly equal" to the value.
    expect(get_started).to_have_attribute("href", "/docs/intro")

    # Click the get started link.
    get_started.click()

    # Expects the URL to contain intro.
    expect(page).to_have_url(re.compile(".*intro"))