# Types of Techniques to Avoid Getting Blocked

make your scraper undetectable to be able to extract data from a webpage, and the main types of **techniques for that are imitating a real browser and simulating human behavior**.

### 1. Set Real Request Headers

As we mentioned, your scraper activity should look as similar as possible to a regular user browsing the target website. Web browsers usually send a lot of information that HTTP clients or libraries don’t.

One of the most important headers for web scraping is `User-Agent`. That string informs the server about the operating system, vendor, and version of the requesting User-Agent.

Then, set these headers using your preferred library so the target website thinks your web scraper is a regular web browser.

## What Is a User Agent?

**User Agent (UA)** is a string sent by the user's web browser to a server. It's located in the HTTP header and identifies the browser type and version as well as the operating system.

- Accessed with JavaScript on the client side using the `navigator.userAgent` property.
- The remote web server uses this information to identify and render the content in a way that's compatible with the user's specifications.

### Formate of UA:

`Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>`

#### Example: Execute JavaScript to get the user agent
```python
user_agent = driver.execute_script("return navigator.userAgent;")
print(f"User-Agent: {user_agent}")
 **or**
response = requests.get(url)
user_agent = response.json().get('user-agent')


### Why Is a User Agent Important for Web Scraping?
Since UA strings help web servers identify the type of browser (and bots) requests, adopting them for scraping can help mask your spider as a web browser.

## What Are the Best User Agents for Scraping?

We compiled a list of the best ones to use while scraping. They can help you emulate a browser and avoid getting blocked:

- Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36
- Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0
- Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.2420.81
- Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 OPR/109.0.0.0
- Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36
- Mozilla/5.0 (Macintosh; Intel Mac OS X 14.4; rv:124.0) Gecko/20100101 Firefox/124.0
- Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4.1 Safari/605.1.15
- Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 OPR/109.0.0.0
- Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36
- Mozilla/5.0 (X11; Linux i686; rv:124.0) Gecko/20100101 Firefox/124.0

## How to Check User Agents and Understand Them

The easiest way to do so is to visit [UserAgentString.com](https://useragentstring.com/). It automatically displays the user agent for your web browsing environment. You can also get comprehensive information on other user agents. You just have to copy/paste any string in the input field and click on ''Analyze.''


### How to Set a New User Agent Header in Python?
Let's run a quick example of changing a scraper user agent using Python requests. We'll use a string associated with Chrome:

`Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36`

Use the following code snippet to set the `User-Agent` header while sending the request:

#### To test your request header, you should use an endpoint like HTTPBin, which is specifically designed for testing HTTP requests and responses.

In [6]:
# pip3 install selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

# ...

# create a Chrome Options instance
options = Options()
options.add_argument("--headless=new")

# set a custom User Agent
custom_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
options.add_argument(f"user-agent={custom_user_agent}")

# Start the WebDriver instance with the options
driver = webdriver.Chrome(options=options)

# open the target website
driver.get("https://httpbin.io/user-agent")

# print the User Agent to verify
print(driver.find_element(By.TAG_NAME, "body").text)

# release the allocated resources
driver.quit()


{
  "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}


## Rotate User Agents

In [8]:
import random

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from selenium.webdriver.common.by import By



# list of User Agent strings

user_agents = [
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14.4; rv:124.0) Gecko/20100101 Firefox/124.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
]



# create a Chrome Options instance

options = Options()

options.add_argument("--headless=new")



# set a random User Agent

random_user_agent = random.choice(user_agents)

options.add_argument(f"user-agent={random_user_agent}")



# Start the WebDriver instance with the options

driver = webdriver.Chrome(options=options)



# open the target website

driver.get("https://httpbin.io/user-agent")



# print the User Agent to verify

print(driver.find_element(By.TAG_NAME, "body").text)



# release the allocated resources

driver.quit()



{
  "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14.4; rv:124.0) Gecko/20100101 Firefox/124.0"
}


In [1]:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import csv
import random

# list of user agents to rotate
user_agents = [
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14.4; rv:124.0) Gecko/20100101 Firefox/124.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
]

# randomly select a user agent from the list
random_user_agent = random.choice(user_agents)

# create a Chrome Options instance
options = Options()

# set the user agent in the options
options.add_argument(f"user-agent={random_user_agent}")

# set the options to use Chrome in headless mode
options.add_argument("--headless=new")

# initialize an instance of the chrome driver (browser) in headless mode
driver = webdriver.Chrome(
    options=options,
)

# visit your target site
driver.get("https://www.scrapingcourse.com/ecommerce")

# extract all the product containers
products = driver.find_elements(By.CSS_SELECTOR, ".product")

# declare an empty list to collect the extracted data
extracted_products = []

# loop through the product containers
for product in products:

    # extract the elements into a dictionary using the CSS selector
    product_data = {
        "Url": product.find_element(
            By.CSS_SELECTOR, ".woocommerce-LoopProduct-link"
        ).get_attribute("href"),
        "Image": product.find_element(By.CSS_SELECTOR, ".product-image").get_attribute(
            "src"
        ),
        "Name": product.find_element(By.CSS_SELECTOR, ".product-name").text,
        "Price": product.find_element(By.CSS_SELECTOR, ".price").text,
    }

    # append the extracted data to the extracted_product list
    extracted_products.append(product_data)

print(extracted_products)

# specify the CSV file name
csv_file = "products.csv"

# write the extracted data to the CSV file
with open(csv_file, mode="w", newline="", encoding="utf-8") as file:
    # write the headers
    writer = csv.DictWriter(file, fieldnames=["Url", "Image", "Name", "Price"])
    writer.writeheader()
    # write the rows
    writer.writerows(extracted_products)

# confirm that the data has been written to the CSV file
print(f"Data has been written to {csv_file}")

# release the resources allocated by Selenium and shut down the browser
driver.quit()


[{'Url': 'https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/', 'Image': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg', 'Name': 'Abominable Hoodie', 'Price': '$69.00'}, {'Url': 'https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/', 'Image': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main.jpg', 'Name': 'Adrienne Trek Jacket', 'Price': '$57.00'}, {'Url': 'https://www.scrapingcourse.com/ecommerce/product/aeon-capri/', 'Image': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wp07-black_main.jpg', 'Name': 'Aeon Capri', 'Price': '$48.00'}, {'Url': 'https://www.scrapingcourse.com/ecommerce/product/aero-daily-fitness-tee/', 'Image': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ms01-blue_main.jpg', 'Name': 'Aero Daily Fitness Tee', 'Price': '$24.00'}, {'Url': 'https://www.scrapingcourse.com/ecommerce/product/aether-gym-pant/', 'Image': 'h