# 1. What is Scrapy?

Scrapy is an open-source Python framework designed for **extracting data from websites** (web scraping). It can also be used for other types of web crawling, like gathering data for analytics or indexing websites.

## Why Use Scrapy?
- **Fast**: Scrapy uses asynchronous requests, which makes it faster compared to other scraping libraries like BeautifulSoup.
- **Built-in Tools**: Scrapy offers tools for handling requests, following links, handling pagination, and managing data pipelines.
- **Handling Complex Websites**: With Scrapy, you can deal with dynamic content, form submissions, cookies, and authentication.


# 2. Installation

Install Scrapy using pip:<pre>
!pip install scrapy

# 3. Setting Up a Scrapy Project

## 3.1 How to Create a Scrapy Project
To create a new Scrapy project:
- Open a terminal or command prompt.
- Navigate to the directory where you want to create the project.
- Run the following command:<pre>
scrapy startproject myproject</pre>
This will create a new Scrapy project in a directory called myproject.

## 3.2 Overview of the Scrapy Project
The project created by Scrapy will have the following structure:<pre>
_ myproject/
_   scrapy.cfg             # Configuration file for the project
_   myproject/             # Python module for the project
_       __init__.py        # Makes 'myproject' a Python package
_       items.py           # Defines item classes
_       middlewares.py     # For custom middlewares
_       pipelines.py       # For item processing pipelines
_       settings.py        # Project settings
_       spiders/           # Store spider classes (core scraping logic)
_           __init__.py    # Makes 'spiders' a Python package</pre><br>
**scrapy.cfg**: This is the project configuration file, which links the project's settings to the Scrapy tool.<br>
**items.py**: Defines the structure of the data you want to scrape (like a schema).<br>
**middlewares.py**: Contains custom middlewares that modify requests/responses.<br>
**pipelines.py**: Defines how the scraped data is processed or stored after being scraped.<br>
**settings.py**: This file contains all the configuration settings for the Scrapy project (such as enabling/disabling features, customizing behavior).<br>
**spiders/**: This folder stores all the spiders (the actual scraping scripts) you will write.<br>

## 3.3 Explaining Scrapy Spiders, Items & Item Pipelines
### **Spiders**:
A spider is a class responsible for defining how Scrapy will navigate through a website and extract the information you need.
Spiders are placed inside the spiders/ directory.
A basic spider looks like this:

In [None]:
import scrapy
class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = ['http://example.com']

    def parse(self, response):
        yield {
            'title': response.css('title::text').get(),
            'url': response.url
        }

**start_urls**: The initial URL(s) the spider will start scraping.<br>
**parse method**: Defines how to extract and return the data from the page using Scrapy selectors (.css or .xpath).

### **Items**:
Items are used to define the data structure of what you’re scraping.
You create an item by defining it in items.py like this:<pre>

In [None]:
import scrapy
class MyItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()

You can then yield MyItem in the spider instead of returning a dictionary:

In [None]:
from myproject.items import MyItem
class ExampleSpider(scrapy.Spider):
   name = "example"
   start_urls = ['http://example.com']
   def parse(self, response):
       item = MyItem()
       item['title'] = response.css('title::text').get()
       item['url'] = response.url
       yield item



### **Item Pipelines**:
Pipelines are used for post-processing the scraped items (e.g., cleaning the data, saving it to a database, etc.).
Pipelines are defined in pipelines.py. Here's an example:

In [None]:
class MyPipeline:
   def process_item(self, item, spider):
       # Process item (e.g., clean data, save to DB)
       return item


You need to enable the pipeline in settings.py by adding it to ITEM_PIPELINES:

In [None]:
ITEM_PIPELINES = {
    'myproject.pipelines.MyPipeline': 300,
}


The number 300 determines the order in which pipelines run (lower numbers run first).

## 3.4 Explaining Scrapy Middlewares & Settings
### **Middlewares**:

Middlewares are hooks that allow you to modify Scrapy's requests and responses during the scraping process. They act between the engine, spiders, and the web.<br>
Middlewares can be customized in middlewares.py.
Examples of middlewares:
- **Downloader Middleware**: Modifies requests or responses before they reach spiders (e.g., setting custom headers, retrying failed requests).
- **Spider Middleware**: Modifies the spider’s output before it's processed by the Scrapy engine.


In [None]:
class MyCustomMiddleware:
    def process_request(self, request, spider):
        # Modify request (e.g., set user agent)
        request.headers['User-Agent'] = 'my-custom-user-agent'
        return None

You enable the middleware in settings.py:<pre>

In [None]:
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.MyCustomMiddleware': 543,
}


### **Settings**:
Scrapy's behavior can be customized using the settings.py file.
#### Common settings include:
- **USER_AGENT**: Change the default user agent.
- **DOWNLOAD_DELAY**: Delay between requests to avoid being blocked.
- **ROBOTSTXT_OBEY**: Whether Scrapy should respect robots.txt rules.
- **CONCURRENT_REQUESTS**: The number of simultaneous requests Scrapy can send.

Example configuration in settings.py:<pre>
USER_AGENT = 'my-scrapy-bot'
DOWNLOAD_DELAY = 2  # Wait 2 seconds between requests
ROBOTSTXT_OBEY = True  # Respect robots.txt rules
CONCURRENT_REQUESTS = 8  # Max number of concurrent requests


## 3.5 Crawl inside Python script

In [None]:
from scrapy.crawler import CrawlerProcess
from quotes.spiders.quotes_spider import QuotesSpider

def run_spider():
    process = CrawlerProcess()
    process.crawl(QuotesSpider)
    process.start()

# 4. Build your First Scrapy Spider

## 4.1 How to Create a Scrapy Spider
Creating a Scrapy spider is easy using the genspider command. This command sets up the spider for you without needing to manually create a file.<br>
Navigate to your Scrapy project directory in your terminal:<pre>
cd path/to/your/project</pre>
Create a new spider using the genspider command:<pre>
scrapy genspider quotes http://quotes.toscrape.com</pre>
**quotes**: This is the name of the spider.<br>
**http://quotes.toscrape.com**: The starting URL that the spider will scrape.<br>
This command creates a new spider file under the spiders/ directory. The file will look something like this:

In [None]:
import scrapy
class QuotesSpider(scrapy.Spider):
   name = 'quotes'
   allowed_domains = ['quotes.toscrape.com']
   start_urls = ['http://quotes.toscrape.com/']
   def parse(self, response):
       pass


Edit the parse method to define what data you want to scrape and how to extract it:

In [None]:
def parse(self, response):
    for quote in response.css('div.quote'):
        yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('span small::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
        }

**name**: The name of the spider (used to run it).<br>
**allowed_domains**: Restricts the spider to this domain.<br>
**start_urls**: The URLs the spider starts scraping from.<br>
**parse()**: This method processes the response, extracts the data, and yields it.<br>

## 4.2 Using Scrapy Shell to Find CSS Selectors
The Scrapy shell is a great tool for experimenting with extracting data using CSS or XPath selectors before implementing them in your spider. Below are some useful and common commands you can run inside the Scrapy shell.<br>
Open the Scrapy shell by running the following command in your terminal:<pre>
scrapy shell 'http://quotes.toscrape.com/page/1/'

### Common Scrapy Shell Commands:
- **View the page's HTML**:<pre>
view(response)</pre>

This command opens the current HTML page in your browser so you can inspect it.

- **Extract all elements matching a CSS selector**:<pre>
response.css('div.quote')</pre>

Returns all elements that match the div.quote selector.

- **Extract text from a CSS selector**:<pre>
response.css('span.text::text').getall()</pre>

Extracts the text from all span.text elements.

- **Extract a single result**:<pre>
response.css('span.text::text').get()<pre>

Extracts only the first matching text element.

- **Extract attributes (e.g., URLs)**:<pre>
response.css('a::attr(href)').getall()<pre>

Extracts all href attributes from a (anchor) tags.

- **XPath selectors (an alternative to CSS selectors)**:<pre>
response.xpath('//div[@class="quote"]/span[@class="text"]/text()').getall()<pre>

Extracts all quote text elements using XPath.

- **Extracting an element by its ID**:<pre>
response.css('#unique-id::text').get()<pre>

Extracts the text from an element with the ID unique-id.

- **Check response status**:<pre>
response.status<pre>

Checks the HTTP status of the response (e.g., 200 OK).

- **Extracting JSON response (if scraping an API)**:<pre>
response.json()<pre>

Converts the response to JSON format if the page returns JSON data.

- **Following links**:<pre>
response.follow(response.css('a::attr(href)').get(), callback=self.parse)<pre>

Follows the first link found and continues parsing.

- **Getting the full URL**:<pre>
response.url<pre>

Returns the URL of the current page.

- **Closing the shell**:<pre>
exit()<pre>


## 4.3 Using CSS Selectors in Our Spider to Get Data
Once you have tested your selectors in the Scrapy shell, you can now integrate them into your spider's parse method. Here’s an example:

In [None]:
def parse(self, response):
    for quote in response.css('div.quote'):
        yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('span small::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
        }

This will extract the quote text, author, and tags from each quote on the page.

## 4.4 Getting Our Spider to Navigate to Multiple Pages
To scrape multiple pages, modify the spider to follow the "Next" page link.<br>

In the parse method, find the "Next" page link and follow it:

In [None]:
def parse(self, response):
    for quote in response.css('div.quote'):
        yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('span small::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
        }

    # Get the link to the next page and follow it if it exists
    next_page = response.css('li.next a::attr(href)').get()
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

This spider will now scrape all pages by following the "Next" button.

# 5. Build Discovery & Extraction Spider

## 5.1 How to Crawl Pages with Scrapy Spiders
Scrapy spiders can crawl multiple pages by either following links on each page or defining a set of URLs to visit.<br>
Define Starting URLs: In your spider, list the URLs where you want to start the crawl.

In [None]:
import scrapy
class DiscoverySpider(scrapy.Spider):
   name = 'discovery'
   start_urls = ['http://quotes.toscrape.com/']
   def parse(self, response):
       # This method will handle the response and extract data from it
       pass


#### Following Links: 
You can follow links to crawl through multiple pages. For example, to follow the "Next" page link:

In [None]:
def parse(self, response):
    for quote in response.css('div.quote'):
        yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('small::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
        }
    # Follow the "Next" page link
    next_page = response.css('li.next a::attr(href)').get()
    if next_page is not None:
        yield response.follow(next_page, callback=self.parse)


This spider starts at the first page, extracts data, and then follows the "Next" page link until there are no more pages to follow.

## 5.2 Using XPath Queries to Extract Data
XPath (XML Path Language) is a powerful query language used to select nodes from an XML document or HTML page. It allows you to navigate through elements and attributes, making it extremely useful for web scraping when HTML structures get complex.

Here’s a detailed breakdown of common and useful XPath queries in Scrapy:

### 1. Basic XPath Syntax
XPath is written as a path-like structure. Here are some fundamental concepts:

- **Selecting Nodes**:
/ selects from the root node.
// selects nodes anywhere in the document.
.// selects nodes relative to the current node.
Example:<pre>
//div[@class="quote"]  # Select all \<div> elements with class="quote"</pre>
### 2. Selecting Elements
- **Select all elements of a specific type**:<pre>
//a      # Select all \<a> elements</pre>
- **Select an element by attribute**:<pre>
//div[@id="content"]  # Select a \<div> element with id="content"</pre>
- **Select multiple attributes**:<pre>
//div[@class="quote" and @id="unique"]  # Select a \<div> with both class and id</pre>
### 3. Selecting Text Content
- **Select text within an element**:<pre>
//span[@class="text"]/text()  # Extract text from \<span> with class="text"</pre>
- **Extract text even when it's nested**:<pre>
//div[@class="quote"]//text()  # Get all text inside \<div class="quote"></pre>
- **Select normalized text (no extra spaces)**:<pre>
normalize-space(//span[@class="text"])  # Strip spaces around the text</pre>
### 4. Selecting Attributes
- **Extract an attribute's value**:<pre>
//a/@href   # Extract the 'href' attribute from \<a> elements</pre>
- **Extract multiple attributes**:<pre>
//img/@src | //img/@alt   # Extract both src and alt attributes from \<img></pre>
### 5. Using Wildcards
- **Select elements regardless of tag type**:<pre>
//*[@class="quote"]   # Select any element with class="quote"</pre>
- **Wildcard for partial attribute matching**:
    - **contains()** function checks for partial matches:<pre>
//a[contains(@href, 'page')]   # Select all \<a> tags with 'page' in href attribute
- **Using * to match any element type**:<pre>
//div/*    # Select all child elements of \<div></pre>
### 6. Traversing Nodes
- **Select direct child nodes**:<pre>
//div[@class="quote"]/span    # Select only \<span> elements directly under \<div></pre>
- **Select all descendant nodes (direct + indirect children)**:<pre>
//div[@class="quote"]//span   # Select all \<span> elements under \<div>, regardless of depth</pre>
- **Select parent nodes**:<pre>
//span[@class="text"]/..      # Select the parent of \<span> with class="text"</pre>
- **Select sibling nodes**:<pre>
//h2[@class="heading"]/following-sibling::p   # Select \<p> that comes right after \<h2></per>
### 7. Indexing and Positional Selection
- **Select the first/last element**:<pre>
//div[@class="quote"][1]     # Select the first \<div> with class="quote"
//div[@class="quote"][last()]   # Select the last \<div> with class="quote"</pre>
- **Select a specific element by position**:<pre>
//div[@class="quote"][position()=3]   # Select the third \<div> with class="quote"</pre>
- **Select all elements after a specific one**:<pre>
//div[@class="quote"][position()>2]   # Select all \<div> after the second one
### 8. Conditional Expressions
XPath supports conditional logic to refine selections.

- **Select elements with specific conditions**:<pre>
//div[@class="quote" and contains(., "life")]   # Select \<div> with class="quote" that contains the text "life"</pre>
- **Combining conditions**:<pre>
//div[@class="quote"]//span[contains(text(), 'wisdom') or contains(text(), 'life')]
#Select \<span> inside \<div class="quote"> where text contains either 'wisdom' or 'life'
### 9. Using Functions
XPath provides various functions to manipulate and extract data.

- **contains()**: Matches elements or attributes containing a certain substring.<pre>
//a[contains(@href, 'example.com')]   # Select links that contain 'example.com'</pre>
- **starts-with()**: Matches elements or attributes that start with a certain string.<pre>
//a[starts-with(@href, 'http')]   # Select links that start with 'http'</pre>
- **text()** and **normalize-space()**:<pre>
normalize-space(//span[@class="text"])   # Remove leading/trailing whitespace from text</pre>
- **Arithmetic operations**:
//li[position() mod 2 = 0]   # Select all even-numbered <li> elements
### 10. Scrapy-Specific Usage of XPath
In Scrapy, XPath queries are used within the spider's **parse()** method to extract data:

- **Basic example**:<pre>

In [None]:
def parse(self, response):
    quotes = response.xpath('//div[@class="quote"]')
    for quote in quotes:
        yield {
            'text': quote.xpath('span[@class="text"]/text()').get(),
            'author': quote.xpath('span/small/text()').get(),
            'tags': quote.xpath('div[@class="tags"]/a[@class="tag"]/text()').getall(),
        }

## 5.3 Saving the Data to CSV or JSON Format
Once you’ve crawled and extracted your data, Scrapy can automatically save it to CSV, JSON, or other formats. This is done using the command line when running your spider.<br>

- **Save Data as JSON**:
To run your spider and save the output as a JSON file, use the following command:<pre>
scrapy crawl discovery -o output.json</pre>
Scrapy will create a output.json file and save all the scraped data into it.

- **Save Data as CSV**:
To save your data as a CSV file, run:><pre>
scrapy crawl discovery -o output.csv</pre>
This will output the data in a CSV format with one line per scraped item.

- **Save Data as XML**:
To save data as XML, run:<pre>
scrapy crawl discovery -o output.xml</pre>
Scrapy's output format is determined by the -o flag, allowing you to quickly export your data in any common format.

# 6. Cleaning Data with Item Pipelines

Scrapy provides a feature called Item Pipelines that allows you to process and clean the data after it's been scraped but before it's stored or exported. Pipelines can be used for tasks such as cleaning data, validating fields, removing duplicates, or saving the data in various formats (e.g., databases, JSON, CSV).

## 6.1 What are Scrapy Items?
Scrapy Items are containers for storing the scraped data. They are similar to Python dictionaries but provide additional structure and validation.<br>
You define items in the items.py file within your Scrapy project.<br>

- **Creating an Item**:

In [None]:
import scrapy
class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

- **Purpose**:
    - Items help ensure that your scraped data is organized and structured.
    - They make it easier to apply validation and transformations through pipelines.


## 6.2 Using Scrapy Items to Structure Our Data
Once you've defined your Item class, you can use it in your spider to structure the data you're scraping.

- **Example of Using Items in Spider**:

In [None]:
from myproject.items import QuoteItem

def parse(self, response):

    quotes = response.xpath('//div[@class="quote"]')

    for quote in quotes:

       item = QuoteItem()

       item['text'] = quote.xpath('span[@class="text"]/text()').get()

       item['author'] = quote.xpath('span/small/text()').get()

       item['tags'] = quote.xpath('div[@class="tags"]/a[@class="tag"]/text()').getall()

       yield item


- **Benefits**:
    - Structured data output.
    - Easier to work with in pipelines and when saving data.

## 6.3 What are Scrapy Pipelines?
Scrapy Pipelines are used to process items after they’ve been scraped but before they’re saved/exported. Each pipeline is a class that processes the data through the **process_item()** method, using ItemAdapter to access fields in a unified way.

##### **Key Pipeline Functions**:
- **process_item(self, item, spider)**: The main method that processes each scraped item.
- **open_spider(self, spider)**: Runs when the spider opens (useful for setup tasks).
- **close_spider(self, spider)**: Runs when the spider closes (useful for cleanup tasks).

Now, Scrapy uses ItemAdapter to standardize how data is accessed, regardless of whether it's an Item, dictionary, or other data structure.


## 6.4 Cleaning Our Data with Item Pipelines (Updated Example)
#### Example: Cleaning Data in a Pipeline:

In [None]:
from itemadapter import ItemAdapter

class CleanQuotePipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

            # Clean text
        adapter['text'] = adapter['text'].strip().replace('\n', '')

            # Clean author
        adapter['author'] = adapter['author'].strip()

            # Clean tags (remove empty tags, strip whitespace)
        adapter['tags'] = [tag.strip() for tag in adapter['tags'] if tag.strip()]

        return item

#### Explanation:
- **ItemAdapter(item)**: This creates a unified interface for accessing the fields in the item, whether it's a Scrapy Item or a plain dictionary.
The rest of the code performs basic cleaning: stripping whitespace and filtering empty tags.

#### Enabling Pipelines
After creating the pipeline, don’t forget to enable it in the **settings.py** file:

In [None]:
ITEM_PIPELINES = {
    'myproject.pipelines.CleanQuotePipeline': 300,
}

# 7. Saving Data to Files & Databases

## 7.1 Saving Data via Command Line
Scrapy allows you to save scraped data directly to files (such as JSON, CSV, or XML) from the command line.

##### **Basic Command**: You can run the spider and save the output to a file format of your choice using the following command:<pre>
scrapy crawl spider_name -o output_file.format</pre>
##### **Example**:<pre>
scrapy crawl quotes -o quotes.json
scrapy crawl quotes -O quotes.csv
</pre>

- **O** -> Overview.
- **o** -> Append.

- **Supported Formats**:
JSON, JSON Lines (for streaming JSON records), CSV, XML, and more.

## 7.2 Saving Data via Feed Settings
Scrapy’s settings allow you to configure output feeds in settings.py, so every time your spider runs, it automatically saves the data to a file.

##### **Example**: 
- In your project’s settings.py, add the following code to save the scraped data to a JSON file:

In [None]:
FEEDS = {
  'output.json': {
      'format': 'json',
      'encoding': 'utf8',
      'overwrite': True  # Set this to False to append instead of overwrite
  },
}

- You can save to multiple formats or even multiple files:

In [None]:
FEEDS = {
  'output.json': {
      'format': 'json',
      'encoding': 'utf8',
      'overwrite': True
  },
  'output.csv': {
      'format': 'csv',
      'fields': ['field1', 'field2'],  # Optional: Select specific fields
  },
}

##### **Advantages**:
- This method is convenient when running spiders regularly.
- You can control the format, encoding, and more through configuration.

## 7.3 Saving Data Into Databases
You can also save the scraped data directly into databases such as MySQL, PostgreSQL, or SQLite using pipelines.

#### Setting Up Database Connection
- **Install Database Drivers**: Depending on the database you're using, you need the appropriate driver:
    - For MySQL: pip install pymysql
    - For PostgreSQL: pip install psycopg2
    - For SQLite: No need to install anything extra (SQLite is part of Python's standard library).
- **Writing a Database Pipeline**: Here’s an example of how to insert data into a SQLite database

In [None]:
import sqlite3
from itemadapter import ItemAdapter

class SQLitePipeline:
    def open_spider(self, spider):
        self.connection = sqlite3.connect('quotes.db')
        self.cursor = self.connection.cursor()
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS quotes (
                text TEXT,
                author TEXT,
                tags TEXT
            )
        ''')

    def close_spider(self, spider):
        self.connection.commit()
        self.connection.close()

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        self.cursor.execute('''
            INSERT INTO quotes (text, author, tags) VALUES (?, ?, ?)
        ''', (
            adapter.get('text'),
            adapter.get('author'),
            ','.join(adapter.get('tags', []))  # Convert list of tags to a comma-separated string
        ))
        return item

- **Explanation**:
    - **open_spider()**: Opens the database connection and creates the quotes table if it doesn't exist.
    - **close_spider()**: Commits any remaining transactions and closes the connection when the spider finishes.
    - **process_item()**: Inserts each scraped item into the quotes table.

- **Enabling the Database Pipeline**<br>
Activate the database pipeline in settings.py by adding it to the ITEM_PIPELINES section:<pre>
_ ITEM_PIPELINES = {
_   'myproject.pipelines.SQLitePipeline': 300,
_ }
- **The number in ITEM_PIPELINES**<br>
The number in the ITEM_PIPELINES setting represents the order in which the pipelines are executed. Lower numbers are executed first, and higher numbers are executed later,example:

In [None]:
ITEM_PIPELINES = {
   'myproject.pipelines.CleaningPipeline': 200,  # Executes first
   'myproject.pipelines.SQLitePipeline': 300,    # Executes second
}

In this case:
- CleaningPipeline with priority 200 will run first, to clean or modify the scraped data.
- SQLitePipeline with priority 300 will run next, saving the cleaned data to the database.

# 8. Fake User-Agents & Browser Headers

When web scraping, it's common to encounter restrictions, blocks, or CAPTCHAs because websites detect and block automated scripts. In this section, we'll explore how to avoid getting blocked using techniques such as fake user agents and browser headers.

## 8.1 Why We Get Blocked When Web Scraping
Websites use several methods to detect and block web scrapers, including:

- **IP address blocking**: Sites detect multiple requests from the same IP and block it.
- **Rate limiting**: Sending too many requests in a short period can trigger a block.
- **User-agent detection**: Websites check the User-Agent header to detect if the requests come from a browser or an automated script.
- **CAPTCHAs**: Sometimes, websites show CAPTCHAs to verify if the visitor is human.
- **Other headers**: Websites inspect the HTTP request headers to identify unusual patterns, such as missing cookies, referrers, or other browser-specific headers.


## 8.2 Explaining & Using User Agents to Bypass Getting Blocked
The User-Agent string identifies the type of browser making the request. Websites use this information to customize the content for specific browsers. When scraping, if no user agent or a generic user agent is provided, websites may detect the request as coming from a bot and block it.

- **Example of a User-Agent string**:<pre>
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3</pre>
- **Adding User-Agents to Your Scrapy Spider**:
You can set a random User-Agent in Scrapy using the USER_AGENT setting or by using a middleware to rotate User-Agents automatically.<br>
Setting a Single User-Agent: In **settings.py**, add:<pre>

In [None]:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'

### Rotating User-Agents with ScrapeOps
When scraping websites, one common technique to avoid getting blocked is to rotate user agents. A user agent is a string that identifies the browser and operating system making the request. By rotating user agents, you can make your requests appear as if they are coming from different devices or browsers.
####  Step 1: Setting Up ScrapeOps
To set up ScrapeOps for the Fake Headers API via their website, follow these steps:

- **Sign Up / Log In**
    - Visit the ScrapeOps website: Go to scrapeops.io.
    - Sign up or log in: Create a new account or log in to your existing account.
- **Access the Dashboard**
    - Once logged in, navigate to your Dashboard.
    - Here you can view your usage stats, API keys, and available tools.
- **Generate an API Key**
    - In the dashboard, find the section for API Keys.
    - Generate a new API key if you don’t have one already.
    - Copy this key for later use.
- **Navigate to Fake Headers API**
    - From the dashboard, locate the Fake Headers API section or navigate through the menu.
    - Click on the Fake Headers API to access its details and documentation.
- **Get API Endpoint and Parameters**
    - Review the provided API endpoint for the **Fake User-Agents service**.
    - Take note of any required parameters for requests (like user-agent, accept-language, etc.).

#### Step 2: Writing a ScrapeOps User-Agents Pipeline

In [None]:
from urllib.parse import urlencode
from random import randint
import requests

class ScrapeOpsFakeUserAgentMiddleware:
    
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)
    def __init__(self, settings):
        self.scrapeops_api_key = settings.get('SCRAPEOPS_API_KEY')
        self.scrapeops_endpoint = settings.get('SCRAPEOPS_FAKE_USER_AGENT_ENDPOINT', 'https://headers.scrapeops.io/v1/user-agents') 
        self.scrapeops_fake_user_agents_active = settings.get('SCRAPEOPS_FAKE_USER_AGENT_ENABLED', False)
        self.scrapeops_num_results = settings.get('SCRAPEOPS_NUM_RESULTS')
        self.headers_list = []
        self._get_user_agents_list()
        self._scrapeops_fake_user_agents_enabled()
        
    def _get_user_agents_list(self):
        payload = {'api_key': self.scrapeops_api_key}
        if self.scrapeops_num_results is not None:
            payload['num_results'] = self.scrapeops_num_results
        response = requests.get(self.scrapeops_endpoint, params=urlencode(payload))
        json_response = response.json()
        self.user_agents_list = json_response.get('result', [])
        
    def _get_random_user_agent(self):
        random_index = randint(0, len(self.user_agents_list) - 1)
        return self.user_agents_list[random_index]
    
    def _scrapeops_fake_user_agents_enabled(self):
        if self.scrapeops_api_key is None or self.scrapeops_api_key == '' or self.scrapeops_fake_user_agents_active == False:
            self.scrapeops_fake_user_agents_active = False
        else:
            self.scrapeops_fake_user_agents_active = True
            
    def process_request(self, request, spider):        
        random_user_agent = self._get_random_user_agent()
        request.headers['User-Agent'] = random_user_agent

#### Step 3: Modifying Settings.py
- **Adding ScrapeOps Keys**<br>
    - **SCRAPEOPS_API_KEY**: Your unique key to authenticate requests to ScrapeOps.
    - **SCRAPEOPS_FAKE_USER_AGENT_ENDPOINT**: URL to fetch fake user agent strings from ScrapeOps.
    - **SCRAPEOPS_FAKE_USER_AGENT_ENABLED**: Enables (True) or disables (False) the use of fake user agents in your Scrapy project.
    - **SCRAPEOPS_NUM_RESULTS**: Specifies how many fake user agents to retrieve (e.g., 50).

In [None]:
SCRAPEOPS_API_KEY = 'your-api-key'
SCRAPEOPS_NUM_RESULTS = 50


- **Enabling the Middleware Pipeline**<br>
Activate the scrapeops user agents pipeline in settings.py by adding it to the DOWNLOADER_MIDDLEWARES section:<pre>

In [None]:
DOWNLOADER_MIDDLEWARES = {
   #"myproject.middlewares.ExampleDownloaderMiddleware": 543,
   "myproject.middlewares.ScrapeOpsFakeUserAgentMiddleware":400,
}

## 8.3 Explaining & Using Request Headers to Bypass Getting Blocked

Besides User-Agents, websites rely on other HTTP headers to detect bots. Some key headers that scrapers should set include:

- **Referer**: This header specifies the URL of the page that led to the request.

Example: Referer: https://example.com
- **Accept-Language**: This tells the website what language the browser prefers.

Example: Accept-Language: en-US,en;q=0.9
- **Accept-Encoding**: Specifies the type of content the browser can handle (e.g., compressed content).

Example: Accept-Encoding: gzip, deflate, br
- **Cookie**: This header passes session data to the server. Websites may block requests if cookies are missing.

### Rotating Broswer-Header with ScrapeOps
####  Step 1: Setting Up ScrapeOps
Same steps as Fake User-Agents,but using **Fake User-Agents service** now.
#### Step 2: Writing a ScrapeOps User-Agents Pipeline

In [None]:
from urllib.parse import urlencode
from random import randint
import requests

class ScrapeOpsFakeBrowserHeaderAgentMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def __init__(self, settings):
        self.scrapeops_api_key = settings.get('SCRAPEOPS_API_KEY')
        self.scrapeops_endpoint = settings.get('SCRAPEOPS_FAKE_BROWSER_HEADER_ENDPOINT', 'https://headers.scrapeops.io/v1/browser-headers') 
        self.scrapeops_fake_browser_headers_active = settings.get('SCRAPEOPS_FAKE_BROWSER_HEADER_ENABLED', False)
        self.scrapeops_num_results = settings.get('SCRAPEOPS_NUM_RESULTS')
        self.headers_list = []
        self._get_headers_list()
        self._scrapeops_fake_browser_headers_enabled()

    def _get_headers_list(self):
        payload = {'api_key': self.scrapeops_api_key}
        if self.scrapeops_num_results is not None:
            payload['num_results'] = self.scrapeops_num_results
        response = requests.get(self.scrapeops_endpoint, params=urlencode(payload))
        json_response = response.json()
        self.headers_list = json_response.get('result', [])

    def _get_random_browser_header(self):
        random_index = randint(0, len(self.headers_list) - 1)
        return self.headers_list[random_index]

    def _scrapeops_fake_browser_headers_enabled(self):
        if self.scrapeops_api_key is None or self.scrapeops_api_key == '' or self.scrapeops_fake_browser_headers_active == False:
            self.scrapeops_fake_browser_headers_active = False
        else:
            self.scrapeops_fake_browser_headers_active = True
    
    def process_request(self, request, spider):        
        random_browser_header = self._get_random_browser_header()

        request.headers['accept-language'] = random_browser_header['accept-language']
        request.headers['sec-fetch-user'] = random_browser_header['sec-fetch-user'] 
        request.headers['sec-fetch-mode'] = random_browser_header['sec-fetch-mode'] 
        request.headers['sec-fetch-site'] = random_browser_header['sec-fetch-site'] 
        request.headers['sec-ch-ua-platform'] = random_browser_header['sec-ch-ua-platform'] 
        request.headers['sec-ch-ua-mobile'] = random_browser_header['sec-ch-ua-mobile'] 
        request.headers['sec-ch-ua'] = random_browser_header['sec-ch-ua'] 
        request.headers['accept'] = random_browser_header['accept'] 
        request.headers['user-agent'] = random_browser_header['user-agent'] 
        request.headers['upgrade-insecure-requests'] = random_browser_header.get('upgrade-insecure-requests')

#### Step 3: Modifying Settings.py
- **Adding ScrapeOps Keys**<br>
    - **SCRAPEOPS_API_KEY**: Your unique key to authenticate requests to ScrapeOps.
    - **SCRAPEOPS_FAKE_BROWSER_HEADER_ENDPOINT**: URL to fetch fake browser header strings from ScrapeOps.
    - **SCRAPEOPS_FAKE_BROWSER_HEADER_ENABLED**: Enables (True) or disables (False) the use of fake browser header in your Scrapy project.
    - **SCRAPEOPS_NUM_RESULTS**: Specifies how many fake user agents to retrieve (e.g., 50).

In [None]:
SCRAPEOPS_API_KEY = 'your-api-key'
SCRAPEOPS_NUM_RESULTS = 50

# 9. Rotating Proxies & Proxy APIs

## 9.1 What Are Proxies and Why Do We Need Them?
A proxy acts as an intermediary between your computer (the client) and the web server. When you use a proxy, your requests to a target website pass through a third-party server (the proxy) before reaching the destination.<br> This masks your IP address and can help you bypass rate limits, geo-restrictions, and anti-bot measures.

#### Why use proxies?
- **Avoid getting blocked**: Many websites block scrapers that send too many requests from a single IP address.
- **Bypass geo-restrictions**: Some content is only available to users in certain locations.
- **Distribute traffic**: By spreading your requests across multiple IP addresses, you reduce the chances of triggering rate-limiting rules.

## 9.2 3 Most Popular Proxy Integration Methods
There are three common ways to integrate proxies in Scrapy:

- **Manual Proxy Lists**: You maintain a list of proxies and rotate through them.
- **Rotating/Backconnect Proxies**: These services provide a pool of IPs that rotate automatically on every request.
- **Proxy APIs**: Some proxy providers offer an API to fetch fresh proxies dynamically.

## 9.3 How to Integrate and Rotate Proxy Lists in Scrapy

### 1. Install scrapy-rotating-proxies
First, install the scrapy-rotating-proxies library using pip:

In [None]:
pip install scrapy-rotating-proxies

This library helps rotate proxies automatically while scraping, making sure you don’t get blocked or blacklisted by websites.


### 2. Add Proxies to Scrapy Settings
Once the library is installed, you need to configure Scrapy to use proxy rotation. Open your Scrapy project’s **settings.py** file, and add the following configurations:

In [None]:
# Enable the rotating proxies middleware
DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

# List of proxies you want to use for rotation
# These are just examples; you should replace them with your own proxies
ROTATING_PROXY_LIST = [
    'http://123.456.789.101:8000',
    'http://111.222.333.444:8000',
    'http://555.666.777.888:8000',
]

# Or use an external file with the proxies:
# ROTATING_PROXY_LIST_PATH = '/path/to/proxy/list.txt'

# Optional settings
# Number of requests to send per proxy before rotating
ROTATING_PROXY_PAGE_RETRY_TIMES = 5
# Avoid rotation on every request (helps prevent overuse of proxies)
ROTATING_PROXY_BACKOFF_BASE = 300

Replace the proxy IPs and ports with actual proxies that you have. You can use free proxies (though they may be less reliable) or purchase dedicated proxy services.

### 3. How to Rotate Proxy Lists
The scrapy-rotating-proxies middleware will automatically handle rotating through the list of proxies you've provided. However, here are a few additional configurations that can help manage proxy rotation:

- **ROTATING_PROXY_PAGE_RETRY_TIMES**: This setting controls how many times Scrapy should retry a request before rotating to a new proxy. Adjust this based on the reliability of your proxies.
- **ROTATING_PROXY_BACKOFF_BASE**: This defines the base time (in seconds) that Scrapy should wait before retrying a request when using a proxy. This helps avoid overuse of a single proxy.

### 4. Adding Proxy List from an External File
If you have a large proxy list, it might be better to store it in an external file. To do that, save your proxy list in a text file (e.g., proxy_list.txt) with one proxy per line:

In your settings.py, point to this file using:

In [None]:
ROTATING_PROXY_LIST_PATH = 'path/to/proxy_list.txt'

Scrapy will load the proxies from the file and rotate through them during scraping.

## All Rest Methods are paid.

## 9.4 How to Use Rotating/Backconnect Proxies
Rotating/Backconnect proxies automatically rotate your IP address for each request without you having to manage a list. You usually get a single proxy endpoint from the provider, and the rotation happens on the server side.<br>

For example, if your proxy provider gives you a backconnect proxy, you can configure it in settings.py:

In [None]:
PROXY = 'http://backconnectproxy.example.com:8000'

Then, update the middleware to use this proxy:

In [None]:
class ProxyMiddleware:
    def process_request(self, request, spider):
        request.meta['proxy'] = spider.settings.get('PROXY')

This way, every request will go through the backconnect proxy, and the provider will rotate the IP for you.

## 9.5 How to Use Proxy APIs

# 10. Handling Form Submissions

## 10.1 Understanding Form Handling in Scrapy
In Scrapy, you can interact with forms by using the **FormRequest** class, which allows you to submit data through HTML forms. This request class can send **POST** or **GET** requests depending on the form's method attribute.

- **GET** requests are typically used for search forms or when data is passed via the URL.
- **POST** requests are typically used when submitting sensitive data (e.g., login credentials) via forms.

You will need to find the form fields (input fields, buttons, etc.) and populate them with the appropriate data.

## 10.2 Basic Example of Handling Form Submissions

In [None]:
import scrapy
from scrapy.http import FormRequest

class LoginSpider(scrapy.Spider):
    name = 'login_spider'
    start_urls = ['https://example.com/login']

    def parse(self, response):
        # Get CSRF token or any hidden fields if necessary
        csrf_token = response.css('input[name="csrf_token"]::attr(value)').get()
        
        # Create a dictionary with login form data
        formdata = {
            'username': 'myuser',
            'password': 'mypassword',
            'csrf_token': csrf_token  # If the website requires it
        }

        # Sending POST request to log in
        yield FormRequest.from_response(
            response,
            formdata=formdata,
            callback=self.after_login
        )

    def after_login(self, response):
        # Check if login is successful
        if 'Welcome' in response.text:
            self.log('Login successful!')
            # Continue scraping protected pages
            yield scrapy.Request(url='https://example.com/protected', callback=self.parse_protected_page)
        else:
            self.log('Login failed')
    
    def parse_protected_page(self, response):
        # Scrape data from a page after logging in
        data = response.css('.data-class::text').get()
        yield {'data': data}

- **FormRequest.from_response**: A method that automatically handles form data extraction and submits the form.
- **callback**: After submitting the form, the after_form_submission method processes the resulting page.

# 11. Handling CAPTCHAs:

## 11.1  Understanding CAPTCHA Types

There are several types of CAPTCHAs that websites use, each with different levels of complexity:

- **Image-based CAPTCHAs**: These require users to identify distorted characters in an image.
- **ReCAPTCHA**: Google's popular CAPTCHA that includes a checkbox (I'm not a robot) and may require image identification (e.g., selecting pictures of traffic lights).
- **Invisible ReCAPTCHA**: A more user-friendly version of reCAPTCHA that uses behavioral analysis to determine if the user is a human.
- **Mathematical CAPTCHAs**: Simple addition or subtraction problems that users need to solve.
- **Puzzle CAPTCHAs**: Users may need to solve a puzzle, like dragging a slider or matching pieces.

In this tutorial, we’ll focus on the first two types: image-based CAPTCHAs and ReCAPTCHA.

## 11.2 Bypassing CAPTCHAs with Scrapy

Scrapy alone doesn't have the capability to solve CAPTCHAs. However, we can bypass CAPTCHAs using a few strategies:

### Strategy 1: Using CAPTCHA Solving Services
You can use third-party CAPTCHA-solving services like 2Captcha or AntiCaptcha. These services solve CAPTCHAs for you by either using human workers or machine learning to interpret CAPTCHA images.

### Strategy 2: Using Selenium to Bypass CAPTCHAs
If a website has a CAPTCHA challenge that Scrapy can't handle, Selenium can be used to control a web browser to complete the CAPTCHA manually. This is often necessary for complex CAPTCHAs like Google’s reCAPTCHA.

## 3. Using CAPTCHA Solving Services (e.g., 2Captcha, AntiCaptcha)

You can integrate third-party CAPTCHA-solving services with Scrapy to handle CAPTCHA challenges automatically.

#### Example: Using 2Captcha API for Image-based CAPTCHA
2Captcha is a popular service that allows you to send CAPTCHA images for solving. You need to send the CAPTCHA image, get the solution, and then use it to fill in the CAPTCHA form.

### Step 1: Install the 2Captcha Python Client
You can install the 2Captcha client using pip:

In [None]:
pip install python-2captcha

### Step 2: Create a Spider that Uses 2Captcha

In [None]:
import scrapy
import requests
from twocaptcha import TwoCaptcha
from scrapy.http import FormRequest

class CaptchaSpider(scrapy.Spider):
    name = 'captcha_spider'
    start_urls = ['https://example.com/captcha-form']

    def parse(self, response):
        # Extract the captcha image URL
        captcha_url = response.css('img#captcha::attr(src)').get()
        
        # Solve the captcha using 2Captcha
        captcha_solution = self.solve_captcha(captcha_url)
        
        # Create the form data and submit it
        formdata = {
            'username': 'myuser',
            'password': 'mypassword',
            'captcha': captcha_solution  # Solved CAPTCHA
        }

        yield FormRequest.from_response(
            response,
            formdata=formdata,
            callback=self.after_submission
        )

    def solve_captcha(self, captcha_url):
        # Using 2Captcha API to solve the CAPTCHA
        solver = TwoCaptcha('YOUR_2CAPTCHA_API_KEY')

        # Send captcha image URL to 2Captcha for solving
        try:
            result = solver.normal(captcha_url)
            return result['code']  # The CAPTCHA solution
        except Exception as e:
            self.log(f"Error solving CAPTCHA: {e}")
            return None

    def after_submission(self, response):
        # Process the form submission result
        if 'Welcome' in response.text:
            self.log('Form submitted successfully!')
        else:
            self.log('Form submission failed')


- **Captcha Image URL**: The URL of the CAPTCHA image is extracted from the page.
- **2Captcha API**: The TwoCaptcha client is used to send the image to the 2Captcha API for solving.
- **Form Submission**: Once the CAPTCHA is solved, the form is submitted with the solution.


# 12. Dealing with File Downloads (Images, PDFs, etc.)

## 12.1 Understanding File Downloads in Scrapy

- In Scrapy, downloading files is handled by the **FilesPipeline**, which is a default pipeline for handling the downloading of files from URLs. Scrapy’s **FilesPipeline** allows you to download files, store them locally or remotely, and organize them based on the specific requirements of your project.

- The **FilesPipeline** can be used for a variety of file types including:
    - **Images**: Images from websites (e.g., product images, logos).
    - **PDFs**: PDF documents such as eBooks, forms, reports, etc.
    - **Audio and Video Files**: Other types of multimedia.
    - **Any other file types**: Files with different extensions like **.txt**, **.csv**, **.xlsx**, etc.

## 12.2 Setting Up File Download Pipelines

### 12.2.1 Enable Files Pipeline
To start handling file downloads, you need to enable the FilesPipeline and define a FILES_STORE where the files will be saved.

In [None]:
# settings.py

# Enable the FilesPipeline
ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 1,
}

# Define the directory where files will be stored
FILES_STORE = '/path/to/your/local/directory'  # Local directory for storing downloaded files

# Optional: Set the field where file URLs are stored
FILES_URLS_FIELD = 'file_urls'  # Default is 'file_urls', you can change it as needed

# Optional: Define the field to store downloaded file paths
FILES_RESULT_FIELD = 'files'  # Default is 'files', you can change it as needed

Here, **FILES_STORE** points to the directory where you want the downloaded files to be stored locally. You can also specify remote storage if required, such as Amazon S3 or FTP.

### 12.2.2 Customizing File Storage Path
By default, Scrapy will save files with their original filenames in the specified directory. However, you can customize the file storage path by overriding the file_path method in your spider.

In [None]:
import scrapy
from scrapy.pipelines.files import FilesPipeline

class FileDownloadSpider(scrapy.Spider):
    name = 'file_download_spider'
    
    def start_requests(self):
        # Example: start with a URL containing a file to download
        urls = ['https://example.com/somefile.pdf']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    
    def parse(self, response):
        # Use the 'file_urls' field to pass the file URL to the pipeline
        yield {
            'file_urls': [response.url]
        }
    
    def file_path(self, request, response=None, info=None):
        # Customize the file path, e.g., organize files by domain or file type
        filename = request.url.split('/')[-1]
        return f"custom_folder/{filename}"  # Save in custom_folder with the original filen

This spider downloads a PDF file and saves it in a subdirectory called **custom_folder**.

## 12.3 Downloading Multiple Files (e.g., Images)

Scrapy can also handle the download of multiple files in one request. For example, if you're scraping product pages that contain multiple images, you can pass a list of image URLs to the **file_urls** field in your item.

In [None]:
import scrapy

class ImageDownloadSpider(scrapy.Spider):
    name = 'image_download_spider'
    
    def start_requests(self):
        urls = ['https://example.com/product-page']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    
    def parse(self, response):
        # Assuming images are contained within <img> tags with 'src' attributes
        image_urls = response.css('img::attr(src)').getall()
        
        yield {
            'file_urls': image_urls
        }

## 12.4 Handling Files in the Pipeline

Once files are downloaded, Scrapy’s FilesPipeline stores them on the disk, but you may want to process them further. For example, you might want to:

- **Rename** the files based on metadata or a custom scheme.
- **Process** the files (e.g., resize images, convert PDFs to text).
- **Move** files to different locations after downloading.

You can implement this in the **FilesPipeline** by subclassing it and overriding the **file_path** or **item_completed** methods.

#### 12.4.1 Customizing File Path

In [None]:
from scrapy.pipelines.files import FilesPipeline
import os

class CustomFilesPipeline(FilesPipeline):
    
    def file_path(self, request, response=None, info=None):
        # Generate a custom file path for downloaded files
        filename = request.url.split('/')[-1]
        return os.path.join('downloads', filename)  # Save in 'downloads' folder


This custom pipeline saves files in a **downloads** folder with their original filename.

#### 12.4.2 Post-processing Downloaded Files
You can also override the **item_completed** method to handle additional tasks after the files have been downloaded.

In [None]:
from scrapy.pipelines.files import FilesPipeline

class CustomFilesPipeline(FilesPipeline):
    
    def item_completed(self, results, item, info):
        # Process the downloaded files
        for result in results:
            if result[0]:
                # File successfully downloaded
                file_path = result[1]['path']
                # Perform custom post-processing on the file
                self.process_file(file_path)
        return item
    
    def process_file(self, file_path):
        # Your custom logic for processing the file (e.g., image resizing)
        print(f'Processing file: {file_path}')

In this example, after the files are downloaded, they are processed with the **process_file** method, which you can customize to suit your needs.

# Sources
- <a href="https://www.youtube.com/watch?v=mBoX_JCKZTE&t=10397s&pp=ygUVZGF0YSBzY2llbmNlIGJvb3RjYW1w">Scrapy Course – Python Web Scraping for Beginners by freeCodeCamp.org</a>