# Session 23 🐍

☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️

***

# 183. Scrapy
`Scrapy` is a powerful and popular web scraping framework for Python. It provides a complete toolkit for extracting data from websites, processing it, and storing it in your preferred format. Let's explore Scrapy in detail.

# 184. Core Components

***

## 184-1. Spiders
Spiders are classes that define how to scrape a website. Here's a basic spider example:

In [None]:
import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        # Extract data from the page
        title = response.css('h1::text').get()
        yield {'title': title}

        # Follow links to other pages
        for link in response.css('a::attr(href)').getall():
            yield response.follow(link, callback=self.parse)

***

## 184-2. Items
Items define the structure of the scraped data:

In [None]:
import scrapy

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    description = scrapy.Field()

***

## 184-3. Item Pipelines
Pipelines process the scraped items:

In [None]:
class PriceConversionPipeline:
    def process_item(self, item, spider):
        if 'price' in item:
            item['price'] = float(item['price'].replace('$', ''))
        return item

class JsonWriterPipeline:
    def open_spider(self, spider):
        self.file = open('items.json', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

***

## 184-4. Middleware
Middleware can process requests and responses:

In [None]:
class CustomUserAgentMiddleware:
    def process_request(self, request, spider):
        request.headers['User-Agent'] = 'My Custom User Agent'

***

# 185. Project Structure
A typical Scrapy project looks like this:

In [None]:
myproject/
    scrapy.cfg            # Deployment configuration
    myproject/           # Project's Python module
        __init__.py
        items.py         # Item definitions
        middlewares.py   # Project middlewares
        pipelines.py      # Project pipelines
        settings.py       # Project settings
        spiders/         # Spiders directory
            __init__.py
            spider1.py   # Your spider implementations

***

# 186. Running a Spider
Run a spider from the command line:

In [None]:
scrapy crawl myspider -o output.json

Or programmatically:

In [None]:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())
process.crawl(MySpider)
process.start()

***

# 187. Selectors
Scrapy provides powerful selectors for extracting data:

In [None]:
# CSS selectors
response.css('title::text').get()          # Get first match
response.css('p::text').getall()           # Get all matches

# XPath selectors
response.xpath('//title/text()').get()
response.xpath('//p/text()').getall()

# Combining selectors
response.css('div.product').xpath('.//h2/text()').get()

***

# 188. Settings
Configure Scrapy in settings.py:

In [None]:
BOT_NAME = 'myproject'

USER_AGENT = 'Mozilla/5.0 (compatible; MyBot/1.0)'

ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {
    'myproject.pipelines.PriceConversionPipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

DOWNLOAD_DELAY = 2  # 2 seconds delay between requests

***

# 189. Advanced Features

***

## 189-1. Link Extractors

In [None]:
from scrapy.linkextractors import LinkExtractor

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response):
        extractor = LinkExtractor(allow=r'/category/')
        for link in extractor.extract_links(response):
            yield response.follow(link, callback=self.parse_category)

***

## 189-2. Form Handling

In [None]:
class LoginSpider(scrapy.Spider):
    def start_requests(self):
        return [scrapy.FormRequest(
            'http://example.com/login',
            formdata={'user': 'john', 'pass': 'secret'},
            callback=self.after_login
        )]

    def after_login(self, response):
        # Check login success before continuing
        if "authentication failed" in response.text:
            self.logger.error("Login failed")
            return
        # Continue scraping

***

## 189-3. Exporting Data
Scrapy supports multiple export formats:

In [None]:
scrapy crawl myspider -o items.json
scrapy crawl myspider -o items.csv
scrapy crawl myspider -o items.xml

***

## 189-4. Using Feed Exporters

In [None]:
FEED_FORMAT = 'jsonlines'
FEED_URI = 's3://mybucket/%(name)s/%(time)s.json'
FEED_EXPORT_FIELDS = ['name', 'price']  # Control field order

***

# 190. Example Complete Spider
Here's a complete e-commerce spider example:

In [None]:
import scrapy
from myproject.items import ProductItem

class EcommerceSpider(scrapy.Spider):
    name = 'ecommerce'
    allowed_domains = ['example-store.com']
    start_urls = ['https://example-store.com/categories']

    custom_settings = {
        'DOWNLOAD_DELAY': 1,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 2,
    }

    def parse(self, response):
        # Extract category links
        categories = response.css('div.categories a::attr(href)').getall()
        for category in categories:
            yield response.follow(category, callback=self.parse_category)

    def parse_category(self, response):
        # Extract product links
        products = response.css('div.product-card a::attr(href)').getall()
        for product in products:
            yield response.follow(product, callback=self.parse_product)

        # Pagination
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse_category)

    def parse_product(self, response):
        item = ProductItem()
        item['name'] = response.css('h1.product-name::text').get().strip()
        item['price'] = response.css('span.price::text').get().replace('$', '').strip()
        item['description'] = ' '.join(response.css('div.description ::text').getall()).strip()
        item['url'] = response.url
        
        yield item

***

***

# Some Excercises

**1.** Create a spider that scrapes quotes from http://quotes.toscrape.com:
- Extract quote text, author, and tags
- Follow pagination links
- Save results to quotes.json

___

**2.** Modify Exercise 1 to:

1. Create a custom Item class for quotes

2. Add a pipeline that:
    - Converts all tags to lowercase
    - Filters out quotes with less than 3 tags
    - Stores results in both JSON and CSV

---

**3.** Scrape http://quotes.toscrape.com/login:

- First perform login (username/password can be anything)
- After successful login, scrape the protected quotes
- Handle login failure cases

---

**4.** Create a custom middleware that:

1. Rotates user agents from a list of 5 different browsers

2. Logs all 404 responses to a file

3. Retries failed requests with exponential backoff

***

**5.** Scrape books.toscrape.com and:

1. Extract book title, price, rating, and image URL

2. Use Scrapy's ImagesPipeline to download all book covers

3. Save images in folders organized by book rating

***

**6.** Create a spider that:

1. Uses Scrapy's FormRequest to interact with a mock API (https://httpbin.org/post)

2. Sends paginated POST requests with different parameters

3. Processes JSON responses and extracts specific data fields

***

**7.** Modify any previous exercise to:

1. Use Redis for distributed request queue

2. Implement duplicate request filtering

3. Run multiple spider instances that coordinate through Redis

***

**8.** Create a complete e-commerce scraper for a real site (like Amazon or eBay) that:

1. Handles search results pagination

2. Extracts product details with error handling

3. Uses proxies and proper throttling

4. Implements cache for development

5. Stores data in MySQL database through pipeline

6. Includes monitoring/logging system

***

#                                                        🌞 https://github.com/AI-Planet 🌞