# **Scrapy**
One of the challenges of writing web crawlers is that you’re often performing the same tasks again and again: find all links on a page, evaluate the difference between internal and external links, go to new pages. These basic patterns are useful to know and to be able to write from scratch, but the Scrapy library handles many of these details for you. Of course, Scrapy isn’t a mind reader. You still need to define page templates, give it locations to start scraping from, and define URL patterns for the pages that you’re looking for. But in these cases, it provides a clean framework to keep your code organized.

## **Installing Scrapy:**
Scrapy offers the tool for download from its website, as well as instructions for installing Scrapy with third-party installation managers such as pip.

Because of its relatively large size and complexity, Scrapy is not usually a framework that can be installed in the traditional way with

In [1]:
! pip install Scrapy





## **Initializing a New Spider**

Once you’ve installed the Scrapy framework, a small amount of setup needs to be
done for each spider. A spider is a Scrapy project that, like its arachnid namesake, is designed to crawl webs. Throughout this chapter, I use “spider” to describe a Scrapy project in particular, and “crawler” to mean “any generic program that crawls the web,using Scrapy or not.” 

To create a new spider in the current directory, run the following from the command line:

In [4]:
! scrapy startproject wikiSpider .

New Scrapy project 'wikiSpider', using template directory 'C:\Users\anoop\anaconda3\Lib\site-packages\scrapy\templates\project', created in:
    C:\Users\anoop\Learnings\Webscraping\Basics

You can start your first spider with:
    cd .
    scrapy genspider example example.com


## **Writing a Simple Scraper**
To create a crawler, you will add a new file inside the spiders directory at wikiSpider/wikiSpider/spiders/article.py. In your newly created article.py file,

In [5]:
import scrapy

class ArticleSpider(scrapy.Spider):
    name = 'article'

    def start_request(self):
        urls =  [
            'http://en.wikipedia.org/wiki/Python_'
            '%28programming_language%29',
            'https://en.wikipedia.org/wiki/Functional_programming',
            'https://en.wikipedia.org/wiki/Monty_Python'
        ]

        return [scrapy.Request(url = url, callback = self.parse) for url in urls]

    def parse(self, response):
        url = response.url
        title = response.css('h1::text').extract_first()
        print('URL is: {}'.format(url))
        print('Title is: {}'.format(title))

In [8]:
! scrapy runspider ./wikiSpider/spiders/article.py

2024-10-15 10:05:26 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: wikiSpider)
2024-10-15 10:05:26 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:03:56) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Windows-11-10.0.26120-SP0
2024-10-15 10:05:26 [scrapy.addons] INFO: Enabled addons:
[]
2024-10-15 10:05:26 [asyncio] DEBUG: Using selector: SelectSelector
2024-10-15 10:05:26 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-10-15 10:05:26 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-10-15 10:05:26 [scrapy.extensions.telnet] INFO: Telnet Password: 36a373c2c74bf473
2024-10-15 10:05:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'sc

# **Spidering with Rules**
The spider in the previous section isn’t much of a crawler, confined to scraping only the list of URLs it’s provided. It has no ability to seek new pages on its own. To turn it into a fully fledged crawler, you need to use the CrawlSpider class provided by
Scrapy.

In [10]:
! pip install scrapy.contrib

ERROR: Ignored the following versions that require a different python version: 0.0.2 Requires-Python ==3.10; 0.0.3 Requires-Python ==3.10; 0.0.4 Requires-Python ==3.10; 0.0.5 Requires-Python ==3.10; 0.0.6 Requires-Python ==3.10; 0.0.8 Requires-Python ==3.10
ERROR: Could not find a version that satisfies the requirement scrapy.contrib (from versions: none)
ERROR: No matching distribution found for scrapy.contrib


In [None]:
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class ArticleSpider(CrawlSpider):
    name = 'articles'
    allowed_domains = ['wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/' 'Benevolent_dictator_for_life']
    rules = [Rule(LinkExtractor(allow=r'.*'), callback='parse_items',follow=True)]
 
    def parse_items(self, response):
        url = response.url
        title = response.css('h1::text').extract_first()
        text = response.xpath('//div[@id="mw-content-text"]//text()').extract()
        lastUpdated = response.css('li#footer-info-lastmod::text').extract_first()
        lastUpdated = lastUpdated.replace('This page was last edited on ', '')
    
        print('URL is: {}'.format(url))
        print('title is: {} '.format(title))
        print('text is: {}'.format(text))
        print('Last updated: {}'.format(lastUpdated))