# 1. Components of a web crawling script

## High level components of a crawler

There are three components of a web crawler script:

1.The `import` statements to import modules

2.The `scrapy.Spider` class definition to define a specific spider (inheriting from `scarpy.Spider` class)

3.The `scrapy.crawler.CrawlerProcess` process that defines how spider should craw the web


Sample codes below:

```python
# section 1, ipmorts
import scrapy
from scrapy.crawler import CrawlerProcess

# section 2, define spider
class SpiderClassName(scrapy.Spider):
    name = 'spider_name'
    # code for your spider

# sesction 3, define process to run spider
process = CrawlerProcess()

process.crawl(YourSpider)

process.start()
```

## Defining a `Spider` class

For any `Spider` class, 3 components are required:

1.A `name` attribute that's needed to internally refer to the object

1.A method called `.start_requests()` is required to kick start the site visiting behavior

2.A method called `.parse()` is required to parse the returned web contents

Example of defining a `Spider` class is as below, where `.start_requests()` kicks off url visits and `parse()` saves the response to local file.

```python
class WebSpider(scrapy.Spider):
    
    name = 'web_spider'
    
    # generator function, only creating one Request at a time
    def start_requests(self):
        urls = ['url_1', 'url_2', ...]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):
        # define how the exact result should be processed.
        # here we just save out
        html_file = 'website_page1.html'
        with open(html_file, 'wb') as fout:
            fout.write(response.body)
```

**Details above regarding the `.start_requests(self)` method**

* Use the `yield` call to create a generator function instead of `return`. Couple with a `for` loop to make start_requests() iterate through the url list
* For each iterated link, apply `scrapy.Request()` on the URL with at least the following parameters:
    * `url`: the URL to be visited in each iteration by the generator
    * `callback`: The parsing function with which the yielded response will be processed

**Details above regarding the `.parse(self, response)` method**

* The `response` parameter will take the yielded `scrapy.Response` object from the `.start_requests()` generator automatically since is was defined as the `callback` from above `scrapy.Request()` class. This `Response` object will keep track of any further `.follow()` calls on new url links stemming from this initial parent URL
* The `.parse()` methods can take any method name as long as the `scrapy.Request()` object takes its correct name in the `callback` parameter
* The `.parse()` methods (or any further parsing methods) themselves can be generator functions, allowing the spider to go into multiple layers of links if needed. **Notice that for any further requests, we us the existing `Response` object's `.follow()` method instead, so that all the sites visited will be recorded in the `Response` object**

## Example 1: Scraping Author Names of DataCamp websites

```python
# Import the scrapy library
import scrapy

# Create the Spider class
class DCspider( scrapy.Spider ):
  name = 'dcspider'

  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = url_short, callback = self.parse )

  # parse method
  def parse(self, response):
    # Create an extracted list of course author names
    author_names = response.css('p.course-block__author-name::text').extract()
    # Here we will just return the list of Authors
    return author_names
```

## Example 2: Scraping Contents from a URL Stemming from Starting URL

```python

# Import the scrapy library
import scrapy

# Create the Spider class
class DCdescr( scrapy.Spider ):
  name = 'dcdescr'
  
  # start_requests method
  def start_requests( self ):
    # url_short is a global variable defined already
    yield scrapy.Request( url = url_short, callback = self.parse )
  
  # First parse method
  def parse( self, response ):
    links = response.css( 'div.course-block > a::attr(href)' ).extract()
    # Follow each of the extracted links
    for link in links:
      yield response.follow(url=link, callback=self.parse_descr)
      
  # Second parsing method
  def parse_descr( self, response ):
    # Extract course description
    course_descr = response.css( 'p.course__description::text' ).extract_first()
    # For now, just yield the course description
    yield course_descr
```