**Note:** a spider doesn't see the webpage as we do. a spider sees it without javascript. So to scrape data from javascript fields and interact with the webpage we use something like splash or, slenium.

## Scrapy Architecture 

Scrapy has 5 main components.
- **spiders:** this is the file where we define what we want to scrape from a webpage. scrapy has 5 built-in spider classes (scrapy.Spider, CrawlSpider, XMLFeedSpider, CSVFeedSpider, SitemapSpider).
- **pipelines:** this is where the extracted data is manipulated (cleaning of data, removal of duplication, populating data in databases or other external storage files).
- **middlewares:** has everythig to do with the request sent and response received. if we want to inject some custom headers or proxies then we have do it through middlewares.
- **engine:** it ensures the consistency of the whole operation, meaning, it coordinates the operation of other components.
- **scheduler:** responsible for preserving the order of operations.

## Scrapy Fundamentals

### Basic Scrapy commands 

In [1]:
# !scrapy

Scrapy 2.5.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  commands      
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command


### Creating a Scrapy Project

**scrapy startproject project_name**. This will create a folder named "project_name" in the current working directory and start a project using scrapy.

In [1]:
# ! scrapy startproject worldometer

New Scrapy project 'worldometer', using template directory '/home/maidul-hasan/miniconda3/envs/scraper_env/lib/python3.8/site-packages/scrapy/templates/project', created in:
    /home/maidul-hasan/Work/Web Scraping/worldometer

You can start your first spider with:
    cd worldometer
    scrapy genspider example example.com


### Listing the files and folders inside the project directory 

In [6]:
import os

for dirpath, directories, files in os.walk("./worldometer/"):
    print("dirpath: ", dirpath)
    print("directories: ", directories)
    print("files: ", files, "\n")

dirpath:  ./worldometer/
directories:  ['worldometer']
files:  ['scrapy.cfg'] 

dirpath:  ./worldometer/worldometer
directories:  ['spiders']
files:  ['__init__.py', 'settings.py', 'items.py', 'pipelines.py', 'middlewares.py'] 

dirpath:  ./worldometer/worldometer/spiders
directories:  []
files:  ['__init__.py'] 



- ./worldometer/worldometer/spiders/: this is where all the spiders live.
- ./worldometer/scrapy.cfg: this file is very important to execute the spiders we create. it is also used to deploy our spiders to heroku or other hosting services.
- ./worldometer/worldometer/items.py: we use this to clean the data we scrape and store the scraped data in different fields that we create.
- ./worldometer/worldometer/middlewares.py: it does everything that has something to do with the request we send and the response that we get. we can write our own middlware class to manipulate actions associated with request and response.
- ./worldometer/worldometer/pipelines.py: used in order to channel the items we have scraped to a database.
- ./worldometer/worldometer/settings.py: used for extra tweaking and configuration of the project.

### Generating a Spider 

In [2]:
# cd project_folder
# scrapy genspider -t template_name spider_name allowed_domain_name

The default template name is basic and it just provides a .py file that contains a barebone spider class (which inherits from scrapy.Spider). We can have multiple spiders but their names must be unique. The url we supply to scrapy has to be without the front 'https://' and the end '/' as scrapy will handle that.

Remember that you can change the predefined names (variable, attribute, class or method names) in the main spider (that is, the .py file created when you generated the spider) by tweaking the settings but you should not change them on their own as that would break the spider. So it's best if you leave them alone as is.

## Before you start scraping Data 

Use chrome or chromium for web scraping purpose as chrome web store has some preety good add-ons that will help you in the whole process.

- First disable javascript. 
    - open the devoloper tools window (ctrl+shift+i). This lets you inspect an element.
    - go to the command pallete (ctrl + shift + p).
    - type in javascript and press Disable JavaScript. you can re-enable javascript following the same steps.
- Select the inspection tool and then select the element to be inspected (extracted) to see the html tags of that element.
- To see if we can extract the desired data open devoloper tools window (ctrl+shift+i) and press (ctrl+f) to try out a 'css' or, 'xpath' selector.

It is necessery to disable javascript in order to make sure that we are seeing the same html webpage as the spider itself. This way writing a css or, xpath selector becomes easy and we can know if we can extract an element using scrapy only or not.

## CSS Selectors 

In html web pages all elements are tagged with some html tags. For example, a paragraph is tagged with the \<p> tag.

### Selecting elements by their tags

1. To select all the elements with a particular <u>tag name</u>, **syntax: tagName**. It's as simple as that. But if we want to select one particular element or some certain elements then this is not the clever way to do it. Instead we should target elements either by their <u>class attribute, id or by position</u> so we can limit the scope of the CSS selector.

2. To select any element by its <u>class attribute value</u>, **syntax: .className**.

3. To target an element by its <u>id attribute value</u>, **syntax: #id**.

Note: The same exact class attribute value can be assigned to more than one element. However, an id can only be assigned to a particular element.

**Example:** Let’s say we want to select the “p” element that has a class attribute equals to “intro” and an id value "outside". In this case we can use one of the following CSS selectors, **p.intro** or, **#outside**, to select the element.

4. To select an element that has <u>multiple classes</u> (for example, 'bold italic'), **syntax: tagName.className.2nd_className** and so on and so forth.

5. To select an element that has <u>some other attribute name and value</u>, **syntax: tagName [attributeName=attributeValue]**

6. To select an element where <u>an attribute (for example the 'href' attribute)</u>, 
    - <u>starts with some certain letters</u>, **syntax: tagName[href^="https"]**
    - <u>ends with some certain letters</u>, **syntax: tagName[href$="com"]**
    - <u>contains some certain letters</u>, <b>syntax: tagName[href*="google"]</b>

7. To <u>extract the link within a tag(usually 'a' tag)</u>, **syntax: a::attr(href)**

### CSS combinators, Selecting elements by their relative position 

1. To get all the elements that are placed inside a particular element, **syntax: outerElementTag.outerElementClass innerElementTag**.

2. To get all the elements that are immediate children of a specific element, **syntax: .specificElementClass > childElementsTag**

3. To get the n'th child of a certain element, **syntax: certainElementTag:nth-child(childsPosition)**

4. To select the element that is placed immediately after a particular element, **syntax: #firstElementId + #secondElementId**

5. To select all the elements that are placed after a particular element anywhere in the code but not necessarily immediately after it, **syntax: #firstElementId ~ .otherElementsClass**

To get multiple of these nth childs use, **certainElementTag:nth-child(childsPosition), certainElementTag:nth-child(childsPosition), ...........**

## Modifying the settings.py file

    ROBOTSTXT_OBEY = False
    FEED_EXPORT_ENCODING = 'utf-8'
    
    # to avoid repeated requests to the same link.
    AUTOTHROTTLE_ENABLED = True
    HTTPCACHE_ENABLED = True

## Avoid getting banned

### User-Agent Rotator Middleware

Modify the middlewares.py file to choose random user agent from a list of user agents to try avoid getting banned.

    import logging
    import random
    
    from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware


    class UserAgentRotatorMiddleware(UserAgentMiddleware):
        user_agents = [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 "
            "Safari/537.36 Edge/12.246",
            "Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 "
            "Safari/537.36",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 "
            "Safari/601.3.9",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36",
            "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1"
        ]

        def __init__(self, user_agent=""):
            self.user_agent = user_agent

        def process_request(self, request, spider):
            try:
                self.user_agent = random.choice(self.user_agents)
                request.headers.setdefault("User-Agent", self.user_agent)
            except IndexError:
                logging.error("Couldn't fetch the User Agent")

## Debugging the spider

### Using scrapy's default functions

Read the docs at: <a>https://docs.scrapy.org/en/latest/topics/debug.html</a>

### Using the Debug option of an IDE

First create a python file. i.e, debug.py. The debug.py file should look like this - 
    
    import scrapy
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    from Project_Name.spiders.Spider_Name import Spider_Class_Name

    process = CrawlerProcess(settings=get_project_settings())
    process.crawl(Spider_Class_Name)
    process.start()

Next, define the break point in the spider itself and then, run the debug.py file in the Start Debugging mode. 

## Available templates to generate different types of spiders

Available templates:
- basic : The basic template generates a basic spider that inherits from the scrapy.Spider class. Out of the box it contains the spider name, allowed domains, start_urls list or a start_requests class method that utilizes scrapy.Request to send the initial request and a parse method.   
- crawl : The crawl template generates a crawl spider that inherits from the CrawlSpider class. The crawl spider contains a special tuple named "rules". In this tuple we define certain Rule(s) that the crawler obeys when extracting links. We can say it to extract links if the href contains certain word or we can define its working zone by defining a css/xpath expression.One thing to note that the callback method should never be named "parse".
The CrawlSpider is very useful when you have many href links for a particular css/xpath selector but you only want to follow links that has some certain properties to it. In this case you can define a Rule object to do that without hassle.
- csvfeed : Used to scrape csv files. 
- xmlfeed : Used to scrape xml files.