# Scrapy basics

## What you will learn in this course 🧐🧐

As you learned how to parse HTML pages, it is now time to go to the next level and scrape websites automatically. The best way to do so is by using spiders from Scrapy. In this course, we'll learn:

* How to create basic crawlers 
* Target specific tags and attributes in a webpage 
* Follow links to scrap multiple pages
* Simulate user log-in
* Run multiple crawlers at the same time
* Avoid being banned from websites

If Scrapy isn't installed yet in your environment, just execute the cell below:

In [1]:
# Add '!' only if you are running this command on a notebook 
## It tells Jupyter that the command should be interpreted as bash command
!pip3 install Scrapy

Collecting Scrapy
  Downloading Scrapy-2.11.1-py2.py3-none-any.whl.metadata (5.3 kB)
Collecting Twisted>=18.9.0 (from Scrapy)
  Downloading twisted-23.10.0-py3-none-any.whl.metadata (9.5 kB)
Collecting cryptography>=36.0.0 (from Scrapy)
  Downloading cryptography-42.0.4-cp39-abi3-macosx_10_12_universal2.whl.metadata (5.3 kB)
Collecting cssselect>=0.9.1 (from Scrapy)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting itemloaders>=1.0.1 (from Scrapy)
  Downloading itemloaders-1.1.0-py3-none-any.whl.metadata (3.9 kB)
Collecting parsel>=1.5.0 (from Scrapy)
  Downloading parsel-1.8.1-py2.py3-none-any.whl.metadata (10 kB)
Collecting pyOpenSSL>=21.0.0 (from Scrapy)
  Downloading pyOpenSSL-24.0.0-py3-none-any.whl.metadata (12 kB)
Collecting queuelib>=1.4.2 (from Scrapy)
  Downloading queuelib-1.6.2-py2.py3-none-any.whl (13 kB)
Collecting service-identity>=18.1.0 (from Scrapy)
  Downloading service_identity-24.1.0-py3-none-any.whl.metadata (4.8 kB)
Collecting w3lib>

## Create your first spider 🕷️🕷️

Basically, Scrapy works with *Spiders* that describe the successive steps necessary to get the data you're interested in at a given url. To make a scraping engine, you will need to:

- declare your own class that inherits from `Scrapy.Spider`,
- declare two attributes: the `name` of your crawler and the `url` at which you will start crawling,
- declare a `parse` method with an argument called `response` (which represents the variable containing the HTML response at the `url` you just defined).
- The `response` object has ONE method that you ABSOLUTELY need to know and will help you get what you are looking for 95% of the time, it's called `.xpath()` and you will just have to copy an xpath from the webpage's source code to scrape the element, easy right ?!

A Spider always looks somewhat like this:

```python
class RandomQuoteSpider(scrapy.Spider):
    # Name of your spider
    name = "mySpider"

    # Url to start your spider from 
    start_urls = [
        'http://my.url.to.scrape',
    ]

    # Callback function that will be called when starting your spider
    def parse(self, response):
        return {
            'result1': response.xpath("/some/xpath/").get(),
            'result2': response.xpath('/some/xpath/').get(),
        }
```

Let's begin with a [simple example](src/scrapy1.py)


## The Crawler Process

Once your spider has been set up, you have to declare a `CrawlerProcess` that will run the spider and save the results in a `json` file (called a "FEED").

All you have to do is run the python script using `!python src/scrapy1.py` here in the notebook. Typically we do not write scraping code in the notebook but rather in scripts because it is then easier to use repeatedly (like everyday for example) or in an asynchronous manner (see optional lecture from module 4 day 1)

The crawler process will always look like this:

```python
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Chrome/97.0',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'src/' + filename : {"format": "json"},
    }
})
```

Let's study this in detail.

### User agent

Scrapy is able to scrape the web by simulating a web browser (the client) that will send HTTP requests to a web server. The question is, what browser are you simulating? Ideally the browser you are simulating should be the same one that you are using to inspect the websites and get the XPath.

The reason for this is that sometimes the webserver may give you different responses depending on the web browser you are using. For example old web browsers are not necessarily supporting javascript, which may cause the webpage to look a lot simpler and therefore change all the XPaths in the source code.

In most cases, the user agent can be set like this:

`'USER_AGENT' : 'Name_of_the_browser/version_number'`

for example:

`'USER_AGENT': 'Chrome/97.0'`

If you are using chrome you should be able to find your browser version at [chrome://settings/help](chrome://settings/help)

### LOG LEVEL and FEED

The other two settings we are giving the `CrawlerProcess` are the following:

* `LOG_LEVEL`: which indicates what messages will be displayed in the logs, typically messages in the logs are classified in several levels such as CRITICAL, ERROR, WARNING, INFO, DEBUG... Choosing `logging.INFO` will display all the logs with importance INFO and higher.
* `FEED`: indicates the destination and file format for the results to be saved.

It is now time to run our first scraping code, let's go!

In [2]:
!python src/scrapy1.py

2022-06-15 19:24:05 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-06-15 19:24:05 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1n  15 Mar 2022), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-06-15 19:24:05 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20, 'USER_AGENT': 'Chrome/97.0'}
2022-06-15 19:24:05 [scrapy.extensions.telnet] INFO: Telnet Password: efcded54877601a9
2022-06-15 19:24:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-06-15 19:24:05 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermi

You may take a look at the result in [this file](src/1_randomquote.json).

**WARNING**: Scrapy is not made to run multiple independant crawlers in one script. Therefore each script will contain a single Crawler. This also why we do not use scrapy within the notebook, this is not the usage it was designed for. Plus it will make you practice writing scripts instead of notebooks!


## Scraping multiple items per page 🛍️🛍️

Let's see an example where we parse multiple elements with a `for` loop and python's `yield` instruction (see appendix 1 of this lecture for details):

If you take a look at the following [webpage](http://quotes.toscrape.com/page/1/), you may see that lots of quoes are available. Let's take a look at the XPath for the first quote:

`/html/body/div/div[2]/div[1]/div[1]/span[1]/text()`

Now let's take a look at the XPath for the second quote:

`/html/body/div/div[2]/div[1]/div[2]/span[1]/text()` 

We can see that only the index of the 4th `div` tag is changing, therefore the general XPath for the quotes is:

`/html/body/div/div[2]/div[1]/div/span[1]/text()`

We could take advantage of this or we could loop until the last element which XPath is:

`/html/body/div/div[2]/div[1]/div[10]/span[1]/text()`


### Solution 1

In [3]:
!python src/scrapy2.py

2022-06-15 19:24:06 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-06-15 19:24:07 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1n  15 Mar 2022), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-06-15 19:24:07 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20, 'USER_AGENT': 'Chrome/97.0'}
2022-06-15 19:24:07 [scrapy.extensions.telnet] INFO: Telnet Password: 270f70707f21c012
2022-06-15 19:24:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-06-15 19:24:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermi

### Solution 2

In [4]:
!python src/scrapy2-alt.py

2022-06-15 19:24:09 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-06-15 19:24:09 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1n  15 Mar 2022), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-06-15 19:24:09 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20, 'USER_AGENT': 'Chrome/97.0'}
2022-06-15 19:24:09 [scrapy.extensions.telnet] INFO: Telnet Password: f8c011e1676a05ba
2022-06-15 19:24:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-06-15 19:24:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermi

## Appendix 1 - What is Yield keyword for? 💐

You might have noticed that we used the `yield` keyword in Scrapy which could be quite new and confusing. Technically speaking it is called a *generator*.

In a nutshell, `yield` is a very useful keyword to return a data collection without taking up too much machine's memory. 

Let's check out with an example. Let's take two functions: 

In [5]:
# Simple function with return keyword
def return_list(a_list):
    for i in range(len(a_list)):
        a_list[i] *= 2
    return a_list

# Function with yield keyword
def return_with_yield(a_list):
    for i in range(len(a_list)):
        yield a_list[i] * 2

Now let's apply these two functions to our `random_list`

In [6]:
# Create a list of numbers from 0 to 9
random_list = [x for x in range(10)]
# Returns a list
return_list(random_list)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [7]:
# Create a list of numbers from 0 to 9
random_list = [x for x in range(10)]
# Function with yield
return_with_yield(random_list)

<generator object return_with_yield at 0x0000022E0DABCAC0>

In the first example, `return_list` returned directly the full list. Whereas, in the second example, `return_with_yield` returned a `generator`. Generators are very cool because we haven't actually executed the loop. Therefore, we haven't spend too much computer memory. 

So let's say instead of a list of 10 items, you'd have one of 1000000 items, it would make a huge difference in terms of computing time. 

Now if you need to get the actual values of your generator, you can simply create a for loop or a comprehension list like:

In [8]:
# Using a for loop will just print the output:
for number in return_with_yield(random_list):
    print("output", number)

# Using a comprehension list will create a list:
[i for i in return_with_yield(random_list)]

output 0
output 2
output 4
output 6
output 8
output 10
output 12
output 14
output 16
output 18


[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

If you simply need to yield from a list without doing any manipulation, you can use `yield from` instead of creating a loop. 

## Appendix 2 - Crash course on XPath ⚔️

The best way to learn XPath is to follow this great tutorial from <a href="http://zvon.org/comp/r/tut-XPath_1.html#Pages~List_of_XPaths" target="_blank">http://Zvon.org</a>.

## Resources 📚📚

* <a href="https://docs.scrapy.org/en/latest/index.html" target="_blank"> Scrapy Documentation </a>
* <a href="https://docs.python.org/3/library/logging.html" target="_blank"> Logging</a>
* <a href="https://docs.scrapy.org/en/latest/topics/logging.html#topics-logging" target="_blank">Logging in a scrapy</a>
