# Scrapy advanced

In this course we will dig further into scrapy's capabilities, which are numerous, and good enough to let you scrape even websites that are attempting to protect themselves from this practice (for example sites like google or facebook are particularly reluctant to being scraped because they do not want other people being able to leverage their users' data).

## What you will learn in this course 🧐🧐

This course will introduce the following concepts:

* Callbacks
* Web navigation with scrapy
* Post requests with scrapy
* Multiple spiders
* Scrapy Projects
* Avoid being banned
    * Scrapy projects
    * Autothrottle
    * Rotate user agent
    * Rotate IP address

## Callbacks

Callbacks are a way to call functions in our code at specific moments or when a specific task is done, depending on contraints and events that are partly or fully independent from the code itself. For example we may want to use callbacks when:

* We want a function to run when a user clicks a certain button on an app
* We want a function to run when the response from the webserver has been fully loaded
* We want a function to run when a user lands on a specific url

The list goes on.
Scrapy uses a lot of callbacks for tasks that repeat over different pages, or if we want scrapy to follow link after link after link to explore a website.

## Following pagination links 📄📄

Scrapy simulates a web browser that will send requests to web servers, therefore it can do almost everything a web browser can do, such as naviagting the web using links! All you have to do is get the XPath from the link you are looking to click and use the `.follow` method.
[This code example](src/scrapy3.py) shows how to use links to iterate over multiple pages:

In [None]:
!pip install scrapy

In [None]:
!python src/scrapy3.py

The result from this scraping process may be found in [this file](src/3_quotesmultiplepages.json)

## Authentication on a website 🔐🔐

A very useful feature of Scrapy: you can simulate automatic authentication! More generally, you may fill out forms and submit them like any normal web users would!

This can be done by using `scrapy.FormRequest.from_response()` to send a post request with some your login/password to the website.

This method will look for a `<form>` tag in the response object (the html code of the url you requested) and will populate the different fields with data of your choice. The data should be given to the `formdata` argument in the form of a dictionnary `{key:value}` where the `key` is the id attribute of the form field and `value` is what you wish to write there. Automatically this method will look for any clickable or submit button in the form and click.

You can take a look at the [documentation](https://docs.scrapy.org/en/latest/topics/request-response.html#formrequest-objects) to get a better understanding.

You may see this in [this script](src/scrapy4.py).

In [None]:
!python src/scrapy4.py

The result of this query is available in [this file](src/4_quotesauthentication.json).

## Run spiders with multiple starting urls 🕸️ 🕷️

It is possible to run a spider using several starting urls. Take a look at the [code example](src/scrapy5.py).

In [None]:
!python src/scrapy5.py

You may take a look at the [result file](src/5_quotesmultiplespiders.json).

## Avoid being banned

Several websites are trying to protect themselves from scrapers because they do not want their data (which has a lot to do with their added value in most cases) to be captured by other people. This part of the lecture will focus on using scrapy's capabilities that make it harder to detect and therefore harder to be stopped by reluctant websites.

### Scrapy projects

Most of the techniques that can make your scraping processes harder to detect have nothing to do with the actual scraping code, that entirely depends on the webpage you are trying to scrape and its source code, it has everything to do with the configuration of everything that surrounds your spider.

In order to easily access the configuration files of the scraping process and set up everything painlessly we can use what we call scrapy projects.

In [None]:
import os
os.chdir("src")
os.getcwd()

In [None]:
!scrapy startproject tutorial

Here's how the scrapy project will appear in your current working directory:

```
├── tutorial
│   ├── tutorial
│   │   ├── spiders
│   │   │   └── __init__.py
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── middlewares.py
│   │   ├── pipelines.py
│   │   └── settings.py
│   └── scrapy.cfg
```   
  
We will define the spiders we wish to use to scrape the web in the `spiders` folder. The python files and the `cfg` file we will use to configure the way the spider script will be run.

## Autothrottle 🚫🚫

The more scraping you're doing the more requests you make. If websites are well protected, they might ban you because you exceeded requests limitations. 

You may avoid that by delaying the number of requests automatically thanks to the `AutoThrottle` extension. 

As stated in the documentation, `AutoThrottle` extension is designed to: 

- *Be nicer to sites instead of using default download delay of zero.*
- *Automatically adjust Scrapy to the optimum crawling speed, so the user doesn’t have to tune the download delays to find the optimum one. The user only needs to specify the maximum concurrent requests it allows, and the extension does the rest.*

To use autothrottle, it's as simple as adding `"AUTOTHROTTLE_ENABLED": True` to your crawler's settings. If you are working with scrapy projects (as you should if you are trying to do some intense scraping) you only need to uncomment the appropriate line in the [settings.py](src/tutorial/tutorial/settings.py) file.

In [None]:
os.chdir("tutorial")
os.getcwd()

To launch a spider contained in a scrapy project, all you have to do is `cd` into the project directory (in our case `tutorial`) then run the following command:

`!scrapy crawl name_of_the_spider -O filename_to_save_the_results`

In [None]:
!scrapy crawl autothrottle -O results/6_quotesautothrottle.json

### Rotate the user agent

In some cases, websites do not expect the same browser to make multiple requests in a short amount of time, this is why rotating user agents can be useful not to get caught scraping by the website.

All you will have to do is install the `scrapy-user-agents` library and then include this code in the `settings.py` file:

```python
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
```

In [None]:
!pip install scrapy-user-agents

In [None]:
os.chdir("..")
os.getcwd()

In [None]:
!scrapy startproject rotate_user_agent

In [None]:
os.chdir("rotate_user_agent")
os.getcwd()

In [None]:
!scrapy crawl rotate_user_agent -O results/rotate_user_agent.json

### Rotate IP addresses

A very easy way to detect scrapers is to count the number of calls made by the same IP address in a given amount of time. An easy solution for this is to rotate between proxies that will act as an intermediary between your web client and the web server, making it see IP addresses that are not yours.

First we need to install the `scrapy-rotating-proxy` library.

All you have to do is find a list of proxies and add it to the `settings.py` file in the following way:

```python
ROTATING_PROXY_LIST = [
    'Proxy_IP:port',
    'Proxy_IP:port',
    # ...
]
```

Or you could also get the proxy list from file:

```python
ROTATING_PROXY_LIST_PATH = 'listofproxies.txt'
```

We'll use proxies given in the [proxy list file](src/Free_Proxy_List.csv).

In [None]:
!pip install scrapy-rotating-proxies

In [None]:
os.chdir("..")
os.getcwd()

In [None]:
import pandas as pd

proxy = pd.read_csv("Free_Proxy_List.csv")
proxy

In [None]:
proxy_list = [f"{row[0]}:{row[1]}" for row in zip(proxy["ip"],proxy["port"])]
proxy_list[:3]

In [None]:
textfile = open("proxy.txt", "w")
for element in proxy_list:
    textfile.write(element + "\n")
textfile.close()

In [None]:
!scrapy startproject rotate_proxy

In [None]:
os.chdir("rotate_proxy")

In [None]:
!scrapy crawl rotate_proxy -O results/8_rotate_proxy.json

## Ressources 📚📚

[Rotate user agent](https://python.plainenglish.io/rotating-user-agent-with-scrapy-78ca141969fe)

[Rotate proxies](https://medium.com/@TeraCrawler.io/how-to-rotate-proxies-in-scrapy-2bccf38439f7)