# Introduction to crawlers/spiders in Python

This notebook contains a short introduction to working with crawlers/spiders with `Scrapy`:

- What are crawlers/spiders?
- Defining functions in Python
- What is a "class" in Python?
- Building a simple crawler using `Scrapy`

## What are crawlers/spiders?

Where "web scraping" refers to (mostly) automated collections of data and material from websites, crawlers and spiders are bots/programs specifically developed to traverse several websites and performing some scraping tasks.

If we are interested in scraping the content of several websites without knowing the exact URLs of those websites, a crawler can be used to go from site to site and perform the necessary web scraping task.

Developing crawlers can be especially tricky if they have to traverse several domains. This is because the web is connected in such a way where a few sites are dominant and are linked to across most websites (just think of how often you see links to Google, Twitter, Facebook etc. on a website). Imagining the web as an ocean with layers like the figure below, a crawler will always move towards the surface because the websites located there are referenced so often.

Obviously we want to avoid the surface with a crawler, as it will then end up trying to crawl the entire web.

![websea](./img/web_sea.png)

*Source unknown*

### Constructing a crawler

The following should be considered when constructing a crawler:
- Where should the crawler start?
- What sites are of interest?
- What scraping task should the crawler do?
- How should the crawler be limited?

In Python, the best way of constructing a crawler is to use relevant data structures to define the starting points and possible sites to avoid. The scraping tasks can be defined as function to be integrated in the crawler.

## Functions in Python

It is easy to define your own functions in Python. When working with crawlers, it makes sense to think of the scraping tasks to be performed as functions to be integrated.

The syntax for defining a function is as follow:

```{python}
def my_function(x):
    result = x + 2
    return(result)
```

- `def` is used to define functions. 
- `my_function` is the name of the function. Make sure not to overwrite existing functions.
- The `x` in parenthesis is the input argument. A function takes from 0 to any number of arguments separated by comma.
- The indented lines following `:` are the commands for the function.
- `result` is a variable created inside the function. It only exists inside the function. `x` refers to the input argument `x`.
- `return` is used to specify the output of the function. Note: Commands following a return line in the function are ignored. 

In [1]:
def my_function(x):
    result = x + 2
    return(result)

Notice that running the cell above returns nothing. We have simply defined a function which in itself does not have an output.

The function is now available to use:

In [2]:
my_function(3)

5

The function adds 2 to the input argument (in this case `2`) and returns it (stored inside the function as `result`).

Notice that `result` does not exist outside the function:

In [3]:
print(result)

NameError: name 'result' is not defined

**The return statement**

If the function is meant to have an output, the return statement is used to specify what to be returned.

Any lines following a return statement in a function is ignored:

In [4]:
def a_function():

    print("You see me!")
    
    return
    
    print("But you don't see me!")
    
a_function()

You see me!


### Error handling

A integral part of Python programming is the ability to handle errors in the programming. 

In most data analysis tasks, we have little use for error handling. However, when working crawlers we are writing commands that have to be able to process information that we do not know before-hand. Error-handling can therefore be necessary to ensure that the crawler not just terminated when encountering an error.

When we provide a function with an input that it is not able to handle, it will return an error:

In [5]:
my_function("hello")

TypeError: can only concatenate str (not "int") to str

Because the function performs addition on the input argument, Python returns an error because it cannot perform addition on text.

In a crawler setting, the errors we could encounter could also be related to datatypes or trying to access attributs in HTML code that are not present.

There are two main ways of including error handling:
1. Using if-else statements
2. Using try-except statements

#### Using if-else statements as error handling

if-else is simply used to write commands that should only be run when certain conditions are met:

In [6]:
def if_function(x):
    if x > 2:
        return("Above 2!")
    else:
        return("Not above 2!")

In [7]:
if_function(1)

'Not above 2!'

If-else can be used to check the length of a data structure (number of elements), whether a certain attribute or tag is present and so on.

#### Using try-except as error handling

Sometimes we may not know the specific conditions that have to be met to avoid the error but we instead know what errors could occur.
try-except allows one to write commands that accounts for specific errors. 

The logic is as follows:

- *try* to do something.
- *except* if you encounter this error. Do something else.

Below `my_function` is redefined to account for Type Errors:

In [8]:
import numpy as np

def my_function(x):
    try:
        result = x + 2
    except TypeError:
        result = np.nan
        
    return(result)

In [9]:
my_function("hello")

nan

The function now no longer throws an error but instead returns a missing value (`np.nan`) when encountering a TypeError.

#### Except all errors (beware)

try-except statements do not need a specified error to work. It is possible to just except all errors:

In [10]:
def my_function(x):
    try:
        result = x + 2
    except:
        result = np.nan
        
    return(result)

In [11]:
my_function("hello")

nan

As a rule this practiced is highly discouraged as you run the risk of completely overlooking glaring errors in your code:

In [12]:
def my_function(y):
    try:
        result = x + 2
    except:
        result = np.nan
        
    return(result)

In [13]:
my_function(5)

nan

In the function above, there is a mis-match between the input argument and the variable used in the function. Usually Python will throw a `NameError` when running a function like this but this error is captured by the `except` statement (as it captures all errors).

## Classes in Python

Everything in Python is essentially a class. The class determines what is possible with a specific object and what the object can contain.

Classes consists (mainly) of three components:
- The content of the class itself
- Attributes stored within the class
- Methods callable with the class

A `pandas Series` is fx a class:

In [14]:
import pandas as pd

a_series = pd.Series([1,4,6,8])
type(a_series)

pandas.core.series.Series

### Attributes

Attributes in a class are accessed using `.` followed by the attribute.

Pandas series contain the attribute `size` returning the number of element.

In [15]:
a_series.size

4

Because the `size` attribute is specific to the pandas series is class, it is fx not callable from a list:

In [16]:
a_list = [1,4,6,8]
a_list.size

AttributeError: 'list' object has no attribute 'size'

### Methods

Methods in a class are accessed using `.` followed by the name of the metod, followed by parenthesis. 

Methods are similar to functions in that some may accept input arguments (additional to the one given from the class). These are specified in the parenthesis.

Pandas series contain the method `mean()` to calculate the mean. This method is specific to pandas series and is also not callable from a list:

In [None]:
a_series.mean()

In [None]:
a_list.mean()

### Constructing a class

It is possible to specify your own class is python. This allows one to combine functions and data to *one* specific object.

Classes are simply constructed using `class` followed by a name for the class and `:`. 

The indented lines specify what the class should contain.

- Attributes are simply created by specifying a variable to include in the class
- Methods are created by specifying a function in the class. Notice that methods always need the class itself as an argument (`self`)

In [17]:
class my_class:
    number = 5
    
    def say_hello(self):
        print("Hello!")

The class is now defined. To use it, it has to be assigned:

In [18]:
a_class = my_class()

`a_class` is now an instance of `my_class` with the attributes and methods available:

In [19]:
a_class.number

5

In [20]:
a_class.say_hello()

Hello!


#### Input in a class

To specify input for a class, it needs a "constructor". This is done by specifying a function named `__init__`. This function specifies the required arguments and what to do when the class is initiated.

Below we are creating a custom list with a mean function:

In [21]:
total = 0
for i in a_list:
    total = total + i
    mean = total/len(a_list)

    

In [22]:
class numbers_list:
    def __init__(self, numbers):
        self.data = numbers
    def mean(self):
        total = 0
        for number in self.data:
            total = total + number
        mean = total/len(self.data)
        return(mean)        

In [23]:
my_list = numbers_list([2,5,6,7])

In [24]:
my_list.data

[2, 5, 6, 7]

In [25]:
my_list.mean()

5.0

## Building a scraper (using `Scrapy`)

The package [`scrapy`](https://docs.scrapy.org/en/latest/) is used for various web scraping purposes. 

One major challenge when crawling is the massive amount of request-handling needed to crawl across various site (the crawler has to keep sending new requests and not just stop if it encounters a timeout). Another thing to be aware of is crawler-restrictions on the page (`robots.txt`) and avoiding sending too many requests to a server too quickly.

Luckily `scrapy` has a lot of existibng functions and classes that are created to account for common problems in scraping. Using scrapy, one can focus on the actual scraping tasks that needs to be performed.

Here is a boiled down version of how to create a simple scraper using `scrapy`:
- Create a crawler-class that is adapted from the base class `scrapy.Spider` (fx `my_crawler`)
    - Name the spider by creating a `name` attribute (this is used to call it later)
    - Specify the URLs to scrape in a `start_urls` attribute
    - (Optional) Specify how the scraper should initially process the URLs in `start_urls` (by default, it sends a GET request for each and returns a response object)
    - Specify how each response from the requests send should be processed by defining a `parse` function
- Create a data structure for the scraped info to be stored in
- Call the `CrawlerProcess()` from `scrapy`: `process = CrawlerProcess()`
- Define what crawler the `CrawlerProcess()` should use: `process.crawl(my_crawler)`
- Start the crawling: `process.start()`

**NOTE ON RESTARTING CRAWLERS**

A spacy crawler can only be run once in a given notebook instance. To restart the crawler, you have to restart the kernel of the notebook as well.

In [26]:
import requests
import scrapy
from scrapy.crawler import CrawlerProcess
from urllib.parse import urljoin
from bs4 import BeautifulSoup as bs

In [None]:
class eu_crawler(scrapy.Spider):
    name = "eu_crawler"
    main_url = 'https://ec.europa.eu/clima/news-your-voice/news_en'
    start_urls = ['https://ec.europa.eu/clima/news-your-voice/news_en']
    
    def parse(self, response):
        soup = bs(response.text, "html.parser") # Notice that HTML content is refered to as .text in a scrapy response
        
        article_rows_soup = soup.find_all("div", class_ = "ecl-content-item-block__item")
        
        for row in article_rows_soup:
            article_dict = {}

            article_title_soup = row.find("div", class_ = "ecl-content-item__title").find("a")
            article_title = article_title_soup.get_text()
            article_link = article_title_soup['href']

            article_dict['title'] = article_title
            article_dict['link'] = article_link

            article_list.append(article_dict)
        
        try:
            next_page_url = urljoin(self.main_url, soup.find("a", attrs = {'aria-label': "Go to next page"})['href'])
        except:
            next_page_url = None
            
        if next_page_url is not None:
            yield scrapy.Request(url = next_page_url, callback=self.parse)

article_list = []
process = CrawlerProcess(
    {'USER_AGENT': 'Mozilla/5.0'}
)
process.crawl(eu_crawler)
process.start()