# Introduction to crawlers/spiders in Python

This notebook contains a short introduction to working with crawlers/spiders with `Scrapy`:

- What are crawlers/spiders?
- Defining functions in Python
- What is a "class" in Python?
- Building a simple crawler using `Scrapy`

## What are crawlers/spiders?

Where "web scraping" refers to (mostly) automated collections of data and material from websites, crawlers and spiders are bots/programs specifically developed to traverse several websites and perform some scraping tasks.

If we are interested in scraping the content of several websites without knowing the exact URLs of those websites, a crawler can be used to go from site to site and perform the necessary web scraping task.

Developing crawlers can be especially tricky if they have to traverse several domains. This is because the web is connected in such a way where a few sites are dominant and are linked to across most websites (just think of how often you see links to Google, Twitter, Facebook etc. on a website). Imagining the web as an ocean with layers like the figure below, a crawler will always move towards the surface because the websites located there are referenced so often.

Obviously we want to avoid the surface with a crawler, as it will then end up trying to crawl the entire web.

![websea](./img/web_sea.png)

*Source unknown*

### Constructing a crawler

The following should be considered when constructing a crawler:
- Where should the crawler start?
- What sites are of interest?
- What scraping task should the crawler do?
- How should the crawler be limited?

In Python, the best way of constructing a crawler is to use relevant data structures to define the starting points and possible sites to avoid. The scraping tasks can be defined as functions to be integrated in the crawler.

## Building a scraper (using `Scrapy`)

The package [`scrapy`](https://docs.scrapy.org/en/latest/) is used for various web scraping purposes. 

One major challenge when crawling is the massive amount of request-handling needed to crawl across various site (the crawler has to keep sending new requests and not just stop if it encounters a timeout). Another thing to be aware of is crawler-restrictions on the page (`robots.txt`) and avoiding sending too many requests to a server too quickly.

Luckily `scrapy` has a lot of existing functions and classes that are created to account for common problems in scraping. Using scrapy, one can focus on the actual scraping tasks that needs to be performed.

Here is a boiled down version of how to create a simple scraper using `scrapy`:
- Create a crawler-class that is adapted from the base class `scrapy.Spider` (fx `my_crawler`)
    - Name the spider by creating a `name` attribute (this is used to call it later)
    - Specify the URLs to scrape in a `start_urls` attribute
    - (Optional) Specify how the scraper should initially process the URLs in `start_urls` (by default, it sends a GET request for each and returns a response object)
    - Specify how each response from the requests send should be processed by defining a `parse` function
- Create a data structure for the scraped info to be stored in
- Call the `CrawlerProcess()` from `scrapy`: `process = CrawlerProcess()`
- Define what crawler the `CrawlerProcess()` should use: `process.crawl(my_crawler)`
- Start the crawling: `process.start()`

**NOTE ON RESTARTING CRAWLERS**

A spacy crawler can only be run once in a given notebook instance. To restart the crawler, you have to restart the kernel of the notebook as well.

In [None]:
import requests
import scrapy
from scrapy.crawler import CrawlerProcess
from urllib.parse import urljoin
from bs4 import BeautifulSoup as bs

In [None]:
class eu_crawler(scrapy.spider): #intentional error to avoid mass crawling
    name = "eu_crawler"
    main_url = 'https://ec.europa.eu/clima/news-your-voice/news_en'
    start_urls = ['https://ec.europa.eu/clima/news-your-voice/news_en']
    
    def parse(self, response):
        soup = bs(response.text, "html.parser") # Notice that HTML content is refered to as .text in a scrapy response
        
        article_rows_soup = soup.find_all("article", class_ = "ecl-content-item")
        
        for row in article_rows_soup:
            article_dict = {}

            article_title_soup = row.find("div", class_ = "ecl-content-item__title").find("a")
            article_title = article_title_soup.get_text()
            article_link = article_title_soup['href']

            article_date = row.find("time")["datetime"]

            article_summary_soup = row.find("div", class_ = "ecl-content-item__description")
            try:
                article_summary = article_summary_soup.get_text(strip = True)
            except:
                article_summary = ""

            article_dict['title'] = article_title
            article_dict['link'] = article_link
            article_dict['date'] = article_date
            article_dict['summary'] = article_summary

            article_list.append(article_dict)
        
        try:
            next_page_url = urljoin(self.main_url, soup.find("a", attrs = {'aria-label': "Go to next page"})['href'])
        except:
            next_page_url = None
            
        if next_page_url is not None:
            yield scrapy.Request(url = next_page_url, callback=self.parse)

article_list = []
process = CrawlerProcess(
    {'USER_AGENT': 'Mozilla/5.0'}
)
process.crawl(eu_crawler)
process.start()