# Introduction to crawlers/spiders in Python

This notebook contains a short introduction to working with crawlers/spiders with `Scrapy`:

- What are crawlers/spiders?
- Defining functions in Python
- What is a "class" in Python?
- Building a simple crawler using `Scrapy`

## What are crawlers/spiders?

Where "web scraping" refers to (mostly) automated collections of data and material from websites, crawlers and spiders are bots/programs specifically developed to traverse several websites and performing some scraping tasks.

If we are interested in scraping the content of several websites without knowing the exact URLs of those websites, a crawler can be used to go from site to site and perform the necessary web scraping task.

Developing crawlers can be especially tricky if they have to traverse several domains. This is because the web is connected in such a way where a few sites are dominant and are linked to across most websites (just think of how often you see links to Google, Twitter, Facebook etc. on a website). Imagining the web as an ocean with layers like the figure below, a crawler will always move towards the surface because the websites located there are referenced so often.

Obviously we want to avoid the surface with a crawler, as it will then end up trying to crawl the entire web.

![websea](./img/web_sea.png)

*Source unknown*

### Constructing a crawler

The following should be considered when constructing a crawler:
- Where should the crawler start?
- What sites are of interest?
- What scraping task should the crawler do?
- How should the crawler be limited?

In Python, the best way of constructing a crawler is to use relevant data structures to define the starting points and possible sites to avoid. The scraping tasks can be defined as function to be integrated in the crawler.

## Functions in Python

It is easy to define your own functions in Python. When working with crawlers, it makes sense to think of the scraping tasks to be performed as functions to be integrated.

The syntax for defining a function is as follow:

```{python}
def my_function(x):
    result = x + 2
    return(result)
```

- `def` is used to define functions. 
- `my_function` is the name of the function. Make sure not to overwrite existing functions.
- The `x` in parenthesis is the input argument. A function takes from 0 to any number of arguments separated by comma.
- The indented lines following `:` are the commands for the function.
- `result` is a variable created inside the function. It only exists inside the function. `x` refers to the input argument `x`.
- `return` is used to specify the output of the function. Note: Commands following a return line in the function are ignored. 

In [4]:
def my_function(x):
    result = x + 2
    return(result)

Notice that running the cell above returns nothing. We have simply defined a function which in itself does not have an output.

The function is now available to use:

In [3]:
my_function(3)

5

The function adds 2 to the input argument (in this case `2`) and returns it (stored inside the function as `result`).

Notice that `result` does not exist outside the function:

In [1]:
print(result)

NameError: name 'result' is not defined

**The return statement**

If the function is meant to have an output, the return statement is used to specify what to be returned.

Any lines following a return statement in a function is ignored:

In [2]:
def a_function():

    print("You see me!")
    
    return
    
    print("But you don't see me!")
    
a_function()

You see me!


### Error handling

A integral part of Python programming is the ability to handle errors in the programming. 

In most data analysis tasks, we have little use for error handling. However, when working crawlers we are writing commands that have to be able to process information that we do not know before-hand. Error-handling can therefore be necessary to ensure that the crawler not just terminated when encountering an error.

When we provide a function with an input that it is not able to handle, it will return an error:

In [5]:
my_function("hello")

TypeError: can only concatenate str (not "int") to str

Because the function performs addition on the input argument, Python returns an error because it cannot perform addition on text.

In a crawler setting, the errors we could encounter could also be related to datatypes or trying to access attributs in HTML code that are not present.

There are two main ways of including error handling:
1. Using if-else statements
2. Using try-except statements

#### Using if-else statements as error handling

if-else is simply used to write commands that should only be run when certain conditions are met:

In [8]:
def if_function(x):
    if x > 2:
        return("Above 2!")
    else:
        return("Not above 2!")

In [7]:
if_function(1)

'Not above 2!'

If-else can be used to check the length of a data structure (number of elements), whether a certain attribute or tag is present and so on.

#### Using try-except as error handling

Sometimes we may not know the specific conditions that have to be met to avoid the error but we instead know what errors could occur.
try-except allows one to write commands that accounts for specific errors. 

The logic is as follows:

- *try* to do something.
- *except* if you encounter this error. Do something else.

Below `my_function` is redefined to account for Type Errors:

In [10]:
import numpy as np

def my_function(x):
    try:
        result = x + 2
    except TypeError:
        result = np.nan
        
    return(result)

In [12]:
my_function("hello")

nan

The function now no longer throws an error but instead returns a missing value (`np.nan`) when encountering a TypeError.

#### Except all errors (beware)

try-except statements do not need a specified error to work. It is possible to just except all errors:

In [13]:
def my_function(x):
    try:
        result = x + 2
    except:
        result = np.nan
        
    return(result)

In [14]:
my_function("hello")

nan

As a rule this practiced is highly discouraged as you run the risk of completely overlooking glaring errors in your code:

In [19]:
def my_function(y):
    try:
        result = x + 2
    except:
        result = np.nan
        
    return(result)

In [20]:
my_function(5)

nan

In the function above, there is a mis-match between the input argument and the variable used in the function. Usually Python will throw a `NameError` when running a function like this but this error is captured by the `except` statement (as it captures all errors).

In [None]:
def kw_scraper(start_urls, keywords):
    # Create class
    class kw_spider (scrapy.Spider):
        name = "kw_spider"

        def start_requests(self, start_urls = start_urls):
            for start_url in start_urls:
                #logger.info("Starting scrape for {start_url}...".format(start_url = start_url))
                yield scrapy.Request(url = start_url, callback = self.parse)

        #Parsing
        def parse(self, response):
            print(response.url)

            site_dict = {}

            page_url = response.url
            domain_url = urlparse(page_url).netloc
            page_html = requests.get(page_url).content
            page_soup = bs(page_html, 'html.parser')

            page_text = page_soup.get_text().lower()

            #try:
            page_links = list(set([re.sub(r'\/$', '', tag['href']) for tag in page_soup.find_all('a') if tag.has_attr('href')]))
            page_links = list(compress(page_links, [len(link) > 1 for link in page_links]))

            if any(word in page_text for word in keywords):
                matches = list(compress(keywords, [keyword in page_text for keyword in keywords]))

                site_dict['url'] = page_url
                site_dict['links'] = page_links
                site_dict['date-of-access'] = str(datetime.now().date())
                site_dict['keywords_matched'] = matches

                site_list.append(site_dict)

            scraped_urls.append(page_url)

            internal_urls = list(compress(page_links, [(domain_url in link or "http" not in link) for link in page_links]))

            new_urls = list(set(internal_urls) - set(scraped_urls)) # Extracted URLs from pages - should be on same domain

            if len(new_urls)>0:
                more_pages = True # Test for whether there are more pages
            else:
                more_pages = False

            #except:
            #    more_pages = False       
            
            if more_pages:
                for url in new_urls:
                    yield scrapy.Request(url = urljoin(page_url, url), callback=self.parse)
                        
            
            # Save scraped data
            #outname = "drr_scrape{}".format(str(datetime.now().date()))
            #with open(os.path.join(data_dir, outname), 'w') as f:
            #    json.dump(site_list, f)
                
    #Initiatlize lists
    site_list = list()
    scraped_urls = list()

    # Set parameters
    start_urls = start_urls # start URLs
    keywords = keywords # keywords

    #Run spider
    process = CrawlerProcess()
    process.crawl(kw_spider)
    process.start()
    
    return(site_list)