<a href="https://colab.research.google.com/github/RazvanPorojan/scrapydo/blob/master/notebooks/scrapydo-clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ScrapyDo Overview

[ScrapyDo](https://github.com/darkrho/scrapydo) is a [crochet](https://github.com/itamarst/crochet)-based blocking API for [Scrapy](http://scrapy.org). It allows the usage of Scrapy as a library, mainly aimed to be used in spiders prototyping and data exploration in [IPython notebooks](http://ipython.org/notebook.html).

In this notebook we are going to show how to use `scrapydo` and how it helps to rapidly crawl and explore data. Our main premise is that we want to crawl the internet as a mean to analysis data and not as an end.

## Initialization

The function `setup` must be called before any call to other functions.

In [0]:
!pip install scrapydo

In [0]:
import scrapydo
scrapydo.setup()

## The `fetch` function and highlight helper

The `fetch` function returns a `scrapy.Response` object for a given URL.

In [0]:
response = scrapydo.fetch("https://www.linkedin.com/in/cairo-cavalcante-53064738/")
from IPython.core.display import display, HTML
display(HTML(response.text))
#print(response.text)

In [0]:
import scrapy
from loginform import fill_login_form

class MySpiderWithLogin(scrapy.Spider):
    name = 'my-spider'

    start_urls = [
        'https://www.linkedin.com/in/cairo-cavalcante-53064738/',
        'https://www.linkedin.com/in/marius-istrate-b8ab0566/',
    ]

    login_url = 'https://www.linkedin.com/uas/login'

    login_user = 'your-username'
    login_password = 'secret-password-here'

    def start_requests(self):
        # let's start by sending a first request to login page
        yield scrapy.Request(self.login_url, self.parse_login)

    def parse_login(self, response):
        # got the login page, let's fill the login form...
        data, url, method = fill_login_form(response.url, response.body,
                                            self.login_user, self.login_password)

        # ... and send a request with our login data
        return scrapy.FormRequest(url, formdata=dict(data),
                           method=method, callback=self.start_crawl)

    def start_crawl(self, response):
        # OK, we're in, let's start crawling the protected pages
        for url in self.start_urls:
            yield scrapy.Request(url)

    def parse(self, response):
        # do stuff with the logged in response