# Scrapy (Cheatsheet, not actual Crawler code)

<code>$ scrapy startproject articleSpider</code>

### Help, I get an error like "AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'"

Try:
$ pip install cryptography==38.0.4

### Help! I get another error

Try using conda?

### Directory structure of the articleSpider project looks like this:

- scrapy.cfg
- articleSpider
  - \_\_init\_\_.py
  - middlewares.py
  - settings.py
  - items.py
  - pipelines.py
  - spiders
     - \_\_init\_\_.py

<code>$ cd articleSpider</code>

## Generate a basic spider

<code>$ scrapy genspider wikipedia wikipedia.org</code>

Now you have a wikipedia.py file inside your spiders directory! Let's fill out the parse function:

In [None]:
from bs4 import BeautifulSoup

# ... 
def parse(self, response):
    soup = BeautifulSoup(response.body, 'html.parser')
    print(f'TITLE IS: {soup.title}')

Run with 
<code>$ scrapy runspider articleSpider/spiders/wikipedia.py</code>
and rejoice

### Help! I get an error like "MemoryError: Cannot allocate write+execute memory for ffi.callback()" 

Try this? Sure, why not.

$ pip uninstall cffi

$ pip install --upgrade pip 

$ pip install cffi

### About the Scrapy Response object (slides)

<class 'scrapy.http.response.html.HtmlResponse'>

### Scrapy Response Object Parsing

In [None]:

start_urls = [
    "https://en.wikipedia.org/wiki/Python_(programming_language)",
    "https://en.wikipedia.org/wiki/Java_(programming_language)",
    "https://en.wikipedia.org/wiki/Monty_Python"
    ]

#...

response.css('span.mw-page-title-main::text').extract_first()
response.xpath('//span[@class="mw-page-title-main"]//text()').extract()
soup = BeautifulSoup(response.body, 'html.parser')
print(soup.find('span', {'class': 'mw-page-title-main'}).text)


## Generate a crawler (slides)

<code>$ scrapy genspider wikipedia2 wikipedia.org -t crawl</code>

In [None]:
rules = [
        Rule(
            LinkExtractor(allow='(/wiki/)((?!:).)*$'),
            callback='parse',
            follow=True,
            cb_kwargs={'is_article': True}
        ),
        Rule(
            LinkExtractor(allow='.*'),
            callback='parse',
            cb_kwargs={'is_article': False}
        )
    ]

In [None]:
def parse(self, response, is_article):
        if not is_article:
            print(f'Discarding this: {response.css("h1::text").extract()}')
        else:
            print(f'VALUABLE CONTENT: {response.css("span.mw-page-title-main::text").extract()}')


### Adding Items

items.py

In [None]:
import scrapy


class ArticlespiderItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()

wikipedia2.py

In [None]:
    def parse(self, response, is_article):
        if not is_article:
            print(f'Discarding this: {response.css("h1::text").extract()}')
        else:
            article = ArticlespiderItem()
            title = response.css("span.mw-page-title-main::text").extract_first()
            article['title'] = title
            article['url'] = response.url
            return article

### Saving your data

$ scrapy runspider articleSpider/spiders/wikipedia2.py -o articles.csv -t csv

$ scrapy runspider articleSpider/spiders/wikipedia2.py -o articles.json -t json

$ scrapy runspider articleSpider/spiders/wikipedia2.py -o articles.xml -t xml

Note: If you close the spider before it writes the buffer, you're going to get an empty file. Limit the number of pages it scrapes before closing nicely with:
-s CLOSESPIDER_PAGECOUNT=100

## Make a cool project! (slides)

In [None]:
$ scrapy genspider wikipedia3 wikipedia.org -t crawl

In [None]:
    rules = [
        Rule(
            LinkExtractor(allow='(/wiki/)((?!:).)*$'),
            callback='parse',
            follow=True
        )
    ]

In [None]:
   def parse(self, response):
        url_title = response.url.split('/')[-1]
        history_url = f'https://en.wikipedia.org/w/index.php?title={url_title}&action=history'
        yield scrapy.Request(history_url, cb_kwargs={'title': response.css("span.mw-page-title-main::text").extract(), 'language':'en'}, callback=self.parse_history, priority=1)

    def parse_history(self, response, **kwargs):
        for ip in response.css('.mw-anonuserlink bdi::text').extract():
            yield scrapy.Request(f'http://ip-api.com/json/{ip}', cb_kwargs=kwargs, callback=self.parse_ip, priority=2)

    def parse_ip(self, response, title=None, language=None):
        r = json.loads(response.body)
        print(r)