## Scrapy in a jupyter notebook
#### What is Scrapy?
<img src = "scrapy_diagram.png" width = 700 align = "left"></img><br>
- <b>Engine: </b>Orchestration of the data flow between all components of the framework and triggering events when certain actions occur.
- <b>Spider: </b>What we define, how we tell scrapy which parts of the website to gather information from (structured data)
- <b>Scheduler: </b>handles concurrency, throughput, and other policies
- <b>Downloader: </b>gets the url and passes back to engine
- <b>Pipeline: </b> Where the data retrieved by the downloader and processed by the Spider goes. Can be processed further, validated, or stored (persistence).


Source: https://www.jitsejan.com/using-scrapy-in-jupyter-notebook.html

In [1]:
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Show Python version
import platform
platform.python_version()

'3.8.3'

In [2]:
try:
    import scrapy
except:
    !pip install scrapy
    import scrapy
from scrapy.crawler import CrawlerProcess


### imports

In [3]:
import json
import logging
import re
from datetime import datetime

### set up pipeline
This class creates a simple pipeline that writes all found items to a JSON file, where each line contains one JSON element.

In [4]:
class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('cplresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

### Define Spider

In [7]:
class CplSpider(scrapy.Spider):
    name = "cpl"
    start_urls = [
        'https://cpl.org/board-agendas/'
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': 'cplresult.json'                        # Used for pipeline 2
    }
    
    def _parse_classification(self, title):
        if "committee" in title.lower():
            return "COMMITTEE"
        if "commission" in title.lower():
            return "COMMISSION"
        return "BOARD"
    
    def _parse_location(self, response):
        elems = response.css('.has-text-align-center::text').extract()
        if "abundance" in elems[0].lower() and "zoom" not in elems[0].lower():
            meeting_name = response.css('.has-text-align-center::text').extract()[0]+"Zoom"
        else:
            meeting_name = response.css('.has-text-align-center::text').extract()[0]
        if "zoom" not in meeting_name.lower():
            address = "17109 Lake Shore Blvd Cleveland, OH 44110"
        else:
            address = response.css('.has-text-align-center').css('a::attr(href)').extract()
            if len(address)<2:
                add = [x for x in response.css('.has-text-align-center::text').extract() if "meeting id" in x.lower()][0]
                _phone = add.split("(")[0]
                _meeting_id = add.split("(")[1]
                
                meeting_id = [x for x in _meeting_id if x.isnumeric()]
                zoom_link = "https://cpl.zoom.us/j/" + "".join(meeting_id)
                phone = f"tel:{''.join([x for x in _phone if x.isnumeric()])}"
                if phone not in address and len(phone)>len("tel:"):
                    address.append(phone) 
                if zoom_link not in address and len(zoom_link)>len("https://cpl.zoom.us/j/"):
                    address.append(zoom_link) 
        return {"name": name, "address": address}
    
    
    def _parse_links(self, response):
        result = {}
        for text, link in zip(response.css(".entry-content").xpath("ol").css("a::text").extract(), 
                              response.css(".entry-content").xpath("ol").css("a::attr(href)").extract()):
            result[text]=link
        return result
    
    
    def _parse_times(self, title, summary):
        """may not be accurate AM/PM"""
        date_match = re.search(r"[a-zA-Z]{3,10} \d{1,2},? \d{4}", title)
        time = re.findall(r"(\d{1,2}:\d{2})", summary)
        return datetime.strptime(" ".join([date_match.group(0), time]), "%B %d, %Y %I:%M")
    
    def _parse_meetings(self, response):
        title = response.css('.entry-title::text').get()
        times = response.css('.has-text-align-center::text').extract()[3]
        location = response.css('.has-text-align-center::text').extract()[0]+"Zoom."
        yield {"title": title.replace("Agenda","").strip(),
               #"time": str(self._parse_times(title, times)),
               "classification": self._parse_classification(title),
               "location": self._parse_location(response),
               "links": self._parse_links(response)
              }
    
    def parse(self, response):
        #A Response object represents an HTTP response, which is usually downloaded (by the Downloader) 
        # and fed to the Spiders for processing.
        for meeting_link in response.xpath('//h2[@class="entry-title"]//@href').getall():
            yield scrapy.Request(
                meeting_link,
                callback = self._parse_meetings,
                dont_filter = True
            )

### Start the crawler

In [8]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(CplSpider)
process.start()

2021-04-12 17:39:07 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-04-12 17:39:07 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.3 (default, Jul  2 2020, 11:26:31) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform macOS-10.15.7-x86_64-i386-64bit
2021-04-12 17:39:07 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-04-12 17:39:07 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
  exporter = cls(crawler)



<Deferred at 0x7fb4be031dc0>

2021-04-12 17:39:07 [scrapy.core.scraper] ERROR: Spider error processing <GET https://cpl.org/board-agendas/january-21-2021-board-of-trustees-2021-organizational-meeting-agenda/> (referer: https://cpl.org/board-agendas/)
Traceback (most recent call last):
  File "/Users/nmolivo/anaconda3/lib/python3.8/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/Users/nmolivo/anaconda3/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/Users/nmolivo/anaconda3/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/Users/nmolivo/anaconda3/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/Users/nmolivo/anaconda3/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/Users/nmolivo/anaconda3/lib/python3.8/si

2021-04-12 17:39:07 [scrapy.core.scraper] ERROR: Spider error processing <GET https://cpl.org/board-agendas/april-13-2021-joint-finance-human-resources-committee-meeting-agenda/> (referer: https://cpl.org/board-agendas/)
Traceback (most recent call last):
  File "/Users/nmolivo/anaconda3/lib/python3.8/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/Users/nmolivo/anaconda3/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/Users/nmolivo/anaconda3/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/Users/nmolivo/anaconda3/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/Users/nmolivo/anaconda3/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/Users/nmolivo/anaconda3/lib/python3.8/si