#Web Scraping and Dataset Creation

This script effectively adapts a Scrapy project to run in a Jupyter notebook environment. By installing necessary packages, adjusting the asyncio settings for Jupyter, adding the project path to the system path, importing and applying project settings, and finally running the spider, it allows for seamless web scraping directly from a notebook. This approach is particularly useful for development, testing, and educational purposes where the full IDE setup might not be available or preferred.






In [None]:
#Installs scrapy, scrapy-playwright, and playwright packages,
#which are essential for running Scrapy spiders and integrating Playwright for JavaScript-rendered page scraping
!pip install scrapy
!pip install scrapy-playwright
!pip install playwright
!playwright install

In [None]:
#Applies nest_asyncio to allow the asynchronous loop to run in the Jupyter environment,
#which is necessary because Jupyter already runs an event loop.
import nest_asyncio
nest_asyncio.apply()
from scrapy.crawler import CrawlerProcess

In [None]:
#Imports modules required for setting up and running a Scrapy spider.
#This includes Scrapy's CrawlerProcess for running spiders, Settings for configuring the spider, and other utilities.

import sys
import os
from importlib import import_module
from scrapy.settings import Settings
import scrapy
from scrapy_playwright.page import PageMethod
from scrapy.http import HtmlResponse
from scrapy.crawler import CrawlerProcess


#Adds the project directory to Python's path to ensure modules from the Scrapy project can be imported.
#This step is crucial for Colab to recognize Scrapy project's structure and files.
project_path = os.path.abspath(os.path.join('/content/drive/MyDrive/Pwspider'))
if project_path not in sys.path:
    sys.path.append(project_path)


# Import your project's settings module from a.py
settings_module_path = 'Pwspider.settings'  # Adjusted for a.py

imported_settings = import_module(settings_module_path)


# Convert imported settings to a Scrapy Settings object
settings = Settings()
for setting_name in dir(imported_settings):
    if setting_name.isupper():
        setting_value = getattr(imported_settings, setting_name)
        settings.set(setting_name, setting_value)


In [13]:
#Initializes and starts CrawlerProcess with the configured settings and the specified spider.
#This command begins the scraping process.

process = CrawlerProcess(settings=settings)
process.crawl('pwspidey')
process.start()

INFO:scrapy.utils.log:Scrapy 2.11.1 started (bot: Pwspider)
2024-02-16 02:44:11 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: Pwspider)
INFO:scrapy.utils.log:Versions: lxml 4.9.4.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 23.10.0, Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.2, Platform Linux-6.1.58+-x86_64-with-glibc2.35
2024-02-16 02:44:11 [scrapy.utils.log] INFO: Versions: lxml 4.9.4.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 23.10.0, Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.2, Platform Linux-6.1.58+-x86_64-with-glibc2.35
INFO:scrapy.addons:Enabled addons:
[]
2024-02-16 02:44:11 [scrapy.addons] INFO: Enabled addons:
[]
DEBUG:scrapy.utils.log:Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-02-16 02:44:11 [scrapy.utils.log] DEBUG: Using re

Found 100 links on page 1


2024-02-16 02:44:40 [pwspidey] DEBUG: Processing request: https://www.cfainstitute.org/membership/professional-development/refresher-readings/introduction-commodities-commodity-derivatives
DEBUG:scrapy-playwright:[Context=default] Response: <204 https://www.google-analytics.com/g/collect?v=2&tid=G-FD9VH0194T&gtm=45je42e0v9100071815za200&_p=1708051462164&gcd=13l3l3l3l1&npa=0&dma=0&cid=1028909078.1708051467&ul=en-us&sr=1280x720&pscdl=noapi&_eu=AEA&_s=2&sid=1708051466&sct=1&seg=0&dl=https%3A%2F%2Fwww.cfainstitute.org%2Fen%2Fmembership%2Fprofessional-development%2Frefresher-readings&dt=Refresher%20Readings&en=scroll&epn.percent_scrolled=90&_et=9033&tfd=19508>
2024-02-16 02:44:41 [scrapy-playwright] DEBUG: [Context=default] Response: <204 https://www.google-analytics.com/g/collect?v=2&tid=G-FD9VH0194T&gtm=45je42e0v9100071815za200&_p=1708051462164&gcd=13l3l3l3l1&npa=0&dma=0&cid=1028909078.1708051467&ul=en-us&sr=1280x720&pscdl=noapi&_eu=AEA&_s=2&sid=1708051466&sct=1&seg=0&dl=https%3A%2F%2Fwww

Checking for next page button  JSHandle@node
Next page button found. Clicking...


DEBUG:scrapy-playwright:[Context=default] New page created, page count is 4 (4 for all contexts)
2024-02-16 02:44:45 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 4 (4 for all contexts)
DEBUG:scrapy-playwright:[Context=default] New page created, page count is 5 (5 for all contexts)
2024-02-16 02:44:46 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 5 (5 for all contexts)
DEBUG:scrapy-playwright:[Context=default] Request: <GET https://www.cfainstitute.org/membership/professional-development/refresher-readings/time-series-analysis> (resource type: document)
2024-02-16 02:44:46 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.cfainstitute.org/membership/professional-development/refresher-readings/time-series-analysis> (resource type: document)
DEBUG:scrapy-playwright:[Context=default] Request: <GET https://www.cfainstitute.org/membership/professional-development/refresher-readings/credit-analysis-models> (re

Moved to page 2.


DEBUG:scrapy-playwright:[Context=default] Response: <200 https://www.cfainstitute.org/membership/professional-development/refresher-readings/pricing-and-valuation-of-forward-commitments>
2024-02-16 02:45:28 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://www.cfainstitute.org/membership/professional-development/refresher-readings/pricing-and-valuation-of-forward-commitments>
DEBUG:scrapy-playwright:[Context=default] Request: <GET https://uxpatterns.cfainstitute.org/globalbundles/styles/global.css?v=KdNELzb1y2vxWM8E2EL6QwKRgtgsmoCqgS-UIJMeWjM1> (resource type: stylesheet, referrer: https://www.cfainstitute.org/)
2024-02-16 02:45:29 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://uxpatterns.cfainstitute.org/globalbundles/styles/global.css?v=KdNELzb1y2vxWM8E2EL6QwKRgtgsmoCqgS-UIJMeWjM1> (resource type: stylesheet, referrer: https://www.cfainstitute.org/)
DEBUG:scrapy-playwright:[Context=default] Request: <GET https://www.cfainstitute.org/bundles/style

New response URL: https://www.cfainstitute.org/en/membership/professional-development/refresher-readings#first=100&sort=%40refreadingcurriculumyear%20descending&numberOfResults=100
Found 100 links on the new page.


DEBUG:scrapy-playwright:[Context=default] Response: <200 https://static.cloudflareinsights.com/beacon.min.js/v84a3a4012de94ce1a686ba8c167c359c1696973893317>
2024-02-16 02:45:41 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://static.cloudflareinsights.com/beacon.min.js/v84a3a4012de94ce1a686ba8c167c359c1696973893317>
DEBUG:scrapy-playwright:[Context=default] Response: <200 https://uxpatterns.cfainstitute.org/globalbundles/styles/global.css?v=KdNELzb1y2vxWM8E2EL6QwKRgtgsmoCqgS-UIJMeWjM1>
2024-02-16 02:45:41 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://uxpatterns.cfainstitute.org/globalbundles/styles/global.css?v=KdNELzb1y2vxWM8E2EL6QwKRgtgsmoCqgS-UIJMeWjM1>
DEBUG:pwspidey:Processing request: https://www.cfainstitute.org/membership/professional-development/refresher-readings/industry-company-analysis
2024-02-16 02:45:41 [pwspidey] DEBUG: Processing request: https://www.cfainstitute.org/membership/professional-development/refresher-readings/industr

Checking for next page button  JSHandle@node
Next page button found. Clicking...


2024-02-16 02:46:10 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://static.cloud.coveo.com/searchui/v2.10089/css/CoveoFullSearch.css> (resource type: stylesheet, referrer: https://www.cfainstitute.org/)
DEBUG:scrapy-playwright:[Context=default] Request: <GET https://static.cloud.coveo.com/coveoforsitecore/ui/v0.55.8/css/CoveoForSitecore.css> (resource type: stylesheet, referrer: https://www.cfainstitute.org/)
2024-02-16 02:46:10 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://static.cloud.coveo.com/coveoforsitecore/ui/v0.55.8/css/CoveoForSitecore.css> (resource type: stylesheet, referrer: https://www.cfainstitute.org/)
DEBUG:scrapy-playwright:[Context=default] Request: <GET https://static.cloud.coveo.com/searchui/v2.10089/js/CoveoJsSearch.Lazy.min.js> (resource type: script, referrer: https://www.cfainstitute.org/)
2024-02-16 02:46:10 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://static.cloud.coveo.com/searchui/v2.10089/js/Cove

Moved to page 3.


DEBUG:scrapy-playwright:[Context=default] Request: <GET https://fonts.googleapis.com/css?family=Lato:300,400,700> (resource type: stylesheet, referrer: https://static.cloud.coveo.com/)
2024-02-16 02:47:57 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://fonts.googleapis.com/css?family=Lato:300,400,700> (resource type: stylesheet, referrer: https://static.cloud.coveo.com/)
DEBUG:scrapy-playwright:[Context=default] Response: <200 https://static.cloud.coveo.com/coveoforsitecore/ui/v0.55.8/css/CoveoForSitecore.css>
2024-02-16 02:47:57 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://static.cloud.coveo.com/coveoforsitecore/ui/v0.55.8/css/CoveoForSitecore.css>
DEBUG:scrapy-playwright:[Context=default] Response: <200 https://static.cloud.coveo.com/searchui/v2.10089/js/CoveoJsSearch.Lazy.min.js>
2024-02-16 02:47:57 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://static.cloud.coveo.com/searchui/v2.10089/js/CoveoJsSearch.Lazy.min.js>
DEBUG

New response URL: https://www.cfainstitute.org/en/membership/professional-development/refresher-readings#first=200&sort=%40refreadingcurriculumyear%20descending&numberOfResults=100
Found 24 links on the new page.


DEBUG:pwspidey:Processing request: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2018/multinational-operations
2024-02-16 02:48:24 [pwspidey] DEBUG: Processing request: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2018/multinational-operations
DEBUG:pwspidey:Processing request: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Natural-Resources
2024-02-16 02:48:24 [pwspidey] DEBUG: Processing request: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Natural-Resources
DEBUG:pwspidey:Processing request: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Real-Estate-and-Infrastructure
2024-02-16 02:48:24 [pwspidey] DEBUG: Processing request: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Real-Estate-and-Infrastructure
DEBUG:pwspidey:Processing request: https://www.cfainstitute

Checking for next page button  None
No next page button found. Ending pagination 2.


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
    return (yield download_func(request=request, spider=spider))
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/defer.py", line 1248, in adapt
    extracted: _SelfResultT | Failure = result.result()
  File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 232, in __step
    result = coro.send(None)
  File "/usr/local/lib/python3.10/dist-packages/scrapy_playwright/handler.py", line 314, in _download_request
    page = await self._create_page(request=request, spider=spider)
  File "/usr/local/lib/python3.10/dist-packages/scrapy_playwright/handler.py", line 236, in _create_page
    ctx_wrapper = await self._create_browser_context(
  File "/usr/local/lib/python3.10/dist-packages/scrapy_playwright/handler.py", line 196, in _create_browser_context
    context = await self.browser.new_con