#**Inheriting the Spider**
When learning about scrapy spiders, we saw that the main portion of the code for us to adjust is the class for the spider. To help build some familiarity of the class, you will complete a short piece of code to complete a toy-model of the spider class code. We've omitted the code that would actually run the spider, only including the pieces necessary to create the class.

As mentioned in the lesson, a class is roughly a collection of related variables and functions housed together. Sometimes one class likes to use methods from another class, and so we will inherit methods from a different class. That's what we do in the spider class.

We wrote the function inspect_class to look at the your class once you're done, if you'd like to test your solution!

In [3]:
def inspect_class(c):
  newc = c()
  meths = dir(newc)
  if 'name' in meths:
    print("Your spider class name is:", newc.name)
  if 'from_crawler' in meths:
    print("It seems you have inherited methods from scrapy.Spider -- NICE!")
  else:
    print("Oh no! It doesn't seem that you are inheriting the methods from scrapy.Spider!!")

In [4]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider(scrapy.Spider):
  name = "your_spider"
  # start_requests method
  def start_requests(self):
    pass
  # parse method
  def parse(self, response):
    pass
  
# Inspect Your Class
inspect_class(YourSpider)

Your spider class name is: your_spider
It seems you have inherited methods from scrapy.Spider -- NICE!


#**Hurl the URLs**
In the next lesson we will talk about the start_requests method within the spider class. In this quick exercise, we ask you to change around a variable within the start_requests method which foreshadows some of what we will be learning in the next lesson. Basically, we want you to start becoming comfortable turning some of the wheels within a spider class; in this case, making a list of urls within the start_requests method.

We've written a function inspect_class which will print out the list of elements you have in the urls variable within the start_requests method.

Note: in the next several exercises, you will write code to complete your spider class, but the code does not yet include the pieces to actually run the spider; that will come at the end.

In [5]:
def inspect_class( c ):
  newc = c()
  meths = dir( newc )
  if 'start_requests' in meths:
    print( "The start_requests method yields the following urls:" )
    for u in newc.start_requests():
      print(  "\t-", u )

In [6]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    urls = ["https://www.datacamp.com", "https://scrapy.org"]
    for url in urls:
      yield url
  # parse method
  def parse( self, response ):
    pass
  
# Inspect Your Class
inspect_class( YourSpider )

The start_requests method yields the following urls:
	- https://www.datacamp.com
	- https://scrapy.org


#**Self Referencing is Classy**
You probably have noticed that within the spider class, we always input the argument self in the start_requests and parse methods (just look in the sample code in this exercise!). This allows us to reference between methods within the class. That is, if we want to refer to the method parse within the start_requests method, we would need to write self.parse rather than just parse; what writing self does is tell the code: "Look in the same class as start_requests for a method called parse to use."

In this exercise you will get a chance to play with this "self referencing".

In [7]:
def inspect_class( c ):
  newc = c()
  try:
    newc.start_requests()
  except:
    print( "Oh No! Something is wrong with the code! Keep trying." )

In [8]:
# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    self.print_msg( "Hello World!" )
  # parse method
  def parse( self, response ):
    pass
  # print_msg method
  def print_msg( self, msg ):
    print( "Calling start_requests in YourSpider prints out:", msg )
  
# Inspect Your Class
inspect_class( YourSpider )

Calling start_requests in YourSpider prints out: Hello World!


#**Starting with Start Requests**
In the last lesson we learned about setting up the start_requests method within a scrapy spider. Here we have another toy-model spider which doesn't actually scrape anything, but gives you a chance to play with the start_requests method. What we want is for you to start becomming familiar with the arguments you pass into the scrapy.Request call within start_requests.

As before, we have created the function inspect_class to examine what you are yielding in start_requests.

In [9]:
def inspect_class( c ):
  newc = c()
  try:
    y = list( newc.start_requests() )
    first_yield = y[0]
    print( "The url you would scrape is:", first_yield.url )
    cb = first_yield.callback
    print( "The name of the callback method you called is:", cb.__name__ )
  except:
    print( "Oh No! Something is wrong with the code! Keep trying." )

In [10]:
# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = "https://www.datacamp.com", callback = self.parse )
  # parse method
  def parse( self, response ):
    pass
  
# Inspect Your Class
inspect_class( YourSpider )

The url you would scrape is: https://www.datacamp.com
The name of the callback method you called is: parse


#**Pen Names**
In this exercise, we have set up a spider class which, when finished, will retrieve the author names from a shortened version of the DataCamp course directory. The URL for the shortened version is stored in the variable url_short. Your job will be to create the list of extracted author names in the parse method of the spider.

Two things you should know:

* You will be using the response object and the css method here.
* The course author names are defined by the text within the paragraph p elements belonging to the class course-block__author-name

In [3]:
url_short = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'


# Create the Spider class
class DCspider( scrapy.Spider ):
  name = 'dcspider'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = url_short, callback = self.parse )
  # parse method
  def parse( self, response):
    # Create an extracted list of course author names
    author_names = response.css('p.course-block__author-name ::text').extract()
    # Here we will just return the list of Authors
    return author_names

#**Crawler Time**
This will be your first chance to play with a spider which will crawl between sites (by first collecting links from one site, and following those links to parse new sites). This spider starts at the shortened DataCamp course directory, then extracts the links of the courses in the parse method; from there, it will follow those links to extract the course descriptions from each course page in the parse_descr method, and put these descriptions into the list course_descrs. Your job is to complete the code so that the spider runs as desired!

In [None]:
# Create the Spider class
class DCdescr( scrapy.Spider ):
  name = 'dcdescr'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = url_short, callback = self.parse )
  
  # First parse method
  def parse( self, response ):
    links = response.css( 'div.course-block > a::attr(href)' ).extract()
    # Follow each of the extracted links
    for link in links:
      yield response.follow( url = link, callback = self.parse_descr )
      
  # Second parsing method
  def parse_descr( self, response ):
    # Extract course description
    course_descr = response.css( 'p.course__description::text' ).extract_first()
    # For now, just yield the course description
    yield course_descr

#**Time to Run**
In the last lesson, we went through creating an entire web-crawler to access course information from each course in the DataCamp course directory. However, the lesson seemed to stop without a climax, because we didn't play with the code after finishing the parsing methods.

The point of this exercise is to remedy that!

The code we give you to look at in this and the next exercise is long, because its the entire spider that took us the lesson to create! However, don't be intimidated! The point of these two exercises is to give you a very easy task to complete, with the hope that you will look at and run the code for this spider. That way, even though it is long, you will have a grasp of it!

In [1]:
def previewCourses( dc_dict, n = 3 ):
  crs_titles = list( dc_dict.keys() )
  print( "A preview of DataCamp Courses:")
  print("---------------------------------------\n")
  for t in crs_titles[:n]:
    print( "TITLE: %s" % t)
    for i,ct in enumerate(dc_dict[t]):
      print("\tChapter %d: %s" % (i+1,ct) )
    print("")

In [2]:
# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

url_short = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

# Create the Spider class
class DC_Chapter_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    crs_title_ext = crs_title.extract_first().strip()
    ch_titles = response.css('h4.chapter__title::text')
    ch_titles_ext = [t.strip() for t in ch_titles.extract()]
    dc_dict[ crs_title_ext ] = ch_titles_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Chapter_Spider)
process.start()

2022-04-28 19:34:53 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-04-28 19:34:53 [scrapy.utils.log] INFO: Versions: lxml 4.2.6.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.7.13 (default, Apr 24 2022, 01:04:09) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.2 15 Mar 2022), cryptography 37.0.1, Platform Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
2022-04-28 19:34:53 [scrapy.crawler] INFO: Overridden settings:
{}
2022-04-28 19:34:53 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-04-28 19:34:53 [scrapy.extensions.telnet] INFO: Telnet Password: 2ec7baeb457b69a0
2022-04-28 19:34:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-04-28 19:34:53 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.download

In [3]:
previewCourses(dc_dict)

A preview of DataCamp Courses:
---------------------------------------

TITLE: Intermediate R
	Chapter 1: Conditionals and Control Flow
	Chapter 2: Functions
	Chapter 3: Utilities
	Chapter 4: Loops
	Chapter 5: The apply family
	Chapter 6: Conditionals and Control Flow
	Chapter 7: Loops
	Chapter 8: Functions
	Chapter 9: The apply family
	Chapter 10: Utilities

TITLE: Data Analysis in R, the data.table Way
	Chapter 1: Data.table novice
	Chapter 2: Data.table yeoman
	Chapter 3: Data.table expert

TITLE: Reporting with R Markdown
	Chapter 1: Authoring R Markdown Reports
	Chapter 2: Embedding Code
	Chapter 3: Compiling Reports
	Chapter 4: Configuring R Markdown (optional)



#**DataCamp Descriptions**
Like the previous exercise, the code here is long since you are working with an entire web-crawling spider! But again, don't let the amount of code intimidate you, you have a handle on how spiders work now, and you are perfectly capable to complete the easy task for you here!

As in the previous exercise, we have created a function previewCourses which lets you preview the output of the spider, but you can always just explore the dictionary dc_dict too after you run the code.

In this exercise, you are asked to create a CSS Locator string direct to the text of the course description. All you need to know is that from the course page, the course description text is within a paragraph p element which belongs to the class course__description (two underlines).

In [11]:
def previewCourses( dc_dict, n = 1 ):
  crs_titles = list( dc_dict.keys() )
  print( "A preview of DataCamp Courses:")
  print("---------------------------------------\n")
  for t in crs_titles[:n]:
    print( "TITLE: %s" % t)
    print("\tDescription: %s" % dc_dict[t] )
    print("")

In [12]:
# Import scrapy
import scrapy
import sys     

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Description_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    # Create a SelectorList of the course titles text
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    # Extract the text and strip it clean
    crs_title_ext = crs_title.extract_first().strip()
    # Create a SelectorList of course descriptions text
    crs_descr = response.css( 'p.course__description ::text' )
    # Extract the text and strip it clean
    crs_descr_ext = crs_descr.extract_first().strip()
    # Fill in the dictionary
    dc_dict[crs_title_ext] = crs_descr_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
if "twisted.internet.reactor" in sys.modules: 
  del sys.modules["twisted.internet.reactor"]

process = CrawlerProcess()
process.crawl(DC_Description_Spider)
process.start()

2022-04-28 19:44:14 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-04-28 19:44:14 [scrapy.utils.log] INFO: Versions: lxml 4.2.6.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.7.13 (default, Apr 24 2022, 01:04:09) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.2 15 Mar 2022), cryptography 37.0.1, Platform Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
2022-04-28 19:44:14 [scrapy.crawler] INFO: Overridden settings:
{}
2022-04-28 19:44:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-04-28 19:44:14 [scrapy.extensions.telnet] INFO: Telnet Password: 189b4a6e610689e4
2022-04-28 19:44:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-04-28 19:44:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.download

In [13]:
# Print a preview of courses
previewCourses(dc_dict)

A preview of DataCamp Courses:
---------------------------------------

TITLE: Intro to Python for Data Science
	Description: Python is a general-purpose programming language that is becoming more and more popular for doing data science. Companies worldwide are using Python to harvest insights from their data and get a competitive edge. Unlike any other Python tutorial, this course focuses on Python specifically for data science. In our Intro to Python class, you will learn about powerful ways to store and manipulate data as well as cool data science tools to start your own analyses. Enter DataCamp’s online Python curriculum.



#**Capstone Crawler**
This exercise gives you a chance to show off what you've learned! In this exercise, you will write the parse function for a spider and then fill in a few blanks to finish off the spider. On the course directory page of DataCamp, each listed course has a title and a short course description. This spider will be used to scrape the course directory to extract the course titles and short course descriptions. You will not need to follow any links this time. Everything you need to know is:

* The course titles are defined by the text within an h4 element whose class contains the string block__title (double underline).
* The short course descriptions are defined by the text within a paragraph p element whose class contains the string block__description (double underline).

In [20]:
def previewCourses( dc_dict = dc_dict, n = 3 ):
  crs_titles = list( dc_dict.keys() )
  print( "A preview of DataCamp Courses:")
  print("---------------------------------------\n")
  for t in crs_titles[:n]:
    print( "TITLE: %s" % t)
    print( "\tDESCRIPTION: %s" % dc_dict[t] )
    print("")

In [21]:
# Import scrapy
import scrapy

# Import the CrawlerProcess
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class YourSpider( scrapy.Spider ):
  name = 'yourspider'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request(url = url_short, callback = self.parse)
      
  def parse(self, response):
    # My version of the parser you wrote in the previous part
    crs_titles = response.xpath('//h4[contains(@class,"block__title")]/text()').extract()
    crs_descrs = response.xpath('//p[contains(@class,"block__description")]/text()').extract()
    for crs_title, crs_descr in zip( crs_titles, crs_descrs ):
      dc_dict[crs_title] = crs_descr
    
# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
if "twisted.internet.reactor" in sys.modules: 
  del sys.modules["twisted.internet.reactor"]
process = CrawlerProcess()
process.crawl(YourSpider)
process.start()

2022-04-28 19:47:40 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-04-28 19:47:40 [scrapy.utils.log] INFO: Versions: lxml 4.2.6.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.7.13 (default, Apr 24 2022, 01:04:09) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.2 15 Mar 2022), cryptography 37.0.1, Platform Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
2022-04-28 19:47:40 [scrapy.crawler] INFO: Overridden settings:
{}
2022-04-28 19:47:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-04-28 19:47:40 [scrapy.extensions.telnet] INFO: Telnet Password: eca1afbad0829fd9
2022-04-28 19:47:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-04-28 19:47:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.download

In [22]:
# Print a preview of courses
previewCourses(dc_dict)

A preview of DataCamp Courses:
---------------------------------------

TITLE: Introduction to R
	DESCRIPTION: 
          Master the basics of data analysis by manipulating common data structures such as vectors, matrices and data frames.
        

TITLE: Data Analysis in R, the data.table Way
	DESCRIPTION: 
          Master core concepts in data manipulation such as subsetting, updating, indexing and joining your data using data.table.
        

TITLE: Data Manipulation in R with dplyr
	DESCRIPTION: 
          Master techniques for data manipulation using the select, mutate, filter, arrange, and summarise functions in dplyr.
        

