# Web Scraping in Python

## Keep it Classy
In this two-part exercise, you will have a chance to show off what you've learned about attributes; in this case, we focus on the class attribute.
Fill in the blank in the HTML code string html to assign a class attribute to the second div element which has the value "you-are-classy".

In [None]:
# HTML code string
html = '''
<html>
  <body>
    <div class="class1" id="div1">
      <p class="class2">Visit DataCamp!</p>
    </div>
    <div class="you-are-classy">
      <p class="class2">Keep up the good work!</p>
    </div>
  </body>
</html>
'''
# Print out the class of the second div element
whats_my_class( html )

## Where it's @
In this exercise, you'll begin to write an XPath string using attributes to achieve a certain task; that task is to select the paragraph element containing the text "Thanks for Watching!". We've already created most of the XPath string for you.

Consider the following HTML:

<html>
  <body>
    <div id="div1" class="class-1">
      <p class="class-1 class-2">Hello World!</p>
      <div id="div2">
        <p id="p2" class="class-2">Choose DataCamp!</p>
      </div>
    </div>
    <div id="div3" class="class-2">
      <p class="class-2">Thanks for Watching!</p>
    </div>
  </body>
</html>
We have created the function print_element_text() for you, which will print any text contained in your element.

In [None]:
# Create an Xpath string to select desired p element
xpath = '//*[@id="div3"]/p'

# Print out selection text
print_element_text( xpath )

## Check your Class
This exercise is to emphasize that when you use an XPath to select an element by its class attribute without using the contains() function, you match the class exactly. Your job is to fill in the blank below and finish the variable xpath directing to the specified element.

Consider the following HTML:

<html>
  <body>
    <div id="div1" class="class-1">
      <p class="class-1 class-2">Hello World!</p>
      <div id="div2">
        <p id="p2" class="class-2">Choose DataCamp!</p>
      </div>
    </div>
    <div id="div3" class="class-2">
      <p class="class-2">Thanks for Watching!</p>
    </div>
  </body>
</html>

In [None]:
# Create an XPath string to select p element by class
xpath = '//p[@class="class-1 class-2"]'

# Print out select text
print_element_text( xpath )

## Hyper(link) Active
One of the most important attributes to extract for "web-crawling" is the hyperlink url (href attribute) within an a tag. Here, you will extract such a hyperlink! We have created the function print_attribute to print out the data extracted from your XPath, so you can test your XPath strings in the console, if you like.

The exercise refers to the following HTML source code:

<html>
  <body>
    <div id="div1" class="class-1">
      <p class="class-1 class-2">Hello World!</p>
      <div id="div2">
        <p id="p2" class="class-2">Choose 
            <a href="http://datacamp.com">DataCamp!</a>!
        </p>
      </div>
    </div>
    <div id="div3" class="class-2">
      <p class="class-2">Thanks for Watching!</p>
    </div>
  </body>
</html>

In [None]:
# Create an xpath to the href attribute
xpath = '//p[@id="p2"]/a/@href'

# Print out the selection(s); there should be only one
print_attribute( xpath )

## Secret Links
We have loaded the HTML from a secret website and have used it to create the functions how_many_elements() and preview(). The function how_many_elements() allows you to pass in an XPath string and it will print out the number of elements the XPath you wrote has selected. The function preview() allows you to pass in an XPath string and it will print out the first few elements you've selected.

Your job in this exercise is to create an XPath which directs to all href attribute values of the hyperlink a elements whose class attributes contain the string "package-snippet". If you do it correctly, you should find that you have selected 10 elements with your XPath string and that it previews links.

In [None]:
# Create an xpath to the href attributes
xpath = '//a[contains(@href,"package-snippet")]/@href'

# Print out how many elements are selected
how_many_elements( xpath )
# Preview the selected elements
preview( xpath )

## Divvy Up This Exercise
We have pre-loaded an HTML into the string variable html. In this two part problem you will use this html variable as the HTML document to set up a Selector object with, and create a SelectorList which selects all div elements; then, you will check your understanding of what happens within the SelectorList.

In [None]:
from scrapy import Selector

# Create a Selector selecting html as the HTML document
sel = Selector( text=html )

# Create a SelectorList of all div elements in the HTML document
divs = sel.xpath( '//div' )

## Requesting a Selector
We have pre-loaded the URL for a particular website in the string variable url and use the requests library to put the content from the website into the string variable html. Your task is to create a Selector object sel using the HTML source code stored in html.

In [None]:
# Import a scrapy Selector
from scrapy import Selector

# Import requests
import requests

# Create the string html containing the HTML source
html = requests.get( url ).content

# Create the Selector object sel from html
sel = Selector( text=html )

# Print out the number of elements in the HTML document
print( "There are 1020 elements in the HTML document.")
print( "You have found: ", len( sel.xpath('//*') ) )

## The (X)Path to CSS Locators
Many people prefer using CSS Locator notation to XPath notation. As we will see later, it often makes attribute selection very easy. To help get you more comfortable going back and forth between XPath and CSS Locator strings, we give you a chance in this exercise to do some direct "translation" between the two.

Note that the exercises in this chapter may take some time to load.

In [None]:
# Create the XPath string equivalent to the CSS Locator 
xpath = '/html/body/span[1]//a'

# Create the CSS Locator string equivalent to the XPath
css_locator = 'html>body>span:nth-of-type(1) a'

# Create the XPath string equivalent to the CSS Locator 
xpath = '//div[@id="uid"]/span//h4'

# Create the CSS Locator string equivalent to the XPath
css_locator = 'div#uid > span h4'

## Get an "a" in this Course
We have loaded the HTML from a secret website which you will use to set up a Selector object and the function how_many_elements(). When passing this function a CSS Locator string, it will print out the number of elements that the CSS Locator you wrote has selected.

In the second part of this problem, we want you to create a CSS Locator string which will select a certain collection of elements as described here: Select the hyperlink (a element) children of all div elements belonging to the class "course-block" (that is, any div element with a class attribute such that "course-block" is one of the classes assigned). The number of such elements is 11, so you can check your solution with how_many_elements if you choose.

In [None]:
from scrapy import Selector

# Create a selector from the html (of a secret website)
sel = Selector( text = html )

# Fill in the blank
css_locator = 'div.course-block>a'

# Print the number of selected elements.
how_many_elements( css_locator )

## You've been `href`ed
In a previous exercise, you created a CSS Locator string to select the hyperlink (a element) children of all div elements belonging to the class "course-block". Here we have created a SelectorList called course_as having selected those hyperlink children.

Now, we want you to fill in the blank below to extract the href attribute values from these elements. This is another example of chaining, as we've seen in a previous exercise.

The point here is that we can chain together calls to the methods css and xpath, and combine them! We help nudge you in the correct direction by giving you the solution if we chain with another call to the css method.

In [None]:
from scrapy import Selector

# Create a selector object from a secret website
sel = Selector( text=html )

# Select all hyperlinks of div elements belonging to class "course-block"
course_as = sel.css( 'div.course-block > a' )

# Selecting all href attributes chaining with css
hrefs_from_css = course_as.css( '::attr(href)' )

# Selecting all href attributes chaining with xpath
hrefs_from_xpath = course_as.xpath('./@href').extract()

## Top Level Text
This exercise will have you write an XPath and CSS Locator string to direct to the text of a specific paragraph p element. The p element in the HTML is uniquely defined by its id attribute, which is "p3". With this small piece of information, you should be able to create the desired strings; however, we have preloaded the variable html with a string containing the HTML in which this link belongs, if you want to peruse it.

In this exercise, you will only be selecting the text within the element, which does not include the text in future generations of the element. We have created a function print_results for you to compare which elements your strings direct to.

In [None]:
# Create an XPath string to the desired text.
xpath =  "//p[@id='p3']/text()"

# Create a CSS Locator string to the desired text.
css_locator = "p#p3::text"

# Print the text from our selections
print_results( xpath, css_locator )

## All Level Text
This exercise is similar to the previous, but differs in that you will be selecting text from multiple generations of a given element.

You will write an XPath and CSS Locator strings to direct to the text of a specific paragraph p element. The p element in the HTML is uniquely defined by its id attribute, which is "p3". With this small piece of information, you should be able to create the desired strings; however, we have preloaded the variable html with a string containing the HTML in which this link belongs, if you want to peruse it.

In this exercise, you will only be selecting the text within the element which includes all text within the future generations. We have created a function print_results for you to compare which elements your strings direct to.

In [None]:
# Create an XPath string to the desired text.
xpath = '//p[@id="p3"]//text()'


# Create a CSS Locator string to the desired text.
css_locator = 'p#p3 ::text'

# Print the text from our selections
print_results( xpath, css_locator )


## Reveal By Response
We have pre-loaded a Response object, named response, with the content from a secret website. Your job is to figure out the URL and the title of the website using the response variable. You learned how to find the URL in the last lesson. To find the website title, what you need to know is:

The title is the text from the title element
The title element is a child of the head element, which is a child of the html root element.
To note: the html root element only has one child head element, and the head element only has one child title element.

In [None]:
# Get the URL to the website loaded in response
this_url = response.url

# Get the title of the website loaded in response
this_title = response.xpath('/html/head/title/text()').extract_first()

# Print out our findings
print_url_title( this_url, this_title )

## Responding with Selectors
Something that we should emphasize at this point about the relationship between a Selector and Response objects is that both objects return a SelectorList when using the xpath or css methods to direct to elements. In this exercise, we'll prove it to you, by having you find all hyperlink elements belonging to the class course-block__link (notice the double underscore!) and looking at the object that is produced when doing so.

Recall that to find an element by class, you can use a period (.). For example, div.class-2 selects all div elements belonging to class-2.

We have pre-loaded both a Response object named response and a Selector object named sel with the content from the same "secret" website. Once you complete the task of creating a CSS Locator, you will compare both the output from response.css and selector.css to see that they are effectively the same!

In [None]:
# Create a CSS Locator string to the desired hyperlink elements
css_locator = 'a.course-block__link'

# Select the hyperlink elements from response and sel
response_as = response.css(css_locator)
sel_as = sel.css(css_locator)

# Examine similarity
nr = len( response_as )
ns = len( sel_as )
for i in range( min(nr, ns, 2) ):
  print( "Element %d from response: %s" % (i+1, response_as[i]) )
  print( "Element %d from sel: %s" % (i+1, sel_as[i]) )
  print( "" )

## Selecting from a Selection
In this exercise, you will find the text from an h4 element within a particular div element. It will occur in steps where the first step is selecting a family of div elements, and the second step is narrowing in on the first one, from which we will grab the h4 element text. This process of progressively narrowing in on elements (e.g., first to the div elements, then to the h4 element) is another example of "chaining", even if it doesn't look exactly the same as we've seen it before.

Along the way in this exercise, there is a variable first_div set up for you to use. Think carefully about what type of object first_div is!

In [None]:
# Select all desired div elements
divs = response.css('div.course-block')

# Take the first div element
first_div = divs[0]

# Extract the text from the (only) h4 element in first_div
h4_text = first_div.css('h4::text').extract_first()

# Print out the text
print( "The text from the h4 element is:", h4_text )

## Titular
Similar to the work given in the previous lesson, we will have you use a pre-loaded Response object, named response to scrape the course titles from the (shortened version of the) DataCamp course directory https://www.datacamp.com/courses/all. To successfully do so, you only need to know the following

The course titles are the text from all the h4 elements within the HTML document.
We ask you to extract these course titles here.

In [None]:
# Create a SelectorList of the course titles
crs_title_els = response.css('h4::text')

# Extract the course titles 
crs_titles = crs_title_els.extract()

# Print out the course titles 
for el in crs_titles:
  print( ">>", el )

## Scraping with Children
We did a cute trick in the lesson to calculate how many children there were of one of the div elements belonging to the class course-block. Here we ask you to find the number of children of a mystery element (already stored within a Selector object, so you can use the xpath or css method).

To be explicit, we have created the Selector object mystery in the following way:

We first loaded a Response variable using a secret website as the input.
Then we used a call to the xpath method to create a SelectorList of elements (but we won't say which ones)
Finally, we let mystery be the first Selector object of this SelectorList.

In [None]:
# Calculate the number of children of the mystery element
how_many_kids = len( mystery.xpath( './*' ) )

# Print out the number
print( "The number of elements you selected was:", how_many_kids )

## nheriting the Spider
When learning about scrapy spiders, we saw that the main portion of the code for us to adjust is the class for the spider. To help build some familiarity of the class, you will complete a short piece of code to complete a toy-model of the spider class code. We've omitted the code that would actually run the spider, only including the pieces necessary to create the class.

As mentioned in the lesson, a class is roughly a collection of related variables and functions housed together. Sometimes one class likes to use methods from another class, and so we will inherit methods from a different class. That's what we do in the spider class.

We wrote the function inspect_class to look at the your class once you're done, if you'd like to test your solution!

In [None]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider(scrapy.Spider):
  name = "your_spider"
  # start_requests method
  def start_requests(self):
    pass
  # parse method
  def parse(self, response):
    pass
  
# Inspect Your Class
inspect_class(YourSpider)

## Hurl the URLs
In the next lesson we will talk about the start_requests method within the spider class. In this quick exercise, we ask you to change around a variable within the start_requests method which foreshadows some of what we will be learning in the next lesson. Basically, we want you to start becoming comfortable turning some of the wheels within a spider class; in this case, making a list of urls within the start_requests method.

We've written a function inspect_class which will print out the list of elements you have in the urls variable within the start_requests method.

Note: in the next several exercises, you will write code to complete your spider class, but the code does not yet include the pieces to actually run the spider; that will come at the end.

In [None]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    urls = ["https://www.datacamp.com","https://scrapy.org"]
    for url in urls:
      yield url
  # parse method
  def parse( self, response ):
    pass
  
# Inspect Your Class
inspect_class( YourSpider )

## Self Referencing is Classy
You probably have noticed that within the spider class, we always input the argument self in the start_requests and parse methods (just look in the sample code in this exercise!). This allows us to reference between methods within the class. That is, if we want to refer to the method parse within the start_requests method, we would need to write self.parse rather than just parse; what writing self does is tell the code: "Look in the same class as start_requests for a method called parse to use."

In this exercise you will get a chance to play with this "self referencing".

In [None]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    self.print_msg( "Hello World!")
  # parse method
  def parse( self, response ):
    pass
  # print_msg method
  def print_msg( self, msg ):
    print( "Calling start_requests in YourSpider prints out:", msg )
  
# Inspect Your Class
inspect_class( YourSpider )

## Starting with Start Requests
In the last lesson we learned about setting up the start_requests method within a scrapy spider. Here we have another toy-model spider which doesn't actually scrape anything, but gives you a chance to play with the start_requests method. What we want is for you to start becomming familiar with the arguments you pass into the scrapy.Request call within start_requests.

As before, we have created the function inspect_class to examine what you are yielding in start_requests.

In [None]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url="https://www.datacamp.com", callback=self.parse )
  # parse method
  def parse( self, response ):
    pass
  
# Inspect Your Class
inspect_class( YourSpider )

## Pen Names
In this exercise, we have set up a spider class which, when finished, will retrieve the author names from a shortened version of the DataCamp course directory. The URL for the shortened version is stored in the variable url_short. Your job will be to create the list of extracted author names in the parse method of the spider.

Two things you should know:

You will be using the response object and the css method here.
The course author names are defined by the text within the paragraph p elements belonging to the class course-block__author-name
You can inspect the spider using the function inspect_spider() that we built for you -- it will print out the author names you find!

Note that this and the remaining exercises in this chapter may take some time to load.

In [None]:
# Import the scrapy library
import scrapy

# Create the Spider class
class DCspider( scrapy.Spider ):
  name = 'dcspider'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = url_short, callback = self.parse )
  # parse method
  def parse( self, response ):
    # Create an extracted list of course author names
    author_names = response.css('p.course-block__author-name::text').extract()
    # Here we will just return the list of Authors
    return author_names
  
# Inspect the spider
inspect_spider( DCspider )

## Crawler Time
This will be your first chance to play with a spider which will crawl between sites (by first collecting links from one site, and following those links to parse new sites). This spider starts at the shortened DataCamp course directory, then extracts the links of the courses in the parse method; from there, it will follow those links to extract the course descriptions from each course page in the parse_descr method, and put these descriptions into the list course_descrs. Your job is to complete the code so that the spider runs as desired!

We have created a function inspect_spider which will print out one of the course descriptions you scrape (if done correctly)!

In [None]:
# Import the scrapy library
import scrapy

# Create the Spider class
class DCdescr( scrapy.Spider ):
  name = 'dcdescr'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = url_short, callback = self.parse )
  
  # First parse method
  def parse( self, response ):
    links = response.css( 'div.course-block > a::attr(href)' ).extract()
    # Follow each of the extracted links
    for link in links:
      yield response.follow( url = link, callback = self.parse_descr )
      
  # Second parsing method
  def parse_descr( self, response ):
    # Extract course description
    course_descr = response.css( 'p.course__description::text' ).extract_first()
    # For now, just yield the course description
    yield course_descr


# Inspect the spider
inspect_spider( DCdescr )

## Time to Run
In the last lesson, we went through creating an entire web-crawler to access course information from each course in the DataCamp course directory. However, the lesson seemed to stop without a climax, because we didn't play with the code after finishing the parsing methods.

The point of this exercise is to remedy that!

The code we give you to look at in this and the next exercise is long, because its the entire spider that took us the lesson to create! However, don't be intimidated! The point of these two exercises is to give you a very easy task to complete, with the hope that you will look at and run the code for this spider. That way, even though it is long, you will have a grasp of it!

In [None]:
# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Chapter_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    crs_title_ext = crs_title.extract_first().strip()
    ch_titles = response.css('h4.chapter__title::text')
    ch_titles_ext = [t.strip() for t in ch_titles.extract()]
    dc_dict[ crs_title_ext ] = ch_titles_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Chapter_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

## DataCamp Descriptions
Like the previous exercise, the code here is long since you are working with an entire web-crawling spider! But again, don't let the amount of code intimidate you, you have a handle on how spiders work now, and you are perfectly capable to complete the easy task for you here!

As in the previous exercise, we have created a function previewCourses which lets you preview the output of the spider, but you can always just explore the dictionary dc_dict too after you run the code.

In this exercise, you are asked to create a CSS Locator string direct to the text of the course description. All you need to know is that from the course page, the course description text is within a paragraph p element which belongs to the class course__description (two underlines).

In [None]:
# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Description_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    # Create a SelectorList of the course titles text
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    # Extract the text and strip it clean
    crs_title_ext = crs_title.extract_first().strip()
    # Create a SelectorList of course descriptions text
    crs_descr = response.css('p.course__description::text')
    # Extract the text and strip it clean
    crs_descr_ext = crs_descr.extract_first().strip()
    # Fill in the dictionary
    dc_dict[crs_title_ext] = crs_descr_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Description_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

## Capstone Crawler
This exercise gives you a chance to show off what you've learned! In this exercise, you will write the parse function for a spider and then fill in a few blanks to finish off the spider. On the course directory page of DataCamp, each listed course has a title and a short course description. This spider will be used to scrape the course directory to extract the course titles and short course descriptions. You will not need to follow any links this time. Everything you need to know is:

The course titles are defined by the text within an h4 element whose class contains the string block__title (double underline).
The short course descriptions are defined by the text within a paragraph p element whose class contains the string block__description (double underline).

In [None]:
# Import scrapy
import scrapy

# Import the CrawlerProcess
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class YourSpider(scrapy.Spider):
  name = 'yourspider'
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short, callback = self.parse)
      
  def parse(self, response):
    # My version of the parser you wrote in the previous part
    crs_titles = response.xpath('//h4[contains(@class,"block__title")]/text()').extract()
    crs_descrs = response.xpath('//p[contains(@class,"block__description")]/text()').extract()
    for crs_title, crs_descr in zip(crs_titles, crs_descrs):
      dc_dict[crs_title] = crs_descr
    
# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(YourSpider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)