In [None]:
import requests
import scrapy

### xpath

### Ex 1: 

Your job will be to create an `XPath` string using `single` forward-slashes and brackets which navigates to the paragraph `p` element which contains the text `"Where am I?"`

In [None]:
html = '''<html>
  <body>
    <div>
      <p>Good Luck!</p>
      <p>Not here...</p>
    </div>
    <div>
      <p>Where am I?</p>
    </div>
  </body>
</html> '''

In [None]:
xpath = "/html/body/div[2]/p"

In [None]:
xpath

### Ex 2:

In this exercise, you will select all paragraph `p` elements within the HTML. Because we want you to navigate to `all paragraph` elements, it is not important that you know what the HTML code is, since the task can be accomplished with a simple XPath string using the double forward-slash notation you have learned.

In [None]:
xpath_all_p = "//p"

In [None]:
html = '''

<html>
  <body>
    <div>
      <p>Good Luck!</p>
      <p>Not here...</p>
    </div>
    <div>
      <p>Where am I?</p>
    </div>
  </body>
</html> 

'''

### Slashes and Brackets

-Single forward slash `/` looks forward `one` generation

-Double forward slash `//` looks forward `all` future generations

-Square brackets `[]` help narrow in on `specfic` elements


`Tips`:

1.The number of elements selected with the XPath string `xpath = "/html/body/*"` is equal to the `number of children` of the `body element`; whereas the number of elements selected with the XPath string `xpath = "/html/body//*"` is equal to the `total number of descendants` of the `body element`.

2.The number of elements selected by the XPath string `xpath = "/*"` is equal to the number of `root elements` within the `HTML document`, which is typically the `1 html root element`.

3.The number of elements selected by the Xpath string `xpath = "//*"` is equal to the `total number of elements` in the `entire HTML document`.

### Ex 3:

In this exercise, we want to give you the opportunity to create your own `XPath` string to achieve a certain task; the task is to select the paragraph element containing the text `"Choose DataCamp!"`.


In [None]:
'''Consider the following HTML:

<html>
  <body>
    <div>
      <p>Hello World!</p>
      <div>
        <p>Choose DataCamp!</p>
      </div>
    </div>
    <div>
      <p>Thanks for Watching!</p>
    </div>
  </body>
</html>

'''

In [None]:
xpath = "/html/body/div/div/p"

### Attribute

`@ `represents `"attribute"`

For example: @class, @id, @href


### Ex 4:

In this exercise, you'll begin to write an XPath string using attributes to achieve a certain task; that task is to select the paragraph element containing the texts `Hello World!`, `Choose DataCamp!` and `Thanks for Watching!`. 


In [None]:
'''Consider the following HTML:

<html>
  <body>
    <div id="div1" class="class-1">
      <p class="class-1 class-2">Hello World!</p>
      <div id="div2">
        <p id="p2" class="class-2">Choose DataCamp!</p>
      </div>
    </div>
    <div id="div3" class="class-2">
      <p class="class-2">Thanks for Watching!</p>
    </div>
  </body>
</html>

'''

In [None]:
## Hello world 

xpath1 = '//*[@class="class-1"]/p' 

In [None]:
## Choose Datacamp

xpath2 = '//*[@id="p2"]'

In [None]:
## Thanks for watching

xpath3 = '//*[@id="div3"]/p'

### Hyper(link) Active

One of the most important attributes to extract for `"web-crawling"` is the hyperlink url `(href attribute)` within an a tag. Here, you will extract such a hyperlink!

### Ex 5:

Complete the variable xpath to select the `href` attribute value from the `DataCamp` hyperlink.

In [None]:
"""The exercise refers to the following HTML source code:

<html>
  <body>
    <div id="div1" class="class-1">
      <p class="class-1 class-2">Hello World!</p>
      <div id="div2">
        <p id="p2" class="class-2">Choose 
            <a href="http://datacamp.com">DataCamp!</a>!
        </p>
      </div>
    </div>
    <div id="div3" class="class-2">
      <p class="class-2">Thanks for Watching!</p>
    </div>
  </body>
</html>

"""

In [None]:
xpath = '//*[@id="p2"]/a/@href'

### Contains

Xpath `Contains` Notation:

`contains(@attribute-name, "string-expression")`

### Ex 6:

Assign an XPath string to the variable `xpath` which directs to all `href` attribute values of the hyperlink a elements whose `class` attributes `contain` the string `"package-snippet"`. Remember that we use the `contains` call within the XPath string to check if an attribute value contains a particular string.

In [None]:
xpath = '//a[contains(@class, "package-snippet")]/@href'

### CSS Locators

1.`/` replace by `>` (`except` the `first` character)

XPath: `/html/body/div`

CSS Locator:`html > body > div`

2.`//` replaced by a `blank` space(`except` the `first` character) 

XPath: `//div/span//p`

CSS Locator: `div > span p`

3.`[N]` replaced by `:nth-of-type(N)`

XPath: `//div/p[2]`

CSS Locator: `div > p:nth-of-type(2)`

### Ex 7: The (X)Path to CSS Locators

Assign to the variable `css_locator` a CSS Locator string which is equivalent to the `XPath` string given.

`xpath = '/html/body/span[1]//a'`

In [None]:
# # Create the CSS Locator string equivalent to the XPath

css_locator = "html > body > span:nth-of-type(1) a"

### Ex 8: 

Assign to the variable xpath a `XPath` string which is equivalent to the CSS Locator string given.

`css_locator = 'div#uid > span h4'`

In [None]:
# Create the XPath string equivalent to the CSS Locator 

Xpath = '//div[@id="uid"]/span//h4'

### The CSS Wildcard

You can use the wildcard `*` in CSS Locators too! In fact, we can use it in a similar way, when we want to ignore the tag type. For example:

1.The CSS Locator string `'*'` selects `all elements` in the HTML document.

2.The CSS Locator string `'*.class-1'` selects `all elements which belong to class-1`, but this is unnecessary since the string `'.class-1'` will also do the `same` job.

3.The CSS Locator string `'*#uid'` selects the `element with id attribute equal to uid`, but this is unnecessary since the string `'#uid'` will also do the same job.

### Ex 9:

Assign to the variable `css_locator` a CSS Locator string which will select `all children` (regardless of tag-type) of the unique element in the HTML document that has its `id attribute` equal to `uid`.

In [None]:
## Create the CSS Locator to all children of the element whose id is uid

css_locator = '#uid > *'

### XPath and CSS locator Chaining

Selector and SelectorList objects allow for chaining when using the xpath method. What this means is that you can apply the xpath method over once you've already applied it. For example, if sel is the name of our Selector, then-

`sel.xpath('/html/body/div[2]')`

is the same as--

`sel.xpath('/html').xpath('./body/div[2]')`

or is the same as--

`sel.xpath('/html').xpath('./body').xpath('./div[2]')`

### Ex 10:

1.Assign the variable `css_locator` a CSS Locator string which directs to the `hyperlink (a element)` children of `all div element`s belonging to the class `"course-block"`.

2.Assign to the variable `hrefs_from_xpath` the `href` attribute values from the elements in `course_as`. 

3.Do the same for CSS locator `hrefs_from_css`

In [None]:
course_as = 'div.course-block > a'

In [None]:
# Selecting all href attributes chaining with xpath

hrefs_from_xpath = course_as.xpath('./@href')

In [None]:
# # Selecting all href attributes chaining with css

hrefs_from_css = course_as.css('::attr(href)')

### HTML text to Selector

We can create a scrapy `Selector` object using a string with the html code.

1.The `selector` selects the `entire` html document when it is passed to the `text` argument

2.`Selector` and `SelectorList` objects allow for `chaining` when using the `xpath` method. What this means is that you can apply the `xpath` method over once you've already applied it.

In [None]:
## Example

html = '''
<html>
    <body>
        <div class="hello datacamp">
            <p>Hello World!</p>
        </div>
        <p>Enjoy DataCamp!</p>
    </body>
</html>
'''

In [None]:
# Setting up a Selector
# The selector sel has selected the entire html document

sel = Selector( text = html )

# Selecting Selectors
# We can use the xpath call within a Selector to create new Selectors of specific pieces of the html code
# The return is a SelectorList of Selector objects

print("Selector List: \n",sel.xpath("//p"))

In [None]:
# For Extracting Data from a SelectorList, Use the .extract() method

sel.xpath("//p").extract()

In [None]:
# We can use extract_first() to get the first element of the list

first_p = sel.xpath("//p").extract_first()
first_p

In [None]:
# Another way 

ps = sel.xpath('//p')
ps

In [None]:
## extract by indexing

second_p = ps[1]
second_p

In [None]:
second_p.extract()

### Ex 11:

the URL of Datacamp website in the string variable url and use the requests library to put the content from the website into the string variable html. Your task is to--- 

1.Create the string `html` containing the HTML source

2.Set up the Selector object sel with the `html` variable passed as the `text` argument..

3.Print out the number of `elements` in the HTML document

4.Create a `SelectorList` of all `div` elements in the HTML document

In [None]:
url = "https://app.datacamp.com/learn/courses"

# Create the string html containing the HTML source

html = requests.get(url).content

# Create the Selector object sel from html

sel = Selector(text=html)

print(f"You have found total {len(sel.xpath('//*'))} elements") 

In [None]:
# Create a SelectorList of all div elements in the HTML document

divs = sel.xpath("//div")

### Attribute and Text Selection

In [None]:
### Example 
html = '''
<p id="p-example">
    Hello world!
    Try <a href="http://www.datacamp.com">DataCamp</a> today!
</p>

'''

In [None]:
sel = Selector(text=html)

In [None]:
## In XPath use text(), do not include the text of future generations

sel.xpath('//p[@id="p-example"]/text()').extract()

In [None]:
# For CSS Locator, use ::text, do not include the text of future generations

sel.css('p#p-example::text').extract()

In [None]:
## In XPath use text(), include the text of future generations

sel.xpath('//p[@id="p-example"]//text()').extract()

In [None]:
## For CSS Locator, use ::text, include the text of future generations

sel.css('p#p-example ::text').extract()

### Ex 12:

1.Assign to the variable xpath an `XPath` string directing to the `text` within the paragraph `p` element with `id` equal to `p3`, which does `not` include the text of `future generations` of this `p` element.

2.Assign to the variable `css_locator` a CSS Locator string directing to this same text.

In [None]:
## Create an XPath string to the desired text.
Xpath = '//p[@id="p3"]/text()'

## Create a CSS Locator string to the desired text.
css_locator = 'p#p3::text'

### Ex 13:

1.Assign to the variable xpath an `XPath` string directing to the text within the paragraph p element with id equal to p3, which `includes` the `text` of `future generations` of this `p` element.

2.Assign to the variable `css_locator` a CSS Locator string directing to this same text.

In [None]:
## Create an XPath string to the desired text.
Xpath = '//p[@id="p3"]//text()'

## Create a CSS Locator string to the desired text.
css_locator = 'p#p3 ::text'

### Response:

The Response has `all` the tools we learned with Selectors:

1.`xpath` and `css` methods and also `extract` and `extract_first` methods. `Chaining` process also works like a `Selector`.

2.The response `keeps track` of the `URL` within the `response.url()` method

3.The Response helps us move from one site to another, so that we can `"crawl"` the web while scraping.

4.The response lets us "follow" a new link with the `follow()` method

### Ex 14:

The Response object, named response, is from a `secret` website. Your job is to figure out the `URL` and the `title` of the website using the response variable. To find the website title, what you need to know is:

1.The title is the text from the `title` element

2.The `title` element is a `child` of the `head` element, which is a `child` of the `html` root element.

In [None]:
url = 'https://www.datacamp.com/courses/all'

In [None]:
html = requests.get(url).content

In [None]:
sel = Selector(text=html)

In [None]:
##  Get the title of the website loaded in response

the_title = sel.xpath('/html/head/title/text()').extract_first()
the_title

In [2]:
# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Description_Spider(scrapy.Spider):
    name = "dc_chapter_spider"
  # start_requests method
    def start_requests(self):
        yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
    def parse_front(self, response):
        course_blocks = response.css('div.course-block')
        course_links = course_blocks.xpath('./a/@href')
        links_to_follow = course_links.extract()
        for url in links_to_follow:
            yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
    def parse_pages(self, response):
    # Create a SelectorList of the course titles text
        crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    # Extract the text and strip it clean
        crs_title_ext = crs_title.extract_first().strip()
    # Create a SelectorList of course descriptions text
        crs_descr = response.css( "p.course__description ::text" )
    # Extract the text and strip it clean
        crs_descr_ext = crs_descr.extract_first().strip()
    # Fill in the dictionary
        dc_dict[crs_title_ext] = crs_descr_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Description_Spider)
process.start()

# Print a preview of courses

print(url_short)

2021-11-02 23:05:31 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-11-02 23:05:31 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.18362-SP0
2021-11-02 23:05:31 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-11-02 23:05:31 [scrapy.crawler] INFO: Overridden settings:
{}
2021-11-02 23:05:31 [scrapy.extensions.telnet] INFO: Telnet Password: 14af77892de7746a
2021-11-02 23:05:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2021-11-02 23:05:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 's

https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short


In [1]:
url_short = "https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short"