# HyperText Markup Language (HTML)

Webscraping pipeline

1. Setup
    * Understand what we want to do
    * Find sources to help us to do it
2. Acquisition
    * Read in the raw data from online
    * Format the data to be usable
    * Options to use: Scrapy via python
3. Processing
    * Many options


## Overview

How does the code looks behind the scene?

* `<open-tag>` ... `<closed-tag>`
    
nesting gives rise to a hierarcy, which can be be visualized in hierarcy tree

<html>

    <body>
    
        <div>
            <p> Tekst wrapped in div </p>
        </div>
        
        <p> Tekst wrapped in body </p>
        
    </body>
    
</html>

### 1. Attributes

See internet for all the existing attributes. 

Tags can have attributes, given by 

```<tag-name attribute-name="attrib info">```

    ... element contents
    
```</tag-name>```

Example 1: In this example, the div-tag would belong to both classes "some" and "class". They dont have to be unique.

```<div id="unique-id" class="some class"```

    ... div element contents

```</div>```

Example 2: a-tag is specific for hyperlinks to redirect. The href is most important attribute using to identify the IRL where they hyperlink redirect to. 

```<a> href = "https://www.datacamp.com"```

    "This text links to Datacamp!"

```</a>```

### 2. Ex-path-notation - Elements to direct to

In following example, the simple xpath moves forward generations in the tag-tree. The brackets tell us which of the selected siblings to choose, starting counting from 1 counting each 'div'.

* Single forward slash ```/``` : Looks forward 1 generation
* Double forward slash ```//``` : Looks forward all generations
* Square brackets ```[]``` : Narrow in on specific elements
* The wildcard ```*``` : What we want to ignore = without concerning the tag-type

The number of elements selected with the XPath string ```xpath = "/html/body/*``` is equal to the number of children of the body element; whereas the number of elements selected with the XPath string ```xpath = "/html/body//*"``` is equal to the total number of descendants of the body element.

#### Path selection

In [None]:
xpath = '/html/body/div[2]'

The double-forward slash tells us to look forward to all future generations.

In [None]:
xpath = '//table'

When to restrict to a specific div-tag element, and navitage to all 'table' elements which are descendants of that div-element.

In [None]:
xpath = '/html/body/div[2]//table'

#### Attribute selection

When searching for a specific attributes, then use brackets and specify the tag of interest. E.g. if you search for a ```id``` attribute like ```uid``` in all div-elements

In [None]:
xpath = '//div[@id="uid"]'

In [None]:
xpath = '//p[text()="Choose DataCamp!"]'

When wanting to select a substring within the full class attribute, then use the function ```contain(@attri-name, "string-expr")```

In [None]:
# matches all class attribute containing "class-1"
xpath = '//*[contains(@class, "class-1")]'

In [None]:
# matches only whose entire class attribute is equal to "class-1"
xpath = '//*[@class, "class-1"]'

To direct to the attribute self, we take the x path following it by a forward slash and conncect to the class by ```@class-name```.

In [None]:
xpath = '/html/body/div[2]//table/@class'

#### Show the text

In [None]:
# Text of first genereation (example)
xpath = '//p[@id="p3"]/text()'

# Text of all genereations (example)
xpath = '//p[@id="p3"]//text()'

### 3. CSS (Cascading Style Sheets) Locators

CSS describes how the elements are displayed on the screen.

Many people prefer using CSS Locator notation to Xpath natation, as it often makes attribute selection very easy. Learning both, or in combination, is powerful.

* ```*``` selects all elements in HTML document
* ```*.class-1``` selects all elements which belong to ```class-1```, which is similar to ```.class-1```
* ```*#uid``` selects the element with ```id``` attribute equal to ```uid```, which is similar to ```#uid```

#### Path selection

In [10]:
# xpath style
xpath = '/html/body//div/p[2]'

# CSS style
css = 'html > body div > p:nth-of-type(2)'

In [11]:
# equivalent
xpath = '//div[@id="uid"]/span//h4'
css_locator = 'div#uid > span h4'

In [12]:
# equivalent
xpath = '/html/body/span[1]//a'
css_locator = 'html > body > span:nth-of-type(1) a'

In [13]:
# Select paragraph elements within class1
css_locator = 'div#uid > p.class1'

# # is an id-selector
# . is an class selector

In [14]:
# Select all elements whose class attribute belongs to class1
css_locator = '.class1'

#### Attribute selection

In [None]:
# xpath
xpath_attr = '<xpath-to-element>/@attr-name'

# CSS
css_attr = '<xss-to-element>::attr(attr-name)'

In [None]:
# equivalent
xpath_attr = '//div[@id="uid"/a/@href]'
css_attr = 'div#uid > a::attr(href)'

#### Show the text

In [None]:
# Text of first genereation (example)
css_locator = 'p#p3::text'

# Text of all genereations (example)
css_locator = 'p#p3 ::text'

##  Scrapy package

## 1. Selector object

In [1]:
from scrapy import Selector

# html
html = ''' 
<html> 
    <body> 
        <div>
            <p>Hello world!</p>
        </div>
        <p>Enjoy DataCamp</p>
    </body> 
</html> 
'''

# Selector selets the entire HTML document
sel = Selector(text = html)

In [2]:
print(sel)

<Selector xpath=None data='<html> \n    <body> \n        <div>\n   ...'>


To obtain the html from an url, the request package is usefull. 

In [None]:
import requests

# Extract html from url
url = 'https://'
html = requests.get(url).content

# Selector selets the entire HTML document
sel = Selector(text = html)

### 1A. Set up a scrapy selector (xpath)

In [3]:
# select xpath
sel.xpath("//p")

[<Selector xpath='//p' data='<p>Hello world!</p>'>,
 <Selector xpath='//p' data='<p>Enjoy DataCamp</p>'>]

In [6]:
# Access the data in the selectorlist
sel.xpath("//p").extract()

['<p>Hello world!</p>', '<p>Enjoy DataCamp</p>']

In [7]:
# Only extract the first data in the list
sel.xpath("//p").extract_first()

'<p>Hello world!</p>'

In [17]:
# Access the text of the data in the selectorlist
sel.xpath("//p/text()").extract()

['Hello world!', 'Enjoy DataCamp']

In [None]:
# Selecting using fragmented xpath resulting in similar outcomes
sel.xpath('/html/body/div[2]')
sel.xpath('/html').xpath('./body/div[2]')
sel.xpath('/html').xpath('./body').xpath('./div[2]')

### 1B. Set up a scrapy selector (CSS)
Similar for CSS compared to xpath

In [19]:
# Select css locator
sel.css("p")

[<Selector xpath='descendant-or-self::p' data='<p>Hello world!</p>'>,
 <Selector xpath='descendant-or-self::p' data='<p>Enjoy DataCamp</p>'>]

In [20]:
# Access the data in the selectorlist
sel.css("p").extract()

['<p>Hello world!</p>', '<p>Enjoy DataCamp</p>']

In [24]:
# Access the data in the selectorlist
sel.css("p").extract()

['<p>Hello world!</p>', '<p>Enjoy DataCamp</p>']

In [25]:
# Access the text of the data in the selectorlist
sel.css("p::text").extract()

['Hello world!', 'Enjoy DataCamp']

## 2. Response object

Everything learned with selectors can be applied to response objects.

Advantages of response object over selector objects as introduction:
* Keeps track of the url where the html code was loaded from
* Helps us move from one site to another, so to crawl between links on sites and scrape multiple sites automatically.

In [None]:
respone.xpath('//div/span[@class="bio"]')

In [None]:
respone.css('div > span.bio')

In [None]:
respone.xpath('//div').css('span.bio').extract()

In [None]:
# Follow links
response.url
respone.follow(next_url)

## 3. Example: put it all together

General layout to scrape a website

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

# what websites to scrape and how
class YourSpiderClassName(scrapy.Spider):
    
    name = 'your_spider'
    
    # code for your spider
    def start_request(self):
        pass
    
    def parse(self, response):
        pass
    
# Initiate a CrawlerProcess
process = CrawlerProcess()

# Tell the process which spider to use
process.crawl(YourSpider)

# Start the crawling process
process.start()

Genearal template is filled in with some methodes:

In [None]:
# Creating the actual spider
# What sites to scrape
# How to scarpe them

class DCspider(scrapy.Spider):
    
    name = 'dc_spider'
    
    # Which site to be scraped
    def start_request(self):
        
        # url to be scraped
        urls = ['https://datacamp.com/courses/all'] 
        
        for url in urls:
            
            # Reponse variable: where to send the information from the sites to be parsed
            # Refer to the method parse within the start_requests method
            yield scrapy.Request(url = url, callback = self.parse)
            
    # Parse to the website to be scrape
    def parse(self, response):
        
        # response is the varibale passed from the scrapy.Request call
        
        # links
        links = response.css('div.couse-block > a::attr(href)').extract()
        
        # html file path
        filepath = 'DC_courses.html'
        
        # Write the html to a file
        with open(filepath, 'wb') as fout:
            fout.write(response.body)
            
            


Scarpe following sites (which takes the previous example as starting point):

In [None]:
# Creating the actual spider
# What sites to scrape
# How to scarpe them

class DCspider(scrapy.Spider):
    
    name = 'dc_spider'
    
    # Which site to be scraped
    def start_request(self):
        
        # url to be scraped
        urls = ['https://datacamp.com/courses/all'] 
        
        for url in urls:
            
            # Reponse variable: where to send the information from the sites to be parsed
            # Refer to the method parse within the start_requests method
            yield scrapy.Request(url = url, callback = self.parse)
            
            
    # Spide crawls between different sites and scrape the websites
    def parse(self, response):
        
        # response is the varibale passed from the scrapy.Request call
        
        # links
        links = response.css('div.couse-block > a::attr(href)').extract()
        
        # Spide crawl between different sites. The spider follow those links and parse those sites in a 2nd method
        # .follow method woeks fimilarly to the .Request call, but it point the spider to which parsing method we are going to use next.
        for link in links:
            
            # Follow each of those links and scrape those sites
            yield response.follow(url = link, callback = self.parse2)
        
        
    def parse2(self, response):
        
        # parse the course sites here!
            
            


Below an actual example:

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

class DC_Chapter_Spider(scrapy.Spider):
    
    name = 'dc_chapter_spider'
    
    def start_request(self):
        url = 'https://datacamp.com/courses/all'

        yield scrapy.Request(url = url, callback = self.parse_front)
            
            
    def parse_front(self, response):
        # Parse to the front courses page
        
        # Narrow in on the course blocks
        course_blocks = response.css('div.couse-block')
        
        # Direct to the course links
        course_links = course_blocks.xpath('./a/@href')
        
        # Extract the links (as a list of strings)
        links_to_follow = course_links.extract()
        
        # Follow the ilnks to the next parser
        for link in links:
            yield response.follow(url = link, callback = self.parse_pages)
        
        
    def parse_pages(self, response):
        # Parse to the courses pages
        
        # --------------- Course -----------------
        
        # Direct to the course title text
        crs_title = response.xpath('//h1[contains(@class, "title")]/text()')
        
        # Extract and clean the course title text
        crs_title_ext = crs_title.extract_first().strip()
        
        # -------------- Chapter -----------------
        
        # Direct to the chapter title text
        ch_titles = response.css('h4.chapter__title::text')
        
        # Extract and clean the chapter titles text
        ch_titles_ext = [t.strop() for t in ch_titles.extract()]
        
        # -------------- Store -------------------
        
        # Store this in our dictionary
        dc_dict[crs_titles_ext] = ch_titles_ext
            
dc_dict = dict()

process = CrawlerProcess()
process.crawl(DC_Chapter_Spider)
process.start()