<a href="https://colab.research.google.com/github/PariSsy/parissy.github.io/blob/master/Python_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping in Python

* Instructor = Thomas Laetsch, DS at NYU
* [Course Link](https://learn.datacamp.com/courses/web-scraping-with-python)
* Notes taken = Aug 4, 2021



# Chapter 1 - Intro to HTML


## 1.1 - HyperText Markup

```
<html>
    <body>
        <div>
            <p>Hello World!</p>
            <p>Enjoy DataCamp!</p>
        </div>
        <p>Thanks for Watching!</p>
    </body>
</html>
```

## 1.2 - HTML Tags and Attributes

1. Tag names: **html**, **div**, and **p**
  * `<tag-name attrib-name="attrib info"> ..element contents.. </tag-name>`
  * `<div id="unique-id" class="some class"> ..div element contents.. </div>`
  * **id** attribute should be unique
  * **class** attribute doesn't need to be unique
2. **a** tags are for hyperlinks; **href** attribute tells what link to go to
  * `<a href="https://www.datacamp.com"> This text links to DataCamp! </a>`

See tag traction on w3schools.

## 1.3 - Crash Course X

1. Direct to all `table` elements within the entire HTML code: `xpath = '//table'`
2. Direct to all `table` elements which are descendants of the 2nd `div` child of the `body` element: `xpath = '/html/body/div[2]//table'`

# Chapter 2 - XPaths and Selectors

## 2.1 - XPathology

### Slashes and Brackets

* `/` looks forward ONE generation
* `//` looks forward all future generations
* `[]` narrows in specific elements
* `*` asterisk is the wildcard



## 2.2 - Off the Beaten XPath

### (At)tribute
* `@` represents "attribute": `@class`, `@id`, `@href`
* Examples:
  + `xpath = '//p[@class="class-1"]'`
  + `xpath = '//*[@id="uid"]'`
  + `xpath = '//div[@id="uid"]/p[2]'`

### Example
```
<html>
  <body>
    <div id="uid">
      <p class="class-1">Hello World!</p>
      <p class="class-2">Enjoy DataCamp!</p>
    </div>
    </p class="class-1">Thanks for Watching!</p>
  </body>
</html>
```

### Content with Contains
Xpath contains notation: `contains(@attri-name, "string-expr")`

1. `xpath = '//*[contains(@class,"class-1")]'`
2. `xpath = '//*[@class="class-1"]'`
3. `xpath = '/html/body/div/p[2]/@class'`

## 2.3 - scrapy Selector Objects

### Setting up a Selector
```
from scrapy import Selector

html = '''
<html>
  <body>
    <div class="hello datacamp">
      <p>Hello World!</p>
    </div>
    <p>Enjoy DataCamp!</p>
  </body>
</html>
'''

sel = Selector( text = html )
```

### Selecting Selectors

Use **`xpath`** within **`Selector`**:
```
sel.xpath("//p")
# outputs the SelectorList:
[<Selector xpath='//p' data='<p>Hello World!</p>'>,
 <Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]
```

### Extracting Data from a SelectorList

**`extract()`**
```
>>> sel.xpath("//p").extract()
out: [ '<p>Hello World!</p>',
       '<p>Enjoy DataCamp!</p>' ]
```

**`extract_first()`**
```
>>> sel.xpath("//p").extract_first()
out: '<p>Hello World!</p>'
```

## 2.4 - Inspecting the Source

### HTML text to Selector

```
from scrapy import Selector
import requests

url = 'https://en.wikipedia.org/wiki/Web_scraping'
html = requests.get( url ).content
sel = Selector( text = html )
```

# Chapter 3 - CSS Locators, Chaining, and Responses



## 3.1 - CSS Locators

1. General rules
  + `/` replaced by `>` (except the first character)
    + XPath: `/html/body/div`
    + CSS Locator: `html > body > div`
  + `//` replaced by a blank space (except the first character)
    + XPath: `//div/span//p`
    + CSS Locator: `div > span p`
  + `[N]` replaced by `:nth-of-type(N)`
    + XPath: `//div/p[2]`
    + CSS Locator: `div > p:nth-of-type(2)`
2. Conversion example
  + XPATH: `xpath = '/html/body//div/p[2]'`
  + CSS: `css = 'html > body div > p:nth-of-type(2)'`
3. Attributes in CSS
  + `.` to find an element by *class*: `p.class-1` selects all paragraph elements belonging to `class-1`
  + `#` to find an element by *id*: `div#uid` selects the `div` element with `id` = `uid`
  + Examples:
    + Select paragraph elements within class `class1`: `css_locator = 'div#uid > p.class1'`
    + Select all elements whose class attribute belongs to `class1`: `css_locator = '.class1'`

### Selectors with CSS

```
>>> sel.css("div > p")
out: [<Selector xpath='...' data='<p>Hello World!</p>'>]

>>> sel.css("div > p").extract()
out: [ '<p>Hello World!</p>' ]
```

### More conversion examples:

```
xpath = '/html/body/span[1]//a'
css_locator = 'html > body > span:nth-of-type(1) a'

xpath = '//div[@id="uid"]/span//h4'
css_locator = 'div#uid > span h4'
```

## 3.2 - Attribute and Text Selection

1. XPath: `<xpath-to-element>/@attr-name` - e.g., `xpath = '//div[@id="uid]/a/@href'`
2. CSS Locator: `<css-to-element>::attr(attr-name)` - e.g., `css_locator = 'div#uid > a::attr(href)'`

### Text Extraction

```
<p id="p-example">
  Hello world!
  Try <a href="http://www.datacamp.com">DataCamp</a> today!
</p>
```

* In XPath use `text()`

```
sel.xpath('//p[@id="p-example"]/text()').extract()
# result: ['\n Hello world!\n Try ',' today!\n']

sel.xpath('//p[@id="p-example"]//text()').extract()
# result: ['\n Hello world!\n Try ','DataCamp',' today!\n']
```

* In CSS Lacator use `::text`

```
sel.css('p#p-example::text').extract()
# result: ['\n Hello world!\n Try ',' today!\n']

sel.css('p#p-example ::text').extract()
# result: ['\n Hello world!\n Try ','DataCamp',' today!\n']
```


## 3.3 - `response`

Review:
1. `xpath` method works like a Selector - `response.xpath('//div/span[@class="bio"]')`
2. `css` method works like a Selector - `response.css( 'div > span.bio' )`
3. Chaining works like a Selector - `response.xpath('//div').css('span.bio')`
4. Data extraction works like a Selector - `response.xpath('//div').css('span.bio').extract_first()`

New:
1. The `response` keeps track of the URL within the response url variable
```
response.url
>>> 'http://www.DataCamp.com/courses/all'
```
2. The `response` lets us `follow()` a new link
```
# next_url is the string path of the next url we want to scrape
response.follow( next_url )
```

## 3.3 DataCamp Course Link Scraping

DataCamp Site: https://www.datacamp.com/courses/all  
(By the creation of this course, there were 185 DataCamp courses; by the creation of this notebook, there were 356 DataCamp courses)

Response loaded with HTML from https://www.datacamp.com/courses/all
```
course_divs = response.css('div.course-block')
print( len(course_divs) )
>>> 185
```

### Inspecting course-block
```
first_div = course_divs[0]
children = first_div.xpath('./*')
print( len(children) )
>>> 3
```
### The first child
```
first_div = course_divs[0]
children = first_div.xpath('./*')
first_child = children[0]
print( first_child.extract() )
>>> <a class=... />
```

### The second child
```
first_div = course_divs[0]
children = first_div.xpath('./*')
second_child = children[1]
print( second_child.extract() )
>>> <div class=... />
```

### The forgotten child
```
first_div = course_divs[0]
children = first_div.xpath('./*')
third_child = children[2]
print( third_child.extract() )
>>> <span class=... />
```

### Listful
* In one CSS Locator
```
links = response.css('div.course-block > a::attr(href)').extract()
```
* Stepwise
```
# step 1: course blocks
course_divs = response.css('div.course-block')
# step 2: hyperlink elements
hrefs = course_divs.xpath('./a/@href')
# step 3: extract the links
links = hrefs.extract()
```

### Get Schooled

```
for l in links:
print( l )

>>> /courses/free-introduction-to-r
>>> /courses/data-table-data-manipulation-r-tutorial
>>> /courses/dplyr-data-manipulation-r-tutorial
>>> /courses/ggvis-data-visualization-r-tutorial
>>> /courses/reporting-with-r-markdown
>>> /courses/intermediate-r

...
```

# Chapter 4 - Spiders


## 4.1 - A Classy Spider

```
import scrapy
from scrpy.crawler import CrawlerProcess

class SpiderClassName(scrapy.Spider):
    name = "Spider_name"
    # the code for your spider

# Initiate a CrawlerProcess
process = CrawlerProcess()

# tell the process which spider to use
process.crawl(SpiderClassName)

# start the crawling process
process.start()
```

### Weaving the Web

```
class DCspider( scrapy.Spider ):

    name = 'dc_spider'

    def start_requests( self ):
        urls = ['https://www.datacamp.com/courses/all']
        for url in urls:
            yield scrapy.Request( url = url, callback = self.parse )
    
    def parse( self, response ):
        # simple example: write out the html
        html_file = 'DC_courses.html'
        with open( html_file, 'wb' ) as fout:
            fout.write( response.body )
```

* Need to have a function called `start_requests`
* Need to have at least 1 parser function to handle the HTML code

## Section 4.3

### DataCamp Course Links: Save to File

```
class DCspider( scrapy.Spider ):
    name = "dcspider"

    def start_requests( self ):
        urls = [ 'https://www.datacamp.com/courses/all' ]
        for url in urls:
            yield scrapy.Request( url = url, callback = self.parse)
    def parse( self, response ):
        links = response.css('div.course-block > a::attr(href)').extract()
        filepath = 'DC_links.csv'
        with open( filepath, 'w' ) as f:
            f.writelines( [link + '/n' for link in links] )
```

### DataCamp Course Links: Parse Again

```
class DCspider( scrapy.Spider ):
    name = "dcspider"

    def start_requests( self ):
        urls = [ 'https://www.datacamp.com/courses/all' ]
        for url in urls:
            yield scrapy.Request( url = url, callback = self.parse)
    def parse( self, response ):
        links = response.css('div.course-block > a::attr(href)').extract()
        for link in links:
            yield response.follow( url = link, callback = self.parse2)
    def parse2( self, response ):
        # parse the course sites here
```


## Section 4.4

### Inspecting Elements

```
import scrapy
from scrapy.crawler import CrawlerProcess

class DC_Chapter_Spider(scrapy.Spider):

    name = "dc_chapter_spider"

    def start_requests( self ):
        url = 'https://www.datacamp.com/courses/all'
        yield scrapy.Request( url = url, callback = self.parse_front)

    def parse_front( self, response ):
        ## Code to parse the front courses page

    def parse_pages( self, response ):
        ## Code to parse course pages
        ## Fill in dc_dict here

dc_dict = dict()

process = CrawlerProcess()
process.crawl(DC_Chapter_Spider)
process.start()
```

### Parsing the Front Page

```
def parse_front( self, response ):

    # Narrow in on the course blocks
    course_blocks = response.css( 'div.course-block' )

    # Direct to the course links
    course_links = course_blocks.xpath( './a/@href' )
    
    # Extract the links (as a list of strings)
    links_to_follow = course_links.extract()

    # Follow the links to the next parser
    for url in links_to_follow:
        yield response.follow(url = url, callback = self.parse_pages)
```

### Parsing the Course Pages

```
def parse_pages( self, response ):

    # Direct to the course title text
    crs_title = response.xpath( '//h1[contains(@class,"title")]/text()' )

    # Extract and clean the course title text
    crs_title_ext = crs_title.extract_first().strip()

    # Direct to the chapter titles text
    ch_titles = response.css( 'h4.chapter__title::text' )

    # Extract and clean the chapter titles text
    ch_titles_ext = [t.strip() for t in ch_titles.extract()]
    
    # Store this in our dictionary
    dc_dict[ crs_title_ext ] = ch_titles_ext
```

In [None]:
# !pip install "scrapy"
import scrapy
from scrapy.crawler import CrawlerProcess
 
class SpiderClassName(scrapy.Spider):
    name = "Spider_name"
    # the code for your spider
 
# Initiate a CrawlerProcess
process = CrawlerProcess()
 
# tell the process which spider to use
process.crawl(SpiderClassName)
 
# start the crawling process
process.start()

class DCspider( scrapy.Spider ):
 
    name = 'dc_spider'
 
    def start_requests( self ):
        urls = ['https://www.datacamp.com/courses/all']
        for url in urls:
            yield scrapy.Request( url = url, callback = self.parse )
 
    def parse( self, response ):
        # simple example: write out the html
        html_file = 'DC_courses.html'
        with open( html_file, 'wb' ) as fout:
            fout.write( response.body )

In [None]:
# # Import scrapy
# import scrapy
# # Import the CrawlerProcess: for running the spider
# from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Chapter_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    crs_title_ext = crs_title.extract_first().strip()
    ch_titles = response.css('h4.chapter__title::text')
    ch_titles_ext = [t.strip() for t in ch_titles.extract()]
    dc_dict[ crs_title_ext ] = ch_titles_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Chapter_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

## Parse DataCamp Course Descriptions

In [None]:
# # Import scrapy
# import scrapy
# # Import the CrawlerProcess: for running the spider
# from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Description_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    # Create a SelectorList of the course titles text
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    # Extract the text and strip it clean
    crs_title_ext = crs_title.extract_first().strip()
    # Create a SelectorList of course descriptions text
    crs_descr = response.css( 'p.course__description::text' )
    # Extract the text and strip it clean
    crs_descr_ext = crs_descr.extract_first().strip()
    # Fill in the dictionary
    dc_dict[crs_title_ext] = crs_descr_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Description_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

## Capstone Crawler

In [None]:
# # Import scrapy
# import scrapy
# # Import the CrawlerProcess
# from scrapy.crawler import CrawlerProcess

# Create the Spider class
class YourSpider( scrapy.Spider ):
  name = 'yourspider'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request(url = url_short, callback = self.parse)
      
  def parse(self, response):
    # My version of the parser you wrote in the previous part
    crs_titles = response.xpath('//h4[contains(@class,"block__title")]/text()').extract()
    crs_descrs = response.xpath('//p[contains(@class,"block__description")]/text()').extract()
    for crs_title, crs_descr in zip( crs_titles, crs_descrs ):
      dc_dict[crs_title] = crs_descr
    
# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(YourSpider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)