<a href="https://colab.research.google.com/github/PariSsy/parissy.github.io/blob/master/Python_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping in Python

* Instructor = Thomas Laetsch, DS at NYU
* [Course Link](https://learn.datacamp.com/courses/web-scraping-with-python)
* Notes taken = Aug 4, 2021



# Chapter 1 - Intro to HTML


## 1.1 - HyperText Markup

```
<html>
    <body>
        <div>
            <p>Hello World!</p>
            <p>Enjoy DataCamp!</p>
        </div>
        <p>Thanks for Watching!</p>
    </body>
</html>
```

## 1.2 - HTML Tags and Attributes

1. Tag names: **html**, **div**, and **p**
  * `<tag-name attrib-name="attrib info"> ..element contents.. </tag-name>`
  * `<div id="unique-id" class="some class"> ..div element contents.. </div>`
  * **id** attribute should be unique
  * **class** attribute doesn't need to be unique
2. **a** tags are for hyperlinks; **href** attribute tells what link to go to
  * `<a href="https://www.datacamp.com"> This text links to DataCamp! </a>`

See tag traction on w3schools.

## 1.3 - Crash Course X

1. Direct to all `table` elements within the entire HTML code: `xpath = '//table'`
2. Direct to all `table` elements which are descendants of the 2nd `div` child of the `body` element: `xpath = '/html/body/div[2]//table'`

# Chapter 2 - XPaths and Selectors

## 2.1 - XPathology

### Slashes and Brackets

* `/` looks forward ONE generation
* `//` looks forward all future generations
* `[]` narrows in specific elements
* `*` asterisk is the wildcard



## 2.2 - Off the Beaten XPath

### (At)tribute
* `@` represents "attribute": `@class`, `@id`, `@href`
* Examples:
  + `xpath = '//p[@class="class-1"]'`
  + `xpath = '//*[@id="uid"]'`
  + `xpath = '//div[@id="uid"]/p[2]'`

### Example
```
<html>
  <body>
    <div id="uid">
      <p class="class-1">Hello World!</p>
      <p class="class-2">Enjoy DataCamp!</p>
    </div>
    </p class="class-1">Thanks for Watching!</p>
  </body>
</html>
```

### Content with Contains
Xpath contains notation: `contains(@attri-name, "string-expr")`

1. `xpath = '//*[contains(@class,"class-1")]'`
2. `xpath = '//*[@class="class-1"]'`
3. `xpath = '/html/body/div/p[2]/@class'`

## 2.3 - scrapy Selector Objects

### Setting up a Selector
```
from scrapy import Selector

html = '''
<html>
  <body>
    <div class="hello datacamp">
      <p>Hello World!</p>
    </div>
    <p>Enjoy DataCamp!</p>
  </body>
</html>
'''

sel = Selector( text = html )
```

### Selecting Selectors

Use **`xpath`** within **`Selector`**:
```
sel.xpath("//p")
# outputs the SelectorList:
[<Selector xpath='//p' data='<p>Hello World!</p>'>,
 <Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]
```

### Extracting Data from a SelectorList

**`extract()`**
```
>>> sel.xpath("//p").extract()
out: [ '<p>Hello World!</p>',
       '<p>Enjoy DataCamp!</p>' ]
```

**`extract_first()`**
```
>>> sel.xpath("//p").extract_first()
out: '<p>Hello World!</p>'
```

## 2.4 - Inspecting the Source

### HTML text to Selector

```
from scrapy import Selector
import requests

url = 'https://en.wikipedia.org/wiki/Web_scraping'
html = requests.get( url ).content
sel = Selector( text = html )
```

# Chapter 3 - CSS Locators, Chaining, and Responses



## 3.1 - CSS Locators

1. General rules
  + `/` replaced by `>` (except the first character)
    + XPath: `/html/body/div`
    + CSS Locator: `html > body > div`
  + `//` replaced by a blank space (except the first character)
    + XPath: `//div/span//p`
    + CSS Locator: `div > span p`
  + `[N]` replaced by `:nth-of-type(N)`
    + XPath: `//div/p[2]`
    + CSS Locator: `div > p:nth-of-type(2)`
2. Conversion example
  + XPATH: `xpath = '/html/body//div/p[2]'`
  + CSS: `css = 'html > body div > p:nth-of-type(2)'`
3. Attributes in CSS
  + `.` to find an element by *class*: `p.class-1` selects all paragraph elements belonging to `class-1`
  + `#` to find an element by *id*: `div#uid` selects the `div` element with `id` = `uid`
  + Examples:
    + Select paragraph elements within class `class1`: `css_locator = 'div#uid > p.class1'`
    + Select all elements whose class attribute belongs to `class1`: `css_locator = '.class1'`

### Selectors with CSS

```
>>> sel.css("div > p")
out: [<Selector xpath='...' data='<p>Hello World!</p>'>]

>>> sel.css("div > p").extract()
out: [ '<p>Hello World!</p>' ]
```

### More conversion examples:

```
xpath = '/html/body/span[1]//a'
css_locator = 'html > body > span:nth-of-type(1) a'

xpath = '//div[@id="uid"]/span//h4'
css_locator = 'div#uid > span h4'
```

## 3.2 - Attribute and Text Selection

1. XPath: `<xpath-to-element>/@attr-name` - e.g., `xpath = '//div[@id="uid]/a/@href'`
2. CSS Locator: `<css-to-element>::attr(attr-name)` - e.g., `css_locator = 'div#uid > a::attr(href)'`

### Text Extraction

```
<p id="p-example">
  Hello world!
  Try <a href="http://www.datacamp.com">DataCamp</a> today!
</p>
```

* In XPath use `text()`

```
sel.xpath('//p[@id="p-example"]/text()').extract()
# result: ['\n Hello world!\n Try ',' today!\n']

sel.xpath('//p[@id="p-example"]//text()').extract()
# result: ['\n Hello world!\n Try ','DataCamp',' today!\n']
```

* In CSS Lacator use `::text`

```
sel.css('p#p-example::text').extract()
# result: ['\n Hello world!\n Try ',' today!\n']

sel.css('p#p-example ::text').extract()
# result: ['\n Hello world!\n Try ','DataCamp',' today!\n']
```


## 3.3 - 

# Chapter 4 - 


## 4.1 - 

## 4.2 - 

## 4.3 - 