## Attribution

These slides were adapted from [the companion notebooks](https://github.com/REMitchell/python-scraping) for [Web Scraping in Python](http://shop.oreilly.com/product/0636920034391.do), which are open sourced and provided for free.  If you are interested in a more detailed presentation of web scraping in Python, this book is a great source.

In [None]:
!pip install composable
!pip install composablesoup

In [None]:
!pip install composable --upgrade
!pip install composablesoup --upgrade

In [4]:
from composablesoup import find, find_all, get_text, has_attr
from composable.sequence import slice, head
from composable.strict import map, filter
from composable.string import replace
from composable import from_toolz as tlz

In [5]:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get('http://www.pythonscraping.com/pages/page3.html')
items_for_sale = BeautifulSoup(r.content, 'html.parser')

## CSS and Styling HTML Pages

In this section, we will introduce styling web pages using **Cascading Style Sheets (CSS)**, which is common practice in modern web design.  The consequence of this practice is most, if not all, html tags have attributes that classify and group the tags; often in a meaningful/contextual way.  This attributes are useful when web scraping, as we will see in the following sections

### Exploration

1. Go to [this page](http://www.pythonscraping.com/pages/warandpeace.html)
2. Notice that
    1. All of the quotes are colored <font color="#ff5555">red</font>
    2. All of the character names are colored <font color="#55ff55">green</font>
3. Now right click and view the page source.  Look at the `<style>` tag at the top of the page.  *These entries are CSS selectors, which apply style to all matching tags*.
4. Finally, note that
    1. Each quotation is surrounded by `<span class="red">...</span>`
    2. Each name is surrounded by `<span class="green">...</span>`

### CSS Selectors

* **CSS selector** applies style to call matching tags.
* The following selector is
    * named `green`
    * Applies a <font color="#55ff55">green</font> font

```
.green{
	color:#55ff55;
}
```

### Applying CSS selectors to HTML tags

* Apply a selector with the `class` attribute.
* We can apply the `green` selector using

```
<span class="green">...</span>
```
* Imagine that `class="green"` is the same as 
```
<span color="#55ff55">...</span>
```


### Reading War and Peace

In [None]:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get('http://www.pythonscraping.com/pages/warandpeace.html')
war_and_peace = BeautifulSoup(r.content, "html.parser")

In [None]:
war_and_peace

## Searching for HTML Attributes

We can search for any HTML tag by attribute using `find` and `find_all`.  This method of searching is particularly advantagous when dealing with pages that styled using CSS selectors, as most/all tags will be marked with a `class` attribute and these attributes many times are related to the context of the content.

In this section, we will illustrate searching with tag attributes using `find` and `find_all`

### A note on `find` and `find_all`

* `soup.find` returns the first matching tag
* `soup.find_all` returns a list of all matching tags

In [None]:
war_and_peace.find('span')

In [None]:
help(war_and_peace.find)

In [None]:
war_and_peace.find_all('span')[:2]

### pipeable `find` and `find_all`

The module `composablesoup` contains pipeable helper functions for both functions, which we will use exclusively to allow readability and composability.

In [None]:
(war_and_peace 
 >> find('span')
)

In [None]:
(war_and_peace
 >> find_all('span')
 >> head(2)
)

### Use `find_all` when 

* There might be multiple instances
* (almost always, it's a safer option)

### Use `find` when 

* You know there is exactly one instance
* You know you really only want the first
* (almost never, `find_all` is almost always better)

### Two ways to search tag attributes

* Dictionary: `bs.find_all('span', {'class': 'green'})`
* Keyword: `bs.find_all('span', class_ = green)`

**Note:** We use the keyword `class_` here because `class` is a protected Python keyword that is only used to define classes.  Other attributes, like `src`, do not need the added `_` at the end.

### Getting all names using an attribute dictionary

In [None]:
(war_and_peace
 >> find_all('span', attrs = {'class':'green'})
 >> head(3)
)

### Cleaning up the name tags

In [None]:
(war_and_peace
 >> find_all('span', attrs = {'class':'green'})
 >> map(get_text)
 >> head(3)
)

In [None]:
(war_and_peace
 >> find_all('span', attrs = {'class':'green'})
 >> map(get_text)
 >> map(replace('\n', ' '))
 >> head(3)
)

In [None]:
from composable.string import split
from composable import pipeable

lower = pipeable(lambda s: s.lower())

(war_and_peace
#>> head(5)
>> find_all('span', attrs = {'class':'green'})
>> map(get_text)
>> map(replace('\n', ' '))
>> map(split(' '))
>> map(map(lower))

)

### Getting all quotes using the `class_` keyword

In [None]:
(war_and_peace
 >> find_all('span', attrs = {'class':'red'})
 >> head(2)
)

<font color="red"><h2>Exercise 1</h2></font>

Write a list comprehension to 

1. Pull each quote out of the `span` tag.
2. Wrap the quote in `"`

In [None]:

quote_concat = pipeable(lambda s: '"' + s + '"')

(war_and_peace
>> find_all('span', attrs = {'class':'red'})
>> map(get_text)
>> map(quote_concat)
>> head(3)
)


## Getting Data From Tag Attributes

Other, non-CSS attributes have information embedded in thier attributes. For example,

* `src` attribute in `img` tags
* `href` tag in `a` tags.

In this section, we will looks at pulling this information out of a tag.

### Reading the Wikipedia Web Scraping Page

In [None]:
import requests
from bs4 import BeautifulSoup
s = requests.Session() # Start a session
r = s.get('https://en.wikipedia.org/wiki/Web_scraping') # Get a static page
web_scraping = BeautifulSoup(r.content, "html.parser")

### Step 1 - Search For All Tags

In [None]:
(web_scraping
 >> find_all('a')
 >> head(10)
)

### Accessing Attribute Data Looks Like Indexing

* **Syntax:** `tag[attribute_string]`
* This returns the corresponding data

In [None]:
example_a_tag1 = (web_scraping
                 >> find_all('a')
                 >> head(3)
                 >> tlz.get(1)
                )
example_a_tag1

In [None]:
#example_a_tag1['href']
example_a_tag1 >> tlz.get('href')

### Searching for Non-existant Attributes is BAD

* If the attribute doesn't exist, we will get an exception

In [None]:
example_a_tag2 = (web_scraping
                 >> find_all('a')
                 >> head(3)
                 >> tlz.get(0)
                )
example_a_tag2

In [None]:
example_a_tag2['href']

### Using a filter to avoid exceptions

* We can use a comprehension to filter out exceptions
* Use the `has_attr` Tag method

In [None]:
(web_scraping
 >> find_all('a')
 >> filter(has_attr('href'))
 >> head(3)
)

In [None]:
(web_scraping
 >> find_all('a')
 >> filter(has_attr('href'))
 >> map(tlz.get('href'))
 >> head(10)
)

<font color="red"><h2>Exercise 2</h2></font>

Write a list comprehension to get the `src` for all `img` tags on the Wikipedia site.

In [None]:
All = web_scraping >> find_all('a') >> filter(has_attr('href'))
#All
[x for x in All if x.find('img')]


<font color="red"><h2>Exercise 3</h2></font>

Get all image `src` and link `href` from your Assignment 1 website.

## More Complicated Searches

Next, we will

* Search for multiple tags at once
* Search for more than one class

### Searching for a list of tags

Using a list of tags with `find_all` returns all such tags.

In [None]:
(war_and_peace
 >> find_all(['h1', 'h2','h3','h4','h5','h6'])
)

### Matching more than one attribute

We can match more than one `class` using a set of attribute values

In [None]:
(war_and_peace
 >> find_all('span', attrs = {'class':{'green', 'red'}})
 >> head(3)
)

### Searching tag text only

We can search text only using the `text` keyword.

In [None]:
(war_and_peace
 >> find_all(None, text='the prince')
)

### Text search return a NavigableString

* More than text
* Allow access to surrounding tags

In [None]:
(war_and_peace
 >> find_all(None, text='the prince')
 >> map(type)
)

### Getting the surrounding tag with `parent`

More information on parent tags is on the way

In [None]:
(war_and_peace
 >> find_all(None, text='the prince')
 >> map(lambda ns: ns.parent)
)