## Attribution

These slides were adapted from [the companion notebooks](https://github.com/REMitchell/python-scraping) for [Web Scraping in Python](http://shop.oreilly.com/product/0636920034391.do), which are open sourced and provided for free.  If you are interested in a more detailed presentation of web scraping in Python, this book is a great source.

In [None]:
# Install if needed
!pip install composable
!pip install composablesoup

In [None]:
# Check for upgrade is already installed
!pip install composable --upgrade
!pip install composablesoup --upgrade

In [None]:
from composable import pipeable
from composable.strict import map, filter
from composablesoup import find, find_all, get_text, has_attr
from composablesoup.soup import find_parent, parents, children, find_previous_sibling, find_previous_siblings, find_next_sibling, find_next_siblings, find_previous_sibling
from composable.sequence import to_list, head
from composable.string import strip
from composable import from_toolz as tlz

## Parents, Children and Siblings

Beautiful search objects keep reference to all surrounding tags and we will need to exploit these relationships when we can't find a tag through a direct search.  In this section, we will investigate these relationships and using them to access the desired tags.

### Definitions

Many tags have the following relationships. 

* **Parents:** Closest surrounding tag
* **Children:** All tag immediately inside a tag
    * EXACTLY one level deep
* **Descendents:** All embedded tags
    * ANY depth
* **Siblings:** All tags on the same level
    * i.e. all children of the surrounding tag.

### Working example

Please visit [this page](http://www.pythonscraping.com/pages/page3.html) and inspect the source.

In [None]:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get('http://www.pythonscraping.com/pages/page3.html')
items_for_sale = BeautifulSoup(r.content, 'html.parser')

### Plotting the DOM

* HTML
    * body
        * div.wrapper
            * h1
            * div.content
            * table#giftList
                * tr
                    * th
                    * th
                    * th
                    * th
                * tr.gift#gift
                    * td
                    * td
                        * span.excitingNote
                    * td
                    * td
                        * img
                *  ... table continues ...
            * div.footer

<font color="red"><h2>Exercise 1</h2></font>

Identify the parents of

1. `table#giftList1`
2. `span.excitingNote`

> Your answer here

<font color="red"><h2>Exercise 2</h2></font>

Identify the children of

1. `table#giftList1`
2. `tr.gift#gift`

> Your answer here

<font color="red"><h2>Exercise 3</h2></font>

Describe (in words) the descendents of `table#giftList1`

> Your answer here

<font color="red"><h2>Exercise 4</h2></font>

Identify the siblings of `tr.gift#gift`

> Your answer here

### Stepping up a level with `find_parent`

We can access the parent of any tag using the `parent` attribute

In [None]:
(items_for_sale
 >> find('tr', class_ = 'gift')
 >> find_parent
)

### Stepping up two levels with 2*`find_parent`

Applying `find_parent` twice will step us up two levels.

In [None]:
(items_for_sale
 >> find('tr', class_ = 'gift')
 >> find_parent
 >> find_parent
)

### Searching for a specif parents.

We can also use `find_parent` two search for the closest parent that fits some description.

In [None]:
(items_for_sale
 >> find('tr', class_ = 'gift')
 >> find_parent(name='div', attrs={'id':'wrapper'})
)

### Searching for children

Note that we are using `find` (why?) with the `children` attribute

In [1]:
(items_for_sale
 >> find('table',attrs={'id':'giftList'})
 >> children
)

NameError: name 'items_for_sale' is not defined

### Accessing the last and next siblings

* `find_previous_sibling` returns closest previous sibling
* `find_previous_siblings` returns all previous sibling

In [None]:
(items_for_sale 
 >> find('tr', id = 'gift3')
 >> find_previous_sibling
)

In [None]:
(items_for_sale 
 >> find('tr', id = 'gift3')
 >> find_previous_siblings
)

### Accessing the last and next siblings

* `find_next_sibling` returns closest remaining sibling
* `find_next_siblings` returns all remaining sibling

In [None]:
(items_for_sale 
 >> find('tr', class_ = 'gift')
 >> find_next_sibling
)

In [None]:
(items_for_sale 
 >> find('tr', class_ = 'gift')
 >> find_next_siblings
)

### Searching the last and next siblings

We can also use these four functions to search for specific tags

In [None]:
(items_for_sale 
 >> find('tr', class_ = 'gift')
 >> find_next_sibling(attrs={'id':'gift4'})
)

In [None]:
import re
four_or_five = re.compile('(gift4|gift5)')
(items_for_sale 
 >> find('tr', class_ = 'gift')
 >> find_next_siblings(attrs={'id':four_or_five})
)

<font color="red"><h2>Exercise 5</h2></font>

* Look at the site source again, 
    * specifically item prices.
* How can we get to these prices?

### Using relationships to find unlabeled data.


* tr.gift#gift1
    * td
    * td
    * td
        * "$15.00"
    * td
        * `<img src="/img/gifts/img1.jpg"/>`

In [None]:
(items_for_sale
 >> find('tr', id = 'gift1')
)

In [None]:
(items_for_sale
 >> find('tr', id = 'gift1')
 >> find('img')
)

In [None]:
(items_for_sale
 >> find('tr', id = 'gift1')
 >> find('img')
 >> find_parent
)

In [None]:
(items_for_sale
 >> find('tr', id = 'gift1')
 >> find('img')
 >> find_parent
 >> find_previous_sibling
)

In [None]:
(items_for_sale
 >> find('tr', id = 'gift1')
 >> find('img')
 >> find_parent
 >> find_previous_sibling
 >> get_text
)

In [None]:
(items_for_sale
 >> find('tr', id = 'gift1')
 >> find('img')
 >> find_parent
 >> find_previous_sibling
 >> get_text
 >> strip
)

<font color="red"><h2>Exercise 6</h2></font>

See if you can get all of the prices with one pipe