## Attribution

These slides were adapted from [the companion notebooks](https://github.com/REMitchell/python-scraping) for [Web Scraping in Python](http://shop.oreilly.com/product/0636920034391.do), which are open sourced and provided for free.  If you are interested in a more detailed presentation of web scraping in Python, this book is a great source.

In [None]:
!pip install selenium

Collecting selenium
  Downloading selenium-4.30.0-py3-none-any.whl.metadata (7.5 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.29.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting attrs>=23.2.0 (from trio~=0.17->selenium)
  Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Downloading selenium-4.30.0-py3-none-any.whl (9.4 MB)
   ---------------------------------------- 0.0/9.4 MB ? eta -:--:--
    --------------------------------------- 0.1/9.4 MB 2.8 MB/s eta 0:00:04
    --------------------------------------- 0.2/9.4 MB 2.5 MB/s eta 0:00:04
   -- ------------------------------------- 0.5/9.4 MB 3.7 MB/s 

In [1]:
!pip install composable



In [2]:
!pip install toolz --upgrade
!pip install composable --upgrade

Collecting toolz
  Downloading toolz-1.0.0-py3-none-any.whl.metadata (5.1 kB)
Downloading toolz-1.0.0-py3-none-any.whl (56 kB)
   ---------------------------------------- 0.0/56.4 kB ? eta -:--:--
   -------------- ------------------------- 20.5/56.4 kB 640.0 kB/s eta 0:00:01
   ---------------------------------------- 56.4/56.4 kB 591.6 kB/s eta 0:00:00
Installing collected packages: toolz
  Attempting uninstall: toolz
    Found existing installation: toolz 0.12.0
    Uninstalling toolz-0.12.0:
      Successfully uninstalled toolz-0.12.0
Successfully installed toolz-1.0.0


In [3]:
from composable.strict import map, filter
from composable.utility import get, apply
from composable.object import obj, attr
import composable.tuples as tup

## Topic 1 - Two types of web pages

1. A **static** webpage (A) contains all the data in the initial page and (B) does not load or manipulate the page using JavaScript.
2. A **dynamic** webpage (A) uses JavaScript script tags to change the initial page, which (B) might load the desired data after the initial load.

### How to load each type of page

1. **Static pages.** Load using the `requests` library, as it is less combersome and doesn't require interfacing with a browser.
2. **Dynamic pages.** For a dynamic page, you might
   1. Be able to get the desired data using just requests, but *much* more likely
   2. Use `selenium` to programmically open a browser and get the html after scripts.

### Examples - Inspect the source of each page

1. **Static page.** [War and Peace](http://www.pythonscraping.com/pages/warandpeace.html)
2. **Dynamic page.** [My Gas Buddy](https://www.gasbuddy.com/gasprices/illinois/chicago)

**Tasks.** 

1. Use a browser to view the page source of each page,
2. Note the presence/absence of any script tags,
3. Note the similarity/difference between the loaded content and page source.

### Reading War and Peace

#### Example 1 - Reading with `requests`

This page is static, so we can access it with `requests`.

In [4]:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get('http://www.pythonscraping.com/pages/warandpeace.html')
(war_and_peace := BeautifulSoup(r.content, "html.parser"))

<html>
<head>
<style>
.green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
</style>
</head>
<body>
<h1>War and Peace</h1>
<h2>Chapter 1</h2>
<div id="text">
"<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>"
<p></p>
It was in July, 1805, and the speaker was the well-known <span class="green">Anna
Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
of high rank and importance, who was the firs

#### Example 2 - Loading a dynamic page with `selenium`

Next, we will load the My Gas Buddy page for Chicago, which is a dynamic page needing `selenium`

In [5]:
# Showing the requests doesn't work
s = requests.Session()
r = s.get('https://www.gasbuddy.com/gasprices/illinois/chicago')
(gb_chicago := BeautifulSoup(r.content, "html.parser"))

Forbidden

In [6]:
# Now with selenium
from selenium import webdriver
from bs4 import BeautifulSoup
import time

# Initialize the Selenium WebDriver
driver = webdriver.Chrome()

# Navigate to the desired URL
driver.get('https://www.gasbuddy.com/gasprices/illinois/chicago')

# Wait for the page to load completely
time.sleep(1)

# Get the page source which will be passed it to BeautifulSoup
gb_html = driver.page_source

# Close the browser
driver.close()

In [7]:
(gb_chicago := BeautifulSoup(gb_html, "html.parser"))

<html class="smartbanner-margin-top" lang="en-US"><head>
<!-- OneTrust Cookies Consent Notice start -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-V2LZ5H9RW6&amp;cx=c&amp;_slc=1" type="text/javascript"></script><script async="" src="https://www.google-analytics.com/analytics.js" type="text/javascript"></script><script async="" src="https://www.googletagmanager.com/gtag/js?id=G-V2LZ5H9RW6&amp;cx=c&amp;gtm=45He53o2h2v78727790za200&amp;tag_exp=102482433~102788824~102803279~102813109~102887800~102926327" type="text/javascript"></script><script async="" src="//www.googletagmanager.com/gtm/js?id=GTM-N3CG6XK"></script><script src="//web.localytics.com/v4/localytics.min.js" type="text/javascript"></script><script async="" src="//c.amazon-adsystem.com/aax2/apstag.js"></script><script charset="UTF-8" data-domain-script="5912756a-9c9a-429f-9243-93932a946a02" src="https://cdn.cookielaw.org/scripttemplates/otSDKStub.js" type="text/javascript"></script>
<script type="text/ja

## Topic 2 - Searching for HTML Tags and Attributes

We can search for any HTML tag by attribute using `find` and `find_all`.  This method of searching is particularly advantagous when dealing with pages that styled using CSS selectors, as most/all tags will be marked with a `class` attribute and these attributes many times are related to the context of the content.

In this section, we will illustrate searching with tag attributes using `find` and `find_all`

### A note on `find` and `find_all`

* `soup.find` returns the first matching tag
* `soup.find_all` returns a list of all matching tags

In [8]:
war_and_peace.find('span')

<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>

In [9]:
war_and_peace.find_all('span')[:4]

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>]

### pipeable `find` and `find_all`

Since all `bs4` objects come with the `find` and `find_all` methods, we can apply these methods in a pipe by using the pipeable `obj` to make the method call. 

In [10]:
(war_and_peace 
 >> obj.find('span')
)

<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>

In [11]:
(war_and_peace
 >> obj.find_all('span')
) >> tup.head(4)

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>]

### `find` VERSUS `find_all`

Next, we will look at picking between the two find methods.

#### Use `find_all` when 

* There might be multiple instances
* (almost always, it's a safer option)

#### Use `find` when 

* You know there is exactly one instance
* You know you really only want the first
* (almost never, `find_all` is almost always better)

### Two ways to search tag attributes

* Dictionary: `bs.find_all('span', {'class': 'green'})`
* Keyword: `bs.find_all('span', class_ = green)`

**Note:** We use the keyword `class_` here because `class` is a protected Python keyword that is only used to define classes.  Other attributes, like `src`, do not need the added `_` at the end.

### Searching for specific attributes

`bs4` provides two methods for searching through HTML tag attributes, both using keyword assignment.

1. Using `attrs = {'ATTR':'VALUE', ...}`.
2. Useing `ATTR = VALUE`.

#### Pure Python using `attrs = ...`

In [12]:
(proper_names := 
[obj for obj in war_and_peace.find_all('span', attrs = {'class':'green'})][:2]
)[:2]

[<span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>]

#### Pure Python using `class_ = ...`

In [13]:
(proper_names := 
[obj for obj in war_and_peace.find_all('span', class_ = 'green')][:2]
)[:2]

[<span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>]

#### Composable pipe using `attrs = ...`

In [14]:
(war_and_peace
 >> obj.find_all('span', attrs = {'class':'green'})
) >> tup.head(3)

[<span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>]

#### Composable pipe using `class_ = ...`

In [15]:
(war_and_peace
 >> obj.find_all('span', class_ = 'green')
) >> tup.head(3)

[<span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>]

### We can use `attrs = {...}` to search multiple attributes

#### Pure Python using multiple `attrs`

In [16]:
import requests
from bs4 import BeautifulSoup
s = requests.Session() # Start a session
r = s.get('https://en.wikipedia.org/wiki/Web_scraping') # Get a static page
web_scraping = BeautifulSoup(r.content, "html.parser")

In [17]:
web_scraping.find_all('a', 
                      attrs = {'class':'cdx-button',
                               'href':'#'
                               }
                      )

[<a class="cdx-button cdx-button--fake-button cdx-button--fake-button--enabled cdx-button--weight-quiet cdx-button--icon-only" data-event-name="talk-sticky-header" href="#" id="ca-talk-sticky-header" tabindex="-1"><span class="vector-icon mw-ui-icon-speechBubbles mw-ui-icon-wikimedia-speechBubbles"></span>
 <span></span>
 </a>,
 <a class="cdx-button cdx-button--fake-button cdx-button--fake-button--enabled cdx-button--weight-quiet cdx-button--icon-only" data-event-name="subject-sticky-header" href="#" id="ca-subject-sticky-header" tabindex="-1"><span class="vector-icon mw-ui-icon-article mw-ui-icon-wikimedia-article"></span>
 <span></span>
 </a>,
 <a class="cdx-button cdx-button--fake-button cdx-button--fake-button--enabled cdx-button--weight-quiet cdx-button--icon-only" data-event-name="history-sticky-header" href="#" id="ca-history-sticky-header" tabindex="-1"><span class="vector-icon mw-ui-icon-wikimedia-history mw-ui-icon-wikimedia-wikimedia-history"></span>
 <span></span>
 </a>,
 <

### Using `get_text` to extract all text between the outer tags.

Another important `bs4` method is `get_text`, which allows use to grab all the text contained in the outer tags of our object.

In [18]:
(proper_names := 
[obj.get_text() for obj in war_and_peace.find_all('span', attrs = {'class':'green'})][:2]
)[:2]

['Anna\nPavlovna Scherer', 'Empress Marya\nFedorovna']

In [19]:
(war_and_peace
 >> obj.find_all('span', attrs = {'class':'green'})
 >> map(obj.get_text())
) >> tup.head(3)

['Anna\nPavlovna Scherer', 'Empress Marya\nFedorovna', 'Prince Vasili Kuragin']

### Cleaning up the names

Next, let's clean up the names by replacing the new line characters with spaces.

#### Pure Python

In [20]:
(proper_names := 
    [obj.get_text().replace('\n', ' ')
    for obj in war_and_peace.find_all('span', attrs = {'class':'green'})][:2]
)[:2]

['Anna Pavlovna Scherer', 'Empress Marya Fedorovna']

#### Composable pipe

In [21]:
(proper_names := 
  war_and_peace
  >> obj.find_all('span', attrs = {'class':'green'})
  >> map(obj.get_text())
  >> map(obj.replace('\n', ' '))
) >> tup.head(2)

['Anna Pavlovna Scherer', 'Empress Marya Fedorovna']

<font color="red"><h2>Exercise 5.2.1</h2></font>

**Goal.** Find and clean up all the quotes.

**Tasks.** Use both pure Python and a composable pipe to perform the following.
1. Use `find_all` to grab all the relevant `span` tags,
2. Pull each quote out of the `span` tag, and
3. Wrap the quote in `"`.

In [22]:
# Your pure python solution

In [35]:
wrap_in_quote = lambda s: f'"{s}"'

(quotes :=
    [wrap_in_quote(t.get_text().replace('\n', ' ')) for t in war_and_peace.find_all('span', class_='red')]
)

['"Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don\'t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist- I really believe he is Antichrist- I will have nothing more to do with you and you are no longer my friend, no longer my \'faithful slave,\' as you call yourself! But how do you do? I see I have frightened you- sit down and tell me all the news."',
 '"If you have nothing better to do, Count [or Prince], and if the prospect of spending an evening with a poor invalid is not too terrible, I shall be very charmed to see you tonight between 7 and 10- Annette Scherer."',
 '"Heavens! what a virulent attack!"',
 '"First of all, dear friend, tell me how you are. Set your friend\'s mind at rest,"',
 '"Can one be well while suffering morally? Can one be calm in times like these if one has any feeling?"',
 '"You are staying the whole evening, I hope?"',
 '"And the fete at th

In [36]:
# Your composable pipe solution

In [37]:
(quotes :=
 war_and_peace
 >> obj.find_all('span', class_ = 'red')
 >> map(obj.get_text())
 >> map(obj.replace('\n', ' '))
 >> map(lambda s: f'"{s}"')
 )

['"Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don\'t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist- I really believe he is Antichrist- I will have nothing more to do with you and you are no longer my friend, no longer my \'faithful slave,\' as you call yourself! But how do you do? I see I have frightened you- sit down and tell me all the news."',
 '"If you have nothing better to do, Count [or Prince], and if the prospect of spending an evening with a poor invalid is not too terrible, I shall be very charmed to see you tonight between 7 and 10- Annette Scherer."',
 '"Heavens! what a virulent attack!"',
 '"First of all, dear friend, tell me how you are. Set your friend\'s mind at rest,"',
 '"Can one be well while suffering morally? Can one be calm in times like these if one has any feeling?"',
 '"You are staying the whole evening, I hope?"',
 '"And the fete at th

## Topic 3 - Getting Data From Tag Attributes

Other, non-CSS attributes have information embedded in thier attributes. For example,

* `src` attribute in `img` tags
* `href` tag in `a` tags.

In this section, we will looks at pulling this information out of a tag.

### Reading the Wikipedia Web Scraping Page

In [38]:
import requests
from bs4 import BeautifulSoup
s = requests.Session() # Start a session
r = s.get('https://en.wikipedia.org/wiki/Web_scraping') # Get a static page
web_scraping = BeautifulSoup(r.content, "html.parser")

### Step 1 - Search For All Tags

In [39]:
(web_scraping
 >> obj.find_all('a')
) >> tup.head(3)

[<a class="mw-jump-link" href="#bodyContent">Jump to content</a>,
 <a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>,
 <a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>]

### Accessing Attribute Data Looks Like Indexing

* **Syntax:** `tag[attribute_string]`
* This returns the corresponding data

#### First, let's grab an example tag

In [40]:
# Pure Python
(example_a_tag1 := web_scraping.find_all('a')[1])

<a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>

In [41]:
# Composable pipe
(example_a_tag1 := 
 web_scraping
  >> obj.find_all('a')
  >> get(1)
 )


<a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>

#### Access the attribute data by keyword (like in a `dict`)

In [42]:
# Pure Python
example_a_tag1['href']

'/wiki/Main_Page'

In [43]:
# Composable pipe
example_a_tag1 >> get('href')

'/wiki/Main_Page'

### Searching for Non-existant Attributes is BAD

* If the attribute doesn't exist, we will get an exception

In [44]:
example_a_tag1

<a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>

In [45]:
# Error: because there is no such key in a dictionary
example_a_tag1['class']

KeyError: 'class'

In [None]:
example_a_tag1 >> get('class')

KeyError: 'class'

### How to deal with (possibly missing tags)

1. YOLO,
2. Apply a `filter` using the `has_attr` method,
3. Add a default value to `get`

### Pure Python solutions

1. **Filter A.** Use a comprehension with a filter using the `has_attr` method, or
2. **Filter B.** Use a conditional expression with `None` or some other default,
3. **Get w/ default.** Use `get` with a default (cleaner version of 2.)

#### 1. **Filter A.** Filter with the comprehension

In [None]:
# 1. Apply a filter
[ tag["class"] 
 for tag in web_scraping.find_all('a')
 if tag.has_attr('class')] # extra code to filter

[['mw-jump-link'],
 ['mw-logo'],
 ['cdx-button',
  'cdx-button--fake-button',
  'cdx-button--fake-button--enabled',
  'cdx-button--weight-quiet',
  'cdx-button--icon-only',
  'search-toggle'],
 [],
 [],
 [],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],


#### 2. **Filter B.** Use a conditional expression to keep the failures as `None` (or other default)

In [None]:
[ tag["class"] if tag.has_attr('class') else None 
 for tag in web_scraping.find_all('a') ]

[['mw-jump-link'],
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 ['mw-logo'],
 ['cdx-button',
  'cdx-button--fake-button',
  'cdx-button--fake-button--enabled',
  'cdx-button--weight-quiet',
  'cdx-button--icon-only',
  'search-toggle'],
 [],
 [],
 [],
 None,
 None,
 None,
 None,
 None,
 None,
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 [

#### 3. **Get w/ default.** Use `get` with a `default=None`

In [None]:
[ get("class", tag, default=None) 
 for tag in web_scraping.find_all('a') ]

[['mw-jump-link'],
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 ['mw-logo'],
 ['cdx-button',
  'cdx-button--fake-button',
  'cdx-button--fake-button--enabled',
  'cdx-button--weight-quiet',
  'cdx-button--icon-only',
  'search-toggle'],
 [],
 [],
 [],
 None,
 None,
 None,
 None,
 None,
 None,
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 [

In [None]:
# Maybe `[]` is a better default
[ get("class", tag, default=[]) for tag in web_scraping.find_all('a') ]

[['mw-jump-link'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['mw-logo'],
 ['cdx-button',
  'cdx-button--fake-button',
  'cdx-button--fake-button--enabled',
  'cdx-button--weight-quiet',
  'cdx-button--icon-only',
  'search-toggle'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['int

### Composable solutions

1. Apply a `filter` using `has_attr`,
2. `map` the `get` function with a default,

#### 1. Filter to remove failures

In [None]:
(web_scraping
 >> obj.find_all('a')
 >> filter(obj.has_attr('class'))
 >> map(get('class'))
) 

[['mw-jump-link'],
 ['mw-logo'],
 ['cdx-button',
  'cdx-button--fake-button',
  'cdx-button--fake-button--enabled',
  'cdx-button--weight-quiet',
  'cdx-button--icon-only',
  'search-toggle'],
 [],
 [],
 [],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],


#### 2. Map `get` with a default to keep failures.

In [None]:
(web_scraping
 >> obj.find_all('a')
 >> map(get('class', default=[]))
) 

[['mw-jump-link'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['mw-logo'],
 ['cdx-button',
  'cdx-button--fake-button',
  'cdx-button--fake-button--enabled',
  'cdx-button--weight-quiet',
  'cdx-button--icon-only',
  'search-toggle'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['vector-toc-link'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['interlanguage-link-target'],
 ['int

<font color="red"><h2>Exercise 5.2.2</h2></font>

**Goal.** Get all the `src` for all `img` tags on the Wikipedia site.

**Tasks.** 

1. Redo the task using each of the approaches outlined above.
2. Compare and contrast the upsides and downsides of each solution.

In [None]:
# Your code here (add additional cells as needed.)

In [None]:
[t['src'] for t in web_scraping.find_all('img')]

['/static/images/icons/wikipedia.png',
 '/static/images/mobile/copyright/wikipedia-wordmark-en.svg',
 '/static/images/mobile/copyright/wikipedia-tagline-en.svg',
 '//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Ambox_globe_content.svg/48px-Ambox_globe_content.svg.png',
 'https://login.wikimedia.org/wiki/Special:CentralAutoLogin/start?useformat=desktop&type=1x1&usesul3=0',
 '/static/images/footer/wikimedia.svg',
 '/w/resources/assets/mediawiki_compact.svg']

In [None]:
(web_scraping.find_all('img')
 >> map(get('src', default=None))
)

['/static/images/icons/wikipedia.png',
 '/static/images/mobile/copyright/wikipedia-wordmark-en.svg',
 '/static/images/mobile/copyright/wikipedia-tagline-en.svg',
 '//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Ambox_globe_content.svg/48px-Ambox_globe_content.svg.png',
 'https://login.wikimedia.org/wiki/Special:CentralAutoLogin/start?useformat=desktop&type=1x1&usesul3=0',
 '/static/images/footer/wikimedia.svg',
 '/w/resources/assets/mediawiki_compact.svg']

<font color="orange">
Your comparison here.
</font>

## Topic 4- More Complicated Searches

Next, we will

* Search for multiple tags at once
* Searching just the text
* Search for more than one class
* Dealing with `class`es with multiple values

### Searching for a list of tags

Using a list of tags with `find_all` returns all such tags.

#### Example 1 - All header HTML tags

In [None]:
(war_and_peace
 >> obj.find_all(['h1', 'h2','h3','h4','h5','h6'])
)

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

### Searching tag text only

We can search text only using the `text` keyword.

In [None]:
(war_and_peace
 >> obj.find_all(None, text='the prince')
)

['the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince']

### Matching more than one attribute

We can match more than one `class` using a set of attribute values

In [None]:
(war_and_peace
 >> obj.find_all('span', attrs = {'class':{'green', 'red'}})
) >> tup.head(5)

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>,
 <span class="green">Anna Pavlovna</span>]

###  Using a `lambda` to search the `class` tags.

`bs4` allows us the option of passing a `lambda` function instead of a static value to any search.

In [None]:
war_and_peace.find_all('span', class_=lambda cls: cls != 'green')

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="red">If you have nothing better to do, Count [or Prince], and if the
 prospect of spending an evening with a poor invalid is not too
 terrible, I shall be very charmed to see you tonight between 7 and 10-
 Annette Scherer.</span>,
 <span class="red">Heavens! what a virulent attack!</span>,
 <span class="red">First of all, dear friend, tell me how you are. Set your friend's
 mind at rest,</span>,
 <span class="red">Can one be well while suffering morally? Can one be calm in times
 l

### Dealing with multiple class values
While the previous example may seem silly, this ability allows us to deal with a *common* problem: multiple CSS tags.

1. It is common for the `class` tag to be associated with multiple CSS tags, e.g., one for font size, another for font color.
2. Many times, we want to look to see if one of these tags either fully or partially matches the CSS tag.

#### Example - Finding the price span in My Gas Buddy.

Inspecting one of the items on [My Gas Buddies Chicago page](https://www.gasbuddy.com/gasprices/illinois/chicago) shows the closest tag looks like


```{html}
<span class="text__xl___2MXGo text__left___1iOw3 StationDisplayPrice-module__price___3rARL">$2.65</span>
```

let's investigate the spans, narrowing our search to those containing `'price'`

In [None]:
# BEWARE of NoneType! --> when there is not class, the class is `None` --> Easy to get non-type errors
gb_chicago.find_all('span', class_=lambda cls: 'price' in cls)

TypeError: argument of type 'NoneType' is not iterable

In [None]:
# The solution: Check for None first [see https://en.wikipedia.org/wiki/Null_object_pattern]
# cls is not None
gb_chicago.find_all('span', class_=lambda cls: cls is not None and 'price' in cls)

[<span class="text__xl___2MXGo text__left___1iOw3 StationDisplayPrice-module__price___3rARL">$2.65</span>,
 <span class="text__xl___2MXGo text__left___1iOw3 StationDisplayPrice-module__price___3rARL">$2.65</span>,
 <span class="text__xl___2MXGo text__left___1iOw3 StationDisplayPrice-module__price___3rARL">$2.79</span>,
 <span class="text__xl___2MXGo text__left___1iOw3 StationDisplayPrice-module__price___3rARL">$2.79</span>,
 <span class="text__xl___2MXGo text__left___1iOw3 StationDisplayPrice-module__price___3rARL">$2.82</span>,
 <span class="text__xl___2MXGo text__left___1iOw3 StationDisplayPrice-module__price___3rARL">$2.83</span>,
 <span class="text__xl___2MXGo text__left___1iOw3 StationDisplayPrice-module__price___3rARL">$2.84</span>,
 <span class="text__xl___2MXGo text__left___1iOw3 StationDisplayPrice-module__price___3rARL">$2.84</span>,
 <span class="text__xl___2MXGo text__left___1iOw3 StationDisplayPrice-module__price___3rARL">$2.84</span>,
 <span class="text__xl___2MXGo text__

In [None]:
(gb_chicago
 >> obj.find_all('span', class_=lambda cls: cls is not None and 'price' in cls)
 >> map(obj.get_text())
)

['$2.65',
 '$2.65',
 '$2.79',
 '$2.79',
 '$2.82',
 '$2.83',
 '$2.84',
 '$2.84',
 '$2.84',
 '$2.85']

<font color="red"><h2>Exercise 5.2.3</h2></font>

**Goal.** Extract the tags containing the station names.

**Tasks.** Use both pure Python and a composable pipe to perform the following.
1. Inspect the [My Gas Buddies Chicago page](https://www.gasbuddy.com/gasprices/illinois/chicago) to find an identifiable HTML tag close to the station name.
2. Use a lambda to perform the filter without checking for `None` to verify it fails, then
3. Use a `lambda` to perform the filter while correctly checking for `None`.

<a href="/station/177845" style="color: inherit; font-weight: 700; text-decoration: inherit;">Costco</a>

In [None]:
# Your pure Python comprehension code here

In [90]:
gb_chicago.find_all('h3', class_=lambda cls: cls is not None and 'station' in cls)

[<h3 class="header__header3___1b1oq header__header___1zII0 header__midnight___1tdCQ header__snug___lRSNK StationDisplay-module__stationNameHeader___1A2q8"><a href="/station/177845" style="color: inherit; font-weight: 700; text-decoration: inherit;">Costco</a><span> <span><img alt=" " class="image__image___1ZUby BadgeImage-module__image___1OkSK BadgeImage-module__clickable___2RX-5" loading="lazy" src="//static.gasbuddy.com/web/consumer/Verified_Icon.svg" title="This station is verified!"/></span></span></h3>,
 <h3 class="header__header3___1b1oq header__header___1zII0 header__midnight___1tdCQ header__snug___lRSNK StationDisplay-module__stationNameHeader___1A2q8"><a href="/station/123369" style="color: inherit; font-weight: 700; text-decoration: inherit;">Sam's Club</a><span> </span></h3>,
 <h3 class="header__header3___1b1oq header__header___1zII0 header__midnight___1tdCQ header__snug___lRSNK StationDisplay-module__stationNameHeader___1A2q8"><a href="/station/197663" style="color: inherit

In [107]:
gb_chicago.find('h3')

<h3 class="header__header3___1b1oq header__header___1zII0 header__midnight___1tdCQ header__snug___lRSNK StationDisplay-module__stationNameHeader___1A2q8"><a href="/station/177845" style="color: inherit; font-weight: 700; text-decoration: inherit;">Costco</a><span> <span><img alt=" " class="image__image___1ZUby BadgeImage-module__image___1OkSK BadgeImage-module__clickable___2RX-5" loading="lazy" src="//static.gasbuddy.com/web/consumer/Verified_Icon.svg" title="This station is verified!"/></span></span></h3>

In [None]:
# What is the point of specifying "class_= " in the following code?
# What about the situation without an attribute? 

(stations :=
    [t.get_text().replace('\xa0', '') for t in gb_chicago.find_all('h3', class_=lambda cls: cls is not None and 'station' in cls)]
)

['Costco',
 "Sam's Club",
 'Kwik Trip',
 "Woodman's",
 'Marathon',
 'BP',
 'Amoco',
 'Gulf',
 'Mobil',
 'Billionaire Choice Gas Mart']

In [92]:
# Your composable functional code here


In [93]:
(gb_chicago
 >> obj.find_all('h3', class_=lambda cls: cls is not None and 'station' in cls)
 >> map(obj.get_text().replace('\xa0', ''))
)

['Costco',
 "Sam's Club",
 'Kwik Trip',
 "Woodman's",
 'Marathon',
 'BP',
 'Amoco',
 'Gulf',
 'Mobil',
 'Billionaire Choice Gas Mart']

## More searching with `lambda` functions

We can use a lambda function 
* to perform more complicated searches.
* **Syntax:** `bs.find_all(lambda tag: bool_expr)`

### Example 3

In the following examples, we will be using `lambda` functions to perform more advanced searches on [this page](http://www.pythonscraping.com/pages/page3.html)

In [94]:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get('http://www.pythonscraping.com/pages/page3.html')
items_for_sale = BeautifulSoup(r.content, 'html.parser')

#### Example 3.1 - Filter by the number of attributes.

Let's find all tags with exactly 2 attributed on [this page](http://www.pythonscraping.com/pages/page3.html)

In [95]:
(items_for_sale
 >> obj.find_all(lambda tag: len(tag.attrs) == 2)
) >> tup.head(5)

[<img src="../img/gifts/logo.jpg" style="float:left;"/>,
 <tr class="gift" id="gift1"><td>
 Vegetable Basket
 </td><td>
 This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
 <span class="excitingNote">Now with super-colorful bell peppers!</span>
 </td><td>
 $15.00
 </td><td>
 <img src="../img/gifts/img1.jpg"/>
 </td></tr>,
 <tr class="gift" id="gift2"><td>
 Russian Nesting Dolls
 </td><td>
 Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
 </td><td>
 $10,000.52
 </td><td>
 <img src="../img/gifts/img2.jpg"/>
 </td></tr>,
 <tr class="gift" id="gift3"><td>
 Fish Painting
 </td><td>
 If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
 </td><td>
 $10,005.00
 </td><td>
 <img src="../img/gifts/img3.jpg"/>
 </td>

#### Example 3.2

Let's find all tags containing a specific piece of text.

In [96]:
# Using a tag returns the outer tag containing the text.
(items_for_sale
 >> obj.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')
)

[<span class="excitingNote">Or maybe he's only resting?</span>]

In [97]:
# ... VS. using `text=...` only returns the text
(items_for_sale
 >> obj.find_all('', text='Or maybe he\'s only resting?')
)


["Or maybe he's only resting?"]

### Searching with regular expressions

The ultimate tool for performing complex text searches is a Regular Expression, which will be our next topic of discussion.

In [98]:
import re
gift_img = re.compile(r'\.\.\/img\/gifts/img.*\.jpg')

(items_for_sale
 >> obj.find_all('img', attrs={'src': gift_img})
 >> map(get('src', default='')) # why is '' a good default here?
)

['../img/gifts/img1.jpg',
 '../img/gifts/img2.jpg',
 '../img/gifts/img3.jpg',
 '../img/gifts/img4.jpg',
 '../img/gifts/img6.jpg']

In [99]:
(items_for_sale
 >> obj.find_all('img', src =  gift_img)
 >> map(get('src', default='')) # why is '' a good default here?
)


['../img/gifts/img1.jpg',
 '../img/gifts/img2.jpg',
 '../img/gifts/img3.jpg',
 '../img/gifts/img4.jpg',
 '../img/gifts/img6.jpg']

## Text search return a NavigableString

* More than text
* Allow access to surrounding tags

In [100]:
(war_and_peace
 >> obj.find_all(None, text='the prince')
 >> map(type)
)

[bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.NavigableString]

### Getting the surrounding tag with `parent`

More information on parent tags is on the way.  **Note.** In this case, we are accessing an attribute (not method), so we use the pipeable attribute `attr` 

In [101]:
(war_and_peace
 >> obj.find_all(None, text='the prince')
)


['the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince']

In [102]:
(war_and_peace
 >> obj.find_all(None, text='the prince')
 >> map(attr.parent)
#  >> map(type)
)

[<span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>]

In [103]:
(war_and_peace
 >> obj.find_all(None, text='the prince')
 >> map(attr.parent)
 >> map(attr.parent)
#  >> map(type)
) # >> apply(len)

[<div id="text">
 "<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>"
 <p></p>
 It was in July, 1805, and the speaker was the well-known <span class="green">Anna
 Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
 Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
 of high rank and importance, who was the first to arrive at her
 reception. <span class="green">Anna Pavlovna</span> had had a cough for some days. She was, as
 she said, suffering from la gr

In [104]:
(war_and_peace
 >> obj.find_all(None, text='the prince')
 >> map(attr.parent)
 >> map(attr.parent)
 >> map(attr.contents)
) # >> apply(len)

[['\n"',
  <span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
  Buonapartes. But I warn you, if you don't tell me that this means war,
  if you still try to defend the infamies and horrors perpetrated by
  that Antichrist- I really believe he is Antichrist- I will have
  nothing more to do with you and you are no longer my friend, no longer
  my 'faithful slave,' as you call yourself! But how do you do? I see
  I have frightened you- sit down and tell me all the news.</span>,
  '"\n',
  <p></p>,
  '\nIt was in July, 1805, and the speaker was the well-known ',
  <span class="green">Anna
  Pavlovna Scherer</span>,
  ', maid of honor and favorite of the ',
  <span class="green">Empress Marya
  Fedorovna</span>,
  '. With these words she greeted ',
  <span class="green">Prince Vasili Kuragin</span>,
  ', a man\nof high rank and importance, who was the first to arrive at her\nreception. ',
  <span class="green">Anna Pavlovna</span>,
  ' had had a cough for

In [105]:
(war_and_peace
 >> obj.find_all(None, text='the prince')
 >> map(attr.parent)
 >> map(attr.parent)
 >> map(attr.children)
 >> map(list)
) # >> apply(len)

[['\n"',
  <span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
  Buonapartes. But I warn you, if you don't tell me that this means war,
  if you still try to defend the infamies and horrors perpetrated by
  that Antichrist- I really believe he is Antichrist- I will have
  nothing more to do with you and you are no longer my friend, no longer
  my 'faithful slave,' as you call yourself! But how do you do? I see
  I have frightened you- sit down and tell me all the news.</span>,
  '"\n',
  <p></p>,
  '\nIt was in July, 1805, and the speaker was the well-known ',
  <span class="green">Anna
  Pavlovna Scherer</span>,
  ', maid of honor and favorite of the ',
  <span class="green">Empress Marya
  Fedorovna</span>,
  '. With these words she greeted ',
  <span class="green">Prince Vasili Kuragin</span>,
  ', a man\nof high rank and importance, who was the first to arrive at her\nreception. ',
  <span class="green">Anna Pavlovna</span>,
  ' had had a cough for