# Web Scraping
3 min video - What is Web Scraping: https://www.youtube.com/watch?v=Ct8Gxo8StBU
  
Some websites can contain a very large amount of invaluable data: stock prices, product details, sports stats, company contacts, you name it.  

**Web scraping, web harvesting**, or **web data extraction** is data scraping used for extracting data from websites.  Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
  
Scraping a web page involves **fetching** it and **extracting** from it. Fetching is the downloading of a page (which a browser does when a user views a page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, extraction can take place. The content of a page may be parsed, searched and reformatted, and its data copied into a spreadsheet or loaded into a database. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be finding and copying names and telephone numbers, companies and their URLs, or e-mail addresses to a list (contact scraping).
  
https://en.wikipedia.org/wiki/Web_scraping

# Beautiful Soup  
https://www.crummy.com/software/BeautifulSoup/  

Python library designed for screen-scraping.  Beautiful Soup is a library for pulling data out of HTML and XML files. It provides ways of navigating, searching, and modifying parse trees.

## Agenda
Quick Start.  
Installing the Soup.  
Making the Soup.  
Kinds of Objects:
- [Tag](#Tag)
- [NavigableString](#NavigableString)
- [BeautifulSoup](#BeautifulSoup)
- [Comment](#Comments-and-other-special-strings)

**[Navigating the Tree:](#Navigating-the-tree)**
- [Going Down](#Going-down)
    - By using a tag name, e.g. *soup.head*
    - .*contents* and .*children*
    - .*descendants*
    - .*string*, .*strings* and .*stripped_strings*  
      
    
- [Going Up](#Going-up)
    - .*parent*
    - .*parents*  
      
    
- [Going Sideways](#Going-sideways)
    - .*next_sibling* and .*previous_sibling*
    - .*next_siblings* and *previous_siblilngs* 
      
    
- [Going Back and Forth](#Going-back-and-forth)
    - .*next_element* and .*previous_element*
    - .*next_elements* and .*previous_elements*  
      
      
    
**[Searching the Tree](#Searching-the-tree):**
- [Kinds of Filters](#Kinds-of-filters)
    - A string
    - A regular expression
    - A list
    - True
    - A function
   
- [find_all()](#find_all())
    - The name argument
    - The keyword arguments
    - Searching by CSS class
    - The string argument
    - The limit argument
    - The recursive argument  
  
  
- [Calling a tag is like calling *find_all()*](#Calling-a-tag-is-like-calling-find_all())
- [*find()*](#find())
- [*find_parents()* and *find_parent()*](#find_parents()-and-find_parent())
- [*find_next_siblings()* and *find_next_sibling()*](#find_next_siblings()-and-find_next_sibling())
- [*find_previous_siblings()* and *find_previous_sibling()*](#find_previous_siblings()-and-find_previous_sibling())
- [*find_all_next()* and *find_next()*](#find_all_next()-and-find_next())
- [*find_all_previous()* and *find_previous()*](#find_all_previous()-and-find_previous())

## Quick Start
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

Here’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice in Wonderland:

In [171]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

Running the “three sisters” document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure:

In [172]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


Here are some simple ways to navigate that data structure:

In [173]:
# title tag
soup.title

<title>The Dormouse's story</title>

In [174]:
# tag name
soup.title.name

'title'

In [175]:
# tag string
soup.title.string

"The Dormouse's story"

In [176]:
soup.title.parent.name

'head'

In [177]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [178]:
# class of a p tag
soup.p['class']

['title']

In [179]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [180]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [181]:
soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

One common task is extracting all the URLs found within a page’s <a> tags:

In [182]:
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


Another common task is extracting all the text from a page:

In [183]:
print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



## Installing Beautiful Soup

```python

conda install -c anaconda beautifulsoup4

```

https://anaconda.org/anaconda/beautifulsoup4

## Making the soup
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

In [344]:
from bs4 import BeautifulSoup

soup = BeautifulSoup("<html>a web page</html>", 'html.parser')
print(soup.prettify())

<html>
 a web page
</html>


```python
# an open filehandle
with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
```

In [185]:
import requests
url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
print(soup.title)

<title>Online Tutorials Library</title>


## Kinds of objects
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: **Tag, NavigableString, BeautifulSoup, and Comment**.
### Tag
A Tag object corresponds to an XML or HTML tag in the original document:

In [186]:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
type(tag)

bs4.element.Tag

Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes.
#### Tag Name
Every tag has a name, accessible as .name:

In [187]:
tag.name

'b'

If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

In [188]:
tag.name = "blockquote"
tag

<blockquote class="boldest">Extremely bold</blockquote>

#### Tag Attributes
A tag may have any number of attributes. The tag &lt;b id="boldest">&lt;/b> has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

In [189]:
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id']

'boldest'

You can access that dictionary directly as .attrs:

In [190]:
tag.attrs

{'id': 'boldest'}

You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

In [191]:
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag

<b another-attribute="1" id="verybold">bold</b>

In [192]:
del tag['id']
del tag['another-attribute']
tag

<b>bold</b>

Multi-valued attributes are discussed on Beautiful Soup Documentation Website: 
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#multi-valued-attributes

### NavigableString
A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:

In [193]:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.string


'Extremely bold'

In [194]:
type(tag.string)

bs4.element.NavigableString

You can’t edit a string in place, but you can replace one string with another, using replace_with():

In [195]:
tag.string.replace_with("No longer bold")
tag

<b class="boldest">No longer bold</b>

NavigableString supports most of the features described in Navigating the tree and Searching the tree, but not all of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the .contents or .string attributes, or the find() method.

If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory.
### BeautifulSoup
The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.

You can also pass a BeautifulSoup object into one of the methods defined in Modifying the tree, just as you would a Tag. This lets you do things like **combine two parsed documents**:

In [340]:
doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")

# replace text with footer tag
doc.find(text="INSERT FOOTER HERE").replace_with(footer)

'INSERT FOOTER HERE'

In [197]:
print(doc)

<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer>Here's the footer</footer></document>


Since the BeautifulSoup object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its .name, so it’s been given the special .name “[document]”:

In [198]:
soup.name

'[document]'

### Comments and other special strings

In [199]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
type(comment)

bs4.element.Comment

The Comment object is just a special type of NavigableString:

In [200]:
comment

'Hey, buddy. Want to buy a used parser?'

But when it appears as part of an HTML document, a Comment is displayed with special formatting:

In [201]:
print(soup.b.prettify())

<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>


See Beautiful Soup Documentation for other special strings: 
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings

## Navigating the tree

Here’s the “Three sisters” HTML document again:

In [341]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


### Going down
**Tags may contain strings and other tags**. These elements are the tag’s children. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.

Note that Beautiful Soup strings don’t support any of these attributes, because a **string can’t have children**.
#### Navigating using tag names
The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the &lt;head> tag, just say soup.head:

In [203]:
soup.head

<head><title>The Dormouse's story</title></head>

In [204]:
soup.title

<title>The Dormouse's story</title>

You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first &lt;b&gt; tag beneath the &lt;body&gt; tag:

In [205]:
soup.body.b

<b>The Dormouse's story</b>

Using a tag name as an attribute will give you only the first tag by that name:

In [206]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

If you need to get all the &lt;a&gt; tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all():

In [207]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#### .contents and .children
A tag’s children are available in a list called .contents:

In [208]:
head_tag = soup.head
head_tag

<head><title>The Dormouse's story</title></head>

In [209]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [210]:
title_tag = head_tag.contents[0]
title_tag

<title>The Dormouse's story</title>

In [211]:
title_tag.contents

["The Dormouse's story"]

The BeautifulSoup object itself has children. In this case, the &lt;html&gt; tag is the child of the BeautifulSoup object.:

In [212]:
len(soup.contents)

1

In [213]:
soup.contents[0].name

'html'

A string does not have .contents, because it can’t contain anything:

In [214]:
text = title_tag.contents[0]
# text.contents - will give AttributeError: 'NavigableString' object has no attribute 'contents'

Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:

In [215]:
for child in title_tag.children:
    print(child)

The Dormouse's story


If you want to modify a tag’s children, use the methods described in Modifying the tree. Don’t modify the the .contents list directly: that can lead to problems that are subtle and difficult to spot.

#### .descendants
The .contents and .children attributes only consider a tag’s direct children. For instance, the &lt;head&gt; tag has a single direct child–the &lt;title&gt; tag:

In [216]:
head_tag.contents

[<title>The Dormouse's story</title>]

But the &lt;title&gt; tag itself has a child: the string “The Dormouse’s story”. There’s a sense in which that string is also a child of the &lt;head&gt; tag. The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:

In [343]:
for child in head_tag.descendants:
    print(child)

<title>The Dormouse's story</title>
The Dormouse's story


The &lt;head&gt; tag has only one child, but it has two descendants: the <title> tag and the &lt;title&gt; tag’s child. The BeautifulSoup object only has one direct child (the &lt;html&gt; tag), but it has a whole lot of descendants:

In [218]:
len(list(soup.children))

1

In [219]:
len(list(soup.descendants))

26

#### .string
If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

In [220]:
title_tag.string

"The Dormouse's story"

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:

In [221]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [222]:
head_tag.string

"The Dormouse's story"

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

In [223]:
print(soup.html.string)

None


#### .strings and stripped_strings
If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:

In [224]:
for string in soup.strings:
    print(repr(string))
    '\n'

"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'


These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead:

In [225]:
for string in soup.stripped_strings:
    print(repr(string))

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'


Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed.

### Going up
Continuing the “family tree” analogy, every tag and every string has a parent: the tag that contains it.
#### .parent
You can access an element’s parent with the .parent attribute. In the example “three sisters” document, the &lt;head&gt; tag is the parent of the &lt;title&gt; tag:

In [226]:
title_tag = soup.title
title_tag

<title>The Dormouse's story</title>

In [227]:
title_tag.parent

<head><title>The Dormouse's story</title></head>

The title string itself has a parent: the &lt;title&gt; tag that contains it:

In [228]:
title_tag.string.parent

<title>The Dormouse's story</title>

The parent of a top-level tag like &lt;html&gt; is the BeautifulSoup object itself:

In [229]:
html_tag = soup.html
type(html_tag.parent)

bs4.BeautifulSoup

#### .parents
You can iterate over all of an element’s parents with .parents. This example uses .parents to travel from an &lt;a&gt; tag buried deep within the document, to the very top of the document:

In [230]:
link = soup.a
link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [231]:
for parent in link.parents:
    print(parent.name)

p
body
html
[document]


### Going sideways
Consider a simple document like this:

In [232]:
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></a>", 'html.parser')
print(sibling_soup.prettify())

<a>
 <b>
  text1
 </b>
 <c>
  text2
 </c>
</a>


The &lt;b&gt; tag and the &lt;c&gt; tag are at the same level: they’re both direct children of the same tag. We call them siblings. When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write.

#### .next_sibling and .previous_sibling
You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:

In [233]:
sibling_soup.b.next_sibling

<c>text2</c>

In [234]:
sibling_soup.c.previous_sibling

<b>text1</b>

The &lt;b&gt; tag has a .next_sibling, but no .previous_sibling, because there’s nothing before the &lt;b&gt; tag on the same level of the tree. For the same reason, the <c> tag has a .previous_sibling but no .next_sibling:

In [235]:
print(sibling_soup.b.previous_sibling)

None


In [236]:
print(sibling_soup.c.next_sibling)

None


The strings “text1” and “text2” are not siblings, because they don’t have the same parent:

In [237]:
sibling_soup.b.string

'text1'

In [238]:
print(sibling_soup.b.string.next_sibling)

None


In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string containing whitespace. Going back to the “three sisters” document:  
  
&lt;a href="http://example.com/elsie" class="sister" id="link1">Elsie&lt;/a>  
&lt;a href="http://example.com/lacie" class="sister" id="link2">Lacie&lt;/a>  
&lt;a href="http://example.com/tillie" class="sister" id="link3">Tillie&lt;/a>  

You might think that the .next_sibling of the first &lt;a> tag would be the second &lt;a> tag. But actually, it’s a string: the comma and newline that separate the first &lt;a> tag from the second:


In [239]:
link = soup.a
link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [240]:
link.next_sibling

',\n'

The second &lt;a> tag is actually the .next_sibling of the comma:

In [241]:
link.next_sibling.next_sibling

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

#### .next_siblings and .previous_siblings
You can iterate over a tag’s siblings with .next_siblings or .previous_siblings:

In [242]:
for sibling in soup.a.next_siblings:
    print(repr(sibling))

',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'


In [243]:
for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

' and\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'


### Going back and forth
Take a look at the beginning of the “three sisters” document:

In [244]:
# <html><head><title>The Dormouse's story</title></head>
# <p class="title"><b>The Dormouse's story</b></p>

An HTML parser takes this string of characters and turns it into a series of events: “open an &lt;html> tag”, “open a &lt;head> tag”, “open a &lt;title> tag”, “add a string”, “close the &lt;title> tag”, “open a &lt;p> tag”, and so on. Beautiful Soup offers tools for reconstructing the initial parse of the document.

#### .next_element and .previous_element
The .next_element attribute of a string or tag points to whatever was parsed immediately afterwards. It might be the same as .next_sibling, but it’s usually drastically different.
  
Here’s the final &lt;a> tag in the “three sisters” document. Its .next_sibling is a string: the conclusion of the sentence that was interrupted by the start of the &lt;a> tag.:

In [245]:
last_a_tag = soup.find("a", id="link3")
last_a_tag

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [246]:
last_a_tag.next_sibling

';\nand they lived at the bottom of a well.'

But the .next_element of that &lt;a> tag, the thing that was parsed immediately after the &lt;a> tag, is not the rest of that sentence: it’s the word “Tillie”:

In [247]:
last_a_tag.next_element

'Tillie'

That’s because in the original markup, the word “Tillie” appeared before that semicolon. The parser encountered an &lt;a> tag, then the word “Tillie”, then the closing &;t/a> tag, then the semicolon and rest of the sentence. The semicolon is on the same level as the &lt;a> tag, but the word “Tillie” was encountered first.

The .previous_element attribute is the exact opposite of .next_element. It points to whatever element was parsed immediately before this one:

In [248]:
last_a_tag.previous_element

' and\n'

In [249]:
last_a_tag.previous_element.next_element

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

#### .next_elements and .previous_elements
You should get the idea by now. You can use these iterators to move forward or backward in the document as it was parsed:

In [250]:
for element in last_a_tag.next_elements:
    print(repr(element))

'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'


## Searching the tree
Beautiful Soup defines a lot of methods for searching the parse tree, but they’re all very similar. I’m going to spend a lot of time explaining the two most popular methods: find() and find_all(). The other methods take almost exactly the same arguments, so I’ll just cover them briefly.

Once again, I’ll be using the “three sisters” document as an example:

In [251]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

By passing in a filter to an argument like find_all(), you can zoom in on the parts of the document you’re interested in.

### Kinds of filters
Before talking in detail about find_all() and similar methods, I want to show examples of different filters you can pass into these methods. These filters show up again and again, throughout the search API. You can use them to filter based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.
#### A string
The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the &lt;b> tags in the document:

In [252]:
soup.find_all('b')

[<b>The Dormouse's story</b>]

If you pass in a byte string, Beautiful Soup will assume the string is encoded as UTF-8. You can avoid this by passing in a Unicode string instead.
#### A regular expression
If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its search() method. This code finds all the tags whose names start with the letter “b”; in this case, the &lt;body> tag and the &lt;b> tag:

In [253]:
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

body
b


This code finds all the tags whose names contain the letter ‘t’:

In [254]:
for tag in soup.find_all(re.compile("t")):
    print(tag.name)

html
title


#### A list
If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the &lt;a> tags **and** all the &lt;b> tags:

In [255]:
soup.find_all(["a", "b"])

[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#### True
The value True matches everything it can. This code finds all the tags in the document, but none of the text strings:

In [256]:
for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
p


#### A function
If none of the other matches work for you, define a function that takes an element as its only argument. The function should return True if the argument matches, and False otherwise.

Here’s a function that returns True if a tag defines the “class” attribute but doesn’t define the “id” attribute:

In [257]:
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

Pass this function into find_all() and you’ll pick up all the &lt;p> tags:

In [258]:
soup.find_all(has_class_but_no_id)

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

This function only picks up the &lt;p> tags. It doesn’t pick up the &lt;a> tags, because those tags define both “class” and “id”. It doesn’t pick up tags like &lt;html> and &lt;title>, because those tags don’t define “class”.

If you pass in a function to filter on a specific attribute like href, the argument passed into the function will be the attribute value, not the whole tag. Here’s a function that finds all a tags whose href attribute does not match a regular expression:

In [259]:
import re
def not_lacie(href):
    return href and not re.compile("lacie").search(href)

soup.find_all(href=not_lacie)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

The function can be as complicated as you need it to be. Here’s a function that returns True if a tag is surrounded by string objects:

In [260]:
from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
    print(tag.name)

body
p
a
a
a
p


Now we’re ready to look at the search methods in detail.

#### find_all()
 
Method signature: find_all(name, attrs, recursive, string, limit, **kwargs)

The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters. I gave several examples in Kinds of filters, but here are a few more:

In [261]:
soup.find_all("title")

[<title>The Dormouse's story</title>]

In [262]:
soup.find_all("p", "title")

[<p class="title"><b>The Dormouse's story</b></p>]

In [263]:
soup.find_all("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [264]:
soup.find_all(id="link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [265]:
import re
soup.find(string=re.compile("sisters"))

'Once upon a time there were three little sisters; and their names were\n'

Some of these should look familiar, but others are new. What does it mean to pass in a value for string, or id? Why does find_all("p", "title") find a <p> tag with the CSS class “title”? Let’s look at the arguments to find_all().

#### The name argument
Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.

This is the simplest usage:

In [266]:
soup.find_all("title")

[<title>The Dormouse's story</title>]

Recall from Kinds of filters that the value to name can be a string, a regular expression, a list, a function, or the value True.

#### The keyword arguments
Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag’s ‘id’ attribute:

In [267]:
soup.find_all(id='link2')

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

If you pass in a value for href, Beautiful Soup will filter against each tag’s ‘href’ attribute:

In [268]:
soup.find_all(href=re.compile("elsie"))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

You can filter an attribute based on a string, a regular expression, a list, a function, or the value True.

This code finds all tags whose id attribute has a value, regardless of what the value is:

In [269]:
soup.find_all(id=True)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

You can filter multiple attributes at once by passing in more than one keyword argument:

In [270]:
soup.find_all(href=re.compile("elsie"), id='link1')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments:

In [271]:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>', 'html.parser')
# data_soup.find_all(data-foo="value") - not going to work

You can use these attributes in searches by putting them into a dictionary and passing the dictionary into find_all() as the attrs argument:

In [272]:
data_soup.find_all(attrs={"data-foo": "value"})

[<div data-foo="value">foo!</div>]

You can’t use a keyword argument to search for HTML’s ‘name’ element, because Beautiful Soup uses the name argument to contain the name of the tag itself. Instead, you can give a value to ‘name’ in the attrs argument:

In [273]:
name_soup = BeautifulSoup('<input name="email"/>', 'html.parser')
name_soup.find_all(name="email") # does not work

[]

In [274]:
name_soup.find_all(attrs={"name": "email"})

[<input name="email"/>]

#### Searching by CSS class
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

In [275]:
soup.find_all("a", class_="sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

As with any keyword argument, you can pass class_ a string, a regular expression, a function, or True:

In [276]:
soup.find_all(class_=re.compile("itl")) # for our class "title"

[<p class="title"><b>The Dormouse's story</b></p>]

In [277]:
def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes:

In [278]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
css_soup.find_all("p", class_="strikeout")

[<p class="body strikeout"></p>]

In [279]:
css_soup.find_all("p", class_="body")

[<p class="body strikeout"></p>]

You can also search for the exact string value of the class attribute:

In [280]:
css_soup.find_all("p", class_="body strikeout")

[<p class="body strikeout"></p>]

But searching for variants of the string value won’t work:

In [281]:
css_soup.find_all("p", class_="strikeout body")

[]

If you want to search for tags that match two or more CSS classes, you should use a CSS selector:

In [282]:
css_soup.select("p.strikeout.body")

[<p class="body strikeout"></p>]

In older versions of Beautiful Soup, which don’t have the class_ shortcut, you can use the attrs trick mentioned above. Create a dictionary whose value for “class” is the string (or regular expression, or whatever) you want to search for:

In [283]:
soup.find_all("a", attrs={"class": "sister"})

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#### The string argument
With string you can search for strings instead of tags. As with name and the keyword arguments, you can pass in a string, a regular expression, a list, a function, or the value True. Here are some examples:


In [284]:
soup.find_all(string="Elsie")

['Elsie']

In [285]:
soup.find_all(string=["Tillie", "Elsie", "Lacie"])

['Elsie', 'Lacie', 'Tillie']

In [286]:
soup.find_all(string=re.compile("Dormouse"))

["The Dormouse's story", "The Dormouse's story"]

In [287]:
def is_the_only_string_within_a_tag(s):
    """Return True if this string is the only child of its parent tag."""
    return (s == s.parent.string)

soup.find_all(string=is_the_only_string_within_a_tag)

["The Dormouse's story",
 "The Dormouse's story",
 'Elsie',
 'Lacie',
 'Tillie',
 '...']

Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the &lt;a> tags whose .string is “Elsie”:

In [288]:
soup.find_all("a", string="Elsie")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

The string argument is new in Beautiful Soup 4.4.0. In earlier versions it was called text:

In [289]:
soup.find_all("a", text="Elsie")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

#### The limit argument

find_all() returns all the tags and strings that match your filters. This can take a while if the document is large. If you don’t need all the results, you can pass in a number for limit. This works just like the LIMIT keyword in SQL. It tells Beautiful Soup to stop gathering results after it’s found a certain number.

There are three links in the “three sisters” document, but this code only finds the first two:

In [290]:
soup.find_all("a", limit=2)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

#### The recursive argument
If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on. If you only want Beautiful Soup to consider **direct children**, you can pass in recursive=False. See the difference here:

In [291]:
soup.html.find_all("title")

[<title>The Dormouse's story</title>]

In [292]:
soup.html.find_all("title", recursive=False)

[]

Here’s that part of the document:

In [293]:
'''<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
...
'''

"<html>\n <head>\n  <title>\n   The Dormouse's story\n  </title>\n </head>\n...\n"

The &lt;title> tag is beneath the &lt;html> tag, but it’s not directly beneath the &lt;html> tag: the &lt;head> tag is in the way. Beautiful Soup finds the &lt;title> tag when it’s allowed to look at all descendants of the &lt;html> tag, but when recursive=False restricts it to the &lt;html> tag’s immediate children, it finds nothing.

Beautiful Soup offers a lot of tree-searching methods (covered below), and they mostly take the same arguments as find_all(): name, attrs, string, limit, and the keyword arguments. But the recursive argument is different: find_all() and find() are the only methods that support it. Passing recursive=False into a method like find_parents() wouldn’t be very useful.

#### Calling a tag is like calling find_all()

Because find_all() is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object or a Tag object as though it were a function, then it’s the same as calling find_all() on that object. These two lines of code are equivalent:

In [294]:
soup.find_all("a")
soup("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

These two lines are also equivalent:

In [295]:
soup.title.find_all(string=True)
soup.title(string=True)

["The Dormouse's story"]

#### find()
Method signature: find(name, attrs, recursive, string, **kwargs)

The find_all() method scans the entire document looking for results, but sometimes you only want to find one result. If you know a document only has one &lt;body> tag, it’s a waste of time to scan the entire document looking for more. Rather than passing in limit=1 every time you call find_all, you can use the find() method. These two lines of code are nearly equivalent:

In [296]:
soup.find_all('title', limit=1)

[<title>The Dormouse's story</title>]

In [297]:
soup.find('title')

<title>The Dormouse's story</title>

The only difference is that find_all() returns a list containing the single result, and find() just returns the result.

If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None:

In [298]:
print(soup.find("nosuchtag"))

None


Remember the soup.head.title trick from Navigating using tag names? That trick works by repeatedly calling find():

In [299]:
soup.head.title

<title>The Dormouse's story</title>

In [300]:
soup.find("head").find("title")

<title>The Dormouse's story</title>

#### find_parents() and find_parent()

Method signature: find_parents(name, attrs, string, limit, **kwargs)

Method signature: find_parent(name, attrs, string, **kwargs)

I spent a lot of time above covering find_all() and find(). The Beautiful Soup API defines ten other methods for searching the tree, but don’t be afraid. Five of these methods are basically the same as find_all(), and the other five are basically the same as find(). The only differences are in what parts of the tree they search.

First let’s consider find_parents() and find_parent(). Remember that find_all() and find() work their way down the tree, looking at tag’s descendants. These methods do the opposite: **they work their way up the tree**, looking at a tag’s (or a string’s) parents. Let’s try them out, starting from a string buried deep in the “three daughters” document:

In [301]:
a_string = soup.find(string="Lacie")
a_string

'Lacie'

In [302]:
a_string.find_parents("a")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [303]:
a_string.find_parent("p")

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

In [304]:
a_string.find_parents("p", class_="title")

[]

In [305]:
a_string.find_parents("p", class_="title")

[]

One of the three &lt;a> tags is the direct parent of the string in question, so our search finds it. One of the three &lt;p> tags is an indirect parent of the string, and our search finds that as well. There’s a &lt;p> tag with the CSS class “title” somewhere in the document, but it’s not one of this string’s parents, so we can’t find it with find_parents().

You may have made the connection between find_parent() and find_parents(), and the .parent and .parents attributes mentioned earlier. The connection is very strong. These search methods actually use .parents to iterate over all the parents, and check each one against the provided filter to see if it matches.

#### find_next_siblings() and find_next_sibling()
Method signature: find_next_siblings(name, attrs, string, limit, **kwargs)

Method signature: find_next_sibling(name, attrs, string, **kwargs)

These methods use .next_siblings to iterate over the rest of an element’s siblings in the tree. The find_next_siblings() method returns all the siblings that match, and find_next_sibling() only returns the first one:

In [306]:
first_link = soup.a
first_link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [307]:
first_link.find_next_siblings("a")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [308]:
first_story_paragraph = soup.find("p", "story")
first_story_paragraph.find_next_sibling("p")

<p class="story">...</p>

#### find_previous_siblings() and find_previous_sibling()

Method signature: find_previous_siblings(name, attrs, string, limit, **kwargs)

Method signature: find_previous_sibling(name, attrs, string, **kwargs)

These methods use .previous_siblings to iterate over an element’s siblings that precede it in the tree. The find_previous_siblings() method returns all the siblings that match, and find_previous_sibling() only returns the first one:

In [309]:
last_link = soup.find("a", id="link3")
last_link

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [310]:
last_link.find_previous_siblings("a")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [311]:
first_story_paragraph = soup.find("p", "story")
first_story_paragraph.find_previous_sibling("p")

<p class="title"><b>The Dormouse's story</b></p>

#### find_all_next() and find_next()
Method signature: find_all_next(name, attrs, string, limit, **kwargs)

Method signature: find_next(name, attrs, string, **kwargs)

These methods use .next_elements to iterate over **whatever tags and strings that come after it in the document**. The find_all_next() method returns all matches, and find_next() only returns the first match:

In [312]:
first_link = soup.a
first_link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [313]:
first_link.find_all_next(string=True)

['Elsie',
 ',\n',
 'Lacie',
 ' and\n',
 'Tillie',
 ';\nand they lived at the bottom of a well.',
 '\n',
 '...',
 '\n']

In [314]:
first_link.find_next("p")

<p class="story">...</p>

In the first example, the string “Elsie” showed up, even though it was contained within the &lt;a> tag we started from. In the second example, the last &lt;p> tag in the document showed up, even though it’s not in the same part of the tree as the &lt;a> tag we started from. For these methods, all that matters is that an element match the filter, and show up later in the document than the starting element.

#### find_all_previous() and find_previous()

Method signature: find_all_previous(name, attrs, string, limit, **kwargs)

Method signature: find_previous(name, attrs, string, **kwargs)

These methods use .previous_elements to iterate over the tags and strings that came before it in the document. The find_all_previous() method returns all matches, and find_previous() only returns the first match:

In [315]:
first_link = soup.a
first_link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [316]:
first_link.find_all_previous("p")

[<p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="title"><b>The Dormouse's story</b></p>]

In [317]:
first_link.find_previous("title")

<title>The Dormouse's story</title>

The call to find_all_previous("p") found the first paragraph in the document (the one with class=”title”), but it also finds the second paragraph, the &lt;p> tag that contains the &lt;a> tag we started with. This shouldn’t be too surprising: we’re looking at all the tags that show up earlier in the document than the one we started with. A &lt;p> tag that contains an &lt;a> tag must have shown up before the &lt;a> tag it contains.

### CSS selectors
BeautifulSoup has a .select() method which uses the SoupSieve package to run a CSS selector against a parsed document and return all the matching elements. Tag has a similar method which runs a CSS selector against the contents of a single tag.

(The SoupSieve integration was added in Beautiful Soup 4.7.0. Earlier versions also have the .select() method, but only the most commonly-used CSS selectors are supported. If you installed Beautiful Soup through pip, SoupSieve was installed at the same time, so you don’t have to do anything extra.)

The SoupSieve documentation lists all the currently supported CSS selectors, but here are some of the basics:

CSS selectors are used to select the element you want to style:  
https://www.w3schools.com/cssref/css_selectors.asp  
http://web.simmons.edu/~grabiner/comm244/weekfour/selectors.html  

You can find tags:

In [318]:
soup.select("title")

[<title>The Dormouse's story</title>]

In [319]:
soup.select("p:nth-of-type(3)")

[<p class="story">...</p>]

Find tags beneath other tags:

In [320]:
soup.select("body a") # a beneath body

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [321]:
soup.select("html head title")

[<title>The Dormouse's story</title>]

Find tags directly beneath other tags:

In [322]:
soup.select("head > title")

[<title>The Dormouse's story</title>]

In [323]:
soup.select("p > a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [324]:
soup.select("p > a:nth-of-type(2)")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [325]:
soup.select("p > #link1")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [326]:
soup.select("body > a")

[]

Find the siblings of tags:

In [327]:
#selects every element of class sister that is preceded by an element with id #link1
soup.select("#link1 ~ .sister")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [328]:
#selects the first element of class sister that is placed immediately after id #link1 
soup.select("#link1 + .sister")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

Find tags by CSS class:

In [329]:
soup.select(".sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [330]:
#Selects all elements with a class attribute containing the word "sister"
soup.select("[class~=sister]")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Find tags by ID:

In [331]:
soup.select("#link1")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [332]:
soup.select("a#link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

Find tags that match any selector from a list of selectors:

In [333]:
soup.select("#link1,#link2")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

Test for the existence of an attribute:

In [334]:
soup.select('a[href]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Find tags by attribute value:

In [335]:
soup.select('a[href="http://example.com/elsie"]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [336]:
#Selects every <a> element whose href attribute value begins with "http://example.com/"
soup.select('a[href^="http://example.com/"]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [337]:
#Selects every <a> element whose href attribute value ends with "tillie"
soup.select('a[href$="tillie"]')

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [338]:
#Selects every <a> element whose href attribute value contains the substring ".com/el"
soup.select('a[href*=".com/el"]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

There’s also a method called select_one(), which finds only the first tag that matches a selector:

In [339]:
#class = sister
soup.select_one(".sister")

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

If you’ve parsed XML that defines namespaces, you can use them in CSS selectors. (See BeautifulSoup Documentation)