# Searching for Tags by Attributes

<p>Let’s create an example web scraper that scrapes the page located at <a href="http://www.pythonscraping.com/pages/warandpeace.html"><em>http://www.pythonscraping.com/pages/warandpeace.html</em></a>.</p>

<p>You can grab the entire page and create a <code>BeautifulSoup</code> object using the following program program: </p>


In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

web_address='http://www.pythonscraping.com/pages/warandpeace.html'
html = urlopen(web_address)
bs = BeautifulSoup(html.read(), 'html.parser')

Using this `BeautifulSoup` object, you can use <font color=blue> find_all() </font> and <font color=blue> get_text() </font> methods to extract a **Python list** of proper nouns which are in green font. This can be accomplished by selecting only the text within <code> &lt;span class="green"&gt;&lt;/span&gt;</code> tags. Two notes about this code:

**<font color='red'>Note 1:</font>**  Previously, you have called <code>bs.tagName</code> to get the first occurrence of that tag. Now, you are calling <code>bs.find_all(tagName, tagAttributes)</code> which creates a **list of all of the tags** on the page, rather than just the first (we will discuss <code>find_all</code> in more detail later in this lecture).

**<font color='red'>Note 2:</font>** After getting a list of names, the program iterates through all names in the list, and prints <code>name.get_text()</code> in order to separate the content from the tags.

In [2]:
nameList = bs.find_all('span', {'class': 'green'})
for name in nameList:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


In [3]:
nameList

[<span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">St. Petersburg</span>,
 <span class="green">the prince</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">Prince Vasili</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">the prince</span>,
 <span class="green">Wintzingerode</span>,
 <span class="green">King of Prussia</span>,
 <span class="green">le Vicomte de Mortemart</span>,
 <span class="green">Montmorencys</span>,
 <span class="green">Rohans</span>,
 <span class="green">Abbe Morio</span>,
 <span class="green">the Emperor</span>,
 <span class="green">the prince</span>,
 

## `get_text()` method

`.get_text()` strips all tags from the document and returns a Unicode string containing the text only. 


Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Thus, calling `.get_text()` should always be the last thing you do, immediately before you print, store, or manipulate your final data.

___

**Now go to "Session 5 Class Exercise" notebook and complete Exercise 1.**

___