---
title: Lesson 2. Meet the Animals
format:
  html:
    toc: true
    toc-expand: 2
    toc-title: CONTENTS
---

This lesson continues to explore the diverse features of BeautifulSoup, a Python library designed for parsing XML and HTML documents. We will utilize BeautifulSoup to extract information about a select group of animals showcased on the __[Meet the Animals](https://nationalzoo.si.edu/animals)__ webpage of Smithsonian's National Zoo and Conservation Biology Institute. Additionally, we will explore Pandas, a powerful Python library used for structuring, analyzing, and manipulating data.

## Data skills | concepts
- Search parameters
- HTML
- Web scraping
- Pandas data structures

## Learning objectives
1. Identify search parameters and understand how they are inserted into a url.
2. Navigate document, element, attribute, and text nodes in a Document Object Model (DOM).
3. Extract and store HTML elements
4. Export data to .csv

This tutorial is designed to support multi-session __[workshops](https://library.osu.edu/events?combine=&tid=All&field_location_code_value=10&sort_bef_combine=field_end_date_value_ASC)__ hosted by The Ohio State University Libraries Research Commons. It assumes you already have a basic understanding of Python, including how to iterate through lists and dictionaries to extract data using a for loop. To learn basic Python concepts visit the [Python - Mastering the Basics](python_basics.ipynb) tutorial.

# LESSON 2

# Step 1. Copyright | Terms of Use
Before starting any webscraping or API project, you must: 

## Review and understand the terms of use.

- Do the terms of service include any restrictions or guidelines?
- Are permissions/licenses needed to scrape data? If yes, have you obtained these permissions/licenses?
- Is the information publicly available?
- If a database, is the database protected by copyright? Or in the public domain?

## Fair Use 
Limited use of copyrighted materials is allowed under certain conditions for journalism, scholarship, and teaching. [Use the Resources for determining fair use](https://library.osu.edu/copyright/fair-use) to verify your project is within the scope of fair use. Contact University Libraries [Copyright Services](https://library.osu.edu/copyright) if you have any questions.

## Check for robots.txt directives
robots.txt directives limit web-scraping or web-crawling. Look for this file in the root directory of the website by adding /robots.txt to the end of the url. Respect these directives.

<div class="accordion" id="accordionExercise1">

  <div class="accordion-item"><h2 class="accordion-header" id="ex1-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex1-collapseOne" aria-expanded="true" aria-controls="ex1-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 1: Examine Copyright | Terms of Use</button></h2><div id="ex-1collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex1-headingOne" data-bs-parent="#accordionExercise1"> <div class="accordion-body fs-4"><p>Locate and read the terms of use for the <a href="https://nationalzoo.si.edu/">Smithsonian's National Zoo & Conservation Biology Institute</a></p><ol><li>What are the copyright restrictions for this resource?</li><li>What are the terms of use?</li></ol></div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex1-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex1-collapseTwo" aria-expanded="false" aria-controls="ex1-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex1-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex1-headingTwo" data-bs-parent="#accordionExercise1"> <div class="accordion-body"><p>The copyright restrictions for the <a href="https://nationalzoo.si.edu/">Smithsonian's National Zoo & Conservation Biology Institute</a> are listed with the <a href="https://www.si.edu/termsofuse">Terms of Use</a> and can be found on the center of the bottom footer of the webpage.</p>

  </div>
  </div>
  </div>

</div>

# Step 2. Is an API available?

**Technically, yes**. The Smithsonian Institution provides an __[Open Access API](https://edan.si.edu/openaccess/apidocs/#api)__ that allows developers to access a wide range of data.

However, for learning purposes, weâ€™ll focus on **scraping a small sample** from the __[Meet the Animals](https://nationalzoo.si.edu/animals/list)__ HTML page. This will help us practice how to:

- Navigate a webpageâ€™s structure
- Extract specific HTML elements
- Store the data for further use

This hands-on approach is a great way to build foundational web scraping skills before working with APIs.

# Step 3. Examine the URL

<div class="accordion" id="accordionExercise2">

  <div class="accordion-item"><h2 class="accordion-header" id="ex2-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex2-collapseOne" aria-expanded="true" aria-controls="ex2-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 2: Examine the URL</button></h2><div id="ex2-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex2-headingOne" data-bs-parent="#accordionExercise2"> <div class="accordion-body fs-4"><p>Go to <a href="https://nationalzoo.si.edu/animals/list">Meet the Animals</a>_ and choose an animal to examine from the list. Note the structure of the URL. Return to <a href="https://nationalzoo.si.edu/animals/list">Meet the Animals</a> and select another animal to examine. Confirm the structure of the URL.</p></div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex2-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex2-collapseTwo" aria-expanded="false" aria-controls="ex2-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex2-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex2-headingTwo" data-bs-parent="#accordionExercise2"> <div class="accordion-body"><p>The base URL for the <a href="https://nationalzoo.si.edu/animals/list">Meet the Animals</a> webpage is <strong>https://nationalzoo.si.edu/animals/list</strong>. Note the URL ends with the word `list`. If meerkat is chosen, the <a href="https://nationalzoo.si.edu/animals/meerkat">Meet the Animals</a> URL changes to <strong>https://nationalzoo.si.edu/animals/meerkat</strong>. The URL now ends with `meerkat` the name of the animal.</p>

  </div>
  </div>
  </div>

</div>

# Step 4. Inspect the elements

Both **XML** and **HTML** are structured as trees, where elements are nested within one another. When you request a URL, the server returns an HTML or XML document. Your browser then downloads and parses this document to display it visually.

In [Lesson 1](./lantern.ipynb) we worked with well-structured XML, which made it easy to navigate:

- Each article was uniquely identified by the `<LogicalSectionID>` tag.
- Titles appeared in the  `<LogicalSectionTitle>` tag.
- Category type was included in the  `<LogicalSectionType>` tag.


In contrast, **HTML** documents can be more complex and less predictable. Fortunately, Google Chromeâ€™s __[Developer Tools](https://developer.chrome.com/docs/devtools/dom)__ make it easier to explore and understand the structure of a webpage.

## Example:
Find the common name for `meerkat`.

1. **Open the [meerkat](https://nationalzoo.si.edu/animals/meerkat) Meet the Animals webpage** in Chrome.
2. **Right-click** on the element you want to inspect (e.g., the common name).
3. Select **Inspect**.

![meerkat_inspect.png](images/meerkat_inspect.png "Screenshot showing the word Inspect at the bottom of options in window that opens after right clicking webpage")

This opens the __[Developer Tools](https://developer.chrome.com/docs/devtools/dom)__ panel, typically on the right of the screen. 

- The default **Elements** tab shows the HTML structure (DOM).
- Scroll through the rendered HTML to explore more content.
  
- ![inspect icon](images/inspect_icon.png "Decorative") Click the  **inspect icon** in the top-left corner of the in the Developer Tools panel.

- Hover over elements on the webpage to highlight them in the HTML. 

As you hover, Chrome will:

- Highlight the corresponding element on the page
- Show a tooltip with tag details (e.g., class, ID)
- Reveal the elementâ€™s location in the HTML tree

![meerkat_inspect_element](images/meerkat_inspect_element.png "the Developer Tools highlight where the element is located in the HTML and provide a tooltip with additional information about the element")

This process helps you identify the exact tags and attributes youâ€™ll need to target when scraping data from the page.

## Viewing an Elementâ€™s HTML Structure
To examine an elementâ€™s exact location within the DOM:

1. In **Chrome Developer Tools**, right-click on the highlighted element.
2. Select **Copy > Copy element**.
3. Paste the copied HTML into **Notepad** or any text editor to view its full structure and attributes.

This is especially helpful for identifying tags, classes, and nesting when preparing to extract data through web scraping.

![meerkat_copy_element.png](images/meerkat_copy_element.png "Screenshot showing location of copy element")  

![meerkat_notepad.png](images/meerkat_notepad.png "Screenshot of HTML snippet for selected element")

<div class="accordion" id="accordionExercise3">

  <div class="accordion-item"><h2 class="accordion-header" id="ex3-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex3-collapseOne" aria-expanded="true" aria-controls="ex3-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 3: Inspect the Elements</button></h2><div id="ex3-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex3-headingOne" data-bs-parent="#accordionExercise3"> <div class="accordion-body fs-4"><p>Go to <a href="https://nationalzoo.si.edu/animals/list">Meet the Animals</a> and choose an animal to examine from the list. Inspect the following elements, select <span class="text-primary">Copy > Copy element</span>, and then past the text to Notepad or a similar text editor.</p><ul><li>Common name</li><li>Scientific name</li><li>Taxonomic information</li><ul><li>Class</li><li>Order</li><li>Family</li><li>Genus and species</li></ul><li>Physical description</li><li>Size</li><li>Native habitat</li><li>Conservation status</li><li>Fun facts</li></ul></div></div>
  </div>

</div>


# Step 5. Identify Python libraries for project
## [requests](https://requests.readthedocs.io/en/latest/)
The [requests](https://requests.readthedocs.io/en/latest/) library retrieves HTML or XML documents from a server and processes the response. 

## [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/)

[BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) parses HTML and XML documents, helping you search for and extract elements from the DOM. 

## [pandas](https://pandas.pydata.org/docs/user_guide/index.html)
Pandas is a large Python library used for manipulating and analyzing tabular data. Helpful Pandas methods include:

### pd.DataFrame
A Pandas **DataFrame** is one of the most powerful and commonly used data structures in Python for working with **tabular data**â€”data that is organized in rows and columns, similar to a spreadsheet or SQL table.

A **DataFrame** is a 2-dimensional labeled data structure with:

- **Rows** (each representing an observation or record)
- **Columns** (each representing a variable or feature)

Think of it like an Excel sheet or a table in a database.

In [None]:
import pandas as pd

df=pd.DataFrame([data, index, columns, dtype, copy])

ðŸ”— See __[Pandas DataFrame](https://pandas.pydata.org/docs/reference/frame.html) documentation. 

### pd.read_csv( )
The `pd.read_csv()` function is used to **read data from a CSV (Comma-Separated Values) file** and load it into a **DataFrame**.

In [None]:
pd.read_csv('INSERT FILEPATH HERE')

**Example**:

In [1]:
import pandas as pd
df=pd.read_csv('data/meet_the_animals.csv')  #df is a common abbreviation for DataFrame
df

Unnamed: 0,animal
0,black-throated-blue-warbler
1,elds-deer
2,false-water-cobra
3,hooded-merganswer
4,patagonian-mara


ðŸ”— See [Pandas .read_csv( )](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) documentation.

### [.tolist( )](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_list.html)
The method `.tolist()` is used in to convert a **Series** (a single column of data) into a **Python list**.

In [None]:
df.Series.tolist()

**Example**:

In [2]:
import pandas as pd
df=pd.read_csv('data/meet_the_animals.csv')
animals=df.animal.tolist()
animals

['black-throated-blue-warbler',
 'elds-deer',
 'false-water-cobra',
 'hooded-merganswer',
 'patagonian-mara']

ðŸ”—  See [.tolist( )](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_list.html) documentation.

### .dropna( )

The `dropna()` method is used to **remove missing values (NaN)** from a DataFrame or Series. Itâ€™s a fast and effective way to clean your dataâ€”but it should be used with care.

In [None]:
DataFrame.dropna(*, axis=0, how=<no_default>, thresh=<no_default>, subset=None, inplace=False, ignore_index=False)

### .fillna( )

The `.fillna()` method is used to **replace NaN (missing) values** with a value you specify.

In [None]:
df.Series.fillna(value=None, *, method=None, axis=None, inplace=False, limit=None, downcast=<no_default>)

This is especially useful when you want to:

- Fill in missing data with a default value
- Use statistical values like the mean or median
- Forward-fill or backward-fill based on surrounding data

ðŸ”—  See [.fillna( )](https://pandas.pydata.org/docs/reference/api/pandas.Series.fillna.html#pandas.Series.fillna) documentation.

### .iterrows( )

The `.iterrows() method` allows you to **iterate over each row** in a DataFrame as a pair:

- The **index** of the row
- The **row data** as a pandas Series

In [None]:
df=DataFrame.iterrows()

**Example**:

In [3]:
import pandas as pd
df=pd.read_csv('data/meet_the_animals.csv')
for idx, row in df.iterrows():
    print(row.animal)

black-throated-blue-warbler
elds-deer
false-water-cobra
hooded-merganswer
patagonian-mara


This is useful when you need to process rows one at a time, especially for tasks like conditional logic or row-wise operations.

<div class="alert alert-dismissible alert-warning" style="max-width: 100%;">
  <div class="card-header" style="font-size: 1.8rem;"><img src="images/bullhorn_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Caution!</div>
  <div class="card-body"><p><span class="text-primary">.iterrows()</span> is not the most efficient method for large datasets. For better performance, consider using vectorized operations or <span class="text-primary">.itertuples()</span>.</p>
  </div>
</div>

ðŸ”—  See [.iterrows( )](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows) documentation.


### .iloc

The `.iloc` property is used to **select rows (and columns)** by their **integer position** (i.e., by index number, not label).



In [None]:
DataFrame.iloc[start:end]

**Example**:

In [4]:
import pandas as pd
df=pd.read_csv('data/meet_the_animals.csv')
for idx, row in df.iloc[0:1].iterrows():
    print(row.animal)

black-throated-blue-warbler


- `.iloc[row_index]` accesses a specific row
- `.iloc[row_index, column_index]` accesses a specific cell
- You can also use slicing to select multiple rows or columns
  
Use `.iloc` when:

- You want to access data by position, not by label
- You're working with numeric row/column indices
- You're iterating or slicing through rows or columns

ðŸ”— See [.iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html#pandas.DataFrame.iloc) documentation.

### .concat( )

The `pandas.concat` function is used to **join two or more DataFrames** along a specific axis:

- **axis=0** â†’ stacks DataFrames vertically (adds rows)
- **axis=1** â†’ stacks DataFrames horizontally (adds columns)


In [None]:
pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)

**Example**:

In [5]:
import pandas as pd

results=pd.DataFrame(columns=['common_name','size'])
df=pd.read_csv('data/meet_the_animals.csv')
for idx, row in df.iterrows():
    common_name=row.animal
    size=10
    data_row={
        'common_name':common_name,
        'size':size     
    }
    data=pd.DataFrame(data_row, index=[0])
    results=pd.concat([data, results], axis=0, ignore_index=True)

results

Unnamed: 0,common_name,size
0,patagonian-mara,10
1,hooded-merganswer,10
2,false-water-cobra,10
3,elds-deer,10
4,black-throated-blue-warbler,10


ðŸ”— See [.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html#pandas.concat) documentation.

---
# BONUS: try/except

Even with well-written code, things can go wrongâ€”like missing HTML tags on a webpage or inconsistent data formats. Thatâ€™s where Pythonâ€™s try / except blocks come in.

They allow your program to **handle errors gracefully** instead of crashing.

## ðŸ§ª How It Works
- The code inside the **try** block is executed first.
- If an error occurs, Python jumps to the except block.
- Your program continues running without stopping unexpectedly.

**Example**:

In [6]:
import pandas as pd

results=pd.DataFrame(columns=['common_name','size'])
for idx, row in df.iterrows():
    try:
        common_name=row.animal
        size=10
        data_row={
            'common_name':common_name,
            'size':size     
        }
        data=pd.DataFrame(data_row, index=[0])
        results=pd.concat([data, results], axis=0, ignore_index=True)
    except:
        common_name='no name found'
        size=0
        data_row={
                    'common_name':common_name,
                    'size':size     
                }
        data=pd.DataFrame(data_row, index=[0])
        results=pd.concat([data, results], axis=0, ignore_index=True)

results

Unnamed: 0,common_name,size
0,patagonian-mara,10
1,hooded-merganswer,10
2,false-water-cobra,10
3,elds-deer,10
4,black-throated-blue-warbler,10


<div class="card border-primary mb-3 p-1" >
  <div class="card-header" style="font-size: 1.8rem;"><img src="images/idea_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Tip:</div>
  <div class="card-body"><p>For a more detailed explanation with examples, ask Copilot to <span class="text-primary">explain try except python</span>.</p><img src="images/microsoft_copilot_icon.svg" alt="copilot icon" >
  </div>
</div>

# Step 6. Write and test code
<div class="accordion" id="accordionExercise4">

  <div class="accordion-item"><h2 class="accordion-header" id="ex4-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex4-collapseOne" aria-expanded="true" aria-controls="ex4-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 4: Meet the Animals</button></h2><div id="ex4-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex4-headingOne" data-bs-parent="#accordionExercise4"> <div class="accordion-body fs-4"><p>Use pandas to read the <span class="text-primary">meet_the_animals.csv</span> file into a DataFrame and create a list of animal common_names. Then iterate through the list of common_names to gather the following elements from the webpages for each animal. Store the values for each variable in a Pandas DataFrame. Export DataFrame to .csv.</p><ul><li>Common name</li><li>Scientific name</li><li>Taxonomic information</li><ul><li>Class</li><li>Order</li><li>Family</li><li>Genus and species</li></ul><li>Physical description</li><li>Size</li><li>Native habitat</li><li>Conservation status</li><li>Fun facts</li></ul></div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex4-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex4-collapseTwo" aria-expanded="false" aria-controls="ex4-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex4-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex4-headingTwo" data-bs-parent="#accordionExercise4"> <div class="accordion-body">

```python
import requests
from bs4 import BeautifulSoup
import pandas as pd

#1. Read in data/meet_the_animals.csv and create a list of animals to search
df = pd.read_csv('data/meet_the_animals.csv')
animals = df.animal.tolist()

# 2. Create a DataFrame for the search results
results = pd.DataFrame(columns=['common_name', 'scientific_name', 'class',
                                'order', 'family', 'genus_species', 'physical_description',
                                'size', 'native_habitat', 'status', 'fun_facts'])

# 3. Identify the base url
base_url = 'https://nationalzoo.si.edu/animals/'

# 4. Iterate through the list of animals. Construct a url for each animal's
# website. Create a dictionary to store variables for each animal
# Then request and parse the HTML for each website, extract the variables and
# store variables in dictionary. 

count = 1
for animal in animals:
    print(f"Starting #{count} {animal}")
    count += 1
    row={} #dictionary to store variables for each animal
    url=base_url+animal
    response=requests.get(url).text
    soup=BeautifulSoup(response, 'html.parser')
    common_name=animal
    scientific_name = soup.h3.text
    row['common_name']=common_name
    row['scientific_name']=scientific_name
    block_titles=soup.find_all('h2',{'class':'block-title'})
    # # find_taxonomic_information=soup.find_all('div',{'class':'views-element-container'})
    for each_tag in block_titles:
        # print(each_tag.text)
        if each_tag.text == 'Taxonomic Information':
            # print(each_tag.text)
            biological_classifications=each_tag.find_all_next('span',{'class':'italic'})
            biological_class=biological_classifications[0].text  #named this biological_class because class alone is reserved word in Python
            biological_order=biological_classifications[1].text
            biological_family=biological_classifications[2].text
            biological_genus=biological_classifications[3].text
            row['class']=biological_class
            row['order']=biological_order
            row['family']=biological_family
            row['genus_species']=biological_genus
        elif each_tag.text == 'Physical Description':
            physical_description=each_tag.find_next('div',{'class':'body'}).text.strip()
            row['physical_description']=physical_description
        elif each_tag.text == 'Size':
            size=each_tag.find_next('div',{'class':'body'}).text.strip()
            row['size']=size
        elif each_tag.text == 'Native Habitat':
            habitat=each_tag.find_next('div',{'class':'body'}).text.strip()
            row['native_habitat']=habitat
        elif each_tag.text == 'Conservation Status':  
            status=each_tag.find_next('ul')['data-designation']
            row['stats']=status
        elif each_tag.text == 'Fun Facts':  
            facts=[]
            facts_list=each_tag.find_next('ol').find_all('li')
            for each_fact in facts_list:
                facts.append(each_fact.text)
            facts=(' ').join(facts)
            row['fun_facts']=facts
            
    each_row=pd.DataFrame(row, index=[0])
    
    #5. Concatenate each row to results.
    results=pd.concat([each_row, results], axis=0, ignore_index=True)

#6. Write results to csv    
results.to_csv('data/animals.csv') 
```
  </div>
  </div>
  </div>

</div>