---
title: Lesson 3. Wikipedia
format:
  html:
    toc: true
    toc-expand: 2
    toc-title: CONTENTS
---

This lesson introduces `pandas.read_html`, a useful tool for extracting tables from HTML, and continues to explore BeautifulSoup, a Python library designed for parsing XML and HTML documents. We will start by gathering artists found on the __[List of Rock and Roll Hall of Fame inductees](https://en.wikipedia.org/wiki/List_of_Rock_and_Roll_Hall_of_Fame_inductees)__ webpage in Wikipedia. We will then assemble discographies for 2-3 of our favorite artists.

## Data skills | concepts
- Search parameters
- HTML
- Web scraping
- Pandas 

## Learning objectives
1. Extract and store tables and other HTML elements in a structured format
2. Apply best practices for managing data

This tutorial is designed to support multi-session __[workshops](https://library.osu.edu/events?combine=&tid=All&field_location_code_value=10&sort_bef_combine=field_end_date_value_ASC)__ offered by The Ohio State University Libraries Research Commons. It assumes you already have a basic understanding of Python, including how to iterate through lists and dictionaries to extract data using a for loop. To learn basic Python concepts visit the [Python - Mastering the Basics](python_basics.ipynb) tutorial.

# LESSON 3

[Lesson 1](lantern.ipynb) and [Lesson 2](animals.ipynb) introduced the basic steps for any webscraping or API project:

1. Review and understand copyright and terms of use.
2. Check to see if an API is available.
3. Examine the URL
4. Inspect the elements
5. Identify Python libraries for project
6. Write and test code

<div class="accordion" id="accordionExercise1">

  <div class="accordion-item"><h2 class="accordion-header" id="ex1-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex1-collapseOne" aria-expanded="true" aria-controls="ex1-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 1: Examine Copyright | Terms of Use</button></h2><div id="ex1-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex1-headingOne" data-bs-parent="#accordionExercise1"> <div class="accordion-body fs-4"><p>Locate and read the terms of use for the  <a href="https://en.wikipedia.org/wiki/List_of_Rock_and_Roll_Hall_of_Fame_inductees">Rock and Roll Hall of Fame</a> inductees found on Wikipedia.</p><ol><li>What are the copyright restrictions for this resource? </li><li>What are the terms of use?</li></ol></div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex1-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex1-collapseTwo" aria-expanded="false" aria-controls="ex1-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex1-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex1-headingTwo" data-bs-parent="#accordionExercise1"> <div class="accordion-body"><p>Read Wikipedia <a href="https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_4.0_International_License">Creative Commons Attribution-ShareAlike 4.0 License</a></p>


  </div>
  </div>
  </div>

</div>

<div class="accordion" id="accordionExercise2">

  <div class="accordion-item"><h2 class="accordion-header" id="ex2-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex2-collapseOne" aria-expanded="true" aria-controls="ex2-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 2: API available?</button></h2><div id="ex2-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex2-headingOne" data-bs-parent="#accordionExercise2"> <div class="accordion-body fs-4"><p>Is an API available?</p></div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex2-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex2-collapseTwo" aria-expanded="false" aria-controls="ex2-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex2-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex2-headingTwo" data-bs-parent="#accordionExercise2"> <div class="accordion-body"><p>Yes, but it may take some time to learn how to use.  It is not necessary to use for this lesson.</p>


  </div>
  </div>
  </div>

</div>

<div class="accordion" id="accordionExercise3">

  <div class="accordion-item"><h2 class="accordion-header" id="ex3-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex3-collapseOne" aria-expanded="true" aria-controls="ex3-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 3: Examine the URL</button></h2><div id="ex3-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex3-headingOne" data-bs-parent="#accordionExercise3"> <div class="accordion-body fs-4"><p>Select 2-3 inductees from the Performers category and locate their album discography on Wikipedia. Examine the structure of the URL for each album discography page.</p></div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex3-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex3-collapseTwo" aria-expanded="false" aria-controls="ex3-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex3-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex3-headingTwo" data-bs-parent="#accordionExercise3"> <div class="accordion-body"><h3> Example: Parliament-Funkadelic</h3><p><strong>Parliament discography</strong></p><a href="https://en.wikipedia.org/wiki/Parliament_discography">https://en.wikipedia.org/wiki/Parliament_discography</a></p><p><em>https:// + en.wikipedia.org/ + wiki/ + Parliament_discography</em></p><p><strong>Funkadelic discography</strong></p><p><a href="https://en.wikipedia.org/wiki/Funkadelic_discography">https://en.wikipedia.org/wiki/Funkadelic_discography</a></p><p><em>https:// + en.wikipedia.org/ + wiki/ + Funkadelic_discography</em></p>


  </div>
  </div>
  </div>

</div>

<div class="accordion" id="accordionExercise4">

  <div class="accordion-item"><h2 class="accordion-header" id="ex4-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex4-collapseOne" aria-expanded="true" aria-controls="ex4-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 4: Inspect the elements</button></h2><div id="ex4-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex4-headingOne" data-bs-parent="#accordionExercise4"> <div class="accordion-body fs-4"><p>Go to the discography page for each artist you selected. Are there differences in how the tables are structured and presented on each page? For example, do tables have a title or header? Inpsect the elements and think about how you might use BeautifulSoup to gather the headers for each table.</p></div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex4-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex4-collapseTwo" aria-expanded="false" aria-controls="ex4-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex4-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex4-headingTwo" data-bs-parent="#accordionExercise4"> <div class="accordion-body">

```python
import requests
from bs4 import BeautifulSoup

url='https://en.wikipedia.org/wiki/Parliament_discography'
headers = {'User-Agent': 'Mozilla/5.0'} # Adding a User-Agent is sometimes necessary. 
response=requests.get(url, headers=headers).text # Ask Copilot why identifying a user-agent for the headers parameter in requests.get is often necessary when scraping web data.
soup=BeautifulSoup(response, 'html.parser')
headers=soup.find_all('div', {'class':'mw-heading2'})
for header in headers:
    print(header.find('h2').text)
```
  </div>
  </div>
  </div>

</div>

<div class="accordion" id="accordionExercise5">

  <div class="accordion-item"><h2 class="accordion-header" id="ex5-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex5-collapseOne" aria-expanded="true" aria-controls="ex5-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 5: Identify Python libraries</button></h2><div id="ex5-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex5-headingOne" data-bs-parent="#accordionExercise5"> <div class="accordion-body fs-4"><p>List the Python libraries you may need to import for this project.</p></div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex5-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex5-collapseTwo" aria-expanded="false" aria-controls="ex5-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex5-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex5-headingTwo" data-bs-parent="#accordionExercise5"> <div class="accordion-body">

```python
import requests
import pandas as pd
from bs4 import BeautifulSoup
```
  </div>
  </div>
  </div>

</div>

# Pandas
## .read_html() 

Read HTML tables directly into DataFrames with __[.read_html()](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html)__ . This extremely useful tool extracts all tables present in specified URL or file, allowing each table to be accessed using standard list indexing and slicing syntax.

The following code instructs Python to go to the Wikipedia __[List of states and territories of the United States](https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States)__ and retrieve the second table.

In [None]:
import requests
import pandas as pd
from io import StringIO

url='https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
headers = {'User-Agent': 'Mozilla/5.0'}
response=requests.get(url, headers=headers).text
tables=pd.read_html(StringIO(response))
tables[1]

<div class="accordion" id="accordionExercise6">

  <div class="accordion-item"><h2 class="accordion-header" id="ex6-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex6-collapseOne" aria-expanded="true" aria-controls="ex6-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 6: .read_html()</button></h2><div id="ex6-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex6-headingOne" data-bs-parent="#accordionExercise6"> <div class="accordion-body fs-4">Visit the Wikipedia <a href="https://en.wikipedia.org/wiki/List_of_Rock_and_Roll_Hall_of_Fame_inductees">List of Rock and Roll Hall of Fame inductees</a> and extract the Performers table.</div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex6-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex6-collapseTwo" aria-expanded="false" aria-controls="ex6-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex6-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex6-headingTwo" data-bs-parent="#accordionExercise6"> <div class="accordion-body">

```python
import requests
import pandas as pd
from io import StringIO

url='https://en.wikipedia.org/wiki/List_of_Rock_and_Roll_Hall_of_Fame_inductees'
headers = {'User-Agent': 'Mozilla/5.0'}
response=requests.get(url, headers=headers).text
tables=pd.read_html(StringIO(response))
performers=tables[0]
performers
```
  </div>
  </div>
  </div>

</div>


# BeautifulSoup
## .find_previous( ) and .find_all_previous( )

Similar to __[.find_next( ) and .find_all_next( )](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all-next-and-find-next)__, __[.find_previous( )](https://beautiful-soup-4.readthedocs.io/en/latest/#find-all-previous-and-find-previous)__ and __[.find_all_previous( )](https://beautiful-soup-4.readthedocs.io/en/latest/#find-all-previous-and-find-previous)__ gathers the previous instance of a named tag.

<div class="accordion" id="accordionExercise7">

  <div class="accordion-item"><h2 class="accordion-header" id="ex7-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex7-collapseOne" aria-expanded="true" aria-controls="ex7-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 7: Gather discographies</button></h2><div id="ex7-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex7-headingOne" data-bs-parent="#accordionExercise7"> <div class="accordion-body fs-4"><p>Gather the discography tables from the discography Wikipedia pages for the 2-3 artists you identified in <strong>Exercise 3</strong>.</p><ol><li>Identify base url</li><li>Store the secondary_url parts for each artist in a list, such as ...</li>
```python
artists=['Cyndi_Lauper_discography','Joe_Cocker_discography','The_White_Stripes_discography']
```
<li>Create a for loop to iterate through each artist in your list.</li><li>Use .read_html to gather the discography tables each artist on your list.</li>
<div class="card border-primary mb-3 p-1" style="max-width: 100%;">
  <div class="card-header" style="font-size: 1.8rem;"><img src="images/idea_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Tip</div>
  <div class="card-body">Use index slicing to focus on one artist page first as you work through steps 4 and 5.
  </div>
</div>
<li>Create an empty list for headers to gather table headers for each artist.</li><li>Use requests with BeautifulSoup and .find_previous() to gather the headers for each table and append each header to your headers list. Check to see if your table headers match the html tables you gathered with .read_html</li>
<div class="card border-primary mb-3 p-1" style="max-width: 100%;">
  <div class="card-header" style="font-size: 1.8rem;"><img src="images/idea_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Tip</div>
  <div class="card-body">Headers do not always exist for each table. You may need to use a try-except block and/or if statements to accurately construct your header list.
  </div>
</div>



</ol></div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex7-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex7-collapseTwo" aria-expanded="false" aria-controls="ex7-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex7-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex7-headingTwo" data-bs-parent="#accordionExercise7"> <div class="accordion-body">

```python
import requests
import pandas as pd
from bs4 import BeautifulSoup

artists=['Cyndi_Lauper_discography','Joe_Cocker_discography','The_White_Stripes_discography']
base_url='https://en.wikipedia.org/wiki/'

for artist in artists[0:1]:

    url=base_url+artist
    headers = {'User-Agent': 'Mozilla/5.0'}
    response=requests.get(url, headers=headers)
    response.encoding = 'utf-8' #requests.get() does not accept an encoding parameter, but you can set encoding on the response object after the request.
    text=response.text
    soup=BeautifulSoup(text, 'html.parser')
    
    #FIND TABLES
    html_tables=pd.read_html(text)

    #FIND TABLE_HEADERS
    table_headers=[]
    table_number=0
    tables=soup.find_all('table')
    for table in tables:
        if table.find_previous('div',{'class':'mw-heading2'}) is not None:
            h2=table.find_previous('div',{'class':'mw-heading2'}).text.split('[')[0]
            table_headers.append(h2.lower().replace(' ','_'))
            # print(f"table_number_{table_number}: {h2}")
        else:
            h2='no_header'
            table_headers.append('h2')
        if table.find_previous('div',{'class':'mw-heading3'}) is not None:
            h3=table.find_previous('div',{'class':'mw-heading3'}).text.split('[')[0]
            table_headers.append(h3.lower().replace(' ','_'))
            print(f"table_number_{table_number}: {h3}")
        else:
            h3='no_header'
            table_headers.append(h3)
        print(f"table_number_{table_number}: {h2}, {h3}")
        table_number += 1
```
  </div>
  </div>
  </div>

</div>



# Managing files
There are several best practices and considerations for effectively __[managing research data](https://guides.osu.edu/rdm-best-practice/organization)__ files. When extracting and locally storing data in individual files, using standardized file naming conventions not only helps you organize and utilize your files efficiently but also facilitates sharing and collaboration with others in future projects. 

- Use short, descriptive names.
- Use `_` underscores or `-` dashes instead of spaces in your file names. Use leading zeros for sequential numbers to ensure proper sorting. 

`file_001_20250506.txt`\
`file_002_20250506.txt`

- Use all lowercase for directory and filenames if possible. 
- Avoid special characters, including `~!@#$%^&*()[]{}?:;<>|\/`
- Use standardized dates `YYYYMMDD` to track versions and updates.
- Include version control numbers to keep track of projects.

## os module
The __[os](https://docs.python.org/3/library/os.html)__ module tells Python where to find and save files.

### os.mkdir('path')
Creates a new directory in your project folder or another specified location. Makes a directory in your project folder or another folder you specify. If a directory by the same name already exists in the path specified, os.mkdir will raise an `OSError`. Use a try-except block to handle the error.

In [None]:
import os
artist = "Cyndi_Lauper"

try:
    os.mkdir(artist)
except FileExistsError:
    print(f"Directory '{artist}' already exists.")
except Exception as e:
    print(f"An error occurred: {e}")

<div class="accordion" id="accordionExercise8">

  <div class="accordion-item"><h2 class="accordion-header" id="ex8-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex8-collapseOne" aria-expanded="true" aria-controls="ex8-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 8</button></h2><div id="ex8-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex8-headingOne" data-bs-parent="#accordionExercise8"> <div class="accordion-body fs-4"><ol><li>Make a directory for each artist in your project folder.</li><li>Use pd.read_csv to create a file for each table. Incorporate the table number and header in the filename.</li>
  <div class="card border-primary mb-3 p-1" style="max-width: 100%;">
  <div class="card-header" style="font-size: 1.8rem;"><img src="images/idea_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Tip</div>
  <div class="card-body">Increment counter variables for table numbers.
  </div>
</div>
  </ol></div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex8-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex8-collapseTwo" aria-expanded="false" aria-controls="ex8-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex8-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex8-headingTwo" data-bs-parent="#accordionExercise8"> <div class="accordion-body">

```python
import requests
import pandas as pd
from bs4 import BeautifulSoup

artists=['Cyndi_Lauper_discography','Joe_Cocker_discography','The_White_Stripes_discography']
base_url='https://en.wikipedia.org/wiki/'

for artist in artists[0:1]:

    url=base_url+artist
    headers = {'User-Agent': 'Mozilla/5.0'}
    response=requests.get(url, headers=headers)
    response.encoding = 'utf-8' #requests.get() does not accept an encoding parameter, but you can set encoding on the response object after the request.
    text=response.text
    soup=BeautifulSoup(text, 'html.parser')
    
    #FIND TABLES
    html_tables=pd.read_html(text)

    #FIND TABLE_HEADERS
    table_headers=[]
    table_number=0
    tables=soup.find_all('table')
    for table in tables:
        if table.find_previous('div',{'class':'mw-heading3'}) is not None:
            h3=table.find_previous('div',{'class':'mw-heading3'}).text.split('[')[0]
            table_headers.append(h3.lower().replace(' ','_'))
            print(f"table_number_{table_number}: {h3}")
        elif table.find_previous('div',{'class':'mw-heading2'}) is not None:
            h2=table.find_previous('div',{'class':'mw-heading2'}).text.split('[')[0]
            table_headers.append(h2.lower().replace(' ','_'))
            print(f"table_number_{table_number}: {h2}")
        else:
            h='no_header'
            table_headers.append(h)
            print(f"table_number_{table_number}: {h}")
        table_number += 1

    #CREATE A DIRECTORY FOR EACH ARTIST AND OUTPUT TABLES TO THE DIRECTORY
    position=0
    for each_header in table_headers:
        if each_header != 'no header':
            table_name=each_header
            number=position
            artist_directory=artist.replace('_discography','').lower()
            try:
                os.mkdir(artist_directory)
            except FileExistsError:
                print(f"Directory '{artist_directory}' already exists.")
            except Exception as e:
                print(f"An error occurred: {e}")
            filename=artist_directory+'/table_number_'+str(number)+'_'+table_name+'.csv'
            print(filename)
            html_table=html_tables[position]
            html_table.to_csv(filename)
        position += 1
```
  </div>
  </div>
  </div>

</div>





