### Name: Asha Cumberbatch 
### Date: April 8th
### Assignment: Project 2 part 2 - Minimum wage table
### Purpose: The aim of this notebook is to pull the data from the tables on the List of US states by minimum wage Wikipedia page (https://en.wikipedia.org/wiki/List_of_US_states_by_minimum_wage). The page will be web scraped and the data cleaned. 

### The columns of particular interest are State, and data for the year 2024, and will be saved to a csv file. That csv file will be merged with other csv files, created from similar pages, then saved as a new data frame. The resulting data frame will be used to gain insight on how these metrics differ by state.

##### The first step is be to import the necessary packages. BeautifulSoup, imported as bs, and pandas, imported as pd will be needed.

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# Minimum wage

##### Before attempting to web scrape the page, the robots.txt file was was run for Wikipedia to ensure that scraping was allowed.
##### The instruction to pull the page also contains an if statement, which will return an error message if there is an issue when attempting to pull the page.

In [2]:
url='https://en.wikipedia.org/wiki/List_of_US_states_by_minimum_wage'
response = requests.get(url)
status = response.status_code
if status == 200:
    page = response.text
    soup = bs(page)
else:
    print(f"Oops! Received status code {status}")

In [3]:
print(soup.prettify())
type(soup)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of US states by minimum wage - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-lim

bs4.BeautifulSoup

##### To assist with identifying and selecting the right table, a function was used. The number of tables on the page and the content of each table is printed.

In [4]:
tables = soup.find_all('table')  
print(len(tables))  

for i, table in enumerate(tables):
    print(f"Table {i}:")
    print(table.prettify()[:500])  
    print("\n")

6
Table 0:
<table class="wikitable sortable mw-datatable static-row-numbers sticky-table-head">
 <caption>
  US state minimum wage rates. 2025.
  <sup class="reference" id="cite_ref-paycom_11-1">
   <a href="#cite_note-paycom-11">
    <span class="cite-bracket">
     [
    </span>
    11
    <span class="cite-bracket">
     ]
    </span>
   </a>
  </sup>
 </caption>
 <tbody>
  <tr>
   <th>
    State
   </th>
   <th data-sort-type="currency" style="text-align:left">
    2025
   </th>
  </tr>
  <tr>
   <td s


Table 1:
<table class="wikitable sortable mw-datatable static-row-numbers sticky-table-col1 sticky-table-head" style="text-align:right;">
 <caption>
  US state minimum wage rates. 2022-2024.
  <sup class="reference" id="cite_ref-DOL_1-4">
   <a href="#cite_note-DOL-1">
    <span class="cite-bracket">
     [
    </span>
    1
    <span class="cite-bracket">
     ]
    </span>
   </a>
  </sup>
  <sup class="reference" id="cite_ref-DOL-historical_2-2">
   <a href="#cite_note-DOL-histo

#### An empty list is set to hold the variables that will be pulled from the table. The table that needs to be pulled will also be printed.

In [5]:
minimum_wage_list = []
minimum_wage_table = tables[1].tbody  # Select the second table (index 1)
minimum_wage_table

<tbody><tr>
<th>State
</th>
<th data-sort-type="currency">2022
</th>
<th data-sort-type="currency">2023
</th>
<th data-sort-type="currency">2024<sup class="reference" id="cite_ref-DOL_1-5"><a href="#cite_note-DOL-1"><span class="cite-bracket">[</span>1<span class="cite-bracket">]</span></a></sup><sup class="reference" id="cite_ref-mw-org_4-3"><a href="#cite_note-mw-org-4"><span class="cite-bracket">[</span>4<span class="cite-bracket">]</span></a></sup>
</th></tr>
<tr>
<td style="text-align:left"><span class="flagicon" style="display:inline-block;width:25px;text-align:left"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="400" data-file-width="600" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Flag_of_Alabama.svg/40px-Flag_of_Alabama.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Flag_of_Alabama.svg/60px-Flag_of_Alabama.svg.png 2x" width="23"/></span></span></span> <a 

In [6]:
# determines the maximum number of columns in the table (net_domestic_migration_table)
max_columns = max([len(row.find_all(['th', 'td'])) for row in minimum_wage_table.find_all('tr')])

for row in minimum_wage_table.find_all('tr'): #uses each row in the table, including the header. (use [1:]: to skip header)
    cells = row.find_all(['th', 'td']) # finds all the header (th) and data (td) elements from each row
    row_data = [cell.text.strip() for cell in cells]  #pulls the text within each cell, removes any spaces
    
    # in case any of the columns are missing, this function inserts None at the beginning of the row
    while len(row_data) < max_columns: #checks if any missing column entries by comparing against the value in max_column
        row_data.insert(0, None)  # 0 tells the code to add the placeholder none to the first column
    
    minimum_wage_list.append(row_data)

for row in minimum_wage_list:
    print(row)

['State', '2022', '2023', '2024[1][4]']
['Alabama', '$7.25[note 1]', '$7.25[note 1]', '$7.25[note 1]']
['Alaska', '$10.34', '$10.85', '$11.73']
['Arizona', '$12.80', '$13.85', '$14.35']
['Arkansas', '$11.00', '$11.00', '$11.00/7.25[f][1]']
['California', '$15.00', '$15.50', '$16.00']
['Colorado', '$12.56', '$13.65', '$14.42']
['Connecticut', '$14.00', '$15.00[14]', '$15.69']
['Delaware', '$10.50', '$11.75', '$13.25']
['Florida', '$11.00', '$12.00', '$12.00. $13.00 on Sept 30, 2024']
['Georgia', '$7.25[note 2]', '$7.25[note 2]', '$7.25[note 2]']
['Hawaii', '$10.10', '$12.00', '$14.00']
['Idaho', '$7.25', '$7.25', '$7.25']
['Illinois', '$12.00', '$13.00', '$14.00']
['Indiana', '$7.25', '$7.25', '$7.25']
['Iowa', '$7.25', '$7.25', '$7.25']
['Kansas', '$7.25', '$7.25', '$7.25']
['Kentucky', '$7.25', '$7.25', '$7.25']
['Louisiana', '$7.25[note 1]', '$7.25[note 1]', '$7.25[note 1]']
['Maine', '$12.75', '$13.80', '$14.15']
['Maryland', '$12.50', '$13.25', '$15.00']
['Massachusetts', '$14.25',

In [7]:
unsorted_minimum_wage_df = pd.DataFrame(minimum_wage_list)
unsorted_minimum_wage_df

Unnamed: 0,0,1,2,3
0,State,2022,2023,2024[1][4]
1,Alabama,$7.25[note 1],$7.25[note 1],$7.25[note 1]
2,Alaska,$10.34,$10.85,$11.73
3,Arizona,$12.80,$13.85,$14.35
4,Arkansas,$11.00,$11.00,$11.00/7.25[f][1]
5,California,$15.00,$15.50,$16.00
6,Colorado,$12.56,$13.65,$14.42
7,Connecticut,$14.00,$15.00[14],$15.69
8,Delaware,$10.50,$11.75,$13.25
9,Florida,$11.00,$12.00,"$12.00. $13.00 on Sept 30, 2024"


#### The output from the webpage has been saved (to unsorted_minimum_wage_df). This dataset is already sorted alphabetically, so will be in the same order as the data pulled from other pages when merging.

In [8]:
# creates a pandas dataframe from the net_domestic_list, using the entries in row 0 as the column names 
# the entries, starting from row 1 will become the data to fill those columns
minimum_wage_df = pd.DataFrame(minimum_wage_list[1:], columns=minimum_wage_list[0]) 

minimum_wage_df

Unnamed: 0,State,2022,2023,2024[1][4]
0,Alabama,$7.25[note 1],$7.25[note 1],$7.25[note 1]
1,Alaska,$10.34,$10.85,$11.73
2,Arizona,$12.80,$13.85,$14.35
3,Arkansas,$11.00,$11.00,$11.00/7.25[f][1]
4,California,$15.00,$15.50,$16.00
5,Colorado,$12.56,$13.65,$14.42
6,Connecticut,$14.00,$15.00[14],$15.69
7,Delaware,$10.50,$11.75,$13.25
8,Florida,$11.00,$12.00,"$12.00. $13.00 on Sept 30, 2024"
9,Georgia,$7.25[note 2],$7.25[note 2],$7.25[note 2]


#### The scraped table contains some entries, like totals for the United States or Washington D.C., which will not be included in the analysis. To identify any of these entries need to be removed, the shape of the data frame must be checked.

In [9]:
minimum_wage_df.shape[0]

51

In [10]:
minimum_wage_df = minimum_wage_df[~minimum_wage_df['State'].isin(["District of Columbia", "United States", "Washington, D.C."])]
print(minimum_wage_df)
minimum_wage_df.shape[0]

             State                2022                 2023  \
0          Alabama       $7.25[note 1]        $7.25[note 1]   
1           Alaska              $10.34               $10.85   
2          Arizona              $12.80               $13.85   
3         Arkansas              $11.00               $11.00   
4       California              $15.00               $15.50   
5         Colorado              $12.56               $13.65   
6      Connecticut              $14.00           $15.00[14]   
7         Delaware              $10.50               $11.75   
8          Florida              $11.00               $12.00   
9          Georgia       $7.25[note 2]        $7.25[note 2]   
10          Hawaii              $10.10               $12.00   
11           Idaho               $7.25                $7.25   
12        Illinois              $12.00               $13.00   
13         Indiana               $7.25                $7.25   
14            Iowa               $7.25                $

50

#### There are a lot of notes that give additional details about the values listed as minimum wage. The value for '2024[1][4]' should be updated to make sure it contains only numeric values. For consistency, the highest value listed in the 2024 column will be used
#### This prompts a warning that the data frame is being modified, without a copy of the data originally scraped being saved. This warning can be avoided by saving a copy of the original minimum_wage_df.
#### The name of the column will also be updated to 2024, for easier referencing.

In [11]:
minimum_wage_df = minimum_wage_df.copy()
minimum_wage_df.rename(columns={'2024[1][4]': '2024'}, inplace=True)

#### There are a lot of notes that give additional details about the values listed as minimum wage. For consistency, the highest value listed in the 2024 column will be used

In [12]:
minimum_wage_df.loc[0, '2024'] = '$7.25' # for the location(x) in the column 2024, use this value
minimum_wage_df.loc[1, '2024'] = '$11.73'
minimum_wage_df.loc[2, '2024'] = '$14.35'
minimum_wage_df.loc[3, '2024'] = '$11.00'
minimum_wage_df.loc[4, '2024'] = '$16.00'
minimum_wage_df.loc[5, '2024'] = '$14.42'
minimum_wage_df.loc[6, '2024'] = '$15.69'
minimum_wage_df.loc[7, '2024'] = '$13.25'
minimum_wage_df.loc[8, '2024'] = '$13.00'
minimum_wage_df.loc[9, '2024'] = '$7.25'
minimum_wage_df.loc[10, '2024'] = '$14.00'
minimum_wage_df.loc[11, '2024'] = '$7.25'
minimum_wage_df.loc[12, '2024'] = '$14.00'
minimum_wage_df.loc[13, '2024'] = '$7.25'
minimum_wage_df.loc[14, '2024'] = '$7.25'
minimum_wage_df.loc[15, '2024'] = '$7.25'
minimum_wage_df.loc[16, '2024'] = '$7.25'
minimum_wage_df.loc[17, '2024'] = '$7.25'
minimum_wage_df.loc[18, '2024'] = '$14.15'
minimum_wage_df.loc[19, '2024'] = '$15.00'
minimum_wage_df.loc[20, '2024'] = '$15.00'
minimum_wage_df.loc[21, '2024'] = '$10.33'
minimum_wage_df.loc[22, '2024'] = '$10.85'
minimum_wage_df.loc[23, '2024'] = '$7.25'
minimum_wage_df.loc[24, '2024'] = '$12.30'
minimum_wage_df.loc[25, '2024'] = '$10.30'
minimum_wage_df.loc[26, '2024'] = '$12.00'
minimum_wage_df.loc[27, '2024'] = '$12.00'
minimum_wage_df.loc[28, '2024'] = '$7.25'
minimum_wage_df.loc[29, '2024'] = '$15.13'
minimum_wage_df.loc[30, '2024'] = '$12.00'
minimum_wage_df.loc[31, '2024'] = '$15.00'
minimum_wage_df.loc[32, '2024'] = '$7.25'
minimum_wage_df.loc[33, '2024'] = '$7.25'
minimum_wage_df.loc[34, '2024'] = '$10.45'
minimum_wage_df.loc[35, '2024'] = '$7.25'
minimum_wage_df.loc[36, '2024'] = '$14.70'
minimum_wage_df.loc[37, '2024'] = '$7.25'
minimum_wage_df.loc[38, '2024'] = '$14.00'
minimum_wage_df.loc[39, '2024'] = '$7.25'
minimum_wage_df.loc[40, '2024'] = '$11.20'
minimum_wage_df.loc[41, '2024'] = '$7.25'
minimum_wage_df.loc[42, '2024'] = '$7.25'
minimum_wage_df.loc[43, '2024'] = '$7.25'
minimum_wage_df.loc[44, '2024'] = '$13.67'
minimum_wage_df.loc[45, '2024'] = '$12.00'
minimum_wage_df.loc[46, '2024'] = '$16.28'
minimum_wage_df.loc[48, '2024'] = '$8.75'
minimum_wage_df.loc[49, '2024'] = '$7.25'
minimum_wage_df.loc[50, '2024'] = '$7.25'
minimum_wage_df

Unnamed: 0,State,2022,2023,2024
0,Alabama,$7.25[note 1],$7.25[note 1],$7.25
1,Alaska,$10.34,$10.85,$11.73
2,Arizona,$12.80,$13.85,$14.35
3,Arkansas,$11.00,$11.00,$11.00
4,California,$15.00,$15.50,$16.00
5,Colorado,$12.56,$13.65,$14.42
6,Connecticut,$14.00,$15.00[14],$15.69
7,Delaware,$10.50,$11.75,$13.25
8,Florida,$11.00,$12.00,$13.00
9,Georgia,$7.25[note 2],$7.25[note 2],$7.25


In [13]:
minimum_wage_df['2024'] = minimum_wage_df['2024'].str.replace('$', '', regex=False).str.strip()
# converts everything in the '2024' column to a string, then removes the dollar signs, commas and extra spaces

minimum_wage_df['2024'] = pd.to_numeric(minimum_wage_df['2024'], errors='coerce')
# converts all the values in the 'household income 2023' column from object to float so it can be used as numeric data

print(minimum_wage_df.dtypes)
#print(minimum_wage_df.head())
minimum_wage_df

State     object
2022      object
2023      object
2024     float64
dtype: object


Unnamed: 0,State,2022,2023,2024
0,Alabama,$7.25[note 1],$7.25[note 1],7.25
1,Alaska,$10.34,$10.85,11.73
2,Arizona,$12.80,$13.85,14.35
3,Arkansas,$11.00,$11.00,11.0
4,California,$15.00,$15.50,16.0
5,Colorado,$12.56,$13.65,14.42
6,Connecticut,$14.00,$15.00[14],15.69
7,Delaware,$10.50,$11.75,13.25
8,Florida,$11.00,$12.00,13.0
9,Georgia,$7.25[note 2],$7.25[note 2],7.25


#### The 2024 column is renames to be more descriptive, so it is clear what is being referenced when the data sets are merged.
#### The State and minimum wage 2024 columns will also be saved as a data frame, and a new csv file created.

In [14]:
minimum_wage_df = minimum_wage_df.rename(columns={'2024': 'minimum wage 2024'})

In [15]:
minimum_wage_df = minimum_wage_df[['State','minimum wage 2024']]
minimum_wage_df

Unnamed: 0,State,minimum wage 2024
0,Alabama,7.25
1,Alaska,11.73
2,Arizona,14.35
3,Arkansas,11.0
4,California,16.0
5,Colorado,14.42
6,Connecticut,15.69
7,Delaware,13.25
8,Florida,13.0
9,Georgia,7.25


In [16]:
minimum_wage_df.to_csv('minimum_wage_df.csv', index=False)

### References
#### Wikipedia contributors. (n.d.). List of US states by minimum wage. Wikipedia, The Free Encyclopedia. Retrieved April 8, 2025, from https://en.wikipedia.org/wiki/List_of_US_states_by_minimum_wage