### Name: Asha Cumberbatch 
### Date: April 8th
### Assignment: Project 2 part 2 - Household income table
### Purpose: The aim of this notebook is to pull the data from the tables on the List of U.S. states and territories by income Wikipedia page (https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_income#States_and_territories_ranked_by_median_household_income). The page will be web scraped and the data cleaned. 

### The columns of particular interest are State, and data for the year 2023, and will be saved to a csv file. That csv file will be merged with other csv files, created from similar pages, then saved as a new data frame. The resulting data frame will be used to gain insight on how these metrics differ by state.

##### The first step is be to import the necessary packages. BeautifulSoup, imported as bs, and pandas, imported as pd will be needed.

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# Household Income

##### Before attempting to web scrape the page, the robots.txt file was was run for Wikipedia to ensure that scraping was allowed.
##### The instruction to pull the page also contains an if statement, which will return an error message if there is an issue when attempting to pull the page.

In [2]:
url='https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_income#States_and_territories_ranked_by_median_household_income'
response = requests.get(url)
status = response.status_code
if status == 200:
    page = response.text
    soup = bs(page)
else:
    print(f"Oops! Received status code {status}")

In [3]:
print(soup.prettify())
type(soup)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of U.S. states and territories by income - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector

bs4.BeautifulSoup

##### To assist with identifying and selecting the right table, a function was used. The number of tables on the page and the content of each table is printed.

In [4]:
tables = soup.find_all('table')  
print(len(tables))  

for i, table in enumerate(tables):
    print(f"Table {i}:")
    print(table.prettify()[:500])  
    print("\n")

8
Table 0:
<table class="sidebar sidebar-collapse nomobile nowraplinks vcard hlist">
 <tbody>
  <tr>
   <td class="sidebar-pretitle">
    This article is part of a series on
   </td>
  </tr>
  <tr>
   <th class="sidebar-title-with-pretitle" style="background:lavender">
    <a href="/wiki/Income_in_the_United_States" title="Income in the United States">
     <small>
      Income in the
     </small>
     <br/>
     United States of America
    </a>
   </th>
  </tr>
  <tr>
   <td class="sidebar-image">
    <


Table 1:
<table class="wikitable sortable mw-datatable sticky-header sort-under static-row-numbers" style="overflow-x:">
 <caption>
  States and territories ranked by median household income. Average annual growth rate 2013–2023, %
 </caption>
 <tbody>
  <tr>
   <th>
    States and Washington, D.C.
   </th>
   <th>
    2023
   </th>
   <th>
    2022
   </th>
   <th>
    2021
   </th>
   <th>
    2019
   </th>
   <th>
    2018
   </th>
   <th>
    2017
   </th>
   <th>
    2016
   <

#### An empty list is set to hold the variables that will be pulled from the table. The table that needs to be pulled will also be printed.

In [5]:
household_income_list = []
household_income_table = tables[1].tbody  # Select the second table (index 1)
print(household_income_table)

<tbody><tr>
<th>States and Washington, D.C.
</th>
<th>2023
</th>
<th>2022
</th>
<th>2021
</th>
<th>2019
</th>
<th>2018
</th>
<th>2017
</th>
<th>2016
</th>
<th>2015
</th>
<th>2014
</th>
<th>2013
</th>
<th>Growth rate
</th></tr>
<tr class="static-row-numbers-norank">
<td><i><b><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/40px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/60px-Flag_of_the_United_States.svg.png 2x" width="23"/></span></span> </span><a href="/wiki/United_States" title="United States">United States</a></b></i>
</td>
<td><b>$77,719</b>
</td>
<td><b>$74,755</b>
</td>
<td><b>$69,717</b>
</td>
<td><b>$65,712</b>
</td>
<td><b>$63,179</b>
</td>
<td><b>$60,336</b>
</td>
<td><b>$57,617</b>


In [6]:
# determines the maximum number of columns in the table (net_domestic_migration_table)
max_columns = max([len(row.find_all(['th', 'td'])) for row in household_income_table.find_all('tr')])

for row in household_income_table.find_all('tr'): #uses each row in the table, including the header. (use [1:]: to skip header)
    cells = row.find_all(['th', 'td']) # finds all the header (th) and data (td) elements from each row
    row_data = [cell.text.strip() for cell in cells]  #pulls the text within each cell, removes any spaces
    
    # in case any of the columns are missing, this function inserts None at the beginning of the row
    while len(row_data) < max_columns: #checks if any missing column entries by comparing against the value in max_column
        row_data.insert(0, None)  # 0 tells the code to add the placeholder none to the first column
    
    household_income_list.append(row_data)

for row in household_income_list:
    print(row)

['States and Washington, D.C.', '2023', '2022', '2021', '2019', '2018', '2017', '2016', '2015', '2014', '2013', 'Growth rate']
['United States', '$77,719', '$74,755', '$69,717', '$65,712', '$63,179', '$60,336', '$57,617', '$55,775', '$53,657', '$52,250', '3.07%']
['Washington, D.C.', '$108,210', '$101,027', '$90,088', '$92,266', '$85,203', '$82,372', '$75,506', '$75,628', '$71,648', '$67,572', '4.82%']
['Massachusetts', '$99,858', '$94,488', '$89,645', '$85,843', '$79,835', '$77,385', '$75,297', '$70,628', '$69,160', '$66,768', '4.11%']
['New Jersey', '$99,781', '$96,346', '$89,296', '$85,751', '$81,740', '$80,088', '$76,126', '$72,222', '$71,919', '$70,165', '3.58%']
['Maryland', '$98,678', '$94,991', '$90,203', '$86,738', '$83,242', '$80,776', '$78,945', '$75,847', '$73,971', '$72,483', '3.13%']
['New Hampshire', '$96,838', '$89,992', '$88,465', '$77,933', '$74,991', '$73,381', '$70,936', '$70,303', '$66,532', '$64,230', '4.19%']
['California', '$95,521', '$91,551', '$84,907', '$80,4

In [7]:
unsorted_household_income_df = pd.DataFrame(household_income_list)
print(unsorted_household_income_df)

                             0         1         2        3        4   \
0   States and Washington, D.C.      2023      2022     2021     2019   
1                 United States   $77,719   $74,755  $69,717  $65,712   
2              Washington, D.C.  $108,210  $101,027  $90,088  $92,266   
3                 Massachusetts   $99,858   $94,488  $89,645  $85,843   
4                    New Jersey   $99,781   $96,346  $89,296  $85,751   
5                      Maryland   $98,678   $94,991  $90,203  $86,738   
6                 New Hampshire   $96,838   $89,992  $88,465  $77,933   
7                    California   $95,521   $91,551  $84,907  $80,440   
8                        Hawaii   $95,322   $92,458  $84,857  $83,102   
9                    Washington   $94,605   $91,306  $84,247  $78,687   
10                         Utah   $93,421   $89,168  $79,449  $75,780   
11                     Colorado   $92,911   $89,302  $82,254  $77,127   
12                  Connecticut   $91,665   $88,429

#### The output from the webpage has been saved (to unsorted_household_income_df). To ensure the scraped data from each webpage will be in the same order when merging the various files, it is sorted alphabetically.

In [8]:
# creates a pandas dataframe from the net_domestic_list, using the entries in row 0 as the column names 
# the entries, starting from row 1 will become the data to fill those columns
household_income_df = pd.DataFrame(household_income_list[1:], columns=household_income_list[0]) #_list[0] is the header row

household_income_df = household_income_df.sort_values(by='States and Washington, D.C.', ascending=True) # sorts the dataFrame alphabetically by the state column

household_income_df


Unnamed: 0,"States and Washington, D.C.",2023,2022,2021,2019,2018,2017,2016,2015,2014,2013,Growth rate
45,Alabama,"$62,212","$59,674","$53,913","$51,734","$49,861","$48,123","$46,257","$44,765","$42,830","$42,849",3.80%
13,Alaska,"$88,121","$86,631","$77,845","$75,463","$74,346","$73,181","$76,440","$73,355","$71,583","$72,237",2.01%
21,Arizona,"$77,315","$74,568","$69,056","$62,055","$59,246","$56,581","$53,558","$51,492","$50,068","$48,510",4.77%
48,Arkansas,"$58,700","$55,432","$52,528","$48,952","$47,062","$45,869","$44,334","$41,995","$41,262","$40,511",3.78%
6,California,"$95,521","$91,551","$84,907","$80,440","$75,277","$71,805","$67,739","$64,500","$61,933","$60,190",4.73%
10,Colorado,"$92,911","$89,302","$82,254","$77,127","$71,953","$69,117","$65,685","$63,909","$61,303","$58,823",4.68%
11,Connecticut,"$91,665","$88,429","$83,771","$78,833","$76,348","$74,168","$73,433","$71,346","$70,048","$67,098",3.17%
16,Delaware,"$82,174","$81,361","$71,091","$70,176","$64,805","$62,852","$61,757","$61,255","$59,716","$57,846",3.57%
31,Florida,"$73,311","$69,303","$63,062","$59,227","$55,462","$52,594","$50,860","$49,426","$47,463","$46,036",4.76%
26,Georgia,"$74,632","$72,837","$66,559","$61,980","$58,756","$56,183","$53,559","$51,244","$49,321","$47,829",4.55%


#### The scraped table contains some entries, like totals for the United States or Washington D.C., which will not be included in the analysis. To identify any of these entries need to be removed, the shape of the data frame must be checked.

In [9]:
household_income_df.shape[0]

52

In [10]:
household_income_df = household_income_df[~household_income_df['States and Washington, D.C.'].isin(["District of Columbia", "United States", "Washington, D.C."])]
print(household_income_df)
household_income_df.shape[0]

   States and Washington, D.C.     2023     2022     2021     2019     2018  \
45                     Alabama  $62,212  $59,674  $53,913  $51,734  $49,861   
13                      Alaska  $88,121  $86,631  $77,845  $75,463  $74,346   
21                     Arizona  $77,315  $74,568  $69,056  $62,055  $59,246   
48                    Arkansas  $58,700  $55,432  $52,528  $48,952  $47,062   
6                   California  $95,521  $91,551  $84,907  $80,440  $75,277   
10                    Colorado  $92,911  $89,302  $82,254  $77,127  $71,953   
11                 Connecticut  $91,665  $88,429  $83,771  $78,833  $76,348   
16                    Delaware  $82,174  $81,361  $71,091  $70,176  $64,805   
31                     Florida  $73,311  $69,303  $63,062  $59,227  $55,462   
26                     Georgia  $74,632  $72,837  $66,559  $61,980  $58,756   
7                       Hawaii  $95,322  $92,458  $84,857  $83,102  $80,212   
25                       Idaho  $74,942  $72,785  $6

50

In [11]:
print (household_income_df.dtypes)
# shows unique values of each entry in the coulmn exactly as they are displayed

States and Washington, D.C.    object
2023                           object
2022                           object
2021                           object
2019                           object
2018                           object
2017                           object
2016                           object
2015                           object
2014                           object
2013                           object
Growth rate                    object
dtype: object


#### To make the column containing the list of the states easier to reference, it will be renamed State. 2023 will also be renamed household income so it is clear what the column is referencing.

In [12]:
household_income_df = household_income_df.rename(columns={'States and Washington, D.C.': 'State', '2023': 'household income 2023'})
household_income_df

Unnamed: 0,State,household income 2023,2022,2021,2019,2018,2017,2016,2015,2014,2013,Growth rate
45,Alabama,"$62,212","$59,674","$53,913","$51,734","$49,861","$48,123","$46,257","$44,765","$42,830","$42,849",3.80%
13,Alaska,"$88,121","$86,631","$77,845","$75,463","$74,346","$73,181","$76,440","$73,355","$71,583","$72,237",2.01%
21,Arizona,"$77,315","$74,568","$69,056","$62,055","$59,246","$56,581","$53,558","$51,492","$50,068","$48,510",4.77%
48,Arkansas,"$58,700","$55,432","$52,528","$48,952","$47,062","$45,869","$44,334","$41,995","$41,262","$40,511",3.78%
6,California,"$95,521","$91,551","$84,907","$80,440","$75,277","$71,805","$67,739","$64,500","$61,933","$60,190",4.73%
10,Colorado,"$92,911","$89,302","$82,254","$77,127","$71,953","$69,117","$65,685","$63,909","$61,303","$58,823",4.68%
11,Connecticut,"$91,665","$88,429","$83,771","$78,833","$76,348","$74,168","$73,433","$71,346","$70,048","$67,098",3.17%
16,Delaware,"$82,174","$81,361","$71,091","$70,176","$64,805","$62,852","$61,757","$61,255","$59,716","$57,846",3.57%
31,Florida,"$73,311","$69,303","$63,062","$59,227","$55,462","$52,594","$50,860","$49,426","$47,463","$46,036",4.76%
26,Georgia,"$74,632","$72,837","$66,559","$61,980","$58,756","$56,183","$53,559","$51,244","$49,321","$47,829",4.55%


In [13]:
print(household_income_df['household income 2023'].unique())  


['$62,212' '$88,121' '$77,315' '$58,700' '$95,521' '$92,911' '$91,665'
 '$82,174' '$73,311' '$74,632' '$95,322' '$74,942' '$80,306' '$69,477'
 '$71,433' '$70,333' '$61,118' '$58,229' '$73,733' '$98,678' '$99,858'
 '$69,183' '$85,086' '$54,203' '$68,545' '$70,804' '$74,590' '$76,364'
 '$96,838' '$99,781' '$62,268' '$82,095' '$76,525' '$67,769' '$62,138'
 '$80,160' '$73,824' '$84,972' '$67,804' '$71,810' '$67,631' '$75,780'
 '$93,421' '$81,211' '$89,931' '$94,605' '$55,948' '$74,631' '$72,415']


##### By removing the quotes, comma and dollar signs from the 'household income 2023' column, the data frame is updated. This prompts a warning that the data frame is being modified, without a copy of the data originally scraped being saved. This warning can be avoided by saving a copy of the original net_domestic_migration_df.

In [14]:
household_income_df = household_income_df.copy()

household_income_df['household income 2023'] = household_income_df['household income 2023'].astype(str).str.replace(',', '').str.replace('$', '', regex=False).str.strip()
# converts everything in the 'houehold income 2023' column to a string, then removes the dollar signs, commas and extra spaces

household_income_df['household income 2023'] = pd.to_numeric(household_income_df['household income 2023'], errors='coerce')
# converts all the values in the 'household income 2023' column from object to int so it can be used as numeric data


print(household_income_df.dtypes)
#print(household_income_df.head())
household_income_df

State                    object
household income 2023     int64
2022                     object
2021                     object
2019                     object
2018                     object
2017                     object
2016                     object
2015                     object
2014                     object
2013                     object
Growth rate              object
dtype: object


Unnamed: 0,State,household income 2023,2022,2021,2019,2018,2017,2016,2015,2014,2013,Growth rate
45,Alabama,62212,"$59,674","$53,913","$51,734","$49,861","$48,123","$46,257","$44,765","$42,830","$42,849",3.80%
13,Alaska,88121,"$86,631","$77,845","$75,463","$74,346","$73,181","$76,440","$73,355","$71,583","$72,237",2.01%
21,Arizona,77315,"$74,568","$69,056","$62,055","$59,246","$56,581","$53,558","$51,492","$50,068","$48,510",4.77%
48,Arkansas,58700,"$55,432","$52,528","$48,952","$47,062","$45,869","$44,334","$41,995","$41,262","$40,511",3.78%
6,California,95521,"$91,551","$84,907","$80,440","$75,277","$71,805","$67,739","$64,500","$61,933","$60,190",4.73%
10,Colorado,92911,"$89,302","$82,254","$77,127","$71,953","$69,117","$65,685","$63,909","$61,303","$58,823",4.68%
11,Connecticut,91665,"$88,429","$83,771","$78,833","$76,348","$74,168","$73,433","$71,346","$70,048","$67,098",3.17%
16,Delaware,82174,"$81,361","$71,091","$70,176","$64,805","$62,852","$61,757","$61,255","$59,716","$57,846",3.57%
31,Florida,73311,"$69,303","$63,062","$59,227","$55,462","$52,594","$50,860","$49,426","$47,463","$46,036",4.76%
26,Georgia,74632,"$72,837","$66,559","$61,980","$58,756","$56,183","$53,559","$51,244","$49,321","$47,829",4.55%


#### The State and household income 2023 columns column will be saved as a data frame, and a new csv file created.

In [15]:
household_income_df = household_income_df[['State','household income 2023']]
household_income_df

Unnamed: 0,State,household income 2023
45,Alabama,62212
13,Alaska,88121
21,Arizona,77315
48,Arkansas,58700
6,California,95521
10,Colorado,92911
11,Connecticut,91665
16,Delaware,82174
31,Florida,73311
26,Georgia,74632


In [16]:
household_income_df.to_csv('household_income_df.csv', index=False)

### References
#### Wikipedia contributors. (n.d.). List of U.S. states and territories by income. Wikipedia, The Free Encyclopedia. Retrieved April 8, 2025, from https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_income#States_and_territories_ranked_by_median_household_income