### Name: Asha Cumberbatch 
### Date: April 8th
### Assignment: Project 2 part 2 - GDP table
### Purpose: The aim of this notebook is to pull the data from the tables on the List of U.S. states and territories by GDP Wikipedia page (https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_GDP). The page will be web scraped and the data cleaned. 

### The columns of particular interest are State, and data for the year 2024, and will be saved to a csv file. That csv file will be merged with other csv files, created from similar pages, then saved as a new data frame. The resulting data frame will be used to gain insight on how these metrics differ by state.

##### The first step is be to import the necessary packages. BeautifulSoup, imported as bs, and pandas, imported as pd will be needed.

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# GDP

##### Before attempting to web scrape the page, the robots.txt file was was run for Wikipedia to ensure that scraping was allowed.
##### The instruction to pull the page also contains an if statement, which will return an error message if there is an issue when attempting to pull the page.

In [2]:
url='https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_GDP'
response = requests.get(url)
status = response.status_code
if status == 200:
    page = response.text
    soup = bs(page)
else:
    print(f"Oops! Received status code {status}")

In [3]:
print(soup.prettify())
type(soup)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of U.S. states and territories by GDP - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-fe

bs4.BeautifulSoup

##### To assist with identifying and selecting the right table, a function was used. The number of tables on the page and the content of each table is printed.

In [4]:
tables = soup.find_all('table') 
print(len(tables))  

for i, table in enumerate(tables):
    print(f"Table {i}:")
    print(table.prettify()[:500])  
    print("\n")

7
Table 0:
<table class="sortable wikitable sticky-header-multi static-row-numbers sort-under" style="text-align:right">
 <tbody>
  <tr>
   <th rowspan="2">
    State or federal district
   </th>
   <th colspan="2" style="max-width:10em">
    <div style="display: inline-block; line-height: 1.2em; padding: .1em 0;">
     Nominal GDP at current prices 2024 (millions of U.S. dollars)
     <sup class="reference" id="cite_ref-GDPByState_1-2">
      <a href="#cite_note-GDPByState-1">
       <span class="cite-bra


Table 1:
<table class="wikitable sortable static-row-numbers" style="text-align:right;">
 <tbody>
  <tr class="static-row-header" style="text-align:center;vertical-align:bottom;">
   <th>
    Territory
   </th>
   <th style="max-width:10em">
    Nominal GDP
    <br/>
    at Current Prices
    <br/>
    (millions of
    <br/>
    U.S. dollars)
   </th>
   <th style="max-width:5em">
    Real GDP growth rate
   </th>
   <th style="max-width:5em">
    GDP per capita
   </th>
   <th sty

#### An empty list is set to hold the variables that will be pulled from the table. The table that needs to be pulled will also be printed.

In [5]:
gdp_list = []
gdp_table = tables[0].tbody  # Select the first table (index 0)
gdp_table

<tbody><tr>
<th rowspan="2">State or federal district
</th>
<th colspan="2" style="max-width:10em"><div style="display: inline-block; line-height: 1.2em; padding: .1em 0;"> Nominal GDP at current prices 2024 (millions of U.S. dollars)<sup class="reference" id="cite_ref-GDPByState_1-2"><a href="#cite_note-GDPByState-1"><span class="cite-bracket">[</span>1<span class="cite-bracket">]</span></a></sup></div>
</th>
<th colspan="2" rowspan="2" style="max-width:10em"><div style="display: inline-block; line-height: 1.2em; padding: .1em 0;"> Annual GDP change at current prices (2023–2024)<sup class="reference" id="cite_ref-GDPByState_1-3"><a href="#cite_note-GDPByState-1"><span class="cite-bracket">[</span>1<span class="cite-bracket">]</span></a></sup></div>
</th>
<th rowspan="2" style="max-width:10em"><div style="display: inline-block; line-height: 1.2em; padding: .1em 0;"> <a href="/wiki/Real_gross_domestic_product" title="Real gross domestic product">Real GDP</a><br/>growth rate<br/>(2023–20

In [6]:
# determines the maximum number of columns in the table (net_domestic_migration_table)
max_columns = max([len(row.find_all(['th', 'td'])) for row in gdp_table.find_all('tr')])

for row in gdp_table.find_all('tr'): #uses each row in the table, including the header. (use [1:]: to skip header)
    cells = row.find_all(['th', 'td']) # finds all the header (th) and data (td) elements from each row
    row_data = [cell.text.strip() for cell in cells]  #pulls the text within each cell, removes any spaces
    
    gdp_list.append(row_data)

for row in gdp_list:
    print(row)

['State or federal district', 'Nominal GDP at current prices 2024 (millions of U.S. dollars)[1]', 'Annual GDP change at current prices (2023–2024)[1]', 'Real\xa0GDPgrowth rate(2023–2024)[1]', 'Nominal GDP per capita[1][2]', '% of national[1]']
['2022', '2024', '2022', '2024', '2022', '2024']
['California', '3,641,643', '4,103,124', '438,535', '5.7%', '2.0%', '$93,460', '$104,916', '14.69%', '14.14%']
['Texas', '2,402,137', '2,709,393', '292,387', '6.0%', '7.4%', '$78,750', '$86,987', '8.69%', '9.34%']
['New York', '2,048,403', '2,297,028', '235,961', '5.8%', '1.5%', '$104,660', '$117,332', '8.11%', '7.92%']
['Florida', '1,439,065', '1,705,565', '256,208', '9.2%', '4.3%', '$63,640', '$73,784', '5.37%', '5.87%']
['Illinois', '1,025,667', '1,137,244', '106,476', '5.6%', '1.0%', '$81,730', '$90,449', '4.11%', '3.92%']
['Pennsylvania', '911,813', '1,024,206', '105,444', '6.2%', '2.5%', '$70,350', '$78,544', '3.67%', '3.52%']
['Ohio', '825,990', '927,740', '96,786', '6.3%', '2.1%', '$70,080'

In [7]:
unsorted_gdp_df = pd.DataFrame(gdp_list)
unsorted_gdp_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,State or federal district,Nominal GDP at current prices 2024 (millions o...,Annual GDP change at current prices (2023–2024...,Real GDPgrowth rate(2023–2024)[1],Nominal GDP per capita[1][2],% of national[1],,,,
1,2022,2024,2022,2024,2022,2024,,,,
2,California,3641643,4103124,438535,5.7%,2.0%,"$93,460","$104,916",14.69%,14.14%
3,Texas,2402137,2709393,292387,6.0%,7.4%,"$78,750","$86,987",8.69%,9.34%
4,New York,2048403,2297028,235961,5.8%,1.5%,"$104,660","$117,332",8.11%,7.92%
5,Florida,1439065,1705565,256208,9.2%,4.3%,"$63,640","$73,784",5.37%,5.87%
6,Illinois,1025667,1137244,106476,5.6%,1.0%,"$81,730","$90,449",4.11%,3.92%
7,Pennsylvania,911813,1024206,105444,6.2%,2.5%,"$70,350","$78,544",3.67%,3.52%
8,Ohio,825990,927740,96786,6.3%,2.1%,"$70,080","$78,120",3.22%,3.20%
9,Georgia,767378,882535,110368,6.7%,1.9%,"$69,570","$78,754",2.99%,3.04%


#### The output from the webpage has been saved (to unsorted_gdp_df). The states are currently saved to 0, and the nominal gdp for 2024 saved as 1. For easier references, this will be renamed State and nominal gdp 2024. 

#### There are also multiple columns in the data frame that will not be used. To make the data easier to look at, the two columns needed for analysis will be saved to a new data frame.


In [8]:
gdp_df = unsorted_gdp_df[[0, 1]]
gdp_df

Unnamed: 0,0,1
0,State or federal district,Nominal GDP at current prices 2024 (millions o...
1,2022,2024
2,California,3641643
3,Texas,2402137
4,New York,2048403
5,Florida,1439065
6,Illinois,1025667
7,Pennsylvania,911813
8,Ohio,825990
9,Georgia,767378


In [9]:
gdp_df = gdp_df.rename(columns={0: 'State', 1: 'nominal gdp 2024'})
gdp_df.shape[0]

54

#### The scraped table also contains some entries, like totals for the United States or Washington D.C., which will not be included in the analysis. To identify any of these entries need to be removed, the shape of the data frame must be checked, then sorted alphabetically,  to match the order of the other scraped pages.

In [10]:
gdp_df = gdp_df[~gdp_df["State"].isin(["District of Columbia", "United States", "Washington, D.C.", "State or federal district", "2022"])]

gdp_df = gdp_df.sort_values(by= "State", ascending=True)

print(gdp_df)
gdp_df.shape[0]


             State nominal gdp 2024
28         Alabama          281,569
50          Alaska           65,699
17         Arizona          475,654
35        Arkansas          165,989
2       California        3,641,643
16        Colorado          491,289
24     Connecticut          319,345
44        Delaware           90,208
5          Florida        1,439,065
9          Georgia          767,378
42          Hawaii          101,083
40           Idaho          110,871
6         Illinois        1,025,667
20         Indiana          470,324
33            Iowa          238,342
34          Kansas          209,326
30        Kentucky          258,981
27       Louisiana          291,952
45           Maine           85,801
19        Maryland          480,113
13   Massachusetts          691,461
15        Michigan          622,563
21       Minnesota          448,032
38     Mississippi          139,976
23        Missouri          396,890
47         Montana           67,072
36        Nebraska          

50

In [11]:
print(gdp_df.dtypes)


State               object
nominal gdp 2024    object
dtype: object


In [12]:
print(gdp_df['nominal gdp 2024'].unique())  
# shows unique values of each entry in the coulmn exactly as they are displayed

['281,569' '65,699' '475,654' '165,989' '3,641,643' '491,289' '319,345'
 '90,208' '1,439,065' '767,378' '101,083' '110,871' '1,025,667' '470,324'
 '238,342' '209,326' '258,981' '291,952' '85,801' '480,113' '691,461'
 '622,563' '448,032' '139,976' '396,890' '67,072' '164,934' '222,939'
 '105,025' '754,948' '125,541' '2,048,403' '715,968' '72,651' '825,990'
 '242,739' '297,309' '911,813' '72,771' '297,546' '68,782' '485,657'
 '2,402,137' '256,370' '40,831' '663,106' '738,101' '97,417' '396,209'
 '49,081']


##### By removing the quotes, comma and dollar signs from the 'nominal gdp 2024' column, the data frame is updated. This prompts a warning that the data frame is being modified, without a copy of the data originally scraped being saved. This warning can be avoided by saving a copy of the original net_domestic_migration_df.

In [13]:
gdp_df = gdp_df.copy()

gdp_df['nominal gdp 2024'] = gdp_df['nominal gdp 2024'].str.replace(',', '').str.strip()
gdp_df

Unnamed: 0,State,nominal gdp 2024
28,Alabama,281569
50,Alaska,65699
17,Arizona,475654
35,Arkansas,165989
2,California,3641643
16,Colorado,491289
24,Connecticut,319345
44,Delaware,90208
5,Florida,1439065
9,Georgia,767378


In [14]:
gdp_df['nominal gdp 2024'] = pd.to_numeric(gdp_df['nominal gdp 2024'], errors='coerce')


In [15]:
gdp_df['nominal gdp 2024'].fillna(0, inplace=True)
gdp_df

Unnamed: 0,State,nominal gdp 2024
28,Alabama,281569
50,Alaska,65699
17,Arizona,475654
35,Arkansas,165989
2,California,3641643
16,Colorado,491289
24,Connecticut,319345
44,Delaware,90208
5,Florida,1439065
9,Georgia,767378


In [16]:
gdp_df.dtypes

State               object
nominal gdp 2024     int64
dtype: object

#### The data frame will be saved to a csv file so it can be merged with the other data collected from similiar pages.

In [17]:
gdp_df.to_csv('gdp_df.csv', index=False)

### References 

#### Wikipedia contributors. (n.d.). List of U.S. states and territories by GDP. Wikipedia, The Free Encyclopedia. Retrieved April 8, 2025, from https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_GDP