### In-Class Assignment: Web Scraping and Data Extraction from a New Webpage
Use the requests library to fetch a new webpage.
Parse the HTML content using BeautifulSoup.
Extract various elements such as figures, tables, and text.
Work collaboratively in groups to practice web scraping and present their findings.
- Task 1: Select a Webpage of interest (e.g., a news article, an educational resource, or a data-driven website). Ensure that the selected webpage contains a variety of elements, such as tables, figures, and text content.
- Task 2: Fetch and Parse the Webpage

In [3]:
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_GDP'
response = requests.get(url)

In [4]:

# Check if the request was successful
if response.status_code == 200:
    print("Successfully fetched the webpage!")
else:
    print("Failed to fetch the webpage.")

Successfully fetched the webpage!


In [5]:

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

### Task 3: Extract Elements

In [37]:
#Find all images and extract their src attributes.
images = soup.find_all('img')
image_urls = [img['src'] for img in images if 'src' in img.attrs]

img=image_urls[4]
img
display(Image(url=img))

In [38]:
#image display


#from IPython.display import Image, display
#for url in image_urls:
    #display(Image(url=url))


In [44]:
# Locate and extract all tables on the webpage, converting them into Pandas DataFrames.
import pandas as pd

tables = soup.find_all('table')
table=tables[0] if tables else None 
df = pd.read_html(str(table))[0]
df.head()

  df = pd.read_html(str(table))[0]


Unnamed: 0_level_0,State or federal district,Nominal GDP at current prices 2023 (millions of U.S. dollars)[1],Nominal GDP at current prices 2023 (millions of U.S. dollars)[1],Annual GDP change at current prices (2022–2023)[1],Annual GDP change at current prices (2022–2023)[1],Real GDP growth rate (2022–2023)[1],Nominal GDP per capita[1][3],Nominal GDP per capita[1][3],% of national[1],% of national[1]
Unnamed: 0_level_1,State or federal district,2022,2024,Annual GDP change at current prices (2022–2023)[1],Annual GDP change at current prices (2022–2023)[1].1,Real GDP growth rate (2022–2023)[1],2022,2024,2022,2023
0,California *,3641643,3987285,220528,6.1%,2.1%,"$93,460","$102,527",14.69%,14.11%
1,Texas *,2402137,2664144,161371,6.7%,5.7%,"$78,750","$86,004",8.69%,9.37%
2,New York *,2048403,2226903,103859,5.1%,0.7%,"$104,660","$114,380",8.11%,7.86%
3,Florida *,1439065,1647446,140438,9.8%,5.0%,"$63,640","$71,703",5.37%,5.77%
4,Illinois *,1025667,1107087,57301,5.6%,1.3%,"$81,730","$88,447",4.11%,3.96%


In [45]:
#for i, table in enumerate(tables):
    df = pd.read_html(str(table))[0]
    print(f"Table {i+1}:\n", df.head(), "\n")




IndentationError: unexpected indent (3325418472.py, line 2)

In [52]:
#Extract the main text content, such as paragraphs or headings.
paragraphs = soup.find_all('p')
text_content = ' '.join([para.get_text() for para in paragraphs])
print(text_content[:926])  # Print the first 500 characters


 This is a list of U.S. states and territories by gross domestic product (GDP). This article presents the 50 U.S. states and the District of Columbia and their nominal GDP at current prices.
 The data source for the list is the Bureau of Economic Analysis (BEA) in 2024. The BEA defined GDP by state as "the sum of value added from all industries in the state."[1]
 Nominal GDP does not take into account differences in the cost of living in different countries, and the results can vary greatly from one year to another based on fluctuations in the exchange rates of the country's currency. Such fluctuations may change a country's ranking from one year to the next, even though they often make little or no difference in the standard of living of its population.[2]
 Overall, in the calendar year 2024, the United States' Nominal GDP at Current Prices totaled at $28.269 trillion, as compared to $25.744 trillion in 2022.
 


### Task 4: Analyze and Discuss Findings
Each group will analyze the extracted data and discuss the following:
- What figures (images) were extracted and what do they represent?
- What information is contained in the tables, and how does it contribute to the overall content of the webpage?
- What is the main focus of the text content extracted? How does it relate to the images and tables?
- Discuss the challenges faced during extraction, such as dealing with complex HTML structures or incomplete data.

#The source of the image, table, and text are from Wikipedia *List of US States and Territories by GDP*, which gave insight into how much each US State preforms economically compared to 2024 from 2022. The image that was extracted is a color-coded map of the US with states in darker shades of red having higher nominal GDP and lighter shades of red for states with lower nominal GDP. From the image, it can be infered which states stand with the highest nominal GDP contributions and which states contributed the least. From the table extracted, the same conclusion can be made; however, with the table's data adding monetary values to attribute to each state, it can be further obsered that California has the highest GDP from 2022 and 2024, but Texas has higher rate of growth between those two years. This is an inference one could not have made by solely analyzing the image. The text extracted gives overview of the data the image and table provide, while also giving additional information like total amounts from states' collective data. The biggest challenge faced during extraction was displaying the desired image in the notebook. The remedy discoved was assigning the url of the desired image to a variable, and then displaying that specific image from the list of images collected from the source. 

### Task 5: Present Findings
Shares your analysis of the extracted elements.
Discusses any patterns, relationships, or insights gained from the data.

Each group should submit their Jupyter notebook (or Python script) with the code, analysis, and any additional notes or reflections on the exercise.