In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup


In [2]:
# Step #1: Web Scraping
# Web scraping was performed to extract data from the Wikipedia page, making it accessible for analysis.
url = "https://en.wikipedia.org/wiki/Air_quality_index"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find the table containing the data
table = soup.find('table', {'class': 'wikitable'})
data = []



In [3]:
# Step #2: Remove Headers and Step #3: Format Data
# Headers were removed to eliminate unnecessary titles or labels that could interfere with data analysis.
# Data formatting was necessary to standardize data types and formats for consistency and readability.
for row in table.find_all('tr')[1:]:
    columns = row.find_all('td')
    if len(columns) >= 3:  # Ensure it's a data row
        city = columns[0].get_text(strip=True)
        aqi = columns[1].get_text(strip=True)
        category = columns[2].get_text(strip=True)
        data.append([city, aqi, category])

df = pd.DataFrame(data, columns=['City', 'AQI', 'Category'])

# Print the DataFrame after Step #2 and Step #3
print("Step #2 and Step #3 - Removing Headers and Formatting Data:")
print(df.head())


Step #2 and Step #3 - Removing Headers and Formatting Data:
               City      AQI    Category
0   Carbon monoxide  8 hours       9 ppm
1  Nitrogen dioxide   1 hour    0.12 ppm
2             Ozone   1 hour    0.10 ppm
3   Sulphur dioxide   1 hour    0.20 ppm
4              Lead   1 year  0.50 μg/m3


In [4]:
# Step #5: Find Duplicates
# Duplicates were removed to avoid redundant information and maintain a clean dataset.
df.drop_duplicates(inplace=True)

# Print the DataFrame after Step #5
print("\nStep #5 - Removing Duplicates:")
print(df.head())




Step #5 - Removing Duplicates:
               City      AQI    Category
0   Carbon monoxide  8 hours       9 ppm
1  Nitrogen dioxide   1 hour    0.12 ppm
2             Ozone   1 hour    0.10 ppm
3   Sulphur dioxide   1 hour    0.20 ppm
4              Lead   1 year  0.50 μg/m3


In [5]:
# Step #6: Fix Casing (Converting City names to title case)
# Casing was fixed to ensure consistent capitalization of city names and avoid inconsistencies in data.
df['City'] = df['City'].str.title()

# Print the DataFrame after Step #6
print("\nStep #6 - Fixing Casing:")
print(df.head())


Step #6 - Fixing Casing:
               City      AQI    Category
0   Carbon Monoxide  8 hours       9 ppm
1  Nitrogen Dioxide   1 hour    0.12 ppm
2             Ozone   1 hour    0.10 ppm
3   Sulphur Dioxide   1 hour    0.20 ppm
4              Lead   1 year  0.50 μg/m3


Data wrangling, including web scraping and data cleaning, can raise several ethical implications when dealing with data from sources like Wikipedia. In the context of the provided data source about the Air Quality Index (AQI), some ethical concerns may arise. Firstly, web scraping can potentially put a strain on the website's servers if performed excessively, which may be seen as unethical, as it can disrupt the normal functioning of the site. Ensuring responsible web scraping practices, such as using proper user agents and adhering to scraping policies, is essential. Additionally, data cleaning steps, such as removing duplicates and fixing casing, are generally straightforward. However, if data cleaning involves subjective decisions, there can be concerns about introducing unintentional biases. For example, the decision to title case city names could lead to a Western-centric bias in capitalization, which may not be appropriate for all contexts. It is important to document and justify any data cleaning decisions to maintain transparency and address potential ethical concerns related to bias or data quality. Ultimately, ethical data wrangling should prioritize the accuracy, fairness, and respect for data sources while ensuring that any analysis is conducted responsibly and with full awareness of potential biases and implications.