**Context**
- This dataset, spanning 170 countries and 300+ cities, provides a holistic view of global air quality dynamics. Focused on crucial pollutants like Carbon Monoxide, Ozone, Nitrogen Dioxide, and Particulate Matter (PM2.5), it serves as a valuable resource for environmental scientists, policymakers, and researchers. The insights derived from this dataset empower users to analyze air quality trends, formulate effective policies, and contribute to fostering a healthier planet.

**Content**
- Featuring essential columns such as country name, city name, overall Air Quality Index (AQI) values, and concentrations of specific pollutants, this dataset supports in-depth analyses and correlation studies. Researchers can uncover patterns and trends in air quality by exploring the relationships between pollutants and overall AQI values. With its comprehensive scope, this dataset is an indispensable tool for those interested in understanding air quality dynamics and actively participating in collective efforts toward a cleaner and healthier atmosphere.

Dataset Structure:
The dataset (global_air_pollution_data.csv) covers the year of 2024 and includes the following columns:

- `country_name`	Name of the Country
- `city_name`	Name of the City
- `aqi_value`	Overall AQI value of the city
- `aqi_category`	Overall AQI category of the city
- `co_aqi_value`	AQI value of Carbon Monoxide of the city
- `co_aqi_category`	AQI category of Carbon Monoxide of the city
- `ozone_aqi_value`	AQI value of Ozone of the city
- `ozone_aqi_category`	AQI category of Ozone of the city
- `no2_aqi_value`	AQI value of Nitrogen Dioxide of the city
- `no2_aqi_category`	AQI category of Nitrogen Dioxide of the city
- `pm2.5_aqi_value`	AQI value of Particulate Matter with a diameter of 2.5 micrometers or less of the city
- `pm2.5_aqi_category`	AQI category of Particulate Matter with a diameter of 2.5 micrometers or less of the city

In [9]:
import pandas as pd

# Read the CSV file into a DataFrame
global_air_pollution_data = pd.read_csv('global_air_pollution_data.csv')

# Display the first few rows of the DataFrame to verify the data
global_air_pollution_data.head()

Unnamed: 0,country_name,city_name,aqi_value,aqi_category,co_aqi_value\t,co_aqi_category,ozone_aqi_value,ozone_aqi_category,no2_aqi_value,no2_aqi_category,pm2.5_aqi_value,pm2.5_aqi_category
0,Russian Federation,Praskoveya,51,Moderate,1,Good,36,Good,0,Good,51,Moderate
1,Brazil,Presidente Dutra,41,Good,1,Good,5,Good,1,Good,41,Good
2,Italy,Priolo Gargallo,66,Moderate,1,Good,39,Good,2,Good,66,Moderate
3,Poland,Przasnysz,34,Good,1,Good,34,Good,0,Good,20,Good
4,France,Punaauia,22,Good,0,Good,22,Good,0,Good,6,Good


In [10]:
# Check for missing values in the DataFrame
missing_values = global_air_pollution_data.isnull().sum()

# Calculate the percentage of missing values
total_rows = global_air_pollution_data.shape[0]
missing_percentage = (missing_values / total_rows) * 100

# Determine if the number of rows with missing items is less than 2%
if (global_air_pollution_data.isnull().sum(axis=1) > 0).mean() < 0.02:
    # Remove rows with missing values
    global_air_pollution_data_cleaned = global_air_pollution_data.dropna()
else:
    # Impute missing values with the mean of each column for numerical data
    global_air_pollution_data_cleaned = global_air_pollution_data.apply(
        lambda col: col.fillna(col.mean()) if col.dtype.kind in 'biufc' else col.fillna(col.mode()[0])
    )

# Display the cleaned DataFrame
global_air_pollution_data_cleaned.head()

Unnamed: 0,country_name,city_name,aqi_value,aqi_category,co_aqi_value\t,co_aqi_category,ozone_aqi_value,ozone_aqi_category,no2_aqi_value,no2_aqi_category,pm2.5_aqi_value,pm2.5_aqi_category
0,Russian Federation,Praskoveya,51,Moderate,1,Good,36,Good,0,Good,51,Moderate
1,Brazil,Presidente Dutra,41,Good,1,Good,5,Good,1,Good,41,Good
2,Italy,Priolo Gargallo,66,Moderate,1,Good,39,Good,2,Good,66,Moderate
3,Poland,Przasnysz,34,Good,1,Good,34,Good,0,Good,20,Good
4,France,Punaauia,22,Good,0,Good,22,Good,0,Good,6,Good


In [12]:
import plotly.express as px
import plotly.graph_objects as go

# Summary statistics
summary_stats = global_air_pollution_data_cleaned.describe()

# Distribution of AQI values
fig_aqi_dist = px.histogram(global_air_pollution_data_cleaned, x='aqi_value', nbins=50, title='Distribution of AQI Values')
fig_aqi_dist.update_layout(xaxis_title='AQI Value', yaxis_title='Count')

# Box plots for AQI values by category
fig_aqi_box = px.box(global_air_pollution_data_cleaned, x='aqi_category', y='aqi_value', title='Box Plot of AQI Values by Category')
fig_aqi_box.update_layout(xaxis_title='AQI Category', yaxis_title='AQI Value')

# Scatter plot for AQI values by city
fig_aqi_scatter = px.scatter(global_air_pollution_data_cleaned, x='city_name', y='aqi_value', color='aqi_category', title='AQI Values by City')
fig_aqi_scatter.update_layout(xaxis_title='City Name', yaxis_title='AQI Value')

# Correlation heatmap
# Select only numeric columns for correlation matrix
numeric_cols = global_air_pollution_data_cleaned.select_dtypes(include=['float64', 'int64']).columns
corr_matrix = global_air_pollution_data_cleaned[numeric_cols].corr()
fig_corr_heatmap = go.Figure(data=go.Heatmap(
    z=corr_matrix.values,
    x=corr_matrix.columns,
    y=corr_matrix.columns,
    colorscale='Viridis'
))
fig_corr_heatmap.update_layout(title='Correlation Heatmap of Air Pollution Metrics')

# Display the plots
fig_aqi_dist.show()
fig_aqi_box.show()
fig_aqi_scatter.show()
fig_corr_heatmap.show()

In [14]:
# Aggregate AQI values by country
aqi_by_country = global_air_pollution_data_cleaned.groupby('country_name')['aqi_value'].mean().reset_index()

# Sort the data by AQI values
aqi_by_country = aqi_by_country.sort_values(by='aqi_value', ascending=False)

# Bar plot for AQI values by country
fig_aqi_country = px.bar(aqi_by_country, x='country_name', y='aqi_value', title='Average AQI Values by Country (Sorted)')
fig_aqi_country.update_layout(xaxis_title='Country Name', yaxis_title='Average AQI Value')

# Display the plot
fig_aqi_country.show()