<a href="https://colab.research.google.com/github/Medynal/Pollution/blob/main/pollution_(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This project analyses air quality data from major cities in India between 2015 and 2020 to understand pollution trends and predict air quality levels. The dataset includes key pollutants such as PM2.5, PM10, NO₂, SO₂, CO, O₃, and related compounds, along with the Air Quality Index (AQI) and AQI categories.

The project involves data cleaning, exploratory data analysis, feature engineering, and machine learning modeling to predict AQI and AQI categories. Pollutant levels, patterns and correlations are visualised and and a streamlit application to predict future AQI and AQI index using models trained by the dataset. project documentations are managed using GitHub

In [36]:
#import os to to access dataset from github
import os

repo_url = "https://github.com/Medynal/Pollution.git"
folder_path = "/content/Pollution"

if not os.path.exists(folder_path):
  !git clone {repo_url} {folder_path}
print(f"Current working directory: {os.getcwd()}")


Current working directory: /content


In [37]:
%ls

cleaned_pollution_dataset.csv  [0m[01;34mPollution[0m/  [01;34msample_data[0m/


In [38]:
# importing the necessary libraries for data analysis
#import pandas and numpy for Data cleaning and analysis
#matplotlib, plotly and seaborn for visualisation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

In [39]:
#extracting the datasets into a list
pollution_dataset= '/content/Pollution/data_folder'
dataframes = []
for filename in os.listdir(pollution_dataset):
    if filename.endswith('.csv'):  # Check if the file is a CSV file
        file_path = os.path.join(pollution_dataset, filename)
        df = pd.read_csv(file_path)  # Read the CSV file into a DataFrame
        dataframes.append(df)  # Add the DataFrame to the list

In [40]:
#concatenating Datasets
pollution_df = pd.concat(dataframes, ignore_index=True)
pollution_df

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Patna,01/06/2015,,,14.41,25.06,39.32,,1.56,1.80,8.89,0.00,0.29,0.00,,
1,Patna,02/06/2015,,,25.00,22.48,47.50,,2.35,9.69,9.90,0.08,0.83,0.09,,
2,Patna,03/06/2015,,,14.29,17.16,29.81,,1.69,20.61,12.63,0.00,0.33,0.00,,
3,Patna,04/06/2015,,,13.03,15.62,28.63,,1.20,4.35,9.77,0.01,0.28,0.00,,
4,Patna,05/06/2015,,,10.40,10.36,20.14,,1.29,7.22,11.90,0.00,0.15,0.00,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29526,Visakhapatnam,27/06/2020,15.02,50.94,7.68,25.06,19.54,12.47,0.47,8.55,23.30,2.24,12.07,0.73,41.0,Good
29527,Visakhapatnam,28/06/2020,24.38,74.09,3.42,26.06,16.53,11.99,0.52,12.72,30.14,0.74,2.21,0.38,70.0,Satisfactory
29528,Visakhapatnam,29/06/2020,22.91,65.73,3.45,29.53,18.33,10.71,0.48,8.42,30.96,0.01,0.01,0.00,68.0,Satisfactory
29529,Visakhapatnam,30/06/2020,16.64,49.97,4.05,29.26,18.80,10.03,0.52,9.84,28.30,0.00,0.00,0.00,54.0,Satisfactory


In [41]:
#Handling Duplicates
duplicate= pollution_df.duplicated().sum()
if duplicate > 0:
  pollution_df.drop_duplicates(inplace= True)
  print(f'{duplicate} duplicates have been removed')
else:
  print('No duplicates found')

No duplicates found


In [42]:
#check for missing values
def missing_values_table(df):
    mis_val = df.isnull().sum() # Total missing values
    mis_val_percent = 100 * mis_val / len(df)  # Percentage of missing values
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table = mis_val_table.rename(columns={0: 'Missing Values', 1: '% of Total Values'})
    mis_val_table = mis_val_table.sort_values('% of Total Values', ascending=False)  # Sort the table by percentage of missing descending
    return mis_val_table

missing_values = missing_values_table(pollution_df)
display(missing_values.style.background_gradient(cmap='Blues'))

TypeError: 'list' object is not callable

In [None]:
missing_values_per_city = pollution_df.groupby('City').apply(lambda x: x.isnull().sum())
missing_values_per_city

**Missing Value Analysis**

The dataset shows significant variation in the proportion of missing values across different pollutants and target variables:

High missingness (>50%):

Xylene – 61.32% missing

Moderate missingness (20–40%):

PM10 – 37.72%

NH3 – 34.97%

Toluene – 27.23%

Lower missingness (10–20%):

Benzene – 19.04%

AQI and AQI_Bucket – 15.85%

PM2.5 – 15.57%

NOx – 14.17%

O3 – 13.62%

SO2 – 13.05%

NO2 – 12.14%

NO – 12.13%

Minimal missingness (<10%):

CO – 6.97%

No missing values:

Date and City

**Observations:**

Xylene, PM10, NH3, and Toluene have the highest missing percentages, indicating sparse monitoring or inconsistent recording for these pollutants across cities.

Target variables (AQI and AQI_Bucket) have ~15–16% missing values, which may impact supervised learning unless handled properly.

Missing values vary significantly by city: for example, Ahmedabad and Mumbai have higher counts for certain pollutants, whereas cities like Aizawl and Guwahati have very sparse measurements.

**Implications for analysis:**

Imputation strategies are necessary, especially for features with moderate missingness.

Options include mean/median imputation, forward/backward filling for time series, or model-based imputation.

Features with extremely high missingness (e.g., Xylene) may need to be excluded or treated cautiously to avoid bias.

For machine learning modeling, careful handling of missing AQI and AQI_Bucket values is critical since they are the prediction targets.

**Next steps:**

Visualize missingness by city and pollutant to identify patterns.

Decide on imputation vs exclusion for each variable based on missing percentage and importance.

Ensure imputation methods respect temporal and spatial dependencies in the data.

In [None]:
# calculate percentage missingness in each city
missing_values_per_city_percent= pollution_df.groupby('City').apply(lambda x: round(x.isnull().sum() *100 /len(x),2))
missing_values_per_city_percent

In [None]:
#visualizing percentage of missingness per city
plt.figure(figsize= (10,6))
sns.heatmap(missing_values_per_city_percent)
plt.xlabel('Pollutant')
plt.show()

In [None]:
#sort dataset
pollution_df= pollution_df.sort_values(['City', 'Date'],ascending= [True, True])

In [None]:
#feature engineering: datetime features
pollution_df['Date']= pd.to_datetime(pollution_df['Date'], errors= 'raise',format= '%d/%m/%Y')
pollution_df['year']= pollution_df['Date'].dt.year
pollution_df['month']= pollution_df['Date'].dt.month
pollution_df['day']= pollution_df['Date'].dt.day
pollution_df['Month name']= pollution_df['Date'].dt.month_name()


In [None]:
pollution_df['AQI_Bucket'].unique()

In [None]:
#handling Missing Values: pollutants and AQI
pollutants = ["PM2.5","PM10","NO","NO2","NOx","NH3","CO","SO2","O3",
              "Benzene","Toluene","Xylene", "AQI"]
pollution_df[pollutants] = pollution_df.groupby(['City', 'month'])[pollutants].ffill().bfill()

#handling Missing Values: AQI_Bucket
def missing_bucket(row):
    if 0 < row['AQI'] <= 50:
        return 'Good'
    elif 50 < row['AQI'] <= 100:
        return 'Satisfactory'
    elif 100 < row['AQI'] <= 200:
        return 'Moderate'
    elif 200 < row['AQI'] <= 300:
        return 'Poor'
    elif 300 < row['AQI'] <= 400:
        return 'Very Poor'
    else:
        return 'Severe'

pollution_df['AQI_Bucket'] = pollution_df['AQI_Bucket'].fillna(
    pollution_df.apply(missing_bucket, axis=1))

pollution_df.isna().sum()


**Handling Missing Values**
Reviews by Gonzalez et al. (2016) and Jiang et al. (2020) emphasize that the choice of imputation method should consider both the percentage of missing data and the spatial-temporal patterns of monitoring stations. with Xylene variables having 100% missingness in almost all the cities, treating this may need careful treatment or exclusion to avoid bias, whereas features with lower missingness can be reliably imputed.

The dataset was first sorted by city and date to maintain a proper temporal order within each location. Missing values for pollutant columns and AQI were then filled using forward fill (ffill). This method carries the last available observation forward to fill subsequent missing entries.

To ensure that the imputation does not mix data across different locations or seasonal patterns, the forward fill was applied within groups defined by city and month. This approach prevents values from spilling over from one city or month to another, preserving the spatio-temporal integrity of the dataset.

**Rationale:**

Many pollutants show short-term persistence, so forward filling within the same city and month provides a reasonable estimate for missing measurements.

Grouping by city and month ensures that the imputed values reflect the local and seasonal context, rather than introducing bias from unrelated locations or months.

This method is particularly useful when missing values are scattered rather than clustered, which is the case for pollutants like PM2.5, PM10, and AQI.

**Impact on Analysis:**

Missing values are now accounted for, enabling accurate exploratory data analysis (EDA), correlation studies, and machine learning modeling.

The method maintains temporal and spatial consistency, which is critical when predicting AQI_Bucket or analyzing pollutant trends.

In [None]:

pollution_df.to_csv('cleaned_pollution_dataset.csv', index= False)

In [None]:

pollution_df.head(20)

In [None]:
# Check observations from each City

cities = pollution_df['City'].unique()

city_list = []
data_length = []
years_covered = []

for city in cities:
    city_data = pollution_df[pollution_df['City'] == city]

    city_list.append(city)
    data_length.append(len(city_data))
    years_covered.append(city_data['year'].nunique())

overview_df = pd.DataFrame({
    "City": city_list,
    "Data_Length": data_length,
    "Years_Covered": years_covered})

overview_df = overview_df.sort_values("Data_Length", ascending=False)

overview_df


In [None]:
pollution_df.describe().T

In [None]:
plt.style.use('ggplot')

pollutant_columns = ['PM2.5','PM10','NO2','NOx','NH3','CO', 'SO2','O3', 'Benzene', 'Toluene', 'Xylene']

#Group by station and calculate mean pollutant concentrations
mean_pollutant_by_city =pollution_df.groupby('City')[pollutant_columns].mean()
#print(mean_pollutant_by_city)

#Find the top 5 stations for each pollutant
top_city = {}
for pollutant in pollutant_columns:
    top_city[pollutant] = mean_pollutant_by_city[pollutant].sort_values(ascending=False).head(5)
#print(top_city)

# Step 3: Plotting
fig, axes = plt.subplots(len(pollutant_columns), 1, figsize=(10, 20))

for i, pollutant in enumerate(pollutant_columns):
    axes[i].barh(top_city[pollutant].index, top_city[pollutant].values, color='red')
    axes[i].set_title(f'Top 5 Cities by {pollutant}')
    axes[i].set_xlabel(f'{pollutant}')
    axes[i].invert_yaxis()  # Highest values on top

plt.tight_layout()
plt.show()

'''Figure 1: Visualizing the top 5 cities by mean pollutant level.'''

Figure 1 shows visualisation of top five cities by each average pollutant level. The dataset was aggregated by city to compute average concentrations for each pollutant.

Observations:

Cities like Delhi, Ahmedabad, gurugram, shilong and Lucknow consistently showed higher pollutant concentrations across multiple variables.

Particulate matter (PM2.5, PM10) were notably high in these cities, reflecting dense urbanization and industrial activity.

Implications:

These findings can guide targeted air quality monitoring and pollution mitigation efforts.


In [None]:
# Visualize average pollutant level by city
pollutants = ['PM2.5','PM10','NO2','NOx','NH3','CO', 'SO2','O3', 'Benzene', 'Toluene', 'Xylene']
cities = pollution_df['City'].unique()

for city in cities:
    city_data = pollution_df[pollution_df['City'] == city]
    plt.figure(figsize=(10, 6))
    for pollutant in pollutants:
        plt.bar(pollutant, city_data[pollutant].mean(), label=pollutant)
    plt.title(f'Average Pollutant Levels {city}')
    plt.xlabel('Pollutant')
    plt.ylabel('Average Concentration')
    plt.legend()
    plt.show()

'''Figure 2: Visualizing the average pollutant level by city.'''


Average Pollutant Levels by City

The average concentration of each pollutant was calculated for every city to compare overall air quality across locations. This aggregation highlights clear differences in pollution intensity between cities.

The results show that larger and more industrialised cities like Mubai, Ahmedabad and Gurugram tend to have higher average levels of particulate matter (PM2.5 and PM10) and gaseous pollutants such as NO₂, NOx, and CO. These high averages reflect sustained pollution exposure. In contrast, smaller or less industrialised cities generally exhibit lower average pollutant concentrations.

This city-level comparison provides a clear overview of long-term pollution patterns and helps identify high-risk urban areas that may require targeted air quality management and policy interventions.

In [None]:
city_stats = pollution_df.groupby('City')[pollutants].mean().reset_index()

import plotly.express as px

# Melt the DataFrame to create a 'Pollutant' column
city_stats_melted = city_stats.melt(id_vars='City',
                               value_vars=pollutants,
                               var_name='Pollutant',
                               value_name='Average Concentration')

# Create the stacked bar chart
fig_cities= px.bar(city_stats_melted,
                     x='City',
                     y='Average Concentration',
                     color='Pollutant', # Use 'Pollutant' for color differentiation
                     title='Average Pollution Levels by City (Stacked)',
                     labels={'City': 'Cities', 'Average Concentration': 'Average Concentration'},
                     barmode='stack')

fig_cities.show()

'''Figure 3: Visualizing the average pollutant levels by city using a stacked bar chart.'''

Figure 3 shows substantial differences in air pollution levels across cities, with particulate matter (PM2.5 and PM10) contributing the largest share of total pollution in most locations. Major metropolitan cities such as Delhi, Ahmedadab, and Lucknow stand out with noticeably higher overall concentrations, indicating more severe air quality challenges. In contrast, smaller cities display relatively lower pollutant levels, highlighting the strong link between urbanisation, traffic density, and industrial activity and increased air pollution.

In [None]:
pollutants = ['PM2.5','PM10','NO2','NOx','NH3','CO', 'SO2','O3', 'Benzene', 'Toluene', 'Xylene']
total_pollutant_concentration = pollution_df[pollutants].mean().sum()  # Calculates total concentration across all pollutants

# Calculate percentage for each pollutant
pollutant_percentages = [(pollution_df[pollutant].mean() / total_pollutant_concentration) * 100 for pollutant in pollutants]

# Create the pie chart
plt.figure(figsize=(8, 8))  # Adjust figure size if needed
perc_pol= px.pie(values= pollutant_percentages,names=pollutants, hole= 0.3,title= 'Percentage of Pollutants')
perc_pol.show()

'''Figure 4: Visualizing the percentage of each pollutant in the dataset.'''

The pie chart illustrates the percentage contribution of different pollutants to overall mean air pollution. It shows that PM10 (about 35.7%) and PM2.5 (around 19.1%) dominate the pollution profile, together accounting for more than half of total pollutant concentration. Gaseous pollutants such as NOx, O₃, and NO₂ contribute a moderate share, while CO, SO₂, benzene, toluene, and xylene make relatively minor contributions. Overall, the chart highlights that particulate matter is the primary driver of air pollution, suggesting that control strategies should prioritise reducing PM emissions

In [None]:
key_variables = ['PM2.5','PM10','NO2','NOx','NH3','CO', 'SO2','O3', 'Benzene', 'Toluene', 'Xylene', 'AQI']

plt.figure(figsize=(15, 10))
for i, column in enumerate(key_variables, 1): # the subplot starts from index 1 and not from 0
  plt.subplot(4, 3, i)
  sns.histplot(pollution_df[column], kde=True, bins=30)
  plt.title(f'Distribution of {column}')
  plt.xlabel(column)
  plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

The set of histograms shows that most air pollutants have highly right-skewed distributions, meaning that low to moderate concentrations occur frequently, while extreme pollution events are less common but significant. Particulate matter (PM2.5 and PM10) and gaseous pollutants such as NO₂ and NOx exhibit long tails, indicating occasional very high concentration spikes. Pollutants like CO, benzene, toluene, and xylene are concentrated at low levels with sharp peaks near zero, suggesting sporadic but intense emissions. Overall, the AQI distribution mirrors this pattern, highlighting that air quality is generally moderate but periodically deteriorates to hazardous levels*, which has important public health implications.

In [None]:
key_variables = ['PM2.5','PM10','NO2','NOx','NH3','CO', 'SO2','O3', 'Benzene', 'Toluene', 'Xylene', 'AQI']

plt.figure(figsize=(15, 10))
for i, column in enumerate(key_variables, 1): # the subplot starts from index 1 and not from 0
  plt.subplot(4, 3, i)
  sns.boxplot(x= pollution_df[column])
  plt.title(f'Boxplot of {column}')
  plt.xlabel(column)
plt.tight_layout()
plt.show()

In [None]:
# Defining a function to determine the season based on the month
def get_season(month):
  if month<= 2 or month==12:
    return 'Winter'
  elif 3<= month <= 6:
    return 'Summer'
  elif 7 <= month <= 9:
    return 'Monsoon'
  else:
    return 'Post Monsoon'

# Apply the function to the 'month' column and create a new 'Season' column
pollution_df['Season'] = pollution_df['month'].apply(get_season)

pollution_df.reset_index(inplace=True) #resetting the index to be true
#pollution_df.head()


In [None]:
# Group by season and calculate average pollutant concentrations
pollutants = ['PM2.5','PM10','NO2','NOx','NH3','CO', 'SO2','O3', 'Benzene', 'Toluene', 'Xylene']
seasonal_avg = pollution_df.groupby('Season')[ pollutants].mean().reset_index()
# Melt the DataFrame for visualization
seasonal_avg_melted = seasonal_avg. melt(id_vars='Season', var_name='Pollutant', value_name='Average Concentration')
seasons= pollution_df['Season'].unique()
plt.figure(figsize= (15, 8))
for i, season in enumerate(seasons, 1):
  plt.subplot(2,2,i)
  val= seasonal_avg_melted[seasonal_avg_melted['Season']== season].sort_values('Average Concentration')
  sns.barplot(val, x='Average Concentration', y= 'Pollutant')
  plt.title(f'Average Pollutant Level in {season}')
plt.tight_layout()
plt.show()





The visualisation compares the average concentration of major air pollutants across four distinct Indian seasons: Winter, Summer, Monsoon, and Post-Monsoon. This seasonal breakdown aligns with the climatological classification commonly used in Indian air quality studies (Guttikunda & Gurjar, 2012)

**Winter**

During winter, concentrations of PM10, PM2.5, NOx, and NO₂ are highest, reflecting poor atmospheric dispersion caused by low temperatures and weak winds which trap pollutants near the surface (Tiwari et al., 2019). These conditions are known to increase particulate pollution in Indian cities.

**Summer**

In summer, particulate matter levels decline due to stronger winds and increased boundary-layer height; however, ozone (O₃) concentrations rise as higher temperatures and solar radiation enhance photochemical reactions involving NOx and volatile organic compounds (Sharma et al., 2017). This seasonal shift highlights the role of meteorology in pollutant formation rather than emissions alone

**Monsoon**

The monsoon season shows the lowest pollution levels, particularly for PM10 and PM2.5, due to effective wet deposition and atmospheric cleansing by rainfall (Kumar et al., 2020).

**Post Monsoon**

the post-monsoon period records a resurgence in particulate matter and nitrogen oxides, largely attributed to agricultural residue burning and reduced rainfall (Cusworth et al., 2018).

In [None]:
# relationship between AQI and each pollutant in each city
pollutants = ['PM2.5','PM10','NO2','NOx','NH3','CO', 'SO2','O3', 'Benzene', 'Toluene', 'Xylene']
plt.figure(figsize= (15, 10))
for i, pollutant in enumerate(pollutants, 1):
  plt.subplot(4,3,i)
  sns.scatterplot(pollution_df,x= pollutant, y='AQI')
  plt.title(f' {pollutant} Vs AQI')
plt.tight_layout()
plt.show()



