# Project - Analyzing the Air Quality trends in India (2015 - 2020)

This project - analyzing air quality trends (2015-2020) using a Kaggle dataset as part of the "Data Analysis with Python: Zero to Pandas" course by Jovian. The dataset includes city-level AQI data, pollutant levels (PM2.5, PM10, NO2, CO, etc.) and stations data. **My goal is to study seasonal trends over the time**, **answering 10 data-driven questions with Pandas, Matplotlib, and Seaborn**. I'm using Google Colab for analysis, applying skills from the course to clean, visualize, and extract insights from real-world data.

## Downloading the Dataset

For this project, I used open-source data from Kaggle, specifically the "Air Quality Data in India" dataset. I downloaded it using the opendatasets Python library and prepared it for analysis. The dataset includes key air quality indicators like AQI values, PM2.5, PM10, NO2, and CO, which I’m exploring to uncover meaningful patterns and insights.

**include the details of rows and columns in the datasets**

In [None]:
!pip install jovian opendatasets --upgrade --quiet

Let's begin by downloading the data, and listing the files within the dataset.

In [None]:
dataset_url = 'https://www.kaggle.com/datasets/rohanrao/air-quality-data-in-india?select=city_day.csv'

In [None]:
import opendatasets as od
od.download(dataset_url)

The dataset has been downloaded and extracted.

In [None]:
data_dir = './air-quality-data-in-india'

In [None]:
import os
os.listdir(data_dir)

Let us save and upload our work to Jovian before continuing.

In [None]:
project_name = "analyzing-air-quality-in-india"

## Data Preparation and Cleaning


In this segment, I loaded the "Air Quality Data in India" dataset into a Pandas DataFrame for analysis. First, I explored the dataset by checking the number of rows and columns, inspecting data types, and analyzing value ranges. I then handled missing, incorrect, and invalid data through techniques like imputation and removal of inconsistencies. Additionally, I performed data preprocessing steps such as parsing dates and merging multiple datasets where necessary to enhance the analysis.




In [None]:
import pandas as pd

In [None]:
city_day_df = pd.read_csv('./air-quality-data-in-india/city_day.csv')

In [None]:
city_day_df

In [None]:
city_day_df.shape

In [None]:
city_day_df.info()

In [None]:
# Range of values for categorical features
city_day_df['AQI_Bucket'].unique()

In [None]:
# Range of values for categorical features
city_day_df['City'].unique()

In [None]:
# Ranges of values for numerical features
numerical_features = city_day_df.select_dtypes(include=['number'])
for col in numerical_features.columns:
  min_val = city_day_df[col].min()
  max_val = city_day_df[col].max()
  print(f"Range of values for '{col}': [{min_val}, {max_val}]")

In [None]:
# Display missing values before imputation
city_day_df.isnull().sum()

In [None]:
import pandas as pd
import numpy as np


# List of numerical and categorical columns
numerical_cols = ['PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene', 'AQI']
categorical_cols = ['City', 'AQI_Bucket']

# Median Imputation
for col in numerical_cols:
    city_day_df[col].fillna(city_day_df[col].median(), inplace=True)

# Mode Imputation for categorical data
for col in categorical_cols:
    city_day_df[col].fillna(city_day_df[col].mode()[0], inplace=True)

# Display missing values after imputation
print("Missing Values After Imputation:\n", city_day_df.isnull().sum())

# Save the cleaned dataset
city_day_df.to_csv("cleaned_city_day.csv", index=False)

print("Data Cleaning Completed Successfully!")


In [None]:
city_day_df.isnull().sum()

In [None]:
city_day_df['Date'] = pd.to_datetime(city_day_df['Date'])

In [None]:
city_day_df['Date']

In [None]:
station_day_df = pd.read_csv('./air-quality-data-in-india/station_day.csv')

In [None]:
station_day_df

In [None]:
stations_df = pd.read_csv('./air-quality-data-in-india/stations.csv')

In [None]:
stations_df

In [None]:
merged_data = pd.merge(stations_df, city_day_df, on="City", how="left")

In [None]:
merged_data

## Computing the mean, sum and other interesting statistics for numeric columns.

In [None]:
merged_data.describe()

##  Identifying and Handling Outliers


In [None]:
def detect_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] < lower_bound) | (df[column] > upper_bound)]

# Detect outliers in AQI
outliers_iqr = detect_outliers_iqr(merged_data, 'AQI')
print("Outliers detected using IQR:\n", outliers_iqr)


In [None]:
import numpy as np

Q1 = merged_data['AQI'].quantile(0.25)
Q3 = merged_data['AQI'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Option 1: Remove extreme outliers
merged_data = city_day_df[(merged_data['AQI'] >= lower_bound) & (merged_data['AQI'] <= upper_bound)].copy()

# Option 2: Winsorization (Capping outliers at upper bound)
merged_data.loc[:, 'AQI'] = np.clip(merged_data['AQI'], lower_bound, upper_bound)

# Option 3: Replace outliers with median
median_aqi = city_day_df['AQI'].median()
merged_data.loc[city_day_df['AQI'] > upper_bound, 'AQI'] = median_aqi

In [None]:
merged_data.describe()

The max AQI value is now 291 (previously it was 2049).

The number of records has reduced to 25,648 (previously it was 186,670), indicating that extreme outliers have been removed.

The mean AQI has dropped to 128.38, which is more reasonable than before.

## Exploratory Analysis and Visualization

I computed key statistics like mean, sum, and range for numeric columns to understand the dataset’s overall trends. To visualize distributions, I used histograms for pollutants and AQI values, revealing variations across different cities and time periods. Relationships between variables were explored using scatter plots and bar charts, highlighting correlations between pollutants and AQI. From this exploratory analysis, I identified patterns such as seasonal pollution spikes and variations across cities, providing valuable insights for further investigation.



Let's begin by importing matplotlib.pyplot and seaborn.

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

# Exploring distributions of numeric columns using histograms

In [None]:
# Explore distributions of numeric columns using histograms

# Define numerical columns
numerical_cols = ['PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene', 'AQI']

# Set up subplots
num_cols = len(numerical_cols)
rows = (num_cols // 3) + (num_cols % 3 > 0)  # Arrange in rows of 3
fig, axes = plt.subplots(rows, 3, figsize=(15, rows * 4))  # Adjust size

# Flatten axes array for easy iteration
axes = axes.flatten()

# Plot histograms for each numeric column
for i, col in enumerate(numerical_cols):
    sns.histplot(city_day_df[col], kde=True, ax=axes[i])
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()



Right-Skewed Distributions (Most Pollutants & AQI)

Almost all pollutants, including PM2.5, PM10, NO, NO2, NOx, CO, SO2, Benzene, Toluene, and Xylene, exhibit a strong right-skewed (positively skewed) distribution.

This suggests that most of the data points have low concentrations, but there are extreme pollution spikes occasionally.

PM2.5 and PM10 Have Some Extreme Outliers

Both PM2.5 and PM10 show very high peaks at lower values, with a long tail extending to the right.

This means that while most daily AQI values remain moderate, there are occasional severe pollution events, possibly during winter or festival seasons.

NO2 and NOx Show Bimodal Trends

The NO2 and NOx distributions appear bimodal (two peaks), indicating two different levels of pollution events—one during normal conditions and another during high-pollution periods.

Ozone (O3) Distribution is Relatively Wide

Unlike other pollutants, Ozone (O3) has a broader spread in its distribution.

This suggests more variability in ozone levels, possibly due to meteorological effects like sunlight, temperature, and photochemical reactions.

CO, SO2, and Benzene Are Typically Low but Can Occasionally Spike

The CO, SO2, and Benzene distributions indicate low median levels, but occasional high concentrations are seen, likely due to industrial emissions or traffic congestion.

AQI Distribution Shows Peaks at Specific Ranges

AQI has a sharp peak around 100, meaning that moderate air quality is the most frequent.

However, the long right tail suggests that pollution events push AQI into unhealthy ranges at times.

# Visualizing the Impact of PM2.5 on AQI Using a Scatter Plot and Trend Line to Identify Linear Trends and Outliers

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='PM2.5', y='AQI', data=merged_data, alpha=0.5)
sns.regplot(x='PM2.5', y='AQI', data=merged_data, scatter=False, color='red')
plt.title('Relationship between PM2.5 and AQI')
plt.xlabel('PM2.5 Concentration')
plt.ylabel('Air Quality Index (AQI)')
plt.show()

The ***scatter plot*** visualizes the relationship between PM2.5 concentration and Air Quality Index (AQI). Each point represents a data entry from the dataset, showing how AQI changes with varying levels of PM2.5.

The blue dots in the scatter plot represent individual observations, illustrating the spread of data points.

The ***red trend line (regression line)*** is fitted using sns.regplot(), which helps in identifying the linear relationship between PM2.5 and AQI.

The red trend line confirms that **as PM2.5 concentration increases, AQI also rises.**

A majority of data points are clustered at low PM2.5 values (0-150) and AQI (0-300).
This indicates that most days have moderate pollution levels.

# Density Distribution of PM10 vs AQI using a KDE (Kernal Density Estimation) Plot to analyze their Correlation and assess the influence of other pollutants on AQI Calculation.

In [None]:
plt.figure(figsize=(10, 6))
sns.kdeplot(x=merged_data['PM10'], y=merged_data['AQI'], fill=True, cmap="magma", levels=50)
plt.title('Density Plot of PM10 vs AQI')
plt.xlabel('PM10 Concentration')
plt.ylabel('Air Quality Index (AQI)')
plt.show()

The plot shows that as **PM10 concentration increases, AQI also rises**, indicating that **PM10 is a major contributor to air pollution**. However, the relationship is not perfectly linear, suggesting that **other pollutants also influence AQI calculations.**

*Most points are concentrated where PM10 is below 100 and AQI is below 250,* indicating that in many cities, PM10 pollution remains at moderate levels. This also suggests that on most days, air quality is within manageable limits.

There are a **few extreme cases where PM10 and AQI is nearly 400 and 2000 respectively**. These could correspond to severe pollution events like **smog, dust storms, or Festival-related pollution spikes.**

In [None]:
pollutants = ['PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene', 'AQI']

# Compute the correlation matrix
correlation_matrix = merged_data[pollutants].corr()

# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="Reds", fmt=".2f", linewidths=0.5)

# Title for the heatmap
plt.title("Correlation Heatmap of Air Pollutants and AQI")

# Show the plot
plt.show()


**PM2.5 has the highest correlation with AQI (0.63)** - The biggest contributor to air pollution.

CO has a strong correlation with AQI (0.65) - Indicates significant impact from *vehicular & industrial emissions*.

NO2 (0.52) and PM10 (0.48) moderately influence AQI - *Pollution from traffic and dust particles*.

SO2 (0.45) and NOx (0.44) also contribute - Mostly from *industrial and fuel combustion sources.*

**NH3, O3, and Benzene have weak correlations** - These pollutants don’t directly affect AQI significantly.

***PM2.5 and CO are the biggest air quality degraders.***

Traffic and industrial emissions play a major role in pollution levels.

Reducing fine particulate matter and carbon monoxide emissions can improve air quality the most.



# Visualizing the distribution of how air quality distributed over the time using countplot.


In [None]:
# using seaborn for a countplot
plt.figure(figsize=(12, 6))
sns.countplot(x='AQI_Bucket', data=merged_data)
plt.title('Distribution of AQI Buckets')
plt.xlabel('AQI Bucket')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Each bar height shows how many times a particular AQI_Bucket appears in the city_day_df dataset.

The bar for **"Moderate" AQI is taller** than any other AQI Bucket. Hence, it means there were ***more days where AQI was "Moderate".***

This helps visualize how air quality was distributed over time.

#Relationship between various pollutants and the AQI through violin plot

In [None]:
plt.figure(figsize=(12, 6))
sns.violinplot(data=merged_data[numerical_cols])
plt.xticks(rotation=90)
plt.title("Violin Plot of Air Pollutants and AQI")
plt.show()


Stable and narrow distributions for CO, SO2, and volatile organic compounds (Benzene, Toluene, Xylene) imply consistent, low levels.

Similar patterns in NO, NO2, and NOx suggest common sources, possibly vehicle emissions.

The right skewness in AQI and particulate matter indicates occasional severe pollution incidents.

# City-wise and state-wise AQI trends

In [None]:
# Grouping data by city and calculating average AQI
city_aqi = merged_data.groupby('City')['AQI'].mean().sort_values(ascending=False)


# Plotting
plt.figure(figsize=(14, 6))
sns.barplot(x=city_aqi.index, y=city_aqi.values)
plt.xticks(rotation=90)
plt.xlabel("City")
plt.ylabel("Average AQI")
plt.title("Most Polluted & Least Polluted Cities Based on Average AQI")
plt.show()


***Highly polluted cities:** *Delhi, Ahmedabad, Patna, Gurugram* are major urban hubs with high industrial activity, vehicular emissions, and population density.

***Moderate polluted cities:*** *Mumbai, Kolkata, Hyderabad, Kochi, and Bengaluru*, which have AQI values ranging between 100-180.

***Least polluted cities:*** *Aizawl, Shillong, Thiruvananthapuram* are located in regions with rich green cover and less industrialisation.

Cities like ***Kochi and Bengaluru***, despite being metropolitan areas, show relatively better air quality, possibly due to better environmental policies and coastal influence.

***State wise trends***





***Calculating the city which experienced the highest pollution spike in a single month***

In [None]:
city_month_aqi = (
    merged_data.groupby(['City', merged_data['Date'].dt.to_period('M')])['AQI']
    .max()
    .reset_index()
)

# Convert period back to datetime format for better visualization
city_month_aqi['Date'] = city_month_aqi['Date'].dt.to_timestamp()

# Find the city and month with the highest AQI spike
highest_spike_city = city_month_aqi.loc[city_month_aqi['AQI'].idxmax()]

# Display the result
print(highest_spike_city)

**Highly Polluted City**

***Ahmedabad*** stands out with extremely high AQI values (deep red shades in 2017-2018).
This suggests a sharp deterioration in air quality during these years.
Adding to that **Ahmedabad** accounts to the city which experienced the highest pollution spike in a single month.

**Cities with Low AQI**

Cities like ***Aizawl, Shillong, and Bengaluru*** have consistently lower AQI (blue shades).
These cities seem to have relatively clean air compared to others.

**AQI Peak Year**

***2017-2018*** appears to be the worst period for air quality in many cities.
After that, AQI values seem to ***decrease in 2020***, possibly due to the ***COVID-19 lockdown*** effect.

**Cities with Fluctuating AQI**

Cities like ***Delhi, Kolkata, and Lucknow*** show mixed patterns, where AQI fluctuates over time rather than following a clear trend.
This could be due to *seasonal variations, meteorological conditions, or pollution control measures.*

**Overall Trend suggests that some cities show improvement, while others remain stagnant or worsen.**


# Asking and Answering Questions


#### **How does AQI vary between weekdays and weekends?**

In [None]:
# Create a new column for Weekday/Weekend
merged_data["Weekday/Weekend"] = merged_data["Date"].dt.dayofweek.apply(lambda x: "Weekend" if x >= 5 else "Weekday")

In [None]:
# Plot a box plot
plt.figure(figsize=(8,5))
sns.boxplot(data=city_day_df, x="Weekday/Weekend", y="AQI", palette=["blue", "red"])
plt.xlabel("Day Type")
plt.ylabel("AQI")
plt.title("Comparison of AQI on Weekdays vs. Weekends")
plt.show()

#### **How does AQI vary across different seasons?**

#### **What is the trend of AQI before, during, and after major festivals (e.g., Diwali, New Year)?**

#### **Has air quality improved or worsened over the years?**

In [None]:
merged_data["Year-Month"] = merged_data["Date"].dt.to_period("M")

# Compute monthly average AQI
monthly_aqi = merged_data.groupby("Year-Month")["AQI"].mean()

In [None]:
# Plot AQI trends over time
plt.figure(figsize=(10,4))
monthly_aqi.plot(marker='o', color='red', linestyle='-')
plt.xlabel("Time")
plt.ylabel("Average AQI")
plt.title("AQI Trends Over Time")
plt.xticks(rotation=45)
plt.show()

In [None]:
# Pivot the data to get years on x-axis and cities on y-axis
merged_data['Year'] = merged_data['Date'].dt.year

heatmap_data = merged_data.pivot_table(index='City', columns='Year', values='AQI', aggfunc='mean')

plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_data, cmap='viridis', annot=False, linewidths=0.5)

plt.title('Yearly AQI Heatmap for All Cities', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('City', fontsize=12)
plt.show()


Highly Polluted City

Delhi and Ahmedabad stands out with extremely high AQI values. This suggests a sharp deterioration in air quality during these years.

Cities with Low AQI

Cities like Aizawl, Shillong, and Bengaluru have consistently lower AQI. These cities seem to have relatively clean air compared to others.

AQI Peak Year

2017-2018 appears to be the worst period for air quality in many cities. After that, AQI values seem to decrease in 2020, possibly due to the COVID-19 lockdown effect.

Cities with Fluctuating AQI

Cities like Delhi, Kolkata, and Lucknow show mixed patterns, where AQI fluctuates over time rather than following a clear trend. This could be due to seasonal variations, meteorological conditions, or pollution control measures.

Overall Trend suggests that some cities show improvement, while others remain stagnant or worsen.

#### **Which Pollutant Contributes the Most to the Most and Least Polluted Cities?**

In [None]:
# Compute average levels of key pollutants per city
city_pollution = city_day_df.groupby("City")[["PM2.5", "PM10", "NO2", "CO", "SO2"]].mean()

# Display the top 5 most polluted cities based on different pollutants
print(city_pollution.head(10))

In [None]:
plt.figure(figsize=(10, 6))
city_pollution.plot(kind='bar', stacked = True, figsize=(12, 6))
plt.title("Pollution Levels Across Cities")
plt.xlabel("Pollutant")
plt.ylabel("City")
plt.show()

In the most polluted city (Ahmedabad), PM10 is the highest pollutant, followed by PM2.5 and NO2, indicating that particulate matter is the dominant contributor to poor air quality.

In the least polluted city (Aizawl), PM10 and PM2.5 are the major pollutants, but their values are significantly lower compared to other cities, making Aizawl's air much cleaner.

CO and SO2 levels are relatively low in both cities, suggesting that particulate matter (PM2.5 and PM10) plays the most significant role in determining air pollution levels.

## Inferences and Conclusion

**TODO** - Write some explanation here: a summary of all the inferences drawn from the analysis, and any conclusions you may have drawn by answering various questions.

## References and Future Work

**TODO** - Write some explanation here: ideas for future projects using this dataset, and links to resources you found useful.

> Submission Instructions (delete this cell)
>
> - Upload your notebook to your Jovian.ml profile using `jovian.commit`.
> - **Make a submission here**: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project
> - Share your work on the forum: https://jovian.ml/forum/t/course-project-on-exploratory-data-analysis-discuss-and-share-your-work/11684
> - Share your work on social media (Twitter, LinkedIn, Telegram etc.) and tag [@JovianML](https://twitter.com/jovianml)
>
> (Optional) Write a blog post
>
> - A blog post is a great way to present and showcase your work.  
> - Sign up on [Medium.com](https://medium.com) to write a blog post for your project.
> - Copy over the explanations from your Jupyter notebook into your blog post, and [embed code cells & outputs](https://medium.com/jovianml/share-and-embed-jupyter-notebooks-online-with-jovian-ml-df709a03064e)
> - Check out the Jovian.ml Medium publication for inspiration: https://medium.com/jovianml


