# Breathe India: A Data-Driven Exploration of Air Quality Across Indian Cities

*By [Your Name], Aspiring Data Scientist*

---

## Introduction

Air pollution is a silent crisis affecting millions across India. As an aspiring data scientist and a concerned citizen, I wanted to dig deep into real air quality data to uncover patterns, highlight pollution hotspots, and tell a compelling story with data. This notebook is my journey through the numbers, visualizations, and insights that matter.

---

## Project Motivation

- Understand the state of air quality across Indian cities
- Practice and showcase data science skills: cleaning, EDA, visualization, and storytelling
- Share actionable insights and raise awareness


In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-vivid')

In [None]:
# Load the dataset
file_path = 'india_air_quality_2025.csv'
df = pd.read_csv(file_path)
df.head()

## Data Overview & First Impressions

Before diving in, let's take a look at the data. What stories might these columns tell? What surprises await in the numbers?

In [None]:
# DataFrame info and missing values
print('--- DataFrame Info ---')
df.info()
print('\n--- Describe ---')
display(df.describe(include='all'))
print('\n--- Missing Values by Column ---')
display(df.isnull().sum())

## Data Cleaning

Real-world data is messy! Let's clean up types and handle missing values so our analysis is trustworthy. I want to make sure every number tells the truth.

In [None]:
# Convert columns to correct types and handle missing values
for col in ['pollutant_avg', 'pollutant_min', 'pollutant_max']:
    df[col] = pd.to_numeric(df[col], errors='coerce')
df['last_update'] = pd.to_datetime(df['last_update'], errors='coerce')
df['pollutant_avg'] = df['pollutant_avg'].fillna(df['pollutant_avg'].mean())
# Optional: fill other missing values or drop rows if needed
df.head()

## Exploratory Data Analysis (EDA)

Let's explore the data visually and statistically. What are the most polluted cities? How do pollutants vary across India? What patterns and surprises can we find?

In [None]:
# PM2.5 Distribution (Static and Interactive)
pm25 = df[df['pollutant_id'].str.upper() == 'PM2.5']
plt.figure(figsize=(10,6))
plt.hist(pm25['pollutant_avg'], bins=30, color='#e74c3c', edgecolor='black', alpha=0.7)
plt.title('Distribution of PM2.5 Across All Cities', fontsize=16)
plt.xlabel('PM2.5 (µg/m³)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.axvline(pm25['pollutant_avg'].mean(), color='blue', linestyle='dashed', linewidth=2, label=f"Mean: {pm25['pollutant_avg'].mean():.1f}")
plt.legend()
plt.tight_layout()
plt.show()

# Interactive Plotly Histogram
fig = px.histogram(pm25, x='pollutant_avg', nbins=30, color_discrete_sequence=['#e67e22'])
fig.update_layout(title='Interactive Distribution of PM2.5', xaxis_title='PM2.5 (µg/m³)', yaxis_title='Count')
fig.show()

In [None]:
## Top 10 Most Polluted Cities (by PM2.5)

# Calculate city-wise average PM2.5
top_cities = pm25.groupby('city')['pollutant_avg'].mean().sort_values(ascending=False).head(10)
plt.figure(figsize=(12,7))
top_cities.plot(kind='bar', color='#8e44ad', edgecolor='black')
plt.title('Top 10 Most Polluted Cities by Average PM2.5', fontsize=16)
plt.xlabel('City', fontsize=12)
plt.ylabel('Average PM2.5 (µg/m³)', fontsize=12)
plt.xticks(rotation=45, ha='right')
for i, v in enumerate(top_cities):
    plt.text(i, v + 1, f"{v:.1f}", ha='center', color='black', fontsize=10)
plt.tight_layout()
plt.show()

# Narrative
print("These cities stand out as PM2.5 hotspots. What factors might contribute to their high pollution levels? Industrialization, traffic, geography? This invites deeper investigation.")

In [None]:
## State-wise PM2.5 Analysis

# Calculate state-wise average PM2.5
state_pm25 = pm25.groupby('state')['pollutant_avg'].mean().sort_values(ascending=False)
plt.figure(figsize=(14,7))
state_pm25.plot(kind='bar', color='#16a085', edgecolor='black')
plt.title('Average PM2.5 by State', fontsize=16)
plt.xlabel('State', fontsize=12)
plt.ylabel('Average PM2.5 (µg/m³)', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Narrative
print("Some states consistently show higher PM2.5 levels. Are these states more urbanized, or do they have unique environmental challenges?")

In [None]:
## PM2.5 Trends Over Time (Interactive)

# Group by date and plot average PM2.5 over time
pm25_time = pm25.groupby('last_update')['pollutant_avg'].mean().reset_index()
fig = px.line(pm25_time, x='last_update', y='pollutant_avg', title='Average PM2.5 Over Time (All Cities)', markers=True, line_shape='spline', color_discrete_sequence=['#e74c3c'])
fig.update_layout(xaxis_title='Date', yaxis_title='Average PM2.5 (µg/m³)')
fig.show()

# Narrative
print("Are there seasonal patterns or sudden spikes in PM2.5? This time series invites us to look for events or policies that may have influenced air quality.")

In [None]:
## Correlation Analysis: Min, Max, Avg

corr = df[['pollutant_min', 'pollutant_max', 'pollutant_avg']].corr()
plt.figure(figsize=(6,4))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap (Min, Max, Avg)', fontsize=14)
plt.tight_layout()
plt.show()

# Narrative
print("Strong correlations between min, max, and avg values are expected, but outliers or weak correlations could signal data quality issues or interesting phenomena.")

In [None]:
## Outlier Detection: PM2.5 (IQR Method)

q1 = pm25['pollutant_avg'].quantile(0.25)
q3 = pm25['pollutant_avg'].quantile(0.75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers = pm25[(pm25['pollutant_avg'] < lower) | (pm25['pollutant_avg'] > upper)]
print(f"Number of PM2.5 outliers: {len(outliers)}")
display(outliers[['city', 'station', 'pollutant_avg']].head(10))

# Narrative
print("Outliers can reveal data entry errors, rare events, or cities facing extreme pollution. Each outlier is a story worth investigating.")

In [None]:
## Interactive Map: PM2.5 by City

# Prepare data for map (average PM2.5 per city, with lat/lon)
city_map = pm25.groupby(['city', 'latitude', 'longitude'], as_index=False)['pollutant_avg'].mean()
fig = px.scatter_mapbox(city_map, lat='latitude', lon='longitude', size='pollutant_avg', color='pollutant_avg',
                        color_continuous_scale='YlOrRd', size_max=20, zoom=4,
                        mapbox_style='carto-positron',
                        hover_name='city',
                        title='Average PM2.5 by City (Map)')
fig.show()

# Narrative
print("This map brings the data to life. The size and color of each marker show which cities are struggling most with PM2.5. Geography, wind, and urbanization all play a role.")

## Key Insights & Next Steps

**Key Insights:**
- Several cities and states face alarmingly high PM2.5 levels, far exceeding safe limits.
- Pollution is not evenly distributed—urbanization, industry, and geography matter.
- Outliers and spikes in the data may signal events or data quality issues worth further study.
- Interactive maps and time series reveal both chronic and acute pollution problems.

**Next Steps:**
- Investigate causes for the worst-affected cities (policy, industry, traffic, geography).
- Build predictive models for air quality forecasting.
- Deploy a public dashboard to share findings and raise awareness.
- Integrate real-time data for ongoing monitoring.

---

*Thank you for exploring this data journey with me! As an aspiring data scientist, I believe every dataset is a chance to make a difference. What will you discover next?*