# Problem Statement 
**Objective**
Delhi faces severe air pollution, affecting public health and the environment. The goal of this project is to predict the Air Quality Index (AQI) using historical data on pollutants like PM2.5, PM10, NO₂, and others. By building and evaluating regression models, we aim to understand key pollution trends and enable early warnings through data-driven AQI forecasting.

In [None]:
#Import Required Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


### Load the Dataset

In [None]:
df = pd.read_csv('../data/Delhi.csv')
df.head()


### Preprocess the Data

In [None]:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')  # Convert to datetime
df.set_index('Date', inplace=True)                          # Time-series index
df.drop(columns=['City', 'AQI_Bucket'], inplace=True)       # Drop unused columns
df.fillna(df.median(numeric_only=True), inplace=True)       # Handle missing values


### Exploratory Data Analysis (EDA)

#### Visualize Pollutants Over Time

In [None]:
plt.figure(figsize=(15, 6))
for col in ['PM2.5', 'PM10', 'NO2', 'SO2', 'O3']:
    plt.plot(df.index, df[col], label=col)
plt.title("Pollutants Over Time")
plt.xlabel("Year")
plt.ylabel("Concentration (µg/m³)")
plt.legend()
plt.grid(True)
plt.show()


**Insights:**
This Graph shows: 
- How pollution levels changed in Delhi from 2015 to 2020. 
- PM10 and PM2.5 have the highest peaks, especially in winters, showing they are the main causes of poor air quality. 
- Other pollutants like NO₂, SO₂, and O₃ stayed lower and more stable.



#### Correlation Heatmap

In [None]:
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation between Pollutants and AQI")
plt.show()


**Insights:**
This heatmap shows:
- How each pollutant is related to AQI. PM2.5 (0.88) and PM10 (0.85) are highly correlated to high AQI. 
- Gases like SO₂ and O₃ have weaker correlation. 
- This proves that particulate matter mainly drives air pollution in Delhi.


### Train-Test Split

In [None]:
# Features (X) and target (y)
X = df.drop(columns=['AQI'])
y = df['AQI']

# 80-20 time-based split (no shuffle to preserve order)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)


### Train Random Forest Regressor


In [None]:
# Train the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)


Random Forest handles non-linear relationships.
Works well on real-world, noisy datasets like pollution data.



### Predict and Evaluate Model

In [None]:
# Predict AQI for test data
y_pred = model.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(" Random Forest Performance")
print(f" Mean Squared Error: {mse:.2f}")
print(f" R² Score: {r2:.4f}")


**Interpretation:**
- Low MSE means predictions are close to actual values.
- High R² means model explains most of the AQI variation

### Plot Actual vs Predicted AQI

In [None]:
# Visual comparison of actual vs predicted AQI
plt.figure(figsize=(15, 6))
plt.plot(y_test.values, label='Actual AQI', color='black')
plt.plot(y_pred, label='Predicted AQI', color='green')
plt.title("Actual vs Predicted AQI")
plt.xlabel("Test Set Index")
plt.ylabel("AQI")
plt.legend()
plt.tight_layout()
plt.show()


**Insights:**
- Random Forest predictions follow actual AQI closely.
- Captures seasonal spikes and dips well.



### Feature Importance

In [None]:
# Feature importance plot
importances = model.feature_importances_
features = X.columns
indices = np.argsort(importances)

plt.figure(figsize=(10, 6))
plt.title('Feature Importances from Random Forest')
plt.barh(range(len(indices)), importances[indices], color='skyblue')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.tight_layout()
plt.show()


**Insights:**
- PM2.5 and PM10 contribute the most to AQI.
- Other gases like CO and NO₂ also influence AQI but less so

#### Usage Example:

In [None]:
# Predict AQI from a new pollutant input (example row)
sample_input = pd.DataFrame([{
    'PM2.5': 180, 'PM10': 250, 'NO': 30, 'NO2': 45, 'NOx': 50,
    'NH3': 20, 'CO': 1.5, 'SO2': 8, 'O3': 15, 'Benzene': 5,
    'Toluene': 7, 'Xylene': 2
}])

predicted_aqi = model.predict(sample_input)
print("Predicted AQI:", round(predicted_aqi[0]))


### Conslusion:
We used a Random Forest model to predict the Air Quality Index (AQI) in Delhi using pollution data. The model gave very accurate results, with an R² score above 0.90 and low error. We found that PM2.5 and PM10 are the main reasons for poor air quality, especially during winter when pollution levels are high. The model also followed real AQI trends closely. This shows that machine learning can help predict air quality in advance, give early health warnings, and support better planning and decisions to reduce pollution.
