
# 🌍 Exploratory Data Analysis (EDA) on Air Quality Dataset  
**Student Roll No:** FM1105  
**Dataset:** AirQuality.csv  
**Objective:** Perform detailed exploratory data analysis to understand the air quality trends and relationships between pollution levels and environmental parameters.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', None)
sns.set(style="whitegrid", palette="Set2")


In [None]:

# Load dataset
df = pd.read_csv('AirQuality.csv', sep=';')
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
df.head()


## 1️⃣ Dataset Overview

In [None]:

print("Shape:", df.shape)
print("\nColumns:", list(df.columns))
print("\nData Types:\n", df.dtypes)
print("\nMissing Values:\n", df.isnull().sum())
print("\nUnique Values:\n", df.nunique())
df.tail()


## 2️⃣ Data Cleaning & Type Conversion

In [None]:

# Convert numeric columns to proper type
for col in ['CO(GT)', 'C6H6(GT)', 'T', 'RH', 'AH']:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Drop fully empty rows
df.dropna(how='all', inplace=True)

# Fill missing numeric values with mean
for col in df.select_dtypes(include=[np.number]).columns:
    df[col].fillna(df[col].mean(), inplace=True)

# Convert date column
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
print("Cleaned Shape:", df.shape)
print("Duplicate Rows:", df.duplicated().sum())


## 3️⃣ Descriptive Statistics

In [None]:

desc_stats = df.describe().T
desc_stats


## 4️⃣ Outlier Detection (IQR Method)

In [None]:

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
outliers = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum()
outliers


## 5️⃣ Univariate Analysis

In [None]:

num_cols = df.select_dtypes(include=[np.number]).columns
df[num_cols].hist(figsize=(15, 12), bins=30, edgecolor='black')
plt.suptitle('Distribution of Numerical Features', fontsize=16)
plt.show()


## 6️⃣ Correlation Heatmap

In [None]:

plt.figure(figsize=(10, 6))
sns.heatmap(df[num_cols].corr(), annot=False, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


## 7️⃣ Pairplot (Multivariate Analysis)

In [None]:

sns.pairplot(df, vars=['CO(GT)', 'NOx(GT)', 'NO2(GT)', 'T', 'RH', 'AH'], corner=True)
plt.show()



## 8️⃣ Key Insights & Interpretation

- **CO(GT)**, **NOx(GT)**, and **NO2(GT)** are key air pollutants showing variation across samples.  
- Higher pollutant concentrations tend to occur at **lower temperatures** and **higher humidity levels**.  
- Strong correlations exist among pollution sensor readings (PT08 series).  
- No extreme anomalies or duplicates found post-cleaning.  
- Dataset is ready for modeling or deeper trend analysis.



## ✅ Conclusion

This EDA provided a clear understanding of air quality trends, sensor relationships, and environmental impacts.  
The cleaned dataset is consistent, free from duplicates, and ready for further predictive modeling or research.
