# Comprehensive Exploratory Analysis of Global Air Quality Trends

## Project Overview

- Problem Statement: Conduct an exploratory data analysis (EDA) of air quality data to uncover patterns, peak pollution periods, and differences in pollution levels across various cities.

- Scope: Utilize public air quality datasets, focusing on pollutants such as PM2.5, PM10, NO2, and O3, to perform a comprehensive EDA.
---

## 1. Imports

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

## 2. Loading the data

In [2]:
air = pd.read_csv('./datasets/AirQuality_data.csv')
# air = air.drop(columns = ['Unnamed: 15', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19'])
air.head()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,10/03/2004,18.00.00,26.0,1360.0,150.0,119.0,1046.0,166.0,1056.0,113.0,1692.0,1268.0,136.0,489.0,7578.0
1,10/03/2004,19.00.00,2.0,1292.0,112.0,94.0,955.0,103.0,1174.0,92.0,1559.0,972.0,133.0,477.0,7255.0
2,10/03/2004,20.00.00,22.0,1402.0,88.0,90.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,119.0,540.0,7502.0
3,10/03/2004,21.00.00,22.0,1376.0,80.0,92.0,948.0,172.0,1092.0,122.0,1584.0,1203.0,110.0,600.0,7867.0
4,10/03/2004,22.00.00,16.0,1272.0,51.0,65.0,836.0,131.0,1205.0,116.0,1490.0,1110.0,112.0,596.0,7888.0


## 3. Data Cleaning

In [3]:
# Confirm the information about the dataframe
air.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9471 entries, 0 to 9470
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           9357 non-null   object 
 1   Time           9357 non-null   object 
 2   CO(GT)         9357 non-null   float64
 3   PT08.S1(CO)    9357 non-null   float64
 4   NMHC(GT)       9357 non-null   float64
 5   C6H6(GT)       9357 non-null   float64
 6   PT08.S2(NMHC)  9357 non-null   float64
 7   NOx(GT)        9357 non-null   float64
 8   PT08.S3(NOx)   9357 non-null   float64
 9   NO2(GT)        9357 non-null   float64
 10  PT08.S4(NO2)   9357 non-null   float64
 11  PT08.S5(O3)    9357 non-null   float64
 12  T              9357 non-null   float64
 13  RH             9357 non-null   float64
 14  AH             9357 non-null   float64
dtypes: float64(13), object(2)
memory usage: 1.1+ MB


The datatypes are reasonable but the number of total entries surpasses the non-null count, but all columns have the same non-null count

In [4]:
# CHeck null values per column
air.isnull().sum()

Date             114
Time             114
CO(GT)           114
PT08.S1(CO)      114
NMHC(GT)         114
C6H6(GT)         114
PT08.S2(NMHC)    114
NOx(GT)          114
PT08.S3(NOx)     114
NO2(GT)          114
PT08.S4(NO2)     114
PT08.S5(O3)      114
T                114
RH               114
AH               114
dtype: int64

Taking a closer look at it...

In [5]:
air[air['Date'].isnull()]

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
9357,,,,,,,,,,,,,,,
9358,,,,,,,,,,,,,,,
9359,,,,,,,,,,,,,,,
9360,,,,,,,,,,,,,,,
9361,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9466,,,,,,,,,,,,,,,
9467,,,,,,,,,,,,,,,
9468,,,,,,,,,,,,,,,
9469,,,,,,,,,,,,,,,


So we find that there were just redunddant rows below the entries. We will drop them all

In [6]:
# Drop null rows
air.dropna(inplace = True)
# Check info again
air.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9357 entries, 0 to 9356
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           9357 non-null   object 
 1   Time           9357 non-null   object 
 2   CO(GT)         9357 non-null   float64
 3   PT08.S1(CO)    9357 non-null   float64
 4   NMHC(GT)       9357 non-null   float64
 5   C6H6(GT)       9357 non-null   float64
 6   PT08.S2(NMHC)  9357 non-null   float64
 7   NOx(GT)        9357 non-null   float64
 8   PT08.S3(NOx)   9357 non-null   float64
 9   NO2(GT)        9357 non-null   float64
 10  PT08.S4(NO2)   9357 non-null   float64
 11  PT08.S5(O3)    9357 non-null   float64
 12  T              9357 non-null   float64
 13  RH             9357 non-null   float64
 14  AH             9357 non-null   float64
dtypes: float64(13), object(2)
memory usage: 1.1+ MB


In [7]:
# The distribution of the data
air.describe()

Unnamed: 0,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
count,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0
mean,-36.996687,1048.990061,-159.090093,18.656834,894.595276,168.616971,794.990168,58.148873,1391.479641,975.072032,168.190232,465.260981,9846.342524
std,211.793927,329.83271,139.789093,413.802064,342.333252,257.433866,321.993552,126.940455,467.210125,456.938184,114.081191,216.407635,4447.196714
min,-2000.0,-200.0,-200.0,-2000.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0
25%,4.0,921.0,-200.0,40.0,711.0,50.0,637.0,53.0,1185.0,700.0,109.0,341.0,6923.0
50%,14.0,1053.0,-200.0,79.0,895.0,141.0,794.0,96.0,1446.0,942.0,172.0,486.0,9768.0
75%,25.0,1221.0,-200.0,136.0,1105.0,284.0,960.0,133.0,1662.0,1255.0,241.0,619.0,12962.0
max,119.0,2040.0,1189.0,637.0,2214.0,1479.0,2683.0,340.0,2775.0,2523.0,446.0,887.0,22310.0


In [12]:
# Identify columns to check for negative values
columns_to_check = ['CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)',
                    'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 
                    'T', 'RH', 'AH']

# Filter the DataFrame to include rows with negative values in any of the specified columns
negative_values_subset = air[air[columns_to_check].lt(0).any(axis=1)]

# Display the subset
print(negative_values_subset.head())
negative_values_subset.shape

          Date      Time  CO(GT)  PT08.S1(CO)  NMHC(GT)  C6H6(GT)  \
9   11/03/2004  03.00.00     6.0       1010.0      19.0      17.0   
10  11/03/2004  04.00.00  -200.0       1011.0      14.0      13.0   
33  12/03/2004  03.00.00     8.0        889.0      21.0      19.0   
34  12/03/2004  04.00.00  -200.0        831.0      10.0      11.0   
39  12/03/2004  09.00.00  -200.0       1545.0    -200.0     221.0   

    PT08.S2(NMHC)  NOx(GT)  PT08.S3(NOx)  NO2(GT)  PT08.S4(NO2)  PT08.S5(O3)  \
9           561.0   -200.0        1705.0   -200.0        1235.0        501.0   
10          527.0     21.0        1818.0     34.0        1197.0        445.0   
33          574.0   -200.0        1680.0   -200.0        1187.0        512.0   
34          506.0     21.0        1893.0     32.0        1134.0        384.0   
39         1353.0   -200.0         767.0   -200.0        2058.0       1588.0   

        T     RH      AH  
9   103.0  602.0  7517.0  
10  101.0  605.0  7465.0  
33   70.0  623.0  6261.

(8530, 15)