### **Data Science Project Introduction: Air Quality Analysis in Madrid**

#### **Problem Statement:** 
The project aims to conduct an extensive analysis of air quality in Madrid using a dataset collected from the Escuelas Aguirre air quality station. The primary focus is to understand the trends, patterns, and potential correlations among various pollutants in the atmosphere.

#### **Objectives:** 
1. **Comprehensive Analysis:** To analyze and interpret the hourly air quality data collected from January 2001 to March 2022.
2. **Insight Generation:** To derive valuable insights regarding the levels of different pollutants present in Madrid's atmosphere over two decades.
3. **Modeling and Visualization:** To employ data science techniques to model pollutant levels and create meaningful visualizations that aid in understanding the air quality dynamics.

#### **About Dataset:**
- **Description:** 172,622 rows of data from the Escuelas Aguirre air quality station, Madrid. The data span from January 2001 to March 2022 and include the following variables:
  - BEN Benzene (µg/m³)
  - CH4 Methane (mg/m³)
  - CO Carbon monoxide (mg/m³)
  - EBE Ethylbenzene (µg/m³)
  - NMHC Non-methane hydrocarbons (mg/m³)
  - NO Nitrogen monoxide (µg/m³)
  - NO2 Nitrogen dioxide (µg/m³)
  - NOx Nitrogen oxides (µg/m³)
  - O3 Ozone (µg/m³)
  - PM10 Particles < 10 µm (µg/m³)
  - PM25 Particles < 2.5 µm (µg/m³)
  - SO2 Sulfur dioxide (µg/m³)
  - TCH Total hydrocarbons (mg/m³)
  - TOL Toluene (µg/m³)
- **Context:** This dataset represents a comprehensive hourly record of pollution levels in Madrid.

#### **Team Members:**
- Turki Alkazman - (220010077)
- Turki Alqou - (220011703)

#### **Context and Insights Aimed:**
- Contextualizing variations in pollutant levels over a 21-year period.
- Identifying potential correlations between different pollutants.
- Exploring seasonal patterns and their impact on air quality.
- Uncovering long-term trends to better comprehend Madrid's environmental conditions.
- Employing machine learning techniques to model and predict pollutant levels, if applicable and beneficial for the analysis.

In [28]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("MadridPolution2001-2022.csv")

df.shape


(172622, 15)

In [29]:
df.head(10)

Unnamed: 0,Time,BEN,CH4,CO,EBE,NMHC,NO,NO2,NOx,O3,PM10,PM25,SO2,TCH,TOL
0,2001-01-01 00:00:00+00:00,4.0,,0.0,2.0,,66.0,67.0,168.0,7.0,32.0,,26.0,,11.0
1,2001-01-01 01:00:00+00:00,9.0,,0.0,5.0,,146.0,71.0,294.0,7.0,41.0,,21.0,,21.0
2,2001-01-01 02:00:00+00:00,9.0,,0.0,5.0,,190.0,73.0,364.0,7.0,50.0,,22.0,,24.0
3,2001-01-01 03:00:00+00:00,10.0,,0.0,5.0,,170.0,75.0,335.0,7.0,55.0,,19.0,,25.0
4,2001-01-01 04:00:00+00:00,8.0,,0.0,4.0,,102.0,67.0,224.0,8.0,42.0,,14.0,,21.0
5,2001-01-01 05:00:00+00:00,3.0,,1.0,2.0,,63.0,60.0,157.0,8.0,21.0,,10.0,,10.0
6,2001-01-01 06:00:00+00:00,2.0,,0.0,1.0,,28.0,47.0,90.0,17.0,14.0,,8.0,,6.0
7,2001-01-01 07:00:00+00:00,2.0,,1.0,1.0,,33.0,43.0,93.0,21.0,13.0,,8.0,,6.0
8,2001-01-01 08:00:00+00:00,2.0,,1.0,1.0,,36.0,49.0,105.0,20.0,17.0,,8.0,,6.0
9,2001-01-01 09:00:00+00:00,2.0,,1.0,1.0,,25.0,43.0,82.0,24.0,15.0,,7.0,,5.0


In [30]:
df.tail(10)

Unnamed: 0,Time,BEN,CH4,CO,EBE,NMHC,NO,NO2,NOx,O3,PM10,PM25,SO2,TCH,TOL
172612,2022-03-31 14:00:00+00:00,0.0,,0.0,0.0,,9.0,27.0,40.0,58.0,21.0,13.0,4.0,,2.0
172613,2022-03-31 15:00:00+00:00,0.0,,0.0,0.0,,7.0,22.0,33.0,57.0,24.0,16.0,4.0,,1.0
172614,2022-03-31 16:00:00+00:00,0.0,,0.0,0.0,,6.0,21.0,30.0,57.0,5.0,2.0,4.0,,2.0
172615,2022-03-31 17:00:00+00:00,0.0,,0.0,0.0,,11.0,34.0,52.0,40.0,13.0,9.0,4.0,,3.0
172616,2022-03-31 18:00:00+00:00,0.0,,0.0,0.0,,13.0,35.0,54.0,35.0,6.0,4.0,4.0,,2.0
172617,2022-03-31 19:00:00+00:00,0.0,,0.0,0.0,,12.0,43.0,62.0,20.0,2.0,1.0,4.0,,2.0
172618,2022-03-31 20:00:00+00:00,0.0,,0.0,0.0,,7.0,43.0,54.0,20.0,2.0,1.0,4.0,,2.0
172619,2022-03-31 21:00:00+00:00,0.0,,0.0,0.0,,4.0,32.0,39.0,29.0,7.0,5.0,4.0,,1.0
172620,2022-03-31 22:00:00+00:00,0.0,,0.0,0.0,,5.0,32.0,40.0,25.0,7.0,3.0,4.0,,1.0
172621,2022-03-31 23:00:00+00:00,0.0,,0.0,0.0,,4.0,30.0,36.0,25.0,2.0,1.0,4.0,,1.0


In [31]:
df.describe()

Unnamed: 0,BEN,CH4,CO,EBE,NMHC,NO,NO2,NOx,O3,PM10,PM25,SO2,TCH,TOL
count,164850.0,140053.0,172187.0,164787.0,139973.0,171916.0,171922.0,171918.0,170897.0,168229.0,106052.0,171855.0,140051.0,164483.0
mean,0.457295,1.00045,0.086034,0.757311,0.002208,40.855109,58.26234,120.906624,37.205088,26.801937,11.833091,9.924547,1.038422,4.283847
std,1.421051,0.160304,0.340024,1.400775,0.063157,62.581025,32.161441,121.13416,27.777307,23.401042,8.792781,8.281144,0.21139,5.897935
min,0.0,0.0,0.0,0.0,0.0,1.0,4.0,5.0,0.0,1.0,0.0,1.0,0.0,0.0
25%,0.0,1.0,0.0,0.0,0.0,7.0,35.0,49.0,13.0,12.0,6.0,5.0,1.0,1.0
50%,0.0,1.0,0.0,0.0,0.0,20.0,54.0,86.0,33.0,20.0,10.0,8.0,1.0,3.0
75%,0.0,1.0,0.0,1.0,0.0,47.25,75.0,149.0,55.0,34.0,15.0,12.0,1.0,5.0
max,43.0,4.0,10.0,81.0,9.0,1041.0,402.0,1910.0,199.0,367.0,215.0,158.0,10.0,174.0


In [32]:
# describe the dataset and check for missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 172622 entries, 0 to 172621
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    172622 non-null  object 
 1   BEN     164850 non-null  float64
 2   CH4     140053 non-null  float64
 3   CO      172187 non-null  float64
 4   EBE     164787 non-null  float64
 5   NMHC    139973 non-null  float64
 6   NO      171916 non-null  float64
 7   NO2     171922 non-null  float64
 8   NOx     171918 non-null  float64
 9   O3      170897 non-null  float64
 10  PM10    168229 non-null  float64
 11  PM25    106052 non-null  float64
 12  SO2     171855 non-null  float64
 13  TCH     140051 non-null  float64
 14  TOL     164483 non-null  float64
dtypes: float64(14), object(1)
memory usage: 19.8+ MB


In [33]:
# check for missing values
df.isnull().sum()

Time        0
BEN      7772
CH4     32569
CO        435
EBE      7835
NMHC    32649
NO        706
NO2       700
NOx       704
O3       1725
PM10     4393
PM25    66570
SO2       767
TCH     32571
TOL      8139
dtype: int64

In [34]:
# label number of missing values in a dict 
dict = {
    'BEN':df['BEN'].isnull().sum(),
    'CH4':df['CH4'].isnull().sum(),
    'CO':df['CO'].isnull().sum(),
    'EBE':df['EBE'].isnull().sum(),
    'NMHC':df['NMHC'].isnull().sum(),
    'NO':df['NO'].isnull().sum(),
    'NO2':df['NO2'].isnull().sum(),
    'NOx':df['NOx'].isnull().sum(),
    'O3':df['O3'].isnull().sum(),
    'PM10':df['PM10'].isnull().sum(),
    'PM25':df['PM25'].isnull().sum(),
    'SO2':df['SO2'].isnull().sum(),
    'TCH':df['TCH'].isnull().sum(),
    'TOL':df['TOL'].isnull().sum()
    }

In [35]:
# make a copy of the dataset
df1 = df.copy()

In [36]:
# fill missing values with Next or Previous Value with the lowest number of missing values
for i in dict:
    if dict[i] <= 2000:
        # fill missing values with Next Value
        df1[i].fillna(method='bfill', inplace=True)
        # fill missing values with Previous Value
        df1[i].fillna(method='ffill', inplace=True)
df1.isnull().sum()



  df1[i].fillna(method='bfill', inplace=True)
  df1[i].fillna(method='ffill', inplace=True)


Time        0
BEN      7772
CH4     32569
CO          0
EBE      7835
NMHC    32649
NO          0
NO2         0
NOx         0
O3          0
PM10     4393
PM25    66570
SO2         0
TCH     32571
TOL      8139
dtype: int64