# General Public

### Data Information

- **Name of dataset:** UCI Machine Learning Repository:

- **Dataset link:** [Air Quality](https://archive.ics.uci.edu/dataset/360/air+quality)

- **Dataset download:** [Download Dataset](https://archive.ics.uci.edu/static/public/360/air+quality.zip)

- **License:** Dataset is from UCI Machine Learning Repository, licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.

- **Size:** AirQualityUCI.xlsx is 1.3MB

In [89]:
import pandas as pd
import bqplot
import bqplot.pyplot as plt
import json
import pandas as pd 
import re
import numpy as np
import random
from scipy.interpolate import interp1d
from functools import reduce
from ast import literal_eval
from matplotlib.colors import Normalize
import seaborn as sns
from scipy.interpolate import make_interp_spline
from scipy.ndimage import gaussian_filter1d
from ucimlrepo import fetch_ucirepo 
import plotly.graph_objs as go
from plotly.subplots import make_subplots

# Introduction

The UCI Machine Learning Repository is a collection of databases, domain theories, and datasets widely used by the machine learning community for experimentation, research, and education. The datasets hosted on the UCI repository cover diverse domains such as classification, regression, clustering, and more. Researchers and practitioners can access these datasets to develop, evaluate, and compare machine learning models across different domains and applications.

The Air Quality dataset from the UCI Machine Learning Repository provides measurements of air quality and weather conditions in an Italian city. The dataset includes various features such as carbon monoxide, nitrogen oxides, ozone, temperature, relative humidity, and atmospheric pressure. Hourly responses average are recorded along with gas concentrations references from a certified analyzer. The aim of the dataset is to analyze the relationships between air pollutants and meteorological conditions.

Source:
 
- http://www.archive.ics.uci.edu/

- Vito, Saverio De et al. “On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario.” Sensors and Actuators B-chemical 129 (2008): 750-757.

The dataset includes 8358 instances of hourly averaged values from an array of 5 metal oxide chemical sensors tested in an Air Quality Chemical Multisensor Device. The experimental conducted within one year, started from March 2004 to February 2005. The details of dataset are following:

1.  **Date (DD/MM/YYYY):** Date of the measurements.
2.  **Time (HH.MM.SS):** Time of the measurements.
3.  **True hourly averaged concentration CO in mg/m^3 (reference analyzer):** Hourly averaged concentration of Carbon Monoxide in milligrams per cubic meter measured by the reference analyzer.
4.  **PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted):** Hourly averaged sensor response of the tin oxide sensor, targeting CO.
5.  **True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer):** Hourly averaged concentration of overall Non-Methanic Hydrocarbons in micrograms per cubic meter measured by the reference analyzer.
6.  **True hourly averaged Benzene concentration in microg/m^3 (reference analyzer):** Hourly averaged concentration of Benzene in micrograms per cubic meter measured by the reference analyzer.
7.  **PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted):** Hourly averaged sensor response of the titania sensor, targeting NMHC.
8.  **True hourly averaged NOx concentration in ppb (reference analyzer):** Hourly averaged concentration of Nitrogen Oxides in parts per billion measured by the reference analyzer
9.  **PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted):** Hourly averaged sensor response of the tungsten oxide sensor, targeting NOx.
10. **True hourly averaged NO2 concentration in microg/m^3 (reference analyzer):** Hourly averaged concentration of Nitrogen Dioxide in micrograms per cubic meter measured by the reference analyzer.
11. **PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted):** Hourly averaged sensor response of the tungsten oxide sensor, targeting NO2.
12. **PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted):** Hourly averaged sensor response of the indium oxide sensor, targeting Ozone.
13. **Temperature in °C:** Temperature measured in degrees Celsius.
14. **Relative Humidity (%):** Relative humidity measured as a percentage.
15. **AH Absolute Humidity:** Absolute humidity measurement.

In [5]:
# fetch dataset 
air_quality = fetch_ucirepo(id=360) 
  
# data (as pandas dataframes) 
X = air_quality.data.features 
y = air_quality.data.targets 
air_quality.metadata

{'uci_id': 360,
 'name': 'Air Quality',
 'repository_url': 'https://archive.ics.uci.edu/dataset/360/air+quality',
 'data_url': 'https://archive.ics.uci.edu/static/public/360/data.csv',
 'abstract': 'Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. ',
 'area': 'Computer Science',
 'tasks': ['Regression'],
 'characteristics': ['Multivariate', 'Time-Series'],
 'num_instances': 9358,
 'num_features': 15,
 'feature_types': ['Real'],
 'demographics': [],
 'target_col': None,
 'index_col': None,
 'has_missing_values': 'no',
 'missing_values_symbol': None,
 'year_of_dataset_creation': 2008,
 'last_updated': 'Sun Mar 10 2024',
 'dataset_doi': '10.24432/C59K5F',
 'creators': ['Saverio Vito'],
 'intro_paper': {'title': 'On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario',
  'authors': 'S. D. 

In [2]:
air_quality.variables 

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,Date,Feature,Date,,,,no
1,Time,Feature,Categorical,,,,no
2,CO(GT),Feature,Integer,,True hourly averaged concentration CO in mg/m^...,mg/m^3,no
3,PT08.S1(CO),Feature,Categorical,,hourly averaged sensor response (nominally CO...,,no
4,NMHC(GT),Feature,Integer,,True hourly averaged overall Non Metanic Hydro...,microg/m^3,no
5,C6H6(GT),Feature,Continuous,,True hourly averaged Benzene concentration in...,microg/m^3,no
6,PT08.S2(NMHC),Feature,Categorical,,hourly averaged sensor response (nominally NMH...,,no
7,NOx(GT),Feature,Integer,,True hourly averaged NOx concentration in ppb...,ppb,no
8,PT08.S3(NOx),Feature,Categorical,,hourly averaged sensor response (nominally NOx...,,no
9,NO2(GT),Feature,Integer,,True hourly averaged NO2 concentration in micr...,microg/m^3,no


# 1. Monthly value of pollutants in 2004
- I used the line chart and heatmap to show the value of pollutants in 2004 ( form March to December). The Line chart shows the monthly value of pollutants decreased except the value of NOX(GT). The heatmap figure shows the same result.

In [90]:
df = pd.read_excel('/Users/lynn/Desktop/courses/dataVisual/air+quality/AirQualityUCI.xlsx')
df['Date'] = df['Date'].astype(str)
df['Time'] = df['Time'].astype(str)
df['DateTime'] =  pd.to_datetime(df['Date'] + ' ' + df['Time'])
data_day = df[df['DateTime'].dt.year == 2004]
unique_months = data_day['DateTime'].dt.month.unique()
mean_value =[]
for i, month in enumerate(unique_months):
    mean_value.append(data_day[data_day['DateTime'].dt.month == month].iloc[:,2:-1].mean().to_numpy())
ave = pd.DataFrame(mean_value,columns=data_day.iloc[:,2:-1].columns,)
ave.index = ['Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
ave

Unnamed: 0,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
Mar,-4.847647,1222.685784,96.554902,9.935104,935.540686,128.676471,1029.058987,89.529412,1571.943301,1027.379902,14.390114,50.169559,0.789412
Apr,-60.916111,1111.609954,120.938889,2.500562,905.970139,29.688889,892.929861,0.394444,1542.734606,958.785185,8.668588,41.305023,-6.613757
May,-39.316532,1052.53237,-199.361559,6.26319,929.319444,58.918011,925.776994,35.622312,1567.100358,906.856967,16.13489,38.957952,-2.824123
Jun,-19.411667,956.241435,-200.0,-0.519794,904.700694,90.305556,842.273495,64.676389,1620.800694,877.460532,14.521481,27.189063,-9.34749
Jul,-48.66129,1044.621192,-200.0,10.31522,969.761089,112.700269,803.41297,90.63172,1641.220206,992.716846,29.11017,32.741521,0.972224
Aug,-70.953495,903.146729,-200.0,-6.639981,769.968862,-13.862903,767.504704,-18.119624,1463.533042,712.882616,14.039819,26.974899,-11.659263
Sep,-43.884306,1049.415394,-200.0,6.266196,963.527199,96.352778,785.024653,-10.0375,1502.940856,1005.385301,19.005671,37.889248,-3.746709
Oct,-92.915591,1182.93414,-200.0,13.237625,1056.619512,79.850806,686.830197,-49.270161,1634.360999,1161.890569,20.20009,61.557785,1.1933
Nov,-8.856389,1132.038773,-200.0,12.512374,1011.080093,370.438889,789.762269,94.758333,1372.112037,1171.240625,13.482384,59.272222,0.939992
Dec,-25.590323,948.593974,-200.0,-13.027542,795.069444,315.928763,766.481631,71.215054,1064.518481,944.656586,-11.565435,30.099037,-20.80876


In [91]:
#colors = plt.cm.viridis(np.linspace(0, 1, len(unique_months)))
month_names = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'}

plt.figure(figsize=(12, 8))
fig = make_subplots(rows=1, cols=1)
for i in range(len(ave)):
    fig.add_trace(go.Scatter(x=ave.index.values,y=ave.iloc[:,i],mode='lines+markers', name=ave.columns[i]))   
fig.update_layout(title='Monthly Value in 2004',title_x=0.5, xaxis_title='Month', yaxis_title='Value', width=1000,height=600)      
fig.show()


In [92]:
heatmap = go.Heatmap(z=ave.values,x=ave.columns,y=ave.index,colorscale='Oranges', showscale=True, hoverongaps=True)
layout = go.Layout(title='Monthly value of pollutants in 2004',title_x=0.5,xaxis=dict(title='Pullutant'),yaxis=dict(title='Month'),width=1000,height=600)
fig = go.Figure(data=[heatmap],layout=layout)
fig.show()

# 2.Air quality based on CO
- I divided the air quality into 5 labels: Good, Moderate, Unhealthy for Sensitive Groups, Unhealthy and Very unhealthy based on the hourly value of CO. We could see the purple dot (very unhealthy) took the large area in the figure. But the bar chart shows the biggest number of total hour is Good.

In [93]:
def mapcat(x):
    if x<=1.0:
        return 'Good'
    elif x>1.0 and x <=1.5:
        return 'Moderate'
    elif x>1.5 and x<=2.0:
        return 'Unhealthy for Sensitive Groups'
    elif x>2.0 and x <=2.5:
        return 'Unhealthy'
    else :
        return 'Very Unhealthy'
    
df = pd.read_excel('/Users/lynn/Desktop/courses/dataVisual/air+quality/AirQualityUCI.xlsx')
df['Date'] = df['Date'].astype(str)
df['Time'] = df['Time'].astype(str)
df['DateTime'] =  pd.to_datetime(df['Date'] + ' ' + df['Time'])
df=df[df['CO(GT)']>0]
#color_map = [('Good','green'), ('Moderate', 'yellow'),('Unhealthy for Sensitive Groups', 'orange'), ('Unhealthy', 'red'), ('Very Unhealthy','purple')]
color_map = {'Good':'green', 'Moderate': 'yellow','Unhealthy for Sensitive Groups': 'orange', 'Unhealthy': 'red', 'Very Unhealthy':'purple'}
df['Label']=df['CO(GT)'].map(mapcat)

scatter_traces = []
for category in df['Label'].unique():
    scatter = go.Scatter(
        x=df[df['Label'] == category]['DateTime'],
        y=df[df['Label'] == category]['CO(GT)'],
        mode='markers',
        name=category,
        marker=dict(
            size=5,
            color=color_map[category],
            symbol='circle'
                    )
    )
    scatter_traces.append(scatter)
    
layout = go.Layout(title='Air Quality', 
                   title_x=0.5, 
                   xaxis=dict(title='Date'), 
                   yaxis=dict(title='Hourly averaged concentration CO'),hovermode='closest',showlegend=True,
                   width=1000,height=600)
fig = go.Figure(data=scatter_traces, layout=layout)
fig.show()


In [72]:
data = pd.Series(df['Label'])
histogram = go.Bar(x=data.value_counts().index, y=data.value_counts().values, text=data.value_counts().values, 
                   hoverinfo='text+y', marker=dict(color='skyblue'))
layout = go.Layout(title='The total hours of different air quality', title_x=0.5, 
                   xaxis=dict(title='The air quality'), yaxis=dict(title='Total hours'),
                    width=1000,height=600)
fig = go.Figure(data=[histogram], layout=layout)

fig.show()

In [88]:
import plotly.express as px
data = pd.Series(df['Label'])
fig = go.Figure()
color_sequence = px.colors.sequential.Rainbow
fig.add_trace(go.Barpolar(
    r=data.value_counts().values,
    theta=data.value_counts().index,
    name='Example',
    marker_color=data.value_counts().values,
    marker_colorscale=color_sequence,
    opacity=0.8
))

fig.update_layout(
    title='The total hours of different air qualit',
    title_x=0.5,
    template='plotly_dark',
    width=1000,height=600,
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, max(data.value_counts().values)]
        )
    ),
    showlegend=False
)

fig.show()

# 3. Daily value of pollutants
- This figure plotted the value of PT08.S1(CO). The size of scatter is setted based on the value of CO, and the color of scatter presents the label of air quality.

In [59]:
def mapcat(x):
    if x<=1.0:
        return 'Good'
    elif x>1.0 and x <=1.5:
        return 'Moderate'
    elif x>1.5 and x<=2.0:
        return 'Unhealthy for Sensitive Groups'
    elif x>2.0 and x <=2.5:
        return 'Unhealthy'
    else :
        return 'Very Unhealthy'

df = pd.read_excel('/Users/lynn/Desktop/courses/dataVisual/air+quality/AirQualityUCI.xlsx')
df.drop('Time',axis=1,inplace=True)
df.set_index('Date',inplace=True)
daily_mean = df.resample('D').mean()
#color_map = [('Good','green'), ('Moderate', 'yellow'),('Unhealthy for Sensitive Groups', 'orange'), ('Unhealthy', 'red'), ('Very Unhealthy','purple')]
color_map = {'Good':'green', 'Moderate': 'yellow','Unhealthy for Sensitive Groups': 'orange', 'Unhealthy': 'red', 'Very Unhealthy':'purple'}
df['Label']=df['CO(GT)'].map(mapcat)
df['colors'] =df['Label'].map(color_map)
df.replace(-200.0,0.0,inplace=True)
df

Unnamed: 0_level_0,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Label,colors
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2004-03-10,2.6,1360.00,150,11.881723,1045.50,166.0,1056.25,113.0,1692.00,1267.50,13.600,48.875001,0.757754,Very Unhealthy,purple
2004-03-10,2.0,1292.25,112,9.397165,954.75,103.0,1173.75,92.0,1558.75,972.25,13.300,47.700000,0.725487,Unhealthy for Sensitive Groups,orange
2004-03-10,2.2,1402.00,88,8.997817,939.25,131.0,1140.00,114.0,1554.50,1074.00,11.900,53.975000,0.750239,Unhealthy,red
2004-03-10,2.2,1375.50,80,9.228796,948.25,172.0,1092.00,122.0,1583.75,1203.25,11.000,60.000000,0.786713,Unhealthy,red
2004-03-10,1.6,1272.25,51,6.518224,835.50,131.0,1205.00,116.0,1490.00,1110.00,11.150,59.575001,0.788794,Unhealthy for Sensitive Groups,orange
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2005-04-04,3.1,1314.25,0,13.529605,1101.25,471.7,538.50,189.8,1374.25,1728.50,21.850,29.250000,0.756824,Very Unhealthy,purple
2005-04-04,2.4,1162.50,0,11.355157,1027.00,353.3,603.75,179.2,1263.50,1269.00,24.325,23.725000,0.711864,Unhealthy,red
2005-04-04,2.4,1142.00,0,12.374538,1062.50,293.0,603.25,174.7,1240.75,1092.00,26.900,18.350000,0.640649,Unhealthy,red
2005-04-04,2.1,1002.50,0,9.547187,960.50,234.5,701.50,155.7,1041.00,769.75,28.325,13.550000,0.513866,Unhealthy,red


In [58]:
fig = go.Figure()

for category, color in color_map.items():
    df_category = df[df['Label'] == category]
    # Add a scatter plot for each category
    fig.add_trace(go.Scatter(
        x=df_category.index.values,
        y=df_category['PT08.S1(CO)'],
        mode='markers',
        marker=dict(
            color=color,
            size= df_category['CO(GT)']+4,
            colorscale='Viridis',
            #showscale=True
        ),
        text=['CO ' + str(i) for i in df_category['CO(GT)']],
        hoverinfo='text',
        name=category,  
        legendgroup=category  
    ))
    
fig.update_layout(
    title='Daily value of pollutants',
    xaxis=dict(title='Date'),
    yaxis=dict(title='value of PT08.S1(CO)'),
    plot_bgcolor='white',
    hovermode='closest',
    width=1000, height=600,
    title_x=0.5,
    showlegend=True 
)

fig.show()

It is very easy to noticed that the dataset is time series and it is a very classic dataset in air quality scientific works. This work finished in 2007 and hase been cited by 582 other scientific research works. In this work, the author present a neural calibration for the prediction of benzene concentrations using a gas multi-sensor device designed to monito urban environment pollution. The experiment was conducted throughout a 13 months long interval and discussed. The conclusions in this work is overall performances show a slight degradation over time, and significant degradation can be obtained in the winter time.


This work, and this dataset have made contribution in this field, as it is hard to get such good quality experimental data, and it is public.Commercial purposes are fully excluded. It provides opportunities for other researchers to do further scientific work based on this dataset. And This dataset also shows us a result that the air qualilty will become better in the winter time. 

Reference:

Vito, S.D., Massera, E., Piga, M., Martinotto, L., & Francia, G.D. (2008). On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sensors and Actuators B-chemical, 129, 750-757.