# Data Visual Analytics for PM2.5 Time-Series Data

## Step 1: What do you want to convey

Just exploring the dataset


## Step 2: Prepare Your Data

Yesterday, we acquired the air quality data set from the website. Let try to see if we can visualize it better. Since we have cleaned and put it in the database, let's start from there.

In [None]:
import sqlite3
import pandas as pd
import numpy as np 
import os
import matplotlib.pyplot as plt
import seaborn as sb
import warnings
warnings.filterwarnings('ignore')

# to display plot inline in this notebook
%matplotlib inline

In [None]:
# create connection
conn = sqlite3.connect('data/thailand_cities.db')

# To retrive the table names inside the database
df_all = pd.read_sql_query('SELECT name FROM sqlite_master WHERE type=\'table\';', conn)
# Tables: cities, and pm_by_city

# Read the data and put it into the Pandas dataframe
df_cities = pd.read_sql_query('SELECT * FROM cities', conn)
df_pm = pd.read_sql_query('SELECT * FROM pm_by_city', conn)

In [None]:
# Display sample data
df_cities.head()

In [None]:
# Display sample data
df_pm.head()

In [None]:
df_pm.describe()

#### Filter Data
1) By a specific keyword

In [None]:
df_pm_bangkok = df_pm[(df_pm['city'] == 'Bangkok')]
df_pm_bangkok.describe()

**<font color='brown'>Exercise: Try to select all data of Songkhla</font>**

In [None]:
# fill your code here


2) By a list of keywords

In [None]:
options = ['Bangkok', 'Songkhla']
df_pm_bs = df_pm[df_pm['city'].isin(options)]
df_pm_bs.describe()

## Step 3: Pick the Best Plot

One of the first plot that we will try is a <b>line plot</b>. It is a type of plot which displays information as a series of data points called 'markers' connected by straight line segments.

### Line Plot

In [None]:
df_pm_bangkok['PM2.5'].describe()
#df_pm_bangkok

In [None]:
# To avoid TypeError: Empty 'DataFrame': no numeric data to plot
# We will convert column with numeric value to be numeric type first

for c in df_pm_bangkok.columns:
    df_pm_bangkok[c] = pd.to_numeric(df_pm_bangkok[c], errors='ignore')
    
df_pm_bangkok.head()

In [None]:
df_pm_bangkok.plot(y='PM2.5')
plt.show()

#### Adjust and Add Element

In [None]:
# to adjust the size of the chart
import matplotlib.pyplot as plt

# Set figure size
plt.figure(figsize=(15,3))

# Add X and Y axis label
plt.xlabel('Hour')
plt.ylabel('PM2.5 Level')

# Add title
plt.title('PM2.5 Level in Bangkok from year 2016 - 2019')

# Add plot
plt.plot('PM2.5', data=df_pm_bangkok, color='skyblue', linewidth=1)

# Add legend
plt.legend()

plt.show()

**<font color="brown">Exercise: Create a line plot of PM2.5 in Songkhla (Note. You can use df_pm_songkhla dataframe shown in step 1)</font>**

In [None]:
# fill your code here


### Multiple Line Plot

In [None]:
# Prepare Songkhla dataset
df_pm_songkhla.head()

In [None]:
# Convert data to numeric vlaue (if you haven't done it in the exercise above)
#for c in df_pm_songkhla.columns:
#    df_pm_songkhla[c] = pd.to_numeric(df_pm_songkhla[c], errors='ignore')
    
# Reset index to start with 0
df_pm_songkhla.reset_index(drop=True, inplace=True)

df_pm_songkhla.head()

In [None]:
# multiple line plot
plt.figure(figsize=(15,5))
plt.plot('PM2.5', data=df_pm_bangkok, color='skyblue', linewidth=1)
plt.plot('PM2.5', data=df_pm_songkhla, color='blue', linewidth=1)
plt.legend(['Bangkok', 'Songkhla'])
plt.show()

**<font color="red">Does this chart represent data correctly?</font>**

### Histogram

In [None]:
data = df_pm_bangkok[df_pm_bangkok['Year'] == 2019]
plt.hist(data['PM2.5'], bins='auto')  # arguments are passed to np.histogram
plt.title("The histogram showing the distribution of PM2.5 Level in Year 2019 in Bangkok")
plt.xlabel('PM2.5 Level')
plt.ylabel('frequencies')
plt.show()

**<font color="brown">Exercise: Create a histogram plot showing distribution of PM2.5 in Songkhla in year 2018 </font>**

In [None]:
# fill your code here


### Time-based indexing

To convert your data into time series in pnadas, you have to create a DateTime column and set it to be an index (time-based indexing). Then you can enjoy many powerful features in time series with pandas to intuitively organize and access our data.

In [None]:
df_bkk = df_pm_bangkok[['Year','Month','Day','UTC Hour', 'PM2.5', 'city']]
df_bkk.head()

In [None]:
from datetime import datetime

# Create an additional date column
df_bkk['DateTime'] = df_bkk.apply(lambda row: datetime(int(row['Year']), int(row['Month']), int(row['Day']), int(row['UTC Hour'])), axis=1)
df_bkk.head()

In [None]:
# set date as index
df_bkk.set_index('DateTime', inplace = True)
df_bkk.head()

With time-based indexing, we can use date/time formatted strings to select data in our DataFrame with the loc accessor. The indexing works similar to standard label-based indexing with loc, but with a few additional features.

For example, we can select data for a single day using a string such as '2018-08-10'.

In [None]:
print(df_bkk.loc['2016-06-24'])

print(df_bkk.loc['2016-06-24':'2016-06-26'])

print(df_bkk.loc['2016-06'])

In [None]:
# Adding weekday name
df_bkk['Weekday Name'] = df_bkk.index.weekday_name
df_bkk.sample(5, random_state = 0)

### Visualizing time series data

With pandas and matplotlib, we can easily visualize our time series data. In this section, we’ll cover a few examples and some useful customizations for our time series plots

In [None]:
# Set figure size
plt.figure(figsize=(15,3))

# Add X and Y axis label
plt.xlabel('Hour')
plt.ylabel('PM2.5 Level')

# Add title
plt.title('PM2.5 Level in Bangkok from year 2016 - 2019')

# Add plot
plt.plot('PM2.5', data=df_bkk, color='skyblue', linewidth=0.5)

# Add legend
plt.legend()

plt.plot()

#### Resampling from hourly to weekly using mean value

The ~26,000 hourly samples are far too dense for us to make much sense of. We can gain more insight by resampling the data to a coarser grid. Let's resample by week:

In [None]:
%matplotlib inline
import seaborn as sb; seaborn.set()
import matplotlib.dates as mdates

bkk_weekly = df_bkk.resample('W').mean()
bkk_weekly['PM2.5'].plot(style=':')
plt.ylabel('Weekly Average PM2.5')
plt.show()

In [None]:
# add day of year in a new column
df_bkk['day#'] = df_bkk.index.dayofyear

# resample hourly data into daily data with mean value
bkk_day = df_bkk[['Year', 'Month', 'Day', 'PM2.5', 'day#']].resample('D').mean()
bkk_day.head()

In [None]:
# Plot daily data using line plot
fig, ax = plt.subplots()
ax.plot(bkk_day.loc['2017-01':'2017-02', 'PM2.5'], marker='o', linestyle='-')

# Add some additional information 
ax.set_ylabel('Daily PM2.5')
ax.set_title('Jan-Feb 2017: PM2.5 ')

# Set x-axis major ticks to weekly interval, on Mondays
ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=mdates.MONDAY))

# Format x-tick labels as 3-letter month name and day number
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'));


In [None]:
# multiple line plot
plt.figure(figsize=(15,5))
plt.plot('PM2.5', data=bkk_day.loc['2016'], color='blue', linewidth=0.5)
plt.plot('PM2.5', data=bkk_day.loc['2017'], color='skyblue', linewidth=0.5)
plt.plot('PM2.5', data=bkk_day.loc['2018'], color='orange', linewidth=0.5)
plt.plot('PM2.5', data=bkk_day.loc['2019'], color='green', linewidth=0.5)
plt.legend(['2016','2017','2018','2019'])
plt.show()

***<font color="blue">What do you see from the chart above? Do you see any seasonal trend? How are you going to plot that?</font>***

**<font color="brown">Exercise: Plot similar plot as shown above for data from Songkhla</font>**

In [None]:
# fill your code here



#### Seasonality

Instead of one line plot, let’s further explore the seasonality of our data with **mulitple line plots**. 

In [None]:
# multiple line plots
plt.figure(figsize=(15,5))
plt.plot('day#', 'PM2.5', data=bkk_day.loc['2016'], color='blue', linewidth=0.5)
plt.plot('day#', 'PM2.5', data=bkk_day.loc['2017'], color='skyblue', linewidth=0.5)
plt.plot('day#', 'PM2.5', data=bkk_day.loc['2018'], color='orange', linewidth=0.5)
plt.plot('day#', 'PM2.5', data=bkk_day.loc['2019'], color='green', linewidth=0.5)
plt.legend(['2016','2017','2018','2019'])
plt.show()

The daily data provide pretty good insight so far. Let's try to explore seasonality and distribution using **box plots**. Using seaborn’s boxplot() function to group the data by different time periods and display the distributions for each group. We’ll first group the data by month, to visualize yearly seasonality.

In [None]:
plt.figure(figsize=(15,5))
sb.boxplot(data=bkk_day, x='Month', y='PM2.5')
plt.show()


In [None]:
fig, axes = plt.subplots(2, 1, figsize=(15, 6), sharex=True)
for name, ax in zip([2017, 2018], axes):
    sb.boxplot(data=bkk_day[(bkk_day['Year'] == name)], x='Month', y='PM2.5', ax=ax)
    ax.set_ylabel('PM2.5')
    ax.set_title('PM2.5 Distribution in Year ' + str(name))

**<font color="brown">Plot graph to compare between Bangkok vs Songkhla</font>**

In [None]:
# fill your code here

### bokeh

Using heatmap to show PM2.5 value between year and month

In [None]:
from bokeh.plotting import figure, show, output_file, output_notebook
from bokeh.palettes import Spectral11, colorblind, Inferno, BuGn, brewer
from bokeh.models import HoverTool, value, LabelSet, Legend, ColumnDataSource,LinearColorMapper,BasicTicker, PrintfTickFormatter, ColorBar
import datetime

In [None]:
temp_df = df_bkk[['Year', 'Month', 'PM2.5']].groupby(['Year','Month']).max().reset_index()
temp_df.head()

In [None]:
# output to notebook
output_notebook()

TOOLS = "hover,save,pan,box_zoom,reset,wheel_zoom,tap"
hm = figure(title="Month-Year wise PM2.5", tools=TOOLS, toolbar_location='above')

source = ColumnDataSource(temp_df)
colors = brewer['BuGn'][9]
colors = colors[::-1]
mapper = LinearColorMapper(
    palette=colors, low=temp_df['PM2.5'].min(), high=temp_df['PM2.5'].max())
hm.rect(x="Year", y="Month",width=2,height=1,source = source,  
    fill_color={
        'field': 'PM2.5',
        'transform': mapper
    },
    line_color=None)
color_bar = ColorBar(
    color_mapper=mapper,
    major_label_text_font_size="10pt",
    ticker=BasicTicker(desired_num_ticks=len(colors)),
    formatter=PrintfTickFormatter(),
    label_standoff=6,
    border_line_color=None,
    location=(0, 0))

hm.add_layout(color_bar, 'right')
hm.xaxis.axis_label = 'Year'
hm.yaxis.axis_label = 'Month'
hm.select_one(HoverTool).tooltips = [
    ('Year', '@Year'),('Month', '@Month'), ('MAX PM2.5', '@PM2.5')
]

#output_file("heatmap.html", title="Heat Map")

show(hm)  

## Reference

https://www.kaggle.com/neerjad/time-series-visualization-using-bokeh/data

https://www.dataquest.io/blog/tutorial-time-series-analysis-with-pandas/
