<center>
<img src="https://www.iybssd2022.org/wp-content/uploads/ASAQ.jpg" width="150"/> 
</center>

        
<center>
<h1><font color= "orange" size="+2">ASAQ SEMINAR SERIES ON AIR QUALITY DATA ANALYSIS AND VISUALIZATION</font></h1>
</center>

---

<center><h1><font color="red" size="+2">Overview of Python Tools for Air Quality Data Analysis</font></h1></center>

## <font color="red">Objectives</font>

In this presentation:

- We introduce a few Python tools that you can use to analyze air quality data.
- We use a real-life timeseries and geolocated dataset to perform tasks such as:
   - Descriptive statistics.
   - Heatmap
   - Simple visualizations
   - Interactive visualizations

The steps showed here are what can be considered as data exploratory tasks before creating a model.

## <font color="red">Reference documents</font>


- [air-quality-analysis](https://github.com/binh-bk/air-quality-analysis): Jupyter notebooks and Python code for analyzing air quality (fine particles, PM2.5)
- [Practical Application of Python for Air Quality Data Analysis and Modeling](https://www.cmascenter.org/conference//2018/slides/kim_practical_application_2018.pdf)
- [Visualization with Seaborn](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.14-Visualization-With-Seaborn.ipynb)

----

## <font color="red">Required modules/packages</font>

- __Seaborn__: A data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- __Skimpy__: A light weight tool for creating summary statistics from dataframes.
- __Matplotlib__: for basic plots.
- __Pandas__: Manipulation and exploratory data analysis of tabular data.
- __Shapely__: For manipulation and analysis of planar geometric objects
- __GeosPandas__: Combines the capabilities of Pandas and Shapely for geospatial operations
- __MovingPandas__: Handling the movement of geospatial objects.
- __Plotly__: Python graphing library that makes interactive graphs.

---

# <font color="green">Uncomment the cell below is using Google Colab</font>

In [1]:
#!pip install movingpandas
#!pip install hvplot
#!pip install holoviews
#!pip install skimpy
#!pip install plotly

---

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
import matplotlib.ticker as mticker
from mpl_toolkits.axes_grid1.axes_divider import make_axes_locatable

In [None]:
from shapely import geometry as shpgeom
from shapely import wkt as shpwkt

In [None]:
import pandas as pd
import geopandas as gpd
import movingpandas as mpd

In [None]:
import skimpy

In [None]:
import seaborn as sns
#sns.set_context("notebook", font_scale=1.3)
#sns.set_style('whitegrid')

In [None]:
import holoviews as hv

In [None]:
import hvplot.pandas 

In [None]:
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
pio.templates.default = "plotly_white"

In [None]:
mpd.show_versions()

## <font color="red">Data Access</font>

File name:

In [None]:
#file_name = "L2-01-08-2023evening.xlsx"
#data_url = "/".join(["../sample_data", file_name])
data_url = "https://github.com/JulesKouatchou/asaq_py/raw/main/sample_data/L2-01-08-2023evening.xlsx"

### <font color="blue">Read the file</font>

- We use `Pandas` to read the Excel file
- We obtain a `DataFrame` that is seen as data organized in labeled rows and columns.
  - Each row is a considered as a data point.
  - Each column can be seen for instance as a the set of latitudes or measurements of a specific field.
     - All the values of a given column are of the same data type (integer, float, boolean)
     - Each colunm is in fact a `NumPy` array.
- A `DataFrame` is a collection of one-dimensional `NumPy` arrays.

In [None]:
df = pd.read_excel(data_url, sheet_name="Feuil1")

In [None]:
type(df)

In [None]:
df

#### Quick observations
- There are 27 labeled columns
   - The first two columns appear to be related to the time
   - Three columns contain the latitude, longitude and altitude information.
   - On column has the speed.
   - The remaining columns have measurement related data
- There are 4732 rows (data points)
   - Each row has an index, 0 to 4732
   - Each data point consists of 27 values.


### <font color="blue"> Obtain basic data information</font>
We can get the column count, number of values in each column, data type of each column, etc.:

In [None]:
df.info()

### <font color="blue">Conversion of the time</font>

- Note that the columns _datetime_ and _time_ have the same values that refer to the time (when the measurements were done) in the format `HH:MM:SS`.
- We are dealing here with time series data.
- When we deal with such data, it is important that the date/time values are converted into Python `datetime` objects.


We read the file again:

- Transform values of the _datetime_ column into a Python `datetime` objects.
- Rename the column to _t_.

In [None]:
df = pd.read_excel(data_url, 
                   sheet_name="Feuil1",
                  parse_dates={'t': [0]}
                  )
df

#### Make the time as the index of the DataFrame

In [None]:
df.set_index('t', inplace=True)

In [None]:
df.info()

We can remove the column `time`:

In [None]:
df = df.drop(columns=['time'])
df

In [None]:
df.info()

## <font color="red"> Obtain Descriptive Statistics</font>

This only applies to numeric columns.

In [None]:
df.describe().T

__We can also use `Skimpy` to provide summary statistics.__

- This is an improved representation of the `describe()` function.

In [None]:
skimpy.skim(df)

## <font color="red">Basic Visualization with `Matplotlib` and `Pandas`</font>

#### Line plot

In [None]:
df["PM2.5"].plot(figsize=(15,5))

A minute average could make the graph less messy:

In [None]:
dft = df[['PM2.5']].resample('1min')

In [None]:
dft.mean().plot(figsize=(15,5), kind='line')

#### Combine mean and standard deviation

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.plot(dft.mean())
ax.plot(dft.std())

In [None]:
std = dft.std()

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.fill_between(std.index, 
                 dft.mean()[dft.mean().columns[0]] - std[std.columns[0]],
                 dft.mean()[dft.mean().columns[0]] + std[std.columns[0]], 
                 color='gray',
                 alpha=0.5)
ax.plot(dft.mean().index, dft.mean().values)
ax.set_xlabel('Datasource: ASAQ', fontsize=10)
ax.set_title('Minute averaged $PM_{2.5}$', 
          color='navy',
          fontsize=20, y=1.05)
ax.set_ylabel('Concentration, $\mu g/m^3$');

## <font color="red">Perform visualization with `Seaborn`</font>

- `Seaborn` is a Python library for data visualization that offers a user-friendly interface for producing visually appealing and informative statistical graphics.
- It is designed to work with Pandas dataframes, making it easy to visualize and explore data quickly and effectively.

#### Line plot

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
sns.lineplot(data=dft.mean(), ax=ax)
ax.set_xlabel('Datasource: ASAQ', fontsize=10)
plt.title('Minute averaged $PM_{2.5}$ ', fontsize=15, y=1.05)
plt.ylabel('Concentration, $\mu g/m^3$');

#### Scatterplot with varying point sizes and hues

In [None]:
ax = sns.relplot(data=df, x="RH", y='CO2', 
                 hue='altitude',
                 size='TPM', 
                 alpha=.75, 
                 #palette="muted",
                 height=5,
                )
ax.set(ylabel="CO2");

#### Scatterplot Matrix

In [None]:
sns.pairplot(df);

#### Joint kernel density estimate

In [None]:
g = sns.jointplot(
    data=df,
    x="TPM", y="CO2",
    kind="hex", color="#4CB391"
)

## <font color="red">Correlation</font>

#### Create the heatmap with `Seaborn`

In [None]:
fig, ax = plt.subplots(figsize=(12, 11))
sns.heatmap(df.corr(), cmap='seismic', ax=ax)

#### Create the heatmap with `Plotly`

Note that we have here an interactive plot.

In [None]:
mat = px.imshow(df.corr(), x=df.columns, 
                 y=df.columns, title="Correlation matrix", width=1000, height=1000)
mat.show()

#### Correlation only with PM2.5

In [None]:
fig, ax = plt.subplots(figsize=(6,10))
df.corr()['PM2.5'].sort_values().to_frame().drop('PM2.5').plot.barh(ax=ax)

#### Create a donut plot for pollutant concentrations

In [None]:
# Define pollutants and their colors
pollutants = ["PM1", "PM2.5", "PM5", "PM10", "N_1-2.5", "N_2.5-5", "N_0.5-1", "N_5-10", "N_10"]
pollutant_colors = px.colors.qualitative.Plotly

# Calculate the sum of pollutant concentrations
total_concentrations = df[pollutants].sum()

# Create a DataFrame for the concentrations
concentration_data = pd.DataFrame({
    "Pollutant": pollutants,
    "Concentration": total_concentrations
})

# Create a donut plot for pollutant concentrations
fig = px.pie(concentration_data, names="Pollutant", values="Concentration",
             title="Pollutant Concentrations",
             hole=0.4, color_discrete_sequence=pollutant_colors)

# Update layout for the donut plot
fig.update_traces(textinfo="percent+label")
fig.update_layout(legend_title="Pollutant")

## <font color="red">Data Manipulation with `MovingPandas`</font>

- A Python library for handling the movement of geospatial objects.
- Provides trajectory data structures and functions for movement data exploration and analysis.
- It is based based on Pandas, GeoPandas, and HoloViz.

#### Convert the Pandas DataFrame into a MovingPandas trajectory


```python
Trajectory(df, traj_id, 
           obj_id=None, t=None, x=None, y=None, 
           crs='epsg:4326', parent=None)
```

- __df__ (GeoDataFrame or DataFrame) – GeoDataFrame with point geometry column and timestamp index
- __traj_id__ (any) – Trajectory ID
- __obj_id__ (any) – Moving object ID
- __t__ (string) – Name of the DataFrame column containing the timestamp
- __x__ (string) – Name of the DataFrame column containing the x coordinate
- __y__ (string) – Name of the DataFrame column containing the y coordinate
- __crs__ (string) – CRS of the x/y coordinates
- __parent__ (Trajectory) – Parent trajectory

In [None]:
traj = mpd.Trajectory(df, x="lng", y="lat", traj_id=1)

### <font color="blue">Processing the Trajectory</font>

- We compute the distance, speed, and acceleration of movement along the trajectory (between consecutive points). 
- The paramters are added as new columns.

#### Compute the distance and the speed

In [None]:
traj.add_distance(overwrite=True, name="distance (m)")

In [None]:
traj.add_speed(overwrite=True, name="speed (m/s)")

In [None]:
traj.df

#### Perform visualization

In [None]:
traj.plot();

Plot CO2 along the path

In [None]:
fig, ax = plt.subplots(figsize=(12,10))

traj.plot(legend=True, 
           column="CO2", 
           capstyle='round', 
              cmap="jet", ax=ax);

In [None]:
fig, ax = plt.subplots(figsize=(12,15))

mytile= "EsriImagery"
traj.hvplot(ax=ax, tiles=mytile, c="CO2", line_width=5, cmap='Dark2')