<a href="https://colab.research.google.com/github/MLGlobalHealth/StatML4PopHealth/blob/main/assessments/groupwork_instruction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
<img src="https://raw.githubusercontent.com/MLGlobalHealth/StatML4PopHealth/main/practicals/resources/logos/imperial.png" width="250" vspace="8"/>
<img src="https://raw.githubusercontent.com/MLGlobalHealth/StatML4PopHealth/main/practicals/resources/logos/mlgh.png" width="220" hspace="50" vspace="5"/>
<img src="https://raw.githubusercontent.com/MLGlobalHealth/StatML4PopHealth/main/practicals/resources/logos/ammi.png" width="190"/>

<font size="6">Modern Statistics and Machine Learning<br> for Population Health in Africa </font>

<font size="4">24th - 28th March 2025</font>

</center>

## Data

Today we are going to consider an air pollution data set for eight cities in seven African countries, Lagos, Accra, Nairobi, Yaounde, Bujumbura, Kisumu, Kampala, and Gulu. [Especially, fine particulate matter (PM2.5) of 2.5 microns or less in diameter are linked to poor health outcomes and millions of premature deaths globally](https://www.thelancet.com/journals/lanplh/article/PIIS2542-5196(24)00003-2/fulltext). [In the UK, PM2.5 emissions have decreased by $>85\%$ since 1970](https://www.gov.uk/government/statistics/emissions-of-air-pollutants/emissions-of-air-pollutants-in-the-uk-particulate-matter-pm10-and-pm25), but in many other parts of the world PM2.5 emissions have increased to extremely dangerous levels.

PM2.5 are most accurately measured through low-cost on-the-ground sensor networks, but these are expensive. [We will focus on predicting PM2.5 concentrations using satellite-derived PM2.5 data that are themselves derived from Aerosol Optical Depth (AOD) measurements of the Sentinel 5P satellite instrument, available through the Google Earth Data Catalogue.](https://developers.google.com/earth-engine/datasets/catalog/sentinel-5p)

In fact, many different atmospheric variables are measured and can be used as features to predict PM2.5 concentrations to empower communities to access crucial air quality information, provide them with the evidence needed to tackle local pollution challenges, and improve public health. The data that we consider today were previously used as part of a [prediction challenge](https://zindi.africa/competitions/airqo-african-air-quality-prediction-challenge/data) to support the [Clean Africa Air network](https://www.airqo.net).

For our purposes today, we will only use the noisy satellite-based PM2.5 measurements to predict actual, daily average PM2.5 concentrations using GP methodology. We will also focus on a single location to simplify matters. Our main objective is to estimate the number of days in a full year that PM2.5 concentrations were unhealthy above 35 μg/m3, [posing a significant risk to the general population as per WHO guidelines](https://www.who.int/publications/i/item/9789240034228).

In [None]:
# Install CmdStanPy for Google Colab
!curl -O "https://raw.githubusercontent.com/MLGlobalHealth/StatML4PopHealth/main/practicals/resources/scripts/utilities.py"
from utilities import custom_install_cmdstan, test_cmdstan_installation
custom_install_cmdstan()

In [None]:
import os
import pickle
from pathlib import Path

import arviz as az
from cmdstanpy import CmdStanModel
import numpy as np
import pandas as pd
import folium
from datetime import timedelta,datetime

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Aesthetics
sns.set_theme(style="whitegrid")
font = {"family": "sans-serif",
        "weight": "normal",
		"size": 10.5}
mpl.rc('font', **font)

In [None]:
from google.colab import drive
drive.mount('/content/drive')
output_dir = Path(*["drive", "MyDrive", "short_course", "output"])
output_dir.mkdir(parents=True, exist_ok=True)

In [None]:
# get the input data
!curl -O "https://raw.githubusercontent.com/MLGlobalHealth/StatML4PopHealth/main/data/sentinel_5p_particulate_matter.csv"

## Select and explore PM2.5 data

Let us load and have a look at the satellite-based data of atmospheric variables in the eight cities:


In [None]:
# put WHO classification of PM2_5 health risks into table
dwho = {
    'risk': ['good','moderate','unhealthy for sensitive groups','unhealthy','very unhealthy'],
    'pm25_low': [0, 10, 25, 35, 55],
    'pm25_high': [10, 25, 35, 55,1000],
    'text': ['good air quality',
             'acceptable for short-term exposure but may affect sensitive groups over long-term exposure',
             'increased risk for vulnerable populations such as children, elderly, and people with pre-existing health conditions',
             'significant risk to the general population, especially with prolonged exposure',
             'severe health risk to the general public']
}
dwho = pd.DataFrame(dwho)
dwho

In [None]:
# load data
db = pd.read_csv("sentinel_5p_particulate_matter.csv")
db.head()

This is very rich data. We will subset this to Kampala, the capital of Uganda with about 7 million people in the wider metropolitan area. The name Kampala goes back to the hill of the impala, a particular anthilope species that were once grazing there in large numbers. Let us plot the locations for which atmospheric variables are available:

In [None]:
#	subset to Kampala, Uganda
dp_kampala = db.loc[db['city'] == "Kampala",['city', 'country', 'date', 'hour', 'site_id', 'site_latitude', 'site_longitude', 'pm2_5']]
dp_kampala['date'] = pd.to_datetime(dp_kampala['date'], format = '%Y-%m-%d')
dp_kampala['year'] = dp_kampala['date'].dt.strftime('%Y')
dp_kampala['month'] = dp_kampala['date'].dt.strftime('%m')
dp_kampala.head()


In [None]:
# name sites and merge
dp_sites = dp_kampala.loc[:,['city', 'site_id','site_latitude', 'site_longitude']].drop_duplicates()
dp_sites['site_name'] = ['site-' + str(i) for i in range(1,len(dp_sites)+1)]
dp_kampala = pd.merge(dp_kampala, dp_sites, on = ['city','site_id','site_latitude', 'site_longitude'])
dp_kampala.head()

In [None]:
# Create a map showing measurement locations in Kampala
# coordinates for the center of Kampala
lat_kampala = 0.3136
lon_kampala = 32.5818

# create map
map = folium.Map(location=[lat_kampala, lon_kampala], zoom_start=12)

# Add markers for each site
for idx, row in dp_sites.iterrows():
    folium.Marker(
        location=[row['site_latitude'], row['site_longitude']],
        popup=row['site_name']
    ).add_to(map)

# Sace and display  the map
map.save(output_dir.joinpath("kampala_map.html"))
map

Let us plot the time series of the noisy PM2.5 measurements in two sites:

In [None]:
# Select specific sites and rename them
dp = dp_kampala[dp_kampala['site_name'].isin(['site-4', 'site-9'])]
dp.loc[:,['site_name']] = dp['site_name'].replace({'site-4': 'Buwate', 'site-9': 'Kyebando'})
dp

In [None]:
plt.figure(figsize=(10, 6))
custom_palette = sns.color_palette(["#073344FF", "#0B7C9AFF"])

# Create a rectangle for the WHO PM2.5 range
for _, row in dwho.iterrows():
    plt.fill_between(
        [dp['date'].min() - timedelta(days=10), dp['date'].max() + timedelta(days=10)],
        row['pm25_low'], row['pm25_high'],
        color=sns.color_palette("OrRd", len(dwho))[_], alpha=0.5, label=row['risk']
    )
# Plot PM2.5 data points
sns.scatterplot(
    data=dp, x='date', y='pm2_5', hue='site_name', palette=custom_palette, s=50
)

# Customizing the plot
plt.xlim([dp['date'].min() - timedelta(days=10), dp['date'].max() + timedelta(days=10)])
plt.ylim([0, max(dp['pm2_5']) * 1.05])
plt.xlabel('')
plt.ylabel('PM2.5 concentration')
plt.title('PM2.5 Measurements in Kampala')
plt.legend(title='Location')

## Non-parametric modelling with GPs

Let us denote by $y_i$ the PM2.5 concentrations in the $i$th observation.

We model $y_i$ with
\begin{align*}
& y_i \sim \text{LogNormal}(\mu_i, \sigma^2) \\
& \mu_i = \beta_0 + f(\text{date}_i) \\
& \beta_0 \sim \text{Normal}(0, 2) \\
& f \sim \text{GP}(0,k) \\
\end{align*}
where the median is $\exp(\mu_i)$, the mean is $\exp(\mu_i + \sigma^2/2)$, the variance is
$(\exp(\sigma^2)-1)exp(2\mu_i + \sigma^2)$, and $f$ is a random function that is
evaluated using dates as inputs and so captures time effects non-parametrically. The random function is given a zero-mean GP prior with squared exponential kernel with GP variance $\alpha$ and lengthscale $\rho$.
We specify hyper-priors by
\begin{align*}
& \alpha \sim \text{Half-Cauchy}(0, 1) \\
& \rho \sim \text{Inv-Gamma}(5, 1) \\
& \sigma \sim \text{Half-Cauchy}(0,1)
\end{align*}
The hyperparameters $\alpha$, $\rho$ are given default priors that are suitable for a standardised input domain $[0,1]$.

Below is the `Stan` model file, [based on the Stan v2.36.0 manual](https://mc-stan.org/docs/stan-users-guide/gaussian-processes.html#predictive-inference-with-a-gaussian-process).

Note how the variance to mean ratio in the LogNormal scales exponentially with $\sigma^2$, and for this reason
one typically attaches priors with relatively limited variance to the baseline parameter when compared
to other observation likelihood models.

Note that the joint distribution of $f$ evaluated at a finite set of inputs is just a multivariate normal, and so we can straightforwardly generate samples from $f$ through linear transformation of iid standard normal random variables (through the line $f = L_f * z$).

## Group Project Instructions

### Objective
The goal of this group project is to analyze air pollution data from a selected site using a Gaussian Process (GP) model with Hilbert Space approximation. The primary objective is to estimate the number of days in a year that PM2.5 concentrations were at unhealthy level (above 35 $\mu$g/$m^3$).

-----
### Tasks

#### 1. Site Selection
Select **one site** from Kampala city. Make sure that there are not too many missing values.

#### 2. Model Implementation
Implement the **Gaussian Process (GP) model** as specified in the previous section, using **Hilbert Space approximation**. You can also consider changing kernel from squared exponential to, for example, matern class kernel (Optional).  

#### 3. Model Diagnostics
Evaluate **convergence and mixing** of the Markov Chain Monte Carlo (MCMC) algorithm.  
- Obtain summary statistics, including **Rhat** and **Effective sample size (ESS)**.
- Since there may be many parameters, focus on the **trace plot of the model parameter with the lowest effective sample size**.
- Note there may be a few divergent transitions, but for the purpose of this project, you may proceed to next steps if they are deemed satisfactory.

#### 4. Visualization
- Plot the **posterior median of the target variable (PM2.5 concentration)** over time.
- Include 95% credible intervals to visualize variability.

#### 5. Answer the Main Objective
- Calculate the number of days in a full year that the level of PM2.5 concentrations were unhealthy above $35 \mu g / m^3$.
- Summarize your findings and interpret the results in the context of public health.


### 6. Additional Sites (Optional)
- If time permits, **repeat the analysis for a few additional sites**.
- Compare the results across sites and discuss any spatial patterns or variability.

-----

### Deliverables
Prepare a short presentation with slides highlighting tasks achieved, with a focus on the results and conclusions of your analysis. Each group member should present for at least 2-3 minutes to ensure balanced participation. Submit your python notebook, one per group. Marks are available for achieving the different tasks in 1-6 (group mark), and contents and delivery of the presentation (individual mark).

### Tips / Hints

* To select a site that is a good fit for the project, it's useful to plot all time series. Try following code
```
fig = px.line(dp_all,  x='date', y='pm2_5', color='site_name')
fig.show()
```
Choose a location where the first and last observations are more than 1 year apart, and not too many missing values in between. Use **all** observations available in the model fitting phase.
* You will have to obtain GP predictions on the days where the response variable (MP2.5 level) is not recorded.
* You can use `log_normal`, `lognormal_rng` functions to specify log normal likelihood or sample from log normal distribution. You can also see [Stan manual](`https://mc-stan.org/docs/2_21/functions-reference/lognormal.html`).

In [None]:
fig = px.line(dp_all,  x='date', y='pm2_5', color='site_name')
fig.show()

We will show data processing step below. In this example, we focus on 'site-9'


In [None]:
dp = dp_kampala[dp_kampala['site_name'] == 'site-9'].sort_values(by=['date'])
print('The number of observation:', len(dp),'between', (dp['date'].min()).strftime('%d %B, %Y') , 'and', (dp['date'].max()).strftime('%d %B, %Y'))

At the selected site, we have 229 observations during the period of 402 days
(from 21 January, 2023 to 26 February, 2024). We fit the GP model to these 229 points, and make prediction on the rest of the days (173 days). This can be done in `generated quantities` block

Let's prepare the data. We want to standardize the variable $t$, which represents the date. Let $k_i$ denote the unstandardized day variable, ranging from 1 to 402 in this example.

To standardize $t$, it is common practice to use only the observed (training) data when calculating the mean ($m$) and standard deviation ($s$). This means that in the standardized variable, given by $t_i = (k_i - m ) / s$, $m$ and $s$ are calculated exclusively from the observed data, ensuring that the standardization process does not incorporate any information from the validation or test sets.

In [None]:
# converting the date to start from 1
day_num = (dp.date - min(dp.date)).dt.days + 1.0
# get mean and standard diviation
mean_day = day_num.mean()
std_day = day_num.std()

Let's use this in `dp`

In [None]:
dp.loc[:,['day']] = day_num
dp.loc[:,['day_std']] = (dp['day'] - mean_day) / std_day

Now we can create a new DataFrame to include the remaining days that are not present in the `dp` DataFrame.

In [None]:
# create df with all dates (converted date, which we call day)
dp_all ={
    'date' : [dp.date.min() + timedelta(days=x) for x in range((dp.date.max()-dp.date.min()).days + 1)],
    'day' : range(1, (dp.date.max()-dp.date.min()).days + 2),
}
dp_all = pd.DataFrame(dp_all)
# standardise `day` using mean and srandard diviation we calculated in the previous step
dp_all['day_std'] = (dp_all['day'] - mean_day) / std_day
# datafrmame for days without observation
dp_new = dp_all[~dp_all['day'].isin(day_num)].sort_values(by=['date'])

Then we have all data we need to provide to `Stan` model, but we will create a dataframe which combine both data.

In [None]:
dp.loc[:,['source']] = 'obs'
dp_new.loc[:,['source']] = 'pred'
dp_all = pd.concat([dp[['pm2_5','date','day','day_std','source']], dp_new], axis=0).sort_values(by=['date'])

In [None]:
plt.figure(figsize=(10, 6))

# Create a rectangle for the WHO PM2.5 range
for _, row in dwho.iterrows():
    plt.fill_between(
        [dp['date'].min() - timedelta(days=10), dp['date'].max() + timedelta(days=10)],
        row['pm25_low'], row['pm25_high'],
        color=sns.color_palette("OrRd", len(dwho))[_], alpha=0.5, label=row['risk']
    )
# Plot PM2.5 data points
sns.scatterplot(
    data= dp_all, x='date', y='pm2_5', s=50, color = "#0B7C9AFF"
)
# Customizing the plot
plt.xlim([dp['date'].min() - timedelta(days=10), dp['date'].max() + timedelta(days=10)])
plt.ylim([0, max(dp['pm2_5']) * 1.05])
plt.xlabel('')
plt.ylabel('PM2.5 concentration')
plt.title('PM2.5 Measurements in Kampala')