# PCA and Clustering of Weather Data Features

The goal is to classify large-scale weather regimes (LSWRs) by performing a PCA on timeseries data containing multiple phyiscal quantities on a large grid covering the whole of Europe.
Classification is done by applying a clustering algorithm on the data that assigns each individual time step to a cluster, where outliers are allowed.

To allow classification of timeseries grid data, we apply a dimensionality reduction algorithm (PCA) and transform the original data into PC space. The PC space is a multi-dimensional space that represents the phase space of the dynamical system described by the data. To avoid the curse of dimensions, we use only a reduced amount of PCs for the transformation such that the PCs reflect most of the variance of the data.

For clustering the states of the dynamical system in PC space, we use the [hierarchical density-based spatial clustering algorithm for applications with noise (HDBSCAN)](https://arxiv.org/abs/1911.02282), which is a modification of the [DBSCAN](https://dl.acm.org/doi/10.5555/3001460.3001507) algorithm.

## Data

The underlying data are from the ECMWF IFS HRES model. A detailed description can be found [here (pp. 21)](https://www.maelstrom-eurohpc.eu/content/docs/uploads/doc6.pdf).

The data cover a time range of 2017-2020 with an hourly temporal resolution. Hence, the data contain `~10^3` samples.

## PCA

The PCA is performed on the whole dataset, whereas only 3 PCs are kept for the transformation. This is because we only have `10^3` data samples. Choosing more PCs would require more data (curse of dimensions). I.e., using `N` PCs would require at least `10^N` samples.

## Clustering

Before applying the clustering, the data are transformed into the 3-D sub-space of the PC-space. The result reflects the phase space containing all states of the dynamical system throughout the given time span.

Within this space, we perform a clustering to find reoccuring states of the system. Each cluster represents a LSWR, i.e. all clusters represent the ensemble of LSWRs that our system resigned in during the given time range.

## Satistical Analysis of the Clusters

The clusters (LSWRs) are then statistically analyzed such that we retrieve information about the LSWRs in general (total abundance, mean and standard deviation of their duration) and the appearance of each individual LSWR in the time series.

In [None]:
%matplotlib notebook
%load_ext dotenv
%dotenv mantik.env

import functools
import itertools

import hdbscan
import mantik
import mlflow
import sklearn.cluster as cluster

import lifetimes

lifetimes.utils.log_to_stdout()
mantik.init_tracking()

In [None]:
# Create fake dataset of two temporarily variable elliptical data regions on a grid
#dataset = lifetimes.testing.create_dummy_ecmwf_ifs_hres_dataset(
#    grid_size=(10, 10)
#)
#ds = dataset.as_xarray()

# Or load from local file
path = '/home/fabian/Documents/MAELSTROM/data/pca/temperature_level_128_daily_averages_2017_2020.nc'
ds = lifetimes.datasets.EcmwfIfsHres(
    paths=[path],
    overlapping=False,
)

data = ds.as_xarray()["t"]
data

In [None]:
anim = lifetimes.plotting.animate_timeseries(data)

In [None]:
modes = [lifetimes.modes.Modes(feature=data)]
pca_partial_method = functools.partial(
    lifetimes.modes.methods.spatio_temporal_principal_component_analysis,
    time_coordinate="time",
    latitude_coordinate="latitude",
)
[pca] = lifetimes.modes.determine_modes(modes=modes, method=pca_partial_method)

In [None]:
lifetimes.plotting.plot_first_three_components_timeseries(pca)

In [None]:
lifetimes.plotting.plot_scree_test(pca, variance_ratio=0.95)

In [None]:
n_components_range = range (3, 4)
min_cluster_size_range = range(30, 31)
for (
    n_components, 
    min_cluster_size,
) in itertools.product(
    n_components_range, 
    min_cluster_size_range,
):
    #with mlflow.start_run():
        algorithm = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size)
        clusters = lifetimes.modes.methods.find_principal_component_clusters(
            algorithm=algorithm, 
            pca=pca, 
            n_components=n_components, 
            use_varimax=False,
        )
        #mlflow.log_param("n_components", n_components)
        #mlflow.log_param("hdbscan_min_cluster_size", min_cluster_size)
        #mlflow.log_metric("n_clusters", clusters.n_clusters)

In [None]:
lifetimes.plotting.plot_first_three_components_timeseries_clusters(clusters)

In [None]:
lifetimes.plotting.plot_condensed_tree(clusters)

In [None]:
lifetimes.plotting.plot_single_linkage_tree(clusters)

In [None]:
clusters.labels.plot()

In [None]:
cluster_lifetimes = lifetimes.modes.methods.determine_lifetimes_of_modes(
    modes=clusters.labels,
    time_coordinate="time",
)
cluster_lifetimes

In [None]:
pca.components_in_original_shape[0].plot()

In [None]:
clusters.inverse_transformed_cluster(0).plot()

In [None]:
clusters.inverse_transformed_cluster(1).plot()