# 1. <a id='toc1_'></a>[Clustering](#toc0_)

**Table of contents**<a id='toc0_'></a>    
1. [Clustering](#toc1_)    
1.1. [Import dependencies](#toc1_1_)    
1.2. [Load the Train datasets](#toc1_2_)    
1.2.1. [Load the scaled-extended data](#toc1_2_1_)    
1.2.2. [Load the nomalized-extended data: The data used for the AEs](#toc1_2_2_)    
1.2.3. [Load the data without scaling](#toc1_2_3_)    
1.2.4. [Summary](#toc1_2_4_)    
1.3. [Get the indecies from ``train_df`` saved for follow-up](#toc1_3_)    
1.4. [Load the trained cluster models](#toc1_4_)    
1.4.1. [Load the ``DAE_gmm_20_features_7_clusters_model``](#toc1_4_1_)    
1.4.2. [Load the ``gmm_major_features_7_clusters_model``](#toc1_4_2_)    
1.4.3. [Load the ``gmm_selected_best_20_features_7_clusters_model``](#toc1_4_3_)    
1.5. [Cluster](#toc1_5_)    
1.5.1. [Cluster a 100e3 samples from ``encoded_features_transformer_seq`` data using the ``DAE_gmm_20_features_7_clusters_model`` model](#toc1_5_1_)    
1.5.2. [Cluster a 100e3 samples from ``scaled_train_df`` data using the ``gmm_major_features_7_clusters_model`` model](#toc1_5_2_)    
1.5.3. [Cluster a 100e3 samples from ``scaled_train_df`` data using the ``gmm_selected_best_20_features_7_clusters_model`` model](#toc1_5_3_)    
1.6. [Pad the cluster labels to the sampled ``train_df``](#toc1_6_)    
1.7. [Behaviour Analysisi](#toc1_7_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=true
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## 1.1. <a id='toc1_1_'></a>[Import dependencies](#toc1_1_)

In [1]:
%load_ext autoreload
%autoreload 2

In [12]:
##> import libraries
import sys
from pathlib import Path
# Add root directory to path for imports >
root_dir = Path.cwd().resolve().parent
if root_dir.exists():
    sys.path.append(str(root_dir))
else:
    raise FileNotFoundError('Root directory not found')

from pprint import pprint

import pickle
import numpy as np
import torch
import shap
import pandas as pd
import torch.optim as optim
import torch.nn as nn
from torch.utils.data import DataLoader

#> import custom libraries
from src.cluster import cluster_and_visualize, plot_bic_aic, flatten_trajectory_data, SHAPAnalysis, sample_features_and_indices, cluster, plot_cluster, describe_clusters
from src.models import DenoiseAutoEncoder, TransformerDenoiseAutoEncoder
from src.load import load_datasets, load_df_to_dataset
from src.plot import plot_mean_std_feature, plot_cluster_counts
from src.traj_dataloader import DenoiseAutoencoderSequencedDataset
from src.train import train_and_evaluate
from src.load import stratified_sample_df as ss_sample

import joblib
from collections import defaultdict

#> Plot
import matplotlib.pyplot as plt
import seaborn as sns
import scienceplots  # https://github.com/garrettj403/SciencePlots?tab=readme-ov-file
plt.style.use(['science', 'grid', 'notebook'])  # , 'ieee'

# %matplotlib inline
%matplotlib widget

In [4]:
data_dir = root_dir / 'data'
data_dir = data_dir.resolve()
if not data_dir.exists():
    raise FileNotFoundError('Data directory not found')

# assets_dir = data_dir / 'local' / 'aistraj' / 'tvt_assets'
assets_dir = root_dir.parents[2] / 'aistraj' / 'tvt_assets'
assets_dir = assets_dir.resolve()
if not assets_dir.exists():
    raise FileNotFoundError('Assets directory not found')

models_dir = root_dir / 'models'
models_dir = models_dir.resolve()
if not models_dir.exists():
    raise FileNotFoundError('Models directory not found')

scaled_tvt_data_import_assets_dir = assets_dir / 'scaled' 
scaled_tvt_data_import_assets_dir = scaled_tvt_data_import_assets_dir.resolve()
if not scaled_tvt_data_import_assets_dir.exists():
    raise FileNotFoundError('Train-Validate-Test Pickled Data directory not found')

extend_tvt_data_import_assets_dir = assets_dir / 'extended' 
extend_tvt_data_import_assets_dir = extend_tvt_data_import_assets_dir.resolve()
if not extend_tvt_data_import_assets_dir.exists():
    raise FileNotFoundError('Train-Validate-Test Pickled Data directory not found')

tvt_data_import_assets_dir = assets_dir / 'original' 
tvt_data_import_assets_dir = tvt_data_import_assets_dir.resolve()
if not tvt_data_import_assets_dir.exists():
    raise FileNotFoundError('Train-Validate-Test Pickled Data directory not found')

# Images dir
images_dir = root_dir / 'images_and_description'
images_dir = images_dir.resolve()
if not images_dir.exists():
    raise FileNotFoundError('images directory not found')

## 1.2. <a id='toc1_2_'></a>[Load the Train datasets](#toc1_2_)

Laveraging that the index is preserved during the training with different features sets. There is no need to descale rather to decode the data. Just follow the index in the original datasets.

### 1.2.1. <a id='toc1_2_1_'></a>[Load the scaled-extended data](#toc1_2_1_)

In [4]:
# Define the paths to the tvt files
train_pickle_path = scaled_tvt_data_import_assets_dir / 'scaled_cleaned_extended_train_df.parquet'

scaled_train_df = load_df_to_dataset(train_pickle_path).data

### 1.2.2. <a id='toc1_2_2_'></a>[Load the nomalized-extended data: The data used for the AEs](#toc1_2_2_)

In [7]:
## Load all the datasteps (input, latent, decoded) of the Denoising AE >>
pickle_save_path=Path(models_dir / 'sga')
model_name = 'transformer_denoiseautoencoder_model_parquet'
pickle_dir = pickle_save_path / Path(model_name).stem
with open(pickle_dir / "encoded_features_train.pkl", "rb") as f:
    encoded_features_transformer_seq = pickle.load(f)
with open(pickle_dir / "all_inputs_train.pkl", "rb") as f:
    all_inputs_transformer_seq = pickle.load(f)
with open(pickle_dir / "all_reconstructions_train.pkl", "rb") as f:
    all_reconstructions_transformer_seq = pickle.load(f)

In [8]:
## Print the shapes of the data >>
print (encoded_features_transformer_seq.shape)
print (all_inputs_transformer_seq.shape)
print (all_reconstructions_transformer_seq.shape)

(57444, 256, 4)
(57444, 256, 20)
(57444, 256, 20)


In [9]:
## Flatten the data in 2D (as a table) >>
encoded_features_transformer_seq = flatten_trajectory_data(encoded_features_transformer_seq, flatten_mode='fine_grained_behavior')
all_inputs_transformer_seq = flatten_trajectory_data(all_inputs_transformer_seq, flatten_mode='fine_grained_behavior')
all_reconstructions_transformer_seq = flatten_trajectory_data(all_reconstructions_transformer_seq, flatten_mode='fine_grained_behavior')

In [10]:
## Print the shapes of the data >>
print (encoded_features_transformer_seq.shape)
print (all_inputs_transformer_seq.shape)
print (all_reconstructions_transformer_seq.shape)

(14705664, 4)
(14705664, 20)
(14705664, 20)


Now we have all the datasets needed for the AE in the following sequence:
1. ``all_inputs_transformer_seq`` : The (min-max normalised) input data for the Denoising AE (DAE) 
2. ``encoded_features_transformer_seq`` : The latent space of the DAE
3. ``all_reconstructions_transformer_seq``: The reconstructued (output) data from the DAE

### 1.2.3. <a id='toc1_2_3_'></a>[Load the data without scaling](#toc1_2_3_)

In [11]:
# Define the paths to the tvt (non-scaled - original) data files: Only the train is needed
train_pickle_path = extend_tvt_data_import_assets_dir / 'cleaned_extended_train_df.parquet'
# Load
train_df = load_df_to_dataset(train_pickle_path).data

### 1.2.4. <a id='toc1_2_4_'></a>[Summary](#toc1_2_4_)

+ **Scaled Train dataset**: ``scaled_train_df``
+ **AE-related Train dataset**
  1. ``all_inputs_transformer_seq`` : The (min-max normalised) input data for the Denoising AE (DAE) 
  2. ``encoded_features_transformer_seq`` : The latent space of the DAE
  3. ``all_reconstructions_transformer_seq``: The reconstructued (output) data from the DAE
+ **Train dataset without scaling**: ``train_df``

## 1.3. <a id='toc1_3_'></a>[Get the indecies from ``train_df`` saved for follow-up](#toc1_3_)

In [12]:
# Reset the index of the tain_df and move the original index to a new column called 'original_index'
train_df.reset_index(inplace=True)
train_df.rename(columns={'index': 'original_index'}, inplace=True)

In [13]:
train_df

Unnamed: 0,original_index,epoch,stopped,cog_c,aad,rot_c,speed_c,distance_c,acc_c,cdd,...,lon,lat,obj_id,datetime,season,part_of_day,month_sin,month_cos,hour_sin,hour_cos
0,11605,1657070180,0,103.223207,1.914476,0.191448,1.305838,13.058376,0.007239,372.140836,...,10.14568,54.365526,255806357,2022-07-06 01:16:20,1,2,-0.5,-0.866025,0.258819,0.965926
1,11606,1657070190,0,104.746263,1.523056,0.152306,1.368473,13.68473,0.006264,385.825566,...,10.145883,54.365495,255806357,2022-07-06 01:16:30,1,2,-0.5,-0.866025,0.258819,0.965926
2,11607,1657070200,0,104.203357,0.542906,-0.054291,1.458833,14.588328,0.009036,400.413894,...,10.146101,54.365463,255806357,2022-07-06 01:16:40,1,2,-0.5,-0.866025,0.258819,0.965926
3,11608,1657070210,0,103.046189,1.157167,-0.115717,1.595331,15.953308,0.01365,416.367201,...,10.14634,54.36543,255806357,2022-07-06 01:16:50,1,2,-0.5,-0.866025,0.258819,0.965926
4,11609,1657070220,0,103.510316,0.464127,0.046413,1.679767,16.797667,0.008444,433.164869,...,10.146591,54.365395,255806357,2022-07-06 01:17:00,1,2,-0.5,-0.866025,0.258819,0.965926
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14705495,118315304,1683331160,0,199.900217,0.262915,0.026292,4.086556,40.865558,0.044867,4163.814006,...,10.208204,54.414996,210524000,2023-05-05 23:59:20,0,2,0.5,-0.866025,-0.258819,0.965926
14705496,118315305,1683331170,0,198.168437,1.731781,-0.173178,4.23632,42.363198,0.014976,4206.177204,...,10.208,54.414634,210524000,2023-05-05 23:59:30,0,2,0.5,-0.866025,-0.258819,0.965926
14705497,118315306,1683331180,0,196.815691,1.352745,-0.135275,4.466943,44.669433,0.023062,4250.846637,...,10.207801,54.41425,210524000,2023-05-05 23:59:40,0,2,0.5,-0.866025,-0.258819,0.965926
14705498,118315307,1683331190,0,197.284832,0.469141,0.046914,4.434698,44.346978,-0.003225,4295.193615,...,10.207597,54.41387,210524000,2023-05-05 23:59:50,0,2,0.5,-0.866025,-0.258819,0.965926


## 1.4. <a id='toc1_4_'></a>[Load the trained cluster models](#toc1_4_)

Here we have three models trained with the train data but with different _scaling_ sets. The models are:
1. ``DAE_gmm_20_features_7_clusters_model``: GMM trained model on the latent space data (``encoded_features_transformer_seq``)
2. ``gmm_major_features_7_clusters_model``: GMM trained model on the major features data (``scaled_train_df``)
3. ``gmm_selected_best_10_features_7_clusters_model``: GMM trained model on the selected best features data (``scaled_train_df``)

### 1.4.1. <a id='toc1_4_1_'></a>[Load the ``DAE_gmm_20_features_7_clusters_model``](#toc1_4_1_)

In [14]:
# Directory to the saved model >
gmm_dae_model_dir = models_dir / 'sga' /  'DAE_gmm_20_features_7_clusters_model.joblib'

# Load the model >
gmm_dae_model = joblib.load(gmm_dae_model_dir)

gmm_dae_model

### 1.4.2. <a id='toc1_4_2_'></a>[Load the ``gmm_major_features_7_clusters_model``](#toc1_4_2_)

In [15]:
# Directory to the saved model >
gmm_major_features_model_dir = models_dir / 'gaf' /  'gmm_major_features_7_clusters_model.joblib'

# Load the model >
gmm_major_features_model = joblib.load(gmm_major_features_model_dir)

gmm_major_features_model

### 1.4.3. <a id='toc1_4_3_'></a>[Load the ``gmm_selected_best_10_features_7_clusters_model``](#toc1_4_3_)

In [16]:
# Directory to the saved model >
gmm_selected_features_model_dir = models_dir / 'gaf' /  'gmm_selected_best_10_features_7_clusters_model.joblib'

# Load the model >
gmm_selected_features_model = joblib.load(gmm_selected_features_model_dir)

gmm_selected_features_model

## 1.5. <a id='toc1_5_'></a>[Cluster](#toc1_5_)

### 1.5.1. <a id='toc1_5_1_'></a>[Cluster a 100e3 samples from ``encoded_features_transformer_seq`` data using the ``DAE_gmm_20_features_7_clusters_model`` model](#toc1_5_1_)

+ Sample from the ``encoded_features_transformer_seq`` data

In [17]:
sampled_features, sampled_indices = sample_features_and_indices (encoded_features_transformer_seq[:14705500], 
                                                                 train_df.index[:14705500], 
                                                                 100000)

+ Predict and get scores

In [18]:
# GMM cluster
model_name = 'DAE_gmm_20_features_7_clusters_final'
cluster_method = 'gmm'  # 'kmeans', 'gmm', 'hdbscan
n_clusters = 7
dae_df, dae_scores, _ = cluster(pd.DataFrame(sampled_features), 
                                n_clusters=n_clusters,
                                cluster_method=cluster_method, 
                                distance_metric='euclidean', 
                                append_to_df=True, 
                                save_model_path=models_dir,
                                save_model_name=model_name,
                                verbose=2)

The average silhouette_score is : 0.8497963871423618
The Calinski-Harabasz score is : 872078.3462158148
The Davies-Bouldin score is : 0.2726383688142084
Saving the trained model to  /data1/gfalouji/repos/behaviour_analysis/sv-nba-analysis/models/DAE_gmm_20_features_7_clusters_final_model.joblib
completed with success!


### 1.5.2. <a id='toc1_5_2_'></a>[Cluster a 100e3 samples from ``scaled_train_df`` data using the ``gmm_major_features_7_clusters_model`` model](#toc1_5_2_)

+ From the ``scaled_train_df`` data, select the rows in ``sampled_indices``

In [19]:
major_features = ['season', 'part_of_day',
                  'distance_c', 'dist_ww', 'dist_ra', 'dist_cl', 'dist_ma', 
                  'cog_c', 'rot_c', 'speed_c', 'acc_c', 'lon', 'lat']

In [20]:
major_feat_df = scaled_train_df.iloc[sampled_indices,:][major_features]
major_feat_df

Unnamed: 0,season,part_of_day,distance_c,dist_ww,dist_ra,dist_cl,dist_ma,cog_c,rot_c,speed_c,acc_c,lon,lat
11177025,0,0,9.773429e-06,0.006345,0.002979,0.006413,0.002979,0.534125,-0.002248,0.156016,-0.981735,0.170650,0.068030
10922261,0,1,4.791735e-06,0.003453,0.003802,0.004347,0.003802,0.692489,0.000299,-0.257508,-0.212930,0.172225,0.071417
2798670,1,2,5.651457e-06,0.005522,0.002205,0.005845,0.002205,0.906041,-0.001545,-0.186144,-0.446655,0.170578,0.069383
3829547,1,2,4.464625e-08,0.005596,0.002296,0.005939,0.002296,0.896979,0.001407,-0.651557,-0.032515,0.170447,0.069419
10556381,0,0,1.654757e-05,0.003813,0.002729,0.004135,0.002729,0.490989,-0.000701,0.718328,-0.404028,0.172316,0.070213
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3747068,1,0,1.511057e-05,0.005956,0.002594,0.006027,0.002594,0.607893,0.000697,0.599045,-0.243445,0.170985,0.068290
2255389,1,1,1.690702e-05,0.003626,0.002860,0.003920,0.002860,0.043911,0.000100,0.748165,0.195251,0.172567,0.070240
9643563,0,1,1.543341e-05,0.005585,0.002207,0.005764,0.002207,0.249087,0.000588,0.625843,2.945784,0.170955,0.068861
366011,1,0,7.408287e-06,0.006722,0.003354,0.006784,0.003354,0.129738,0.000287,-0.040311,-1.312124,0.170337,0.067769


+ Predict and get scores

In [21]:
# GMM cluster
model_name = 'gmm_major_features_7_clusters_final'
cluster_method = 'gmm'  # 'kmeans', 'gmm', 'hdbscan
n_clusters = 7
major_df, major_scores, _ = cluster(major_feat_df, 
                                    n_clusters=n_clusters,
                                    cluster_method=cluster_method, 
                                    distance_metric='euclidean', 
                                    append_to_df=True, 
                                    save_model_path=models_dir,
                                    save_model_name=model_name,
                                    verbose=2)

The average silhouette_score is : 0.4121503684743852
The Calinski-Harabasz score is : 17490.846924344078
The Davies-Bouldin score is : 5.265608735580207
Saving the trained model to  /data1/gfalouji/repos/behaviour_analysis/sv-nba-analysis/models/gmm_major_features_7_clusters_final_model.joblib
completed with success!


### 1.5.3. <a id='toc1_5_3_'></a>[Cluster a 100e3 samples from ``scaled_train_df`` data using the ``gmm_selected_best_10_features_7_clusters_model`` model](#toc1_5_3_)

+ From the ``scaled_train_df`` data, select the targeted columns

In [22]:
## Load selected features average scores >>
selected_features_path = models_dir / 'selected_features' / 'all_selected_features.csv'
all_features_scores = pd.read_csv(selected_features_path)

display(all_features_scores)

Unnamed: 0,selected_features,feature_importance_pca,feature_importance_vtm,feature_importance_mlp,mean,variance
0,acc_c,1.032544e-13,1.0,0.794431,0.5981436,0.2788965
1,month_sin,1.0,6.78417e-07,0.585267,0.5284224,0.2524231
2,distance_c,1.724464e-07,0.0,1.0,0.3333334,0.3333333
3,dist_ra,9.316073e-12,1.205393e-12,0.992225,0.3307418,0.3281705
4,lat,0.0,1.879845e-12,0.987764,0.3292547,0.3252259
5,dist_ww,2.112695e-10,1.680394e-12,0.97262,0.3242066,0.3153297
6,dist_cl,3.995835e-12,1.156549e-12,0.971416,0.3238054,0.3145498
7,lon,0.0,1.845572e-12,0.967629,0.322543,0.3121019
8,dist_ma,7.556275e-13,1.205393e-12,0.960541,0.3201802,0.3075462
9,rot_c,1.990598e-07,1.42943e-10,0.954455,0.3181518,0.3036615


In [23]:
best_features_stop_indx = 10  # Take the first 10 of the sorted features
## Create a list of the selected features
selected_features = all_features_scores['selected_features'].values[:best_features_stop_indx]
print(f'The selected features are: {selected_features}')

The selected features are: ['acc_c' 'month_sin' 'distance_c' 'dist_ra' 'lat' 'dist_ww' 'dist_cl'
 'lon' 'dist_ma' 'rot_c']


In [24]:
selected_df= scaled_train_df.iloc[sampled_indices,:][selected_features]
selected_df

Unnamed: 0,acc_c,month_sin,distance_c,dist_ra,lat,dist_ww,dist_cl,lon,dist_ma,rot_c
11177025,-0.981735,8.660254e-01,9.773429e-06,0.002979,0.068030,0.006345,0.006413,0.170650,0.002979,-0.002248
10922261,-0.212930,5.000000e-01,4.791735e-06,0.003802,0.071417,0.003453,0.004347,0.172225,0.003802,0.000299
2798670,-0.446655,-5.000000e-01,5.651457e-06,0.002205,0.069383,0.005522,0.005845,0.170578,0.002205,-0.001545
3829547,-0.032515,-5.000000e-01,4.464625e-08,0.002296,0.069419,0.005596,0.005939,0.170447,0.002296,0.001407
10556381,-0.404028,1.224647e-16,1.654757e-05,0.002729,0.070213,0.003813,0.004135,0.172316,0.002729,-0.000701
...,...,...,...,...,...,...,...,...,...,...
3747068,-0.243445,-5.000000e-01,1.511057e-05,0.002594,0.068290,0.005956,0.006027,0.170985,0.002594,0.000697
2255389,0.195251,-8.660254e-01,1.690702e-05,0.002860,0.070240,0.003626,0.003920,0.172567,0.002860,0.000100
9643563,2.945784,1.224647e-16,1.543341e-05,0.002207,0.068861,0.005585,0.005764,0.170955,0.002207,0.000588
366011,-1.312124,1.224647e-16,7.408287e-06,0.003354,0.067769,0.006722,0.006784,0.170337,0.003354,0.000287


+ Predict and get scores

In [25]:
# GMM cluster
# GMM cluster
model_name = 'gmm_selected_best_10_features_7_clusters_model'
cluster_method = 'gmm'  # 'kmeans', 'gmm', 'hdbscan
n_clusters = 7
select_df, select_scores, _ = cluster(selected_df, 
                                      n_clusters=n_clusters,
                                      cluster_method=cluster_method, 
                                      distance_metric='euclidean', 
                                      append_to_df=True, 
                                      save_model_path=models_dir,
                                      save_model_name=model_name,
                                      verbose=2)

The average silhouette_score is : 0.4452951058288049
The Calinski-Harabasz score is : 36758.95156277374
The Davies-Bouldin score is : 2.6907939317885843
Saving the trained model to  /data1/gfalouji/repos/behaviour_analysis/sv-nba-analysis/models/gmm_selected_best_10_features_7_clusters_model_model.joblib
completed with success!


## 1.6. <a id='toc1_6_'></a>[Pad the cluster labels to the sampled ``train_df``](#toc1_6_)

+ Initialise the final dataframe with cluster labels

In [26]:
clustered_df = train_df.iloc[sampled_indices,:].copy()

+ Add to ``cluster_df`` the cluster labels from ``dae_df``

In [27]:
clustered_df['dae_gmm_clusters'] =dae_df['gmm_clusters'].values

+ Add to ``cluster_df`` the cluster labels from ``major_df``

In [30]:
clustered_df['major_gmm_clusters'] = major_df['gmm_clusters'].values

+ Add to ``cluster_df`` the cluster labels from ``major_df``

In [31]:
clustered_df['selected_gmm_clusters'] = selected_df['gmm_clusters'].values

+ Display the final ``clustered_df``

In [32]:
clustered_df

Unnamed: 0,original_index,epoch,stopped,cog_c,aad,rot_c,speed_c,distance_c,acc_c,cdd,...,datetime,season,part_of_day,month_sin,month_cos,hour_sin,hour_cos,dae_gmm_clusters,major_gmm_clusters,selected_gmm_clusters
11177025,88643178,1651062360,0,192.284979,8.092272,-0.809227,3.080386,30.803857,-0.014123,14867.092647,...,2022-04-27 12:26:00,0,0,8.660254e-01,-0.500000,1.224647e-16,-1.000000e+00,3,4,3
10922261,86165953,1651414800,0,249.296089,1.075787,0.107579,1.510257,15.102573,-0.00315,4200.399894,...,2022-05-01 14:20:00,0,1,5.000000e-01,-0.866025,-5.000000e-01,-8.660254e-01,1,4,0
2798670,22356903,1657307160,0,326.17491,5.561214,-0.556121,1.781224,17.81224,-0.006486,943.410321,...,2022-07-08 19:06:00,1,2,-5.000000e-01,-0.866025,-9.659258e-01,2.588190e-01,2,0,0
3829547,29843047,1657485750,0,322.912473,5.066277,0.506628,0.014072,0.140716,-0.000575,11299.647552,...,2022-07-10 20:42:30,1,2,-5.000000e-01,-0.866025,-8.660254e-01,5.000000e-01,2,0,0
10556381,83566900,1654516340,0,176.756161,2.522277,-0.252228,5.215458,52.154577,-0.005877,47245.947717,...,2022-06-06 11:52:20,0,0,1.224647e-16,-1.000000,2.588190e-01,-9.659258e-01,6,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3747068,29010285,1657454320,0,218.841638,2.508311,0.250831,4.762545,47.625448,-0.003585,36847.261715,...,2022-07-10 11:58:40,1,0,-5.000000e-01,-0.866025,2.588190e-01,-9.659258e-01,6,0,0
2255389,17677960,1659551120,0,15.808122,0.360303,0.03603,5.328749,53.287488,0.002676,7209.94456,...,2022-08-03 18:25:20,1,1,-8.660254e-01,-0.500000,-1.000000e+00,-1.836970e-16,5,0,0
9643563,75470836,1685898790,0,89.671164,2.116945,0.211695,4.864297,48.642969,0.041935,1173.095206,...,2023-06-04 17:13:10,0,1,1.224647e-16,-1.000000,-9.659258e-01,-2.588190e-01,1,4,3
366011,3058763,1687423930,0,46.70566,1.032495,0.103249,2.334941,23.349413,-0.018839,7685.923656,...,2023-06-22 08:52:10,1,0,1.224647e-16,-1.000000,8.660254e-01,-5.000000e-01,0,0,3


+ Save the final ``clustered_df``

In [33]:
target_save_path = data_dir / 'clusters' / 'final_clusters_df.csv'
clustered_df.to_csv(target_save_path, index=False)

## 1.7. <a id='toc1_7_'></a>[Behaviour Analysis](#toc1_7_)

### Import 'final_clusters_df.csv'

In [6]:
target_save_path = data_dir / 'clusters' / 'final_clusters_df.csv'

clustered_df = pd.read_csv (target_save_path)

### Definite the different feature list
+ 'group_columns': List of column names to group by.
+ 'describe_features': List of features to describe in each group.
+ 'selected_features': List of features which represent the temporal and spatial patterns

In [13]:
group_columns = ['dae_gmm_clusters', 'major_gmm_clusters', 'selected_gmm_clusters']

describe_features = [
    'season', 'part_of_day',
    'aad', 'cdd', 'dir_ccs', 'cog_c', 'rot_c', 'distance_c', 'dist_ww', 'dist_ra',
    'dist_cl', 'dist_ma', 'speed_c', 'acc_c', 'lon', 'lat'
]

selected_features = ['part_of_day', 'season', 'cdd', 'distance_c']

### Get new dataframe with describe_features and cluster labels with specified method
+ used for plots

In [10]:
dae_cluster_df = clustered_df [describe_features+[group_columns[0]]]
major_cluster_df = clustered_df [describe_features+[group_columns[1]]]
selected_cluster_df = clustered_df [describe_features+[group_columns[2]]]

### Get describe dataframe for different methods

In [14]:
dae_describe_df = describe_clusters (clustered_df, group_columns[0], selected_features)

major_describe_df = describe_clusters (clustered_df, group_columns[1], selected_features)

selected_describe_df = describe_clusters (clustered_df, group_columns[2], selected_features)

In [15]:
dae_describe_df

Unnamed: 0,dae_gmm_clusters,0,1,2,3,4,5,6
part_of_day,count,14243.0,18300.0,11390.0,18696.0,12244.0,12795.0,12332.0
part_of_day,mean,0.001685038,0.9955738,1.939596,0.01668806,1.999673,1.017663,0.0002432695
part_of_day,std,0.05680708,0.06801081,0.238244,0.1281033,0.02213525,0.131729,0.01559584
part_of_day,min,0.0,0.0,1.0,0.0,0.0,1.0,0.0
part_of_day,25%,0.0,1.0,2.0,0.0,2.0,1.0,0.0
part_of_day,50%,0.0,1.0,2.0,0.0,2.0,1.0,0.0
part_of_day,75%,0.0,1.0,2.0,0.0,2.0,1.0,0.0
part_of_day,max,2.0,2.0,2.0,1.0,2.0,2.0,1.0
season,count,14243.0,18300.0,11390.0,18696.0,12244.0,12795.0,12332.0
season,mean,0.7088394,0.7437705,0.9191396,0.7253958,0.840575,0.7092614,0.7326468


In [16]:
major_describe_df

Unnamed: 0,major_gmm_clusters,0,1,2,3,4,5,6
part_of_day,count,40748.0,760.0,6.0,3.0,51873.0,6609.0,1.0
part_of_day,mean,0.7940512,1.098684,0.6666667,0.3333333,0.7535134,0.8881828,2.0
part_of_day,std,0.7962424,0.8400379,0.5163978,0.5773503,0.7893544,0.8292809,
part_of_day,min,0.0,0.0,0.0,0.0,0.0,0.0,2.0
part_of_day,25%,0.0,0.0,0.25,0.0,0.0,0.0,2.0
part_of_day,50%,1.0,1.0,1.0,0.0,1.0,1.0,2.0
part_of_day,75%,1.0,2.0,1.0,0.5,1.0,2.0,2.0
part_of_day,max,2.0,2.0,1.0,1.0,2.0,2.0,2.0
season,count,40748.0,760.0,6.0,3.0,51873.0,6609.0,1.0
season,mean,1.587489,0.0,0.0,0.0,0.0,1.732486,3.0


In [17]:
selected_describe_df

Unnamed: 0,selected_gmm_clusters,0,1,2,3,4,5,6
part_of_day,count,49700.0,8.0,3.0,34121.0,221.0,13151.0,2796.0
part_of_day,mean,0.7378471,0.625,0.3333333,0.8021453,1.108597,0.8867006,0.7875536
part_of_day,std,0.7799811,0.9161254,0.5773503,0.8005948,0.87748,0.8321114,0.796191
part_of_day,min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
part_of_day,25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0
part_of_day,50%,1.0,0.0,0.0,1.0,1.0,1.0,1.0
part_of_day,75%,1.0,1.25,0.5,1.0,2.0,2.0,1.0
part_of_day,max,2.0,2.0,1.0,2.0,2.0,2.0,2.0
season,count,49700.0,8.0,3.0,34121.0,221.0,13151.0,2796.0
season,mean,0.7317505,0.0,0.0,0.7731309,0.2217195,0.8430538,0.806867


### Plots for different features with cluster labels

In [None]:
save_path = images_dir / f'dae_part_of_day_mean_std.png'

plot_mean_std_feature(
    df=dae_cluster_df, 
    group_column='dae_gmm_clusters', 
    feature='part_of_day',
    title='Mean and Std of Part of Day across Clusters',
    xlabel='Cluster',
    ylabel='Part of Day',
    save_path=None
)

### Analyse for DAE method
+ count
+ temporal features
    - 'part_of_day'
    - 'season'
+ spatial features
    - 'cdd'
    - 'distance_c'


### Analyse for latent features method

In [None]:
save_path = images_dir / f'dae_part_of_day_mean_std.png'

plot_mean_std_feature(
    df=dae_cluster_df, 
    group_column='dae_gmm_clusters', 
    feature='part_of_day',
    title='Mean and Std of Part of Day across Clusters',
    xlabel='Cluster',
    ylabel='Part of Day',
    save_path=save_path
)


### Analyse for major features method

In [None]:
save_path = images_dir / f'major_part_of_day_mean_std.png'

plot_mean_std_feature(
    df=major_cluster_df, 
    group_column='major_gmm_clusters', 
    feature='part_of_day',
    title='Mean and Std of Part of Day across Clusters',
    xlabel='Cluster',
    ylabel='Part of Day',
    save_path=save_path,
    std=True
)

### Analyse for selected features method

In [None]:
save_path = images_dir / f'selected_part_of_day_mean_std.png'

plot_mean_std_feature(
    df=selected_cluster_df, 
    group_column='selected_gmm_clusters', 
    feature='part_of_day',
    title='Mean and Std of Part of Day across Clusters',
    xlabel='Cluster',
    ylabel='Part of Day',
    save_path=save_path
)