Bachelor's thesis project exploring clustering approaches for multivariate Ecological Momentary Assessment (EMA) time-series data.
This project implements a pipeline for analyzing EMA time-series data through feature extraction, dimensionality reduction, and clustering. The goal is to identify distinct behavioral patterns in longitudinal self-report data.
- Preprocessing — Data cleaning, transformation, and structured/unstructured representations of EMA time series
- Feature Extraction — Coefficient extraction using multiple regression models (Linear, Polynomial, Lasso, Ridge, XGBoost, Random Forest, SVR, MLP, VAR)
- Dimensionality Reduction — PCA and PLSR with cross-validation
- Clustering — KMeans, KMedoids, Agglomerative Clustering, Gaussian Mixture Models, and DTW-based time-series clustering
- Evaluation — Silhouette Score, Calinski-Harabasz Index, Davies-Bouldin Index, BIC/AIC
├── Clustering.ipynb # Main analysis notebook (feature extraction, reduction, clustering)
├── Preprocessing.ipynb # Data preprocessing pipeline
├── clustering.py # Reusable clustering utility functions
└── Data/
├── Data/ # Raw data (codebook + original datasets)
├── preprocessed/ # Cleaned and transformed datasets
├── coefficients/ # Extracted regression coefficients
└── cluster/ # Clustering results
- Python 3
- scikit-learn / scikit-learn-extra
- tslearn (DTW clustering)
- pandas / NumPy
- matplotlib
Open the Jupyter notebooks in order:
Preprocessing.ipynb— Run preprocessing stepsClustering.ipynb— Run the full analysis pipeline (feature extraction → clustering)
The clustering.py module provides helper functions used by the notebooks.
This project is part of an academic thesis. Please cite appropriately if you use this work.