Clustering and Dimensionality Reduction using EM Algorithm and Autoencoders

Overview

This project investigates scenarios where the relationship between features is non-linear or data contains multiple clusters, each exhibiting different feature relationships. We develop methods for combining multiple Principal Component Analyses (PCAs) using the Expectation-Maximization (EM) algorithm and introduce constraints to an Autoencoder neural network to mimic PCA. The project involves the use of Python and PyTorch with simulated data, specifically the Vehicle Silhouettes dataset.

Dataset Overview

The dataset used in this project is the Vehicle Silhouettes dataset, which contains various geometric features of different vehicles. These features are used to distinguish between different classes of vehicles.

Features:

COMPACTNESS
CIRCULARITY
DISTANCE CIRCULARITY
RADIUS RATIO
PR.AXIS ASPECT RATIO
MAX.LENGTH ASPECT RATIO
SCATTER RATIO
ELONGATEDNESS
PR.AXIS RECTANGULARITY
MAX.LENGTH RECTANGULARITY
SCALED VARIANCE_MAJOR
SCALED VARIANCE_MINOR
SCALED RADIUS OF GYRATION
SKEWNESS ABOUT_MAJOR
SKEWNESS ABOUT_MINOR
KURTOSIS ABOUT_MAJOR
KURTOSIS ABOUT_MINOR
HOLLOWS RATIO

Installation

To run this project locally, follow these steps:

Clone the repository:

git clone https://github.com/DaZeTw/EM_Algorithm-Autoencoder.git
cd your-repo-name

Create a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate   # On Windows, use `venv\Scripts\activate`

Install the required packages:
```
pip install -r requirements.txt
```

Google Colab Notebook

For an interactive version of this project, you can view and run the code in Google Colab: Google Colab Notebook

Usage

Data Preprocessing: Load and preprocess the dataset.
Dimensionality Reduction: Apply PCA and Kernel PCA.
Clustering: Perform clustering using K-Means, EM algorithm, and Gaussian Mixture Models.
Autoencoders: Train and evaluate autoencoders for dimensionality reduction.

Running the Script

To run the main script, use:

python Clustering_and_Dimensionality_Reductio.py

Results

Clustering Metrics

Adjusted Rand Index (ARI): Measures the similarity between true labels and predicted labels.
Normalized Mutual Information (NMI): Measures the mutual dependence between true labels and predicted labels.
Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters.
Davies-Bouldin Index (DB): Measures the average similarity ratio of each cluster with its most similar cluster.

Evaluation Results

K-Means: Achieves the best silhouette score and DB index.
K-Means Initialization: Shows the best ARI.
Random Initialization: Provides high ARI and NMI.
Hierarchical Initialization: Performs worse across most metrics.
Gaussian Mixture Model: Exhibits high NMI and a good silhouette score.

Dimensionality Reduction

PCA: Effective for linear patterns with high explained variance.
Simple Autoencoder: Capable of capturing complex, non-linear patterns but requires careful tuning.
PCA Autoencoder: Balances capturing variance and maintaining linear interpretability.

Key Observations

The choice of initialization method significantly impacts clustering performance.
PCA remains a robust baseline for linear patterns, while advanced methods like autoencoders offer significant advantages for non-linear relationships.

Conclusion

The project demonstrates the importance of selecting appropriate methods and initialization strategies for effective clustering and dimensionality reduction. Future work could explore further optimization and application to different datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Clustering_and_Dimensionality_Reductio.py		Clustering_and_Dimensionality_Reductio.py
README.md		README.md
requirements.txt		requirements.txt
vehicle.csv		vehicle.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Clustering and Dimensionality Reduction using EM Algorithm and Autoencoders

Table of Contents

Overview

Dataset Overview

Installation

Google Colab Notebook

Usage

Running the Script

Results

Clustering Metrics

Evaluation Results

Dimensionality Reduction

Key Observations

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Languages

DaZeTw/EM_Algorithm-Autoencoder

Folders and files

Latest commit

History

Repository files navigation

Clustering and Dimensionality Reduction using EM Algorithm and Autoencoders

Table of Contents

Overview

Dataset Overview

Installation

Google Colab Notebook

Usage

Running the Script

Results

Clustering Metrics

Evaluation Results

Dimensionality Reduction

Key Observations

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages