Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different results from DBCV matlab implementation of Moulavi et al. #3

Closed
bakachan19 opened this issue Mar 3, 2024 · 5 comments
Closed
Labels
fixed The presented problem has been successfully fixed.

Comments

@bakachan19
Copy link

Hi,
thanks for the python implementation of DBCV.
I used your implementation on the dataset_1.txt provided by the authors of DBCV paper from this link: https://github.com/pajaskowiak/dbcv .

With your implementation I get a score of 0.85434 while the authors report a score of 0.6149 with their matlab implementation.
Any idea on what is causing this difference in the results?

Thank you!

@FelSiq
Copy link
Owner

FelSiq commented Mar 3, 2024

Hi @bakachan19,

Thanks for pointing it out. I'm going to investigate this issue as soon as possible.

Also, note that the distance metric used during the computation affects the results. For instance, the squared euclidean distance (currently the default value) holds a value of 0.85420, whereas the standard euclidean distance, 0.70895.

@bakachan19
Copy link
Author

Hi @FelSiq,
I wasn't aware of the different metrics available. Thanks for letting me know.
Also, I am looking forward to your investigation outcome.

Have an amazing weekend.

@FelSiq
Copy link
Owner

FelSiq commented Apr 11, 2024

TL;DR

The main branch has been updated with full support to the original MATLAB implementation. The default MST algorithm was kept as the Scipy's Kruskal's implementation, but you can now swap to the Prim's variant implemented in the original version.

If you require strictly equivalency to the original MATLAB implementation, the following setup produces equivalent results to the MATLAB original implementation:

import dbcv
X, y = ...
score = dbcv.dbcv(X, y, metric="sqeuclidean", noise_id=0, use_original_mst_implementation=True)

More info below.


Hi @bakachan19.

It appears that most of the results discrepancy stems from the examples in https://github.com/pajaskowiak/dbcv being mislabeled (noise instances are labeled as -1 instead of 0, which appears to be the assumption of their implementation).

They seem to be using squared Euclidean distance by default, just like my implementation, so the default metric value is not an issue.

Below, I provide a comparison of estimated DBCV values using their example files (after correcting noise labels):

EDIT: After some analysis comparing implementations, I found a little bug on my code while computing core distances between instances in the same cluster. I pushed a quick-fix to the main branch, so the results should be closer. However, they don't quite match yet.

EDIT 2: After further analysis, I've identified all differences between the two implementations. It turns out that their MST algorithm is not equivalent to Scipy's Kruskal's MST algorithm, as the former tends to generate more hub nodes, consequently affecting the estimated metric value. To address this issue, I've translated the MATLAB MST implementation and now offer it as an alternative to Kruskal's MST using a new user parameter (dbcv(..., use_original_mst_implementation=True)).

Dataset Python (deprecated) Python (main branch, Scipy's Kruskal's) Python (main branch, Translated MST algorithm) MATLAB
dataset_1.txt 0.8542 0.8566 0.8576 0.8576
dataset_2.txt 0.8939 0.5405 0.8103 0.8103
dataset_3.txt 0.6572 0.6308 0.6319 0.6319
dataset_4.txt 0.9116 0.8456 0.8688 0.8688

Both MST algorithms produces trees with the same total edge weight but varying structure, which affects the number of internal nodes/edges. Since DBCV depends on internal edges, the metric value is affected by the MST algorithm implementation.

For reference, I added below a table comparing the MST's total edge weight and internal node count for cluster and each implementation. Note that the Internal node count for the MATLAB implementation also varies depending on the node the algorithm start building the tree (which is set to the first index by default).

Dataset MST total sum (Kruskal's) Internal node count (Kruskal's) MST total sum (MATLAB) Internal node count (MATLAB; index 0 start)
dataset_1.txt 0.9826 115 0.9826 115
dataset_1.txt 3.9304 114 3.9304 114
dataset_1.txt 3.9304 92 3.9304 92
dataset_1.txt 0.9826 92 0.9826 92
dataset_2.txt 2049.8717 46 2049.8717 40
dataset_2.txt 2080.0746 43 2080.0746 41
dataset_2.txt 87754.3054 256 87754.3054 200
dataset_2.txt 122747.7005 300 122747.7005 216
dataset_3.txt 84.371 242 84.371 209
dataset_3.txt 93.5867 221 93.5867 192
dataset_4.txt 753.2395 31 753.2395 25
dataset_4.txt 809.8137 14 809.8137 13
dataset_4.txt 895.8691 20 895.8691 19
dataset_4.txt 3411.5984 40 3411.5984 38
dataset_4.txt 33179.6451 140 33179.6451 113
dataset_4.txt 39420.2701 123 39420.2701 100

The following code should be equivalent to the original MATLAB implementation:

import dbcv
X, y = ...
score = dbcv.dbcv(X, y, metric="sqeuclidean", noise_id=0, use_original_mst_implementation=True)

I'm keeping Kruskal's MST as the default value since it's more optimized. In theory, despite any potential discrepancies in results due to different parameter configurations (such as distance metric or MST algorithm), both implementations should convey similar insights.

Best regards,
Felipe.

@FelSiq
Copy link
Owner

FelSiq commented Apr 13, 2024

After the most recent updates, I believe this issue has been solved. If you find any other problem, please let me know.

Felipe.

@FelSiq FelSiq closed this as completed Apr 13, 2024
@FelSiq FelSiq added the fixed The presented problem has been successfully fixed. label Apr 29, 2024
@bakachan19
Copy link
Author

@FelSiq thank you for taking the time to look into this issue and fix it.

have an amazing day!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed The presented problem has been successfully fixed.
Projects
None yet
Development

No branches or pull requests

2 participants