-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
different results from DBCV matlab implementation of Moulavi et al. #3
Comments
Hi @bakachan19, Thanks for pointing it out. I'm going to investigate this issue as soon as possible. Also, note that the distance metric used during the computation affects the results. For instance, the squared euclidean distance (currently the default value) holds a value of |
Hi @FelSiq, Have an amazing weekend. |
TL;DRThe main branch has been updated with full support to the original MATLAB implementation. The default MST algorithm was kept as the Scipy's Kruskal's implementation, but you can now swap to the Prim's variant implemented in the original version. If you require strictly equivalency to the original MATLAB implementation, the following setup produces equivalent results to the MATLAB original implementation: import dbcv
X, y = ...
score = dbcv.dbcv(X, y, metric="sqeuclidean", noise_id=0, use_original_mst_implementation=True) More info below. Hi @bakachan19. It appears that most of the results discrepancy stems from the examples in https://github.com/pajaskowiak/dbcv being mislabeled (noise instances are labeled as -1 instead of 0, which appears to be the assumption of their implementation). They seem to be using squared Euclidean distance by default, just like my implementation, so the default metric value is not an issue. Below, I provide a comparison of estimated DBCV values using their example files (after correcting noise labels): EDIT: After some analysis comparing implementations, I found a little bug on my code while computing core distances between instances in the same cluster. I pushed a quick-fix to the main branch, so the results should be closer. EDIT 2: After further analysis, I've identified all differences between the two implementations. It turns out that their MST algorithm is not equivalent to Scipy's Kruskal's MST algorithm, as the former tends to generate more hub nodes, consequently affecting the estimated metric value. To address this issue, I've translated the MATLAB MST implementation and now offer it as an alternative to Kruskal's MST using a new user parameter (
Both MST algorithms produces trees with the same total edge weight but varying structure, which affects the number of internal nodes/edges. Since DBCV depends on internal edges, the metric value is affected by the MST algorithm implementation. For reference, I added below a table comparing the MST's total edge weight and internal node count for cluster and each implementation. Note that the Internal node count for the MATLAB implementation also varies depending on the node the algorithm start building the tree (which is set to the first index by default).
The following code should be equivalent to the original MATLAB implementation: import dbcv
X, y = ...
score = dbcv.dbcv(X, y, metric="sqeuclidean", noise_id=0, use_original_mst_implementation=True) I'm keeping Kruskal's MST as the default value since it's more optimized. In theory, despite any potential discrepancies in results due to different parameter configurations (such as distance metric or MST algorithm), both implementations should convey similar insights. Best regards, |
After the most recent updates, I believe this issue has been solved. If you find any other problem, please let me know. Felipe. |
@FelSiq thank you for taking the time to look into this issue and fix it. have an amazing day! |
Hi,
thanks for the python implementation of DBCV.
I used your implementation on the dataset_1.txt provided by the authors of DBCV paper from this link: https://github.com/pajaskowiak/dbcv .
With your implementation I get a score of 0.85434 while the authors report a score of 0.6149 with their matlab implementation.
Any idea on what is causing this difference in the results?
Thank you!
The text was updated successfully, but these errors were encountered: