Skip to content

Ikerlz/dcd

Repository files navigation

dcd: distributed spectral clustering

implemented with Apache Spark

Introduction

Community detection for large scale networks is of great importance in modern data analysis. In this work, we develop a distributed spectral clustering algorithm to handle this task. Specifically, we distribute a certain number of pilot network nodes on the master server and the others on worker servers. A spectral clustering algorithm is first conducted on the master to select pseudo centers. Next, the indexes of the pseudo centers are broadcasted to workers to complete the distributed community detection task using an SVD (singular value decomposition) type algorithm. The proposed distributed algorithm has three advantages. First, the communication cost is low, since only the indexes of pseudo centers are communicated. Second, no further iterative algorithm is needed on workers while a “one-shot” computation suffices. Third, both the computational complexity and the storage requirements are much lower compared to using the whole adjacency matrix. We develop a Python package DCD (The Python package is provided in https://github.com/Ikerlz/dcd.) to implement the distributed algorithm on a Spark system and establish theoretical properties with respect to the estimation accuracy and mis-clustering rates under the stochastic block model. Experiments on a variety of synthetic and empirical datasets are carried out to further illustrate the advantages of the methodology.


Dependencies


Run the PySpark code on the Spark platform

Simulation

python simulation.py

Real Data

  • AgBlog
python AGBlog/main.py
  • Pumbed
python Pumbed/main.py
  • Pokec
python Pokec/main.py
  • Cora
python Cora/main.py

Norm based method

This is an implement of A Divide and Conquer Framework for Distributed Graph Clustering

python DC_method/main.py

Comparison

Compare dcd with the norm based method

python Comparison/main.py

License

MIT License


References

Wu, S., Li, Z. & Zhu, X., (2023) A Distributed Community Detection Algorithm for Large Scale Networks Under Stochastic Block Models.

About

Distributed Community Detection Algorithm for Stochastic Block Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages