MSBD5001 Big Data Computing Projects -- Algorithm Parallelization.
[TOC]
Below is the tree of source code our parallel implementation of DBSCAN.
src
├── ipynb
│ ├── parallel_dbscan.ipynb
│ ├── MatrixDBSCAN.ipynb
│ ├── NaiveDBSCAN.ipynb
│ ├── partitioning.ipynb
│ ├── rtree_fixpartition.ipynb
│ └── rtree_partitioning.ipynb
├── parallel
│ ├── __init__.py
│ ├── dbscan_general.py
│ └── dbscan_rtree.py
├── serial
│ ├── __init__.py
│ └── dbscan.py
├── settings.py
├── test
│ ├── playground.py
│ └── test.py
└── utils
├── __init__.py
├── logger.py
└── utils.py
Our python code is under folder parallel
, serial
, test
and utils
, and ipynb
files include how we explore the way of implementing.
serial
: Under this folder, we implemented the serial DBSCAN algorithm and the improvement methods, and this module will act as local DBSCAN.parallel
: Under this folder, we implemented our parallel functions about partitioning, merging inpyspark
. Two filedbscan_general.py
anddbscan_rtree.py
namely contain our strategy on spartial evenly split and two rtree-based partition. This entire folder also act as a module can be call outside.utils
: In this module, we implemented utility functions, like clustering evaluation function and timer function.settings.py
: Under this file, we set some of configuration and global status.ipynb
: Under this folder, we used jupyter notebook to do some exploratory work,and plot out the clustering result. The final plots of our experiments mainly come fromparallel_dbscan.ipynb
.test
: Under this folder, some test can be done here.
In dbscan.py
two ways of serial dbscan algorithm is implemented: Naive method with redundant computation and optimal method with distance matrix.
Here I tested with Spiral dataset on Clustering basic benchmark, which has 312 points of 2-degree and 3-cluster. Got time consuming in mini-second as below:
Naive DBSCAN:
predict: 1886.1682415008545
Matrix DBSCAN:
predict: 2.608060836791992
It looks quite acceptable, it might be better in neighbours map
as we planed, from the evidence of matrix method.
Further works will be on proper evaluation method and parallel implementation.
- 2019/03/18 Proposal Integration
- 2019/03/22 Proposal Submission
- 2019/04/07 First Progress Check
- 2019/04/14 Overall Code Works Done
- Project Example: Implementation of DBSCAN Algorithm with python Spark APIs
- MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
- Research on the Parallelization of the DBSCAN Clustering Algorithm for Spatial Data Mining Based on the Spark Platform
- DBSCAN Wiki
-
使用spark自带的任务监视器去查看任务的用时,资源分配,以及spark自动生成的DAG
The proposal of a deep project should contain the following contents:
- Description of the problem.
- Description of the algorithm (pseudo-code or language description).
- Brief plan on how to implement it in Spark.
-
Introduction ---- Liu Jinyu
- Problem Description
-
Sequential DBSCAN ---- Wang Shen
- Algorithm explain
- Pseudo-code
- Figure illustration
-
Parallel DBSCAN ---- Ling Liyang
-
Overview
-
Plans on parallelisation
-
- HDFS&Spark cluster deployment
- Sequential DBSCAN efficiency test
- Parallel DBSCAN
- Space based partition
- Naïve redundant computation method
- Distance Matrix
- Neighbour List
- Cost based partition
- Space based partition
- Description of the problem.
- Description of the algorithm (pseudo-code or language description).
- How you implemented the algorithm in Spark, including any optimizations that you have done.
- Experimental results, which should demonstrate the scalability of your implementation as the data size and the number of executors grows.
- Potential improvements, if any.
- Problem Introduction — 30s
- Local DBSCAN Description — 2min
- Implementation in Spark — 6 min
- General implementation — 2 min
- Evenly partition
- Merging
- Optimizations — 4 min
- Improvement on local DBSCAB — 1min
- distance matrix
- adjacent list
- Improvements on Partition — 3 min
- RTree: Cost-based
- RTree: Reduced boundary
- Improvement on local DBSCAB — 1min
- General implementation — 2 min
- Experimental results — 4 min
- Brief introduction on how to tuning hyper parameters
- Efficiency with different data distributions
- Comparation of above implemetations on each dataset
- Summary and Further work — 30s
- The source code of your implementation (as a separate file).
- References (including others' implementation of the same algorithm).