Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs
pyspark_kmodes updated license Apr 9, 2016
.gitignore initial import Apr 5, 2016
LICENSE updated license Apr 9, 2016
MANIFEST.in updated README Apr 9, 2016
README.rst Update README.rst Oct 18, 2016
requirements.txt
setup.cfg initial import Apr 5, 2016
setup.py

README.rst

docs/images/thinkbig.png

pyspark-distributed-kmodes

Ensemble based distributed K-modes clustering for PySpark

This repository contains the source code for the pyspark_kmodes package to perform K-modes clustering in PySpark. The package implements the ensemble-based algorithm proposed by Visalakshi and Arunprabha (IJERD, March 2015).

K-modes clustering is performed on each partition of a Spark RDD, and the resulting clusters are collected to the driver node. Local K-modes clustering is then performed on the centroids returned from each partition to yield a final set of cluster centroids.

This package was written by Marissa Saunders and relies on an adaptation of the KModes package by Nico de Vos https://github.com/nicodv/kmodes for the local iterations. Using this package for clustering Clickstream data is described by Marissa Saunders in this YouTube video.

Installation

This module has been developed and tested on Spark 1.5.2 and 1.6.1 and should work under Python 2.7 and 3.5.

The module depends on scikit-learn 0.16+ (for check_array). See requirements.txt for this and other package dependencies.

Once cloned or downloaded, execute pip from the top-level directory to install:

$ ls
LICENSE                     README.rst              pyspark_kmodes          setup.cfg
MANIFEST.in         docs                    requirements.txt        setup.py

$ pip install .
[...]

Getting Started

The docs directory includes a sample Jupyter/iPython notebook to demonstrate its use.

$ cd docs

$ jupyter notebook PySpark-Distributed-KModes-example.ipynb

References

You can’t perform that action at this time.