PySpark MLLib Introduction and Examples

Bartley Richardson
16 February 2017

Introduction on using MLLib in PySpark created for the NOVA Data Science meetup on 16 February 2017. This notebook contains an example of supervised machine learning (logistic regression) using PySpark MLLib as well as two unsupervised techniques (Word2Vec and k-means clustering). Some final analysis and plots have been generated using Tableau. In those cases, the images/plots generated in Tableau are available as static images in this repo.

This notebook utilizes Spark 2.x. For those using Spark 1.6.x, some commands will differ.

The bank marketing dataset is replicated here and originally provided in the UCI Machine Learning Repository.

References and Acknowledgments

Insipriation taken from: https://github.com/jadianes/spark-py-notebooks

Word2Vec example: https://www.tensorflow.org/tutorials/word2vec/

Original Google Word2Vec code: https://code.google.com/archive/p/word2vec/

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
images		images
LICENSE		LICENSE
NOVADataScience-PySpark-MLLib.ipynb		NOVADataScience-PySpark-MLLib.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

images

images

LICENSE

LICENSE

NOVADataScience-PySpark-MLLib.ipynb

NOVADataScience-PySpark-MLLib.ipynb

README.md

README.md

Repository files navigation

PySpark MLLib Introduction and Examples

References and Acknowledgments

About

Releases

Packages

Languages

License

BartleyR/PySparkMLLib

Folders and files

Latest commit

History

Repository files navigation

PySpark MLLib Introduction and Examples

References and Acknowledgments

About

Resources

License

Security policy

Stars

Watchers

Forks

Languages