Skip to content

BartleyR/PySparkMLLib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PySpark MLLib Introduction and Examples

Bartley Richardson
16 February 2017

Introduction on using MLLib in PySpark created for the NOVA Data Science meetup on 16 February 2017. This notebook contains an example of supervised machine learning (logistic regression) using PySpark MLLib as well as two unsupervised techniques (Word2Vec and k-means clustering). Some final analysis and plots have been generated using Tableau. In those cases, the images/plots generated in Tableau are available as static images in this repo.

This notebook utilizes Spark 2.x. For those using Spark 1.6.x, some commands will differ.

The bank marketing dataset is replicated here and originally provided in the UCI Machine Learning Repository.

References and Acknowledgments

Insipriation taken from: https://github.com/jadianes/spark-py-notebooks

Word2Vec example: https://www.tensorflow.org/tutorials/word2vec/

Original Google Word2Vec code: https://code.google.com/archive/p/word2vec/

About

Introduction on using MLLib in PySpark

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published