Bartley Richardson
16 February 2017
Introduction on using MLLib in PySpark created for the NOVA Data Science meetup on 16 February 2017. This notebook contains an example of supervised machine learning (logistic regression) using PySpark MLLib as well as two unsupervised techniques (Word2Vec and k-means clustering). Some final analysis and plots have been generated using Tableau. In those cases, the images/plots generated in Tableau are available as static images in this repo.
This notebook utilizes Spark 2.x. For those using Spark 1.6.x, some commands will differ.
The bank marketing dataset is replicated here and originally provided in the UCI Machine Learning Repository.
Insipriation taken from: https://github.com/jadianes/spark-py-notebooks
Word2Vec example: https://www.tensorflow.org/tutorials/word2vec/
Original Google Word2Vec code: https://code.google.com/archive/p/word2vec/