## Machine Learning with MLlib
## *Introduction and Feature Extraction*

### University of California, Santa Barbara  
### PSTAT 135/235  
### Last Updated: Dec 12, 2018

---  

### Sources 

1. Learning Spark
2. Spark Documentation  
	https://spark.apache.org/docs/latest/mllib-data-types.html  
	http://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html

### OBJECTIVES
1. Introduction to the machine learning library
2. Introduction to MLlib data types
3. Discuss Feature Extraction tools in MLLib


### CONCEPTS AND FUNCTIONS
- pipeline  
- supervised and unsupervised learning  
- learning tasks: classification, regression, clustering, dimensionality reduction  
- training set, testing set  
- feature extraction  

- MLlib data types:  
  - LabeledPoint  
  - sparse vector, dense vector  
  - sparse matrix, dense matrix  
  - Rating  

- Feature Extraction  
- TF-IDF  
- Word2Vec  
- Cosine Similarity  


---  

**MLlib**

Contains Spark’s ML library  
Works on RDDs  
Contains only algorithms that can be parallelized, since those run well on clusters  
Includes a pipeline API useful for building ML pipelines, similar to scikit-learn  


### Build LogReg Classifier to Predict Spam vs Not

In [2]:
data_path = '/home/jovyan/UCSB_BigDataAnalytics/data/mllib/'

In [7]:
import os
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.classification import LogisticRegressionWithSGD

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .master("local") \
        .appName("mllib_classifier") \
        .getOrCreate()

In [4]:
spark

In [5]:
sc = spark.sparkContext

In [9]:
spam = sc.textFile(os.path.join(data_path, "spam.txt"))
ham = sc.textFile(os.path.join(data_path, "ham.txt"))

In [10]:
spam.collect()

['Dear sir, I am a Prince in a far kingdom you have not heard of.  I want to send you money via wire transfer so please ...',
 'Get Viagra real cheap!  Send money right away to ...',
 'Oh my gosh you can be really strong too with these drugs found in the rainforest. Get them cheap right now ...',
 'YOUR COMPUTER HAS BEEN INFECTED!  YOU MUST RESET YOUR PASSWORD.  Reply to this email with your password and SSN ...',
 'THIS IS NOT A SCAM!  Send money and get access to awesome stuff really cheap and never have to ...']

In [11]:
ham.collect()

['Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!  Check out videos of talks from the summit at ...',
 'Hi Mom, Apologies for being late about emailing and forgetting to send you the package.  I hope you and bro have been ...',
 'Wow, hey Fred, just heard about the Spark petabyte sort.  I think we need to take time to try it out immediately ...',
 'Hi Spark user list, This is my first question to this list, so thanks in advance for your help!  I tried running ...',
 "Thanks Tom for your email.  I need to refer you to Alice for this one.  I haven't yet figured out that part either ...",
 'Good job yesterday!  I was attending your talk, and really enjoyed it.  I want to try out GraphX ...',
 'Summit demo got whoops from audience!  Had to let you know. --Joe']

In [12]:
tf = HashingTF(numFeatures = 10000)

In [13]:
spamFeatures = spam.map(lambda email: tf.transform(email.split(" ")))
normalFeatures = ham.map(lambda email: tf.transform(email.split(" ")))

In [14]:
# Build LabeledPoint datasets (1=spam, 0=ham)
positiveExamples = spamFeatures.map(lambda features: LabeledPoint(1, features))
negativeExamples = normalFeatures.map(lambda features: LabeledPoint(0, features))

In [15]:
pos = positiveExamples.collect()

In [16]:
pos[0]

LabeledPoint(1.0, (10000,[0,365,455,509,1320,1363,1583,2321,2403,3289,3342,4995,5336,5706,5831,6052,6300,6582,6744,8971,8977,9232,9604,9646,9878],[1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))

In [17]:
neg = negativeExamples.collect()

In [18]:
neg[0]

LabeledPoint(0.0, (10000,[0,1162,2403,2809,3080,3317,4161,4770,5423,5651,5743,5831,6006,6827,6971,7069,7872,9150,9370,9521,9604],[1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))

In [19]:
# Build training set
trainData = positiveExamples.union(negativeExamples)
trainData.cache()

UnionRDD[6] at union at NativeMethodAccessorImpl.java:0

In [20]:
# Train LogReg model
model = LogisticRegressionWithSGD.train(trainData)

In [23]:
# push "not spam" example through classifier
Test1 = tf.transform("I like learning Spark programming more than you love Spark programming".split(" "))
Test1

SparseVector(10000, {1205: 1.0, 1605: 1.0, 3682: 1.0, 5195: 1.0, 7279: 2.0, 7779: 1.0, 8321: 2.0, 8517: 1.0, 8550: 1.0})

In [25]:
# Prediction
print("Prediction for example: {}".format(model.predict(Test1)))

Prediction for example: 0


**LabeledPoint**  
Stores feature vector together with label  
**Rating**  
Rating of product by a user. Used in recommendation, for instance.  
**Vector**  
Handles dense and sparse. For sparse, only nonzero values and their indices are stored.  
Sparse saves on memory and runtime.  
**Matrix**  
A local matrix has integer-typed row and column indices and double-typed values, stored on a single   machine.  
MLlib supports dense matrices, whose entry values are stored in a single double array in column-major order, and sparse matrices, whose non-zero entry values are stored in the Compressed Sparse Column (CSC) format in column-major order.  
**Distributed matrix**  
A distributed matrix has long-typed row and column indices and double-typed values  
**Row matrix**  
A RowMatrix is a row-oriented distributed matrix without meaningful row indices  
**CoordinateMatrix**  
CoordinateMatrix is a distributed matrix backed by an RDD of its entries  
A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.

https://en.wikipedia.org/wiki/Row-_and_column-major_order

In [26]:
# Build sparse vector
from pyspark.mllib.linalg import Vectors

# create sparse vector [1.0 0.0 2.0 0.0]
sv1 = Vectors.sparse(4, {0: 1.0, 2: 2.0})

In [27]:
sv1

SparseVector(4, {0: 1.0, 2: 2.0})

In [29]:
Vectors.dense(sv1)

DenseVector([1.0, 0.0, 2.0, 0.0])

### Feature Extraction

*mllib.feature*  
contains classes for common feature transformations:  
-  Term Frequency-Inverse Document Frequency (TF-IDF)  
Produces feature vectors from text documents

There are two algorithms that compute TF-IDF:  

**1. HashingTF**  
	Computes term frequency vector from document  
	Can process one document or an RDD of documents  
	Each document needs to be an interable sequence (a list in Python)  

To reduce the chance of collision, we can increase the target feature dimension, i.e., the  
	 number of buckets of the hash table. The default feature dimension is 1,048,576  

**2. IDF**  
	Computes inverse document frequency  
	Terms that appear in high fraction of the docs are not as valuable  
	IDF will downweight such terms  

Good example of Feature Extraction here:  
http://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html

**TF-IDF Example**

**Word2Vec**  
Computes distributed vector representation of words.  
Similar words are close in the vector space  
Useful in many NLP applications:  
named entity recognition, disambiguation, parsing, tagging and machine translation.  

### Fit Word2VecModel to some text data

In [None]:
from pyspark.mllib.feature import Word2Vec

inp = sc.textFile("C:/spark/spark-2.1.1-bin-hadoop2.7/data/text8_part1.txt").map(lambda row: row.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(inp)

synonyms = model.findSynonyms('china', 40)

for word, cosine_distance in synonyms:
    print("{}: {}".format(word, cosine_distance))
Top Records:
malaysia: 0.9055396899188917
eastern: 0.8834685956632131
africa: 0.8537198739068056
zambia: 0.8535407384161012
myanmar: 0.8475548784366893
predominantly: 0.8461926971224027
mongolia: 0.8371518611342739
countries: 0.8342705781501009
southeast: 0.8316274754770454
central: 0.8313670331856243

**StandardScaler**   

Standardization can improve the convergence rate during the optimization process, and also  
prevents against features with very large variances exerting an overly large influence during model   training.  

For each feature,  
1. Scales to unit variance  
2. Centers to mean zero  
Useful or even essential for some models  
K-means works in Euclidean space, so all features should be on same scale  
Tree models do not need this

Use this in a *Pipeline* so the statistics can be applied to datasets for scoring later

### Standard Scaler  
Load dataset in libsvm format, standardize the features so that the new features have unit variance and/or zero mean

In [30]:
from pyspark.mllib.util import MLUtils
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.feature import StandardScaler

In [32]:
data = MLUtils.loadLibSVMFile(sc, os.path.join(data_path, "sample_libsvm_data.txt"))

In [33]:
data.take(1)

[LabeledPoint(0.0, (692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0

In [34]:
type(data)

pyspark.rdd.PipelinedRDD

In [35]:
# extract labels and features; stored as RDDs
label = data.map(lambda x: x.label)
features = data.map(lambda x: x.features)

In [39]:
dd = data.map(lambda x: (x.features, x.label))

In [40]:
dd.take(1)

[(SparseVector(692, {127: 51.0, 128: 159.0, 129: 253.0, 130: 159.0, 131: 50.0, 154: 48.0, 155: 238.0, 156: 252.0, 157: 252.0, 158: 252.0, 159: 237.0, 181: 54.0, 182: 227.0, 183: 253.0, 184: 252.0, 185: 239.0, 186: 233.0, 187: 252.0, 188: 57.0, 189: 6.0, 207: 10.0, 208: 60.0, 209: 224.0, 210: 252.0, 211: 253.0, 212: 252.0, 213: 202.0, 214: 84.0, 215: 252.0, 216: 253.0, 217: 122.0, 235: 163.0, 236: 252.0, 237: 252.0, 238: 252.0, 239: 253.0, 240: 252.0, 241: 252.0, 242: 96.0, 243: 189.0, 244: 253.0, 245: 167.0, 262: 51.0, 263: 238.0, 264: 253.0, 265: 253.0, 266: 190.0, 267: 114.0, 268: 253.0, 269: 228.0, 270: 47.0, 271: 79.0, 272: 255.0, 273: 168.0, 289: 48.0, 290: 238.0, 291: 252.0, 292: 252.0, 293: 179.0, 294: 12.0, 295: 75.0, 296: 121.0, 297: 21.0, 300: 253.0, 301: 243.0, 302: 50.0, 316: 38.0, 317: 165.0, 318: 253.0, 319: 233.0, 320: 208.0, 321: 84.0, 328: 253.0, 329: 252.0, 330: 165.0, 343: 7.0, 344: 178.0, 345: 252.0, 346: 240.0, 347: 71.0, 348: 19.0, 349: 28.0, 356: 253.0, 357: 252.

In [37]:
label.take(5)

[0.0, 1.0, 1.0, 1.0, 1.0]

In [38]:
features.take(1)

[SparseVector(692, {127: 51.0, 128: 159.0, 129: 253.0, 130: 159.0, 131: 50.0, 154: 48.0, 155: 238.0, 156: 252.0, 157: 252.0, 158: 252.0, 159: 237.0, 181: 54.0, 182: 227.0, 183: 253.0, 184: 252.0, 185: 239.0, 186: 233.0, 187: 252.0, 188: 57.0, 189: 6.0, 207: 10.0, 208: 60.0, 209: 224.0, 210: 252.0, 211: 253.0, 212: 252.0, 213: 202.0, 214: 84.0, 215: 252.0, 216: 253.0, 217: 122.0, 235: 163.0, 236: 252.0, 237: 252.0, 238: 252.0, 239: 253.0, 240: 252.0, 241: 252.0, 242: 96.0, 243: 189.0, 244: 253.0, 245: 167.0, 262: 51.0, 263: 238.0, 264: 253.0, 265: 253.0, 266: 190.0, 267: 114.0, 268: 253.0, 269: 228.0, 270: 47.0, 271: 79.0, 272: 255.0, 273: 168.0, 289: 48.0, 290: 238.0, 291: 252.0, 292: 252.0, 293: 179.0, 294: 12.0, 295: 75.0, 296: 121.0, 297: 21.0, 300: 253.0, 301: 243.0, 302: 50.0, 316: 38.0, 317: 165.0, 318: 253.0, 319: 233.0, 320: 208.0, 321: 84.0, 328: 253.0, 329: 252.0, 330: 165.0, 343: 7.0, 344: 178.0, 345: 252.0, 346: 240.0, 347: 71.0, 348: 19.0, 349: 28.0, 356: 253.0, 357: 252.0

In [41]:
scaler1 = StandardScaler().fit(features)

In [42]:
type(scaler1)

pyspark.mllib.feature.StandardScalerModel

In [44]:
scaler1

<pyspark.mllib.feature.StandardScalerModel at 0x7fdfaa425b00>

In [46]:
# data1 will be unit variance.
data1 = label.zip(scaler1.transform(features))

In [47]:
data1.take(2)

[(0.0,
  SparseVector(692, {127: 0.5468, 128: 1.5923, 129: 2.4354, 130: 1.7081, 131: 0.7335, 154: 0.4346, 155: 2.0985, 156: 2.2563, 157: 2.2368, 158: 2.2269, 159: 2.2555, 181: 0.4713, 182: 2.0575, 183: 2.3318, 184: 2.3761, 185: 2.1237, 186: 2.0452, 187: 2.2657, 188: 0.6339, 189: 0.1022, 207: 0.1056, 208: 0.5395, 209: 1.9268, 210: 2.2383, 211: 2.3018, 212: 2.3568, 213: 1.8002, 214: 0.7116, 215: 2.2256, 216: 2.4032, 217: 1.5931, 235: 1.5394, 236: 2.188, 237: 2.1493, 238: 2.2924, 239: 2.3889, 240: 2.3155, 241: 2.2653, 242: 0.8445, 243: 1.7094, 244: 2.2496, 245: 1.8613, 262: 0.5062, 263: 2.0796, 264: 2.2201, 265: 2.199, 266: 1.7299, 267: 1.083, 268: 2.1786, 269: 2.0345, 270: 0.4392, 271: 0.7218, 272: 2.2177, 273: 1.6764, 289: 0.4794, 290: 2.214, 291: 2.3569, 292: 2.2283, 293: 1.6322, 294: 0.1087, 295: 0.6833, 296: 1.0411, 297: 0.1941, 300: 2.277, 301: 2.3083, 302: 0.5395, 316: 0.3967, 317: 1.6059, 318: 2.3539, 319: 2.1535, 320: 1.9834, 321: 0.8017, 328: 2.4941, 329: 2.3661, 330: 1.7473, 34