<img src="../ucsb_logo_seal.png"> 

## ML Pipelines
### PSTAT 135 / 235: Big Data Analytics
### University of California, Santa Barbara
### Last Updated: Sep 4, 2019

---  


**Sources:**  
Learning Spark, Chapter 11: Machine Learning with MLlib  
https://spark.apache.org/docs/latest/ml-pipeline.html  
http://blog.insightdatalabs.com/spark-pipelines-elegant-yet-powerful/  




### OBJECTIVES
- Introduction to ML Pipelines  


### CONCEPTS

- ML Pipeline
- `DataFrame`
- `Transformer`
- `Estimator`
- `Parameter`

---

**DataFrame**

The ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. For example, columns can store text, feature vectors, true labels, and predictions.  Similar to the data frames in R and Python.

Can be created from an RDD

Columns are named

**Transformer**  
Transforms one DataFrame into another DataFrame

**Estimator**  
An algorithm that can be fit on a DataFrame (e.g., Logistic Regression)

**Parameter**  
Properties of an estimator (e.g., max number of iterations, regularization parameter)

*Setter methods* are available for setting parameters:

Set some params for logistic regression instance  

`lr.setMaxIter(10)
  .setRegParam(0.01)`

**Pipeline**  
A sequential chain of multiple `Transformers` and `Estimators` to specify an ML workflow  

The pipeline in Spark is very similar to the pipeline in scikit-learn  
Acts as a workflow to keep all steps together from start to finish, for example:
- Data preprocessing
- Feature extraction
- Model tuning
- Model fitting

Keeping track of these steps manually can be painful and error-prone.  
For example, the analyst might train on the test set by accident.  
Pipelines make the process easily repeatable, and safer.  
When new data comes along for scoring, it can be pushed through the pipeline.  

**Pipeline Schematic**  
`Cylinders` are DataFrames


<img src="ml_pipeline_graph.png">  

**Pipeline example**

In [None]:
# DATA OUTLINE
train_df  dataframe containing labels, restaurant reviews (string), ratings (integer) 
             will be used to train LogReg model
test_df   dataframe with the same fields, set aside for model evaluation
----------------------------------------------------------------------------------------------

from pyspark.ml import Pipeline  
from pyspark.ml.feature import *  
from pyspark.ml.classification import LogisticRegression

# Configure pipeline stages
# process review data
tok = Tokenizer(inputCol="review", outputCol="words")  
htf = HashingTF(inputCol="words", outputCol="tf", numFeatures=200)  

# process review data
w2v = Word2Vec(inputCol="review", outputCol="w2v")  

# process rating data
ohe = OneHotEncoder(inputCol="rating", outputCol="rc")  

va = VectorAssembler(inputCols=["tf", "w2v", "rc"], outputCol="features")  
lr = LogisticRegression(maxIter=10, regParam=0.01)

# Build the pipeline
pipeline = Pipeline(stages=[tok, htf, w2v, ohe, va, lr])

# Fit the pipeline
model = pipeline.fit(train_df)

# Make a prediction
prediction = model.transform(test_df)

# source: http://blog.insightdatalabs.com

At a high level, the pipeline outlines the steps that will take place sequentially: 

1. The data is processed into features  
2. The features are combined using `VectorAssembler`  
3. The combined features are input to the Logistic Regression model  

Calling `pipeline.fit(train_df)` will actually execute the workflow  

Each step is either a `Transformer` or an `Estimator`  

Each of the preprocessing steps is a `Transformer`  
The logistic regression is an `Estimator`  

**Another Pipeline Example:**  
https://spark.apache.org/docs/1.6.0/ml-guide.html  

**Custom Transformers**  
There are many transformers available in `MLlib`  
Users can also create custom transformers.  

`Transformer` requirements:  

1. Implement the `transform` method  
2. Specify an `inputCol` and `outputCol`  
3. Accept a DataFrame as input and return a DataFrame as output  


**Saving and Loading Pipeline**  
Pipelines can be saved for use later  
This is helpful in several circumstances, including:  

1. The user wishes to return to model development at a later time  
2. Calling the pipeline to score records in production
