### Problem Statement: Create Spark ML Pipeline for Model Training for Sentiment Analysis

##### Through this lab, we introduce the concept of ML Pipelines. Users can build and fine-tune actual machine learning pipelines with the aid of the standardised set of high-level APIs offered by ML Pipelines, which are built on top of DataFrames.

Before starting with the notebook ensure pyspark is installed and working. To install and to find the spark use pip install as shown in the below cells.

In [None]:
import findspark

In [None]:
print(findspark.find())
findspark.init()

Create a Spark Session

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Pipeline") \
    .master('local[3]') \
    .getOrCreate()

MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. 

1. DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.

2. Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.

3. Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.

4. Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.

5. Parameter: All Transformers and Estimators now share a common API for specifying parameters.

##### Dataframe
Create a Dataframe comprising a sentence, an identification value and a sentiment value (0:negative and 1:positive)

In [None]:
training = spark.createDataFrame([
     (0, 'i like apple pie for dessert', 1.0),
     (1, 'i dont drive fast cars', 0.0),
     (2, 'data science is fun', 1.0),
     (3, 'chocolate is not my favorite', 0.0),
     (4, 'my favorite movie is predator', 1.0)],
     ['id', 'text', 'label'])

Import the relevant pyspark packages <br>
1. Pipeline : To create a Training and Testing Pipeline.
2. Tokenizer : To create tokens from the sentence by converting the input string to lowercase and then splits it by white spaces.
3. HashingTF : To generate features from the tokens by Mapping a sequence of terms to their term frequencies using the hashing trick.
4. Logistic Regression : For training a classifier

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

##### Pipeline Components

1. Transformers: A Transformer is an abstraction that includes feature transformers and learned models. Technically, a Transformer implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns.

2. Estimators: An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit().

Initialize the Estimators and Transformers.

In [None]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01, featuresCol='features',labelCol='label')

Create a Estimator Pipeline.

In [None]:
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

Call the fit function for executing the pipeline and generating the trained model.

In [None]:
model = pipeline.fit(training)

Display the Stages of the pipeline.

In [None]:
model.stages

Initialize the test data.

In [None]:
test = spark.createDataFrame([
     (5, 'I like programming'),
     (6, 'I dont eat grapes')],
     ["id", "text"])

Use the Transformer pipeline to generate predictions for the test data.

In [None]:
prediction = model.transform(test)

Display the predictions.

In [None]:
prediction.show(truncate=False, vertical=True)

<hr />
Extract only the prediction value from the output of the pipeline.
<hr />

In [None]:
prediction.select("prediction").toJSON().first()

<hr />
Stop the Spark Session.
<hr />

In [None]:
spark.stop()