# <font color=  #FF5733> Let's start building simple ELEMENTS of a PIPELINE for Organge Churn dataset

## Importing Churn Data

###  Load churn-bigml-80.csv into a DataFrame

In [None]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

CV_data = sqlContext.read.load('/resources/data/MSTC/churn-bigml-80.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')


# Nothing to do with Pipelines.... but...

<font color=blue size=5>PixieDust</font>  <font size=4> is a productivity tool for Python or Scala notebooks, which lets a developer encapsulate business logic into something easy for your customers to consume.


https://pypi.python.org/pypi/pixiedust

In [None]:
!pip install pixiedust


In [None]:
import pixiedust

In [None]:
# With PixieDust's **`display`** API, you can easily view and visualize the data.

display(CV_data)

## Spark: ML Pipelines
https://spark.apache.org/docs/2.2.0/ml-pipeline.html


##  <font color= #e38009> Transformer A: StringIndexer

<font font-family: "calibri" size=3.5>StringIndexer converts String values that are part of a look-up into categorical indices, which could be used by machine learning algorithms in ml library.

***Notice we provide the input column name and the output column name as parameters at the time of initialization of the StringIndexer.***


In [None]:
from pyspark.ml.feature import StringIndexer

# Index labels, adding metadata to the label column
stringindexer = StringIndexer(inputCol='Churn',
                             outputCol='indexedLabel')

model=stringindexer.fit(CV_data)


dataframe_transformedA=model.transform(CV_data)

In [None]:
# limit returns the first NUM rows of a SparkDataFrame as a SparkDataFrame.
# This is useful if you require only a subset of your original SparkDataFrame.


dataframe_transformedA.limit(20).toPandas()

##  <font color= #e38009> Transformer B: VectorAssembler

<font font-family: "calibri" size=3.5>...after “feature engineering” … the feature engineering results are then combined using the VectorAssembler, before being passed to ML Estimator

***Notice we provide the input = list of columns (MUST BE NUMERIC!) and the output column assembles all of them in a single column/vector***

###  For simplicity: first we drop all columns:
* categorical
* and numerical highly correlated

In [None]:
CV_data.printSchema()

### This will be our list with predictors

In [None]:
predictors=('Number vmail messages',
 'Total day minutes',
 'Total day calls',
 'Total eve minutes',
 'Total eve calls',
 'Total night minutes',
 'Total night calls',
 'Total intl minutes',
 'Total intl calls',
 'Customer service calls')

In [None]:
from pyspark.ml.feature import VectorAssembler

assembler=VectorAssembler(inputCols=predictors,outputCol='features')

dataframe_transformedB=assembler.transform(dataframe_transformedA).select('indexedLabel','features')


##  <font color=#FF5733> Estimators

<font font-family: "calibri" size=3.5>
An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. 

Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. <br><br>
***For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.***


In [None]:
from pyspark.ml.classification import DecisionTreeClassifier

# Train a DecisionTree model
dTree_algorithm = DecisionTreeClassifier(maxDepth=2,
                                        labelCol='indexedLabel', featuresCol='features')

In [None]:
dTree_model=dTree_algorithm.fit(dataframe_transformedB)

In [None]:
print(dTree_model._call_java("toDebugString"))

##  <font color= #e38009> Transformers include:learned models: 

*** e.g.  take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column***

In [None]:
predictions=dTree_model.transform(dataframe_transformedB)

In [None]:
predictions.printSchema()

In [None]:
import pandas as pd

pd.DataFrame(predictions.take(5), columns=predictions.columns)

## <font color=#938882>Model Evaluation

### *** For exaluation we will use the training cvs file, that is Train Error***

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [None]:
evaluator=BinaryClassificationEvaluator(labelCol='indexedLabel',\
                                        rawPredictionCol='rawPrediction',\
                                       metricName='areaUnderROC')

In [None]:
accuracy=evaluator.evaluate(predictions)

In [None]:
accuracy

##  Model selection via cross-validation

In this example we will use CrossValidator to select from a grid of parameters in the Tree model

In [None]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Search through decision tree's maxDepth parameter for best model
paramGrid = ParamGridBuilder().addGrid(dTree_algorithm.maxDepth, [2,3,4,5,6,7]).build()

In [None]:
# Set up 3-fold cross validation
crossval = CrossValidator(estimator=dTree_algorithm,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)

In [None]:
Cross_res=crossval.fit(dataframe_transformedB)

In [None]:
print(Cross_res.bestModel)

In [None]:
print(Cross_res.bestModel._call_java("toDebugString"))

In [None]:
# Fetch best model BUT TO BE USED we need process everything NO Pipes!! see below...
Best_tree_model = Cross_res.bestModel
print(Best_tree_model)

In [None]:
predictions_CV=Best_tree_model.transform(dataframe_transformedB)

In [None]:
pd.DataFrame(predictions_CV.take(5), columns=predictions.columns)

In [None]:
accuracy_CV=evaluator.evaluate(predictions_CV)

print(accuracy_CV)

## Now let's create a PIPELINE! see MSTC_Pipeline_PySpark_2.ipynb