In [1]:
displayHTML("<font size=8>Let's start building simple ELEMENTS of a <font size=8 color='green'>PIPELINE</font> for</font> <font color=orange size=8>Orange Churn dataset</font>")

![How to create a DataFrame](https://blog.cloudera.com/wp-content/uploads/2017/04/Spark.png)

### [MSTC](http://mstc.ssr.upm.es/big-data-track) and MUIT:

## Importing Churn Data

###  Load churn-bigml-80.csv into a DataFrame

In [5]:
%fs ls /FileStore/tables

In [6]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

CV_data = sqlContext.read.load('/FileStore/tables/churn_bigml_80-bf1a8.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')


In [7]:
display(CV_data)

In [8]:
CV_data.printSchema()

This is simply to illustrate an example to apply a UDF to a Spark DataFrame

In [10]:
from pyspark.sql.types import DoubleType, StringType
from pyspark.sql.functions import UserDefinedFunction

toStr = UserDefinedFunction(lambda k: k, StringType())
CV_data = CV_data.withColumn('Churn', toStr(CV_data['Churn']))

#binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0}
#toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())
#CV_data = CV_data.withColumn('Churn', toNum(CV_data['Churn']))

In [11]:
CV_data.printSchema()

## Spark: ML Pipelines
https://spark.apache.org/docs/latest/ml-pipeline.html

##  <font color= #e38009> Transformer A: StringIndexer
  https://spark.apache.org/docs/latest/ml-features.html#stringindexer

<font font-family: "calibri" size=3.5>StringIndexer converts String values that are part of a look-up into categorical indices, which could be used by machine learning algorithms in ml library.

***Notice we provide the input column name and the output column name as parameters at the time of initialization of the StringIndexer.***

In [14]:
from pyspark.ml.feature import StringIndexer

# Index labels: using StringIndexer to encodes a string column of labels Churn ("True" , "False" strings NO Boolean) to a column of label indices indexedChurn

stringindexer = StringIndexer(inputCol='Churn',
                             outputCol='indexedChurn')

model=stringindexer.fit(CV_data)


dataframe_transformedA=model.transform(CV_data)

In [15]:
display(dataframe_transformedA)

##  <font color= #e38009> Transformer B: VectorAssembler

### ...after “feature engineering” … the feature engineering results are then combined using the VectorAssembler, before being passed to ML Estimator

###  For simplicity: first we drop all columns:
* categorical
* and numerical highly correlated

### This will be our list with predictors

In [19]:
predictors=('Number vmail messages',
 'Total day minutes',
 'Total day calls',
 'Total eve minutes',
 'Total eve calls',
 'Total night minutes',
 'Total night calls',
 'Total intl minutes',
 'Total intl calls',
 'Customer service calls')

#### Notice we provide to *VectorAssembler* the input = list of columns (MUST BE NUMERIC!) and the output column assembles all of them in a single column/vector

In [21]:
from pyspark.ml.feature import VectorAssembler

assembler=VectorAssembler(inputCols=predictors,outputCol='features')

dataframe_transformedB=assembler.transform(dataframe_transformedA).select('indexedChurn','features')


In [22]:
dataframe_transformedB.take(5)

##  <font color=#FF5733> Estimators

<font font-family: "calibri" size=3.5>
An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. 

Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. <br><br>
***For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.***

In [24]:
from pyspark.ml.classification import DecisionTreeClassifier

# Train a DecisionTree model
dTree_algorithm = DecisionTreeClassifier(maxDepth=2,
                                        labelCol='indexedChurn', featuresCol='features')

In [25]:
dTree_model=dTree_algorithm.fit(dataframe_transformedB)

In [26]:
print(dTree_model._call_java("toDebugString"))

##  <font color= #e38009> Transformers include:learned models: 

*** e.g.  take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column***

In [28]:
predictions=dTree_model.transform(dataframe_transformedB)

In [29]:
predictions.printSchema()

In [30]:
display(predictions)

In [31]:
import pandas as pd

pd.DataFrame(predictions.take(5), columns=predictions.columns)

## <font color=#938882>Model Evaluation

### *** For evaluation we will use the training cvs file, that is Train Error***

In [33]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [34]:
evaluator=BinaryClassificationEvaluator(labelCol='indexedChurn',\
                                        rawPredictionCol='rawPrediction',\
                                       metricName='areaUnderROC')

In [35]:
accuracy=evaluator.evaluate(predictions)

In [36]:
accuracy

In [37]:
# Since dTree_model is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique IDs for this
# LogisticRegression instance.
print("dTree_model was fit using parameters: ")
print(dTree_model.extractParamMap())

In [38]:
dTree_model.extractParamMap().keys()

In [39]:
dTree_model.maxDepth

In [40]:
# We may alternatively specify parameters using a Python dictionary as a paramMap
paramMap = {dTree_model.maxDepth: 1}
paramMap[dTree_model.maxDepth] = 7  # Specify 1 Param, overwriting the original maxIter.


# Now learn a new model using the paramMap parameters.
# paramMap overrides all parameters set earlier via dTree_model.set* methods.
dTree_model2=dTree_algorithm.fit(dataframe_transformedB, paramMap)

dTree_model2.extractParamMap()

In [41]:
predictions2=dTree_model2.transform(dataframe_transformedB)

accuracy2=evaluator.evaluate(predictions2)
print(accuracy2)



In [42]:
print(dTree_model2._call_java("toDebugString"))

##  Model selection via cross-validation

In this example we will use CrossValidator to select from a grid of parameters in the Tree model

In [44]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Search through decision tree's maxDepth parameter for best model
paramGrid = ParamGridBuilder().addGrid(dTree_algorithm.maxDepth, [2,3,4,5,6,7]).build()

In [45]:
# Set up 3-fold cross validation
crossval = CrossValidator(estimator=dTree_algorithm,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)

In [46]:
Cross_res=crossval.fit(dataframe_transformedB)

In [47]:
print(Cross_res.bestModel)

In [48]:
print(Cross_res.bestModel._call_java("toDebugString"))

In [49]:
# Fetch the best model for make predictions with it:
Best_tree_model = Cross_res.bestModel
print(Best_tree_model)

In [50]:
predictions_CV=Best_tree_model.transform(dataframe_transformedB)

In [51]:
pd.DataFrame(predictions_CV.take(5), columns=predictions.columns)

In [52]:
accuracy_CV=evaluator.evaluate(predictions_CV)

print(accuracy_CV)

## Now let's create a PIPELINE! see MSTC_Pipeline_PySpark_2.ipynb