# <font color=  #FF5733> Now we will add more features / transformers to the PIPELINE than those in<br>
## <font color=  #FA5733> MSTC_Pipeline_PySpark_2.ipynb

## Importing Churn Data

###  Load churn-bigml-80.csv into a DataFrame

# <font color=red>Add cache()

In [3]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

CV_data = sqlContext.read.load('/resources/data/MSTC/churn-bigml-80.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true').cache()


## Spark: ML Pipelines
https://spark.apache.org/docs/2.2.0/ml-pipeline.html


##  <font color= #e38009> Transformer A: StringIndexer

<font font-family: "calibri" size=3.5>StringIndexer converts String values that are part of a look-up into categorical indices, which could be used by machine learning algorithms in ml library.


##  <font color= #e38009> Transformer B: VectorAssembler

<font font-family: "calibri" size=3.5>...after “feature engineering” … the feature engineering results are then combined using the VectorAssembler, before being passed to ML Estimator

***Notice we provide the input = list of columns (MUST BE NUMERIC!) and the output column assembles all of them in a single column/vector***

### <font color= #C70039 > list with predictors to Assemble

In [4]:
predictors=('Number vmail messages',
 'Total day minutes',
 'Total day calls',
 'Total eve minutes',
 'Total eve calls',
 'Total night minutes',
 'Total night calls',
 'Total intl minutes',
 'Total intl calls',
 'Customer service calls',
 'IntlPlan',
 'VmailPlan',
 'StateOHE')

##  <font color=#FF5733> Estimators

<font font-family: "calibri" size=3.5>
An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. 

Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. <br><br>
***For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.***


In [7]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import OneHotEncoder

# Index labels, adding metadata to the label column
stringindexer = StringIndexer(inputCol='Churn',
                             outputCol='indexedLabel')

stringindexerIntlPlan = StringIndexer(inputCol='International plan',
                             outputCol='IntlPlan')

stringindexerVmailPlan = StringIndexer(inputCol='Voice mail plan',
                             outputCol='VmailPlan')

# ADD CATEGORICAL USING OHE:
##### First: need Indexer
stringindexerStateNum = StringIndexer(inputCol='State',
                             outputCol='StateNum')
##### Then: we can apply OHE on StateNum

## NOTE WE HAVE TO ADD:
# from pyspark.ml.feature import OneHotEncoder

OHEencoderState=OneHotEncoder(dropLast=False, inputCol='StateNum',
                             outputCol='StateOHE')

#### SEE DENSE , SPARSE VECTORS... and Assembled vector

assembler=VectorAssembler(inputCols=predictors,outputCol='features')

# Train a DecisionTree model
dTree_algorithm = DecisionTreeClassifier(maxDepth=2,
                                        labelCol='indexedLabel', featuresCol='features')


# Chain indexers and tree in a Pipeline

In [8]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[stringindexer,\
                            stringindexerIntlPlan,\
                            stringindexerVmailPlan,\
                            stringindexerStateNum,\
                            OHEencoderState,\
                            assembler, dTree_algorithm])

## <font color=#938882>Model Evaluation using:

* Hyperparameters seection
* Cross-validation

In [10]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

evaluator=BinaryClassificationEvaluator(labelCol='indexedLabel',\
                                        rawPredictionCol='rawPrediction',\
                                       metricName='areaUnderROC')


# Search through decision tree's maxDepth parameter for best model
paramGrid = ParamGridBuilder().addGrid(dTree_algorithm.maxDepth, [2,3,4,5,6,7]).build()

# Set up 3-fold cross validation
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)

In [12]:
from time import time

t0 = time()

CrossvalModel=crossval.fit(CV_data)

tt = time() - t0
print("Task completed in {} seconds".format(round(tt,3)))


Task completed in 24.781 seconds


## <font color=magenta> Let's see the OHE implications ....

In [18]:
stringindexerStateNum = StringIndexer(inputCol='State',
                             outputCol='StateNum')

model1=stringindexerStateNum.fit(CV_data)

TransformState_Num=model1.transform(CV_data)

In [24]:
import pandas as pd

pd.DataFrame(TransformState_Num.select('State','StateNum').take(5), columns=('State','StateNum'))

Unnamed: 0,State,StateNum
0,KS,23.0
1,OH,4.0
2,NJ,27.0
3,OH,4.0
4,OK,24.0


In [26]:
OHEencoderState=OneHotEncoder(dropLast=False, inputCol='StateNum',
                             outputCol='StateOHE')

TransformState_OHE=OHEencoderState.transform(TransformState_Num)

## <font color= #6fb92d >NOTE : <font color=red>SparseVector</font> format

In [38]:
TransformState_OHE.select('State','StateNum','StateOHE')\
             .take(5)

[Row(State='KS', StateNum=23.0, StateOHE=SparseVector(51, {23: 1.0})),
 Row(State='OH', StateNum=4.0, StateOHE=SparseVector(51, {4: 1.0})),
 Row(State='NJ', StateNum=27.0, StateOHE=SparseVector(51, {27: 1.0})),
 Row(State='OH', StateNum=4.0, StateOHE=SparseVector(51, {4: 1.0})),
 Row(State='OK', StateNum=24.0, StateOHE=SparseVector(51, {24: 1.0}))]

In [35]:
assembler=VectorAssembler(inputCols=('Total day calls','Total eve minutes','StateOHE'),\
                          outputCol='features')

VectorAssembled=assembler.transform(TransformState_OHE)

In [39]:
VectorAssembled.select('features').take(5)

[Row(features=SparseVector(53, {0: 110.0, 1: 197.4, 25: 1.0})),
 Row(features=SparseVector(53, {0: 123.0, 1: 195.5, 6: 1.0})),
 Row(features=SparseVector(53, {0: 114.0, 1: 121.2, 29: 1.0})),
 Row(features=SparseVector(53, {0: 71.0, 1: 61.9, 6: 1.0})),
 Row(features=SparseVector(53, {0: 113.0, 1: 148.3, 26: 1.0}))]

# <font face="calibri" color=#d63de2> Evaluate TEST DATA 

##  <font color= #e38009> Transformer : Making predictions with the TRAINED model


### <font color=red>Evaluation on TEST data

In [13]:
Test_data = sqlContext.read.load('/resources/data/MSTC/churn-bigml-20.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

In [14]:
# make predictions and evaluate result
predictions_Test = CrossvalModel.transform(Test_data)
accuracy_Test=evaluator.evaluate(predictions_Test)

print(accuracy_Test)

0.7291958041958042


In [15]:
# Confussion Matrix
predictions_Test.crosstab('Churn','prediction').show()

+----------------+---+---+
|Churn_prediction|0.0|1.0|
+----------------+---+---+
|            True| 24| 71|
|           False|556| 16|
+----------------+---+---+



In [None]:
# make predictions and evaluate result
#pipelineModel=pipeline.fit(CV_data)
#predictions_Test = pipelineModel.transform(Test_data)
#accuracy_Test=evaluator.evaluate(predictions_Test)

#print(accuracy_Test)

### <font color=red>Evaluation on <font color=green> TRAIN data

In [16]:
# make predictions and evaluate result
predictions_Train = CrossvalModel.transform(CV_data)

accuracy_Train=evaluator.evaluate(predictions_Train)

print(accuracy_Train)

0.7764757926558837


In [17]:
# Confussion Matrix
predictions_Train.crosstab('Churn','prediction').show()

+----------------+----+---+
|Churn_prediction| 0.0|1.0|
+----------------+----+---+
|            True|  66|322|
|           False|2264| 14|
+----------------+----+---+



#pipelineModel=pipeline.fit(CV_data)
#predictions_Train = pipelineModel.transform(CV_data)
#accuracy_Train=evaluator.evaluate(predictions_Train)

#print(accuracy_Train)

# <font color= #9e9b9e >..... ANALYZE BEST MODEL

In [None]:
# Fetch best model BUT TO BE USED we need process everything NO Pipes!! see below...
Best_tree_model = Cross_res.bestModel
print(Best_tree_model.stages[2])

In [None]:
Best_tree_model.stages[2]

In [None]:
print(Cross_res.bestModel.stages[2]._call_java("toDebugString"))