# <font color=  #FF5733> Now we will try RF to the PIPELINE with the same features as in<br>
## <font color=  #FA5733> MSTC_Pipeline_PySpark_3.ipynb

## Importing Churn Data

###  Load churn-bigml-80.csv into a DataFrame

# <font color=red>Add cache()

In [6]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

CV_data = sqlContext.read.load('/resources/data/MSTC/churn-bigml-80.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true').cache()


## Spark: ML Pipelines
https://spark.apache.org/docs/2.2.0/ml-pipeline.html


##  <font color= #e38009> Transformer A: StringIndexer

<font font-family: "calibri" size=3.5>StringIndexer converts String values that are part of a look-up into categorical indices, which could be used by machine learning algorithms in ml library.


##  <font color= #e38009> Transformer B: VectorAssembler

<font font-family: "calibri" size=3.5>...after “feature engineering” … the feature engineering results are then combined using the VectorAssembler, before being passed to ML Estimator

***Notice we provide the input = list of columns (MUST BE NUMERIC!) and the output column assembles all of them in a single column/vector***

### <font color= #C70039 > list with predictors to Assemble

In [7]:
predictors=('Number vmail messages',
 'Total day minutes',
 'Total day calls',
 'Total eve minutes',
 'Total eve calls',
 'Total night minutes',
 'Total night calls',
 'Total intl minutes',
 'Total intl calls',
 'Customer service calls',
 'IntlPlan',
 'VmailPlan')

##  <font color=#FF5733> Estimators

<font font-family: "calibri" size=3.5>
An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. 

Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. <br><br>
***For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.***


In [16]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier

from pyspark.ml.classification import RandomForestClassifier

# Index labels, adding metadata to the label column
stringindexer = StringIndexer(inputCol='Churn',
                             outputCol='indexedLabel')

stringindexerIntlPlan = StringIndexer(inputCol='International plan',
                             outputCol='IntlPlan')

stringindexerVmailPlan = StringIndexer(inputCol='Voice mail plan',
                             outputCol='VmailPlan')

assembler=VectorAssembler(inputCols=predictors,outputCol='features')

# Train a RandomForest model.
rf_algorithm = RandomForestClassifier(\
                                      labelCol='indexedLabel', featuresCol='features')

# Train a DecisionTree model
#dTree_algorithm = DecisionTreeClassifier(maxDepth=2,
#                                        labelCol='indexedLabel', featuresCol='features')


## RF hyperparameters
https://spark.apache.org/docs/1.6.1/mllib-ensembles.html#usage-tips

# Chain indexers and tree in a Pipeline

In [17]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[stringindexer,\
                            stringindexerIntlPlan,\
                            stringindexerVmailPlan,\
                            assembler, rf_algorithm])

#                            assembler, dTree_algorithm])

## <font color=#938882>Model Evaluation using:

* Hyperparameters seection
* Cross-validation

In [18]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

evaluator=BinaryClassificationEvaluator(labelCol='indexedLabel',\
                                        rawPredictionCol='rawPrediction',\
                                       metricName='areaUnderROC')


# Search through decision tree's maxDepth parameter for best model
#paramGrid = ParamGridBuilder().addGrid(dTree_algorithm.maxDepth, [2,3,4,5,6,7]).build()
paramGrid = ParamGridBuilder().addGrid(rf_algorithm.numTrees, [100,200,400,800])\
                        .addGrid(rf_algorithm.maxDepth, [2,3,4,5,6,7])\
                        .build()

# Set up 3-fold cross validation
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)

In [19]:
from time import time

t0 = time()

CrossvalModel=crossval.fit(CV_data)

tt = time() - t0
print("Task completed in {} seconds".format(round(tt,3)))


Task completed in 218.869 seconds


# <font face="calibri" color=#d63de2> Evaluate TEST DATA 

##  <font color= #e38009> Transformer : Making predictions with the TRAINED model


### <font color=red>Evaluation on TEST data

In [20]:
Test_data = sqlContext.read.load('/resources/data/MSTC/churn-bigml-20.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

In [21]:
# make predictions and evaluate result
predictions_Test = CrossvalModel.transform(Test_data)
accuracy_Test=evaluator.evaluate(predictions_Test)

print(accuracy_Test)

0.9287081339712927


In [22]:
# Confussion Matrix
predictions_Test.crosstab('Churn','prediction').show()

+----------------+---+---+
|Churn_prediction|0.0|1.0|
+----------------+---+---+
|            True| 35| 60|
|           False|570|  2|
+----------------+---+---+



In [None]:
# make predictions and evaluate result
#pipelineModel=pipeline.fit(CV_data)
#predictions_Test = pipelineModel.transform(Test_data)
#accuracy_Test=evaluator.evaluate(predictions_Test)

#print(accuracy_Test)

### <font color=red>Evaluation on <font color=green> TRAIN data

In [23]:
# make predictions and evaluate result
predictions_Train = CrossvalModel.transform(CV_data)

accuracy_Train=evaluator.evaluate(predictions_Train)

print(accuracy_Train)

0.9740412552157359


In [24]:
# Confussion Matrix
predictions_Train.crosstab('Churn','prediction').show()

+----------------+----+---+
|Churn_prediction| 0.0|1.0|
+----------------+----+---+
|            True| 100|288|
|           False|2278|  0|
+----------------+----+---+



In [None]:
#pipelineModel=pipeline.fit(CV_data)
#predictions_Train = pipelineModel.transform(CV_data)
#accuracy_Train=evaluator.evaluate(predictions_Train)

#print(accuracy_Train)

# <font color= #9e9b9e >..... ANALYZE BEST MODEL

In [29]:
# Fetch best model BUT TO BE USED we need process everything NO Pipes!! see below...
Best_tree_model = CrossvalModel.bestModel
print(Best_tree_model.stages[4])

RandomForestClassificationModel (uid=rfc_f33bc132f3b1) with 800 trees


In [28]:
Best_tree_model.stages

[StringIndexer_4fc3b4208a299d82f873,
 StringIndexer_45b2bc6214f9188f167c,
 StringIndexer_4ed6b61d639052250f28,
 VectorAssembler_40eea090a3f0c9962f50,
 RandomForestClassificationModel (uid=rfc_f33bc132f3b1) with 800 trees]

In [30]:
print(CrossvalModel.bestModel.stages[4]._call_java("toDebugString"))

RandomForestClassificationModel (uid=rfc_f33bc132f3b1) with 800 trees
  Tree 0 (weight 1.0):
    If (feature 11 in {0.0})
     If (feature 3 <= 224.8)
      If (feature 9 <= 3.0)
       If (feature 10 in {0.0})
        If (feature 1 <= 278.9)
         If (feature 6 <= 111.0)
          If (feature 3 <= 183.3)
           Predict: 0.0
          Else (feature 3 > 183.3)
           Predict: 0.0
         Else (feature 6 > 111.0)
          If (feature 6 <= 124.0)
           Predict: 0.0
          Else (feature 6 > 124.0)
           Predict: 0.0
        Else (feature 1 > 278.9)
         Predict: 1.0
       Else (feature 10 not in {0.0})
        If (feature 2 <= 69.0)
         If (feature 3 <= 208.9)
          Predict: 1.0
         Else (feature 3 > 208.9)
          Predict: 0.0
        Else (feature 2 > 69.0)
         If (feature 9 <= 1.0)
          If (feature 4 <= 93.0)
           Predict: 1.0
          Else (feature 4 > 93.0)
           Predict: 0.0
         Else (feature 9 > 1.0)
         