# <font color=  #FF5733> Let's start building a PIPELINE using the elements in<br>
## <font color=  #FA5733> MSTC_Pipeline_PySpark_1.ipynb

## Importing Churn Data

###  Load churn-bigml-80.csv into a DataFrame

In [1]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

CV_data = sqlContext.read.load('/resources/data/MSTC/churn-bigml-80.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')


## Spark: ML Pipelines
https://spark.apache.org/docs/2.2.0/ml-pipeline.html


##  <font color= #e38009> Transformer A: StringIndexer

<font font-family: "calibri" size=3.5>StringIndexer converts String values that are part of a look-up into categorical indices, which could be used by machine learning algorithms in ml library.


##  <font color= #e38009> Transformer B: VectorAssembler

<font font-family: "calibri" size=3.5>...after “feature engineering” … the feature engineering results are then combined using the VectorAssembler, before being passed to ML Estimator

***Notice we provide the input = list of columns (MUST BE NUMERIC!) and the output column assembles all of them in a single column/vector***

### <font color= #C70039 > list with predictors to Assemble

In [3]:
predictors=('Number vmail messages',
 'Total day minutes',
 'Total day calls',
 'Total eve minutes',
 'Total eve calls',
 'Total night minutes',
 'Total night calls',
 'Total intl minutes',
 'Total intl calls',
 'Customer service calls')

##  <font color=#FF5733> Estimators

<font font-family: "calibri" size=3.5>
An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. 

Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. <br><br>
***For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.***


In [6]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier

# Index labels, adding metadata to the label column
stringindexer = StringIndexer(inputCol='Churn',
                             outputCol='indexedLabel')

assembler=VectorAssembler(inputCols=predictors,outputCol='features')

# Train a DecisionTree model
dTree_algorithm = DecisionTreeClassifier(maxDepth=2,
                                        labelCol='indexedLabel', featuresCol='features')


# Chain indexers and tree in a Pipeline

In [8]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[stringindexer,\
                            assembler, dTree_algorithm])

## <font color=#938882>Model Evaluation using:

* Hyperparameters seection
* Cross-validation

In [20]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

evaluator=BinaryClassificationEvaluator(labelCol='indexedLabel',\
                                        rawPredictionCol='rawPrediction',\
                                       metricName='areaUnderROC')


# Search through decision tree's maxDepth parameter for best model
paramGrid = ParamGridBuilder().addGrid(dTree_algorithm.maxDepth, [2,3,4,5,6,7]).build()

# Set up 3-fold cross validation
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)

In [52]:
CrossvalModel=crossval.fit(CV_data)

# <font face="calibri" color=#d63de2> Evaluate TEST DATA 

##  <font color= #e38009> Transformer : Making predictions with the TRAINED model


### <font color=red>Evaluation on TEST data

In [38]:
Test_data = sqlContext.read.load('/resources/data/MSTC/churn-bigml-20.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')

In [60]:
# make predictions and evaluate result
predictions_Test = CrossvalModel.transform(Test_data)
accuracy_Test=evaluator.evaluate(predictions_Test)

print(accuracy_Test)

0.6771623113728378


In [64]:
# make predictions and evaluate result
#pipelineModel=pipeline.fit(CV_data)
#predictions_Test = pipelineModel.transform(Test_data)
#accuracy_Test=evaluator.evaluate(predictions_Test)

#print(accuracy_Test)

0.2507453073242547


### <font color=red>Evaluation on <font color=green> TRAIN data

In [46]:
# make predictions and evaluate result
predictions_Train = CrossvalModel.transform(CV_data)

accuracy_Train=evaluator.evaluate(predictions_Train)

print(accuracy_Train)

0.7119336232723587


In [59]:
#pipelineModel=pipeline.fit(CV_data)
#predictions_Train = pipelineModel.transform(CV_data)
#accuracy_Train=evaluator.evaluate(predictions_Train)

#print(accuracy_Train)

0.2793619832915471


# <font color= #9e9b9e >..... ANALYZE BEST MODEL

In [30]:
# Fetch best model BUT TO BE USED we need process everything NO Pipes!! see below...
Best_tree_model = Cross_res.bestModel
print(Best_tree_model.stages[2])

DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4bf0b6a906cb9b2250b0) of depth 7 with 139 nodes


In [29]:
Best_tree_model.stages[2]

DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4bf0b6a906cb9b2250b0) of depth 7 with 139 nodes

In [31]:
print(Cross_res.bestModel.stages[2]._call_java("toDebugString"))

DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4bf0b6a906cb9b2250b0) of depth 7 with 139 nodes
  If (feature 1 <= 262.8)
   If (feature 9 <= 3.0)
    If (feature 1 <= 220.8)
     If (feature 7 <= 13.0)
      If (feature 8 <= 2.0)
       If (feature 3 <= 259.6)
        If (feature 4 <= 103.0)
         Predict: 0.0
        Else (feature 4 > 103.0)
         Predict: 0.0
       Else (feature 3 > 259.6)
        If (feature 6 <= 77.0)
         Predict: 1.0
        Else (feature 6 > 77.0)
         Predict: 0.0
      Else (feature 8 > 2.0)
       If (feature 3 <= 183.3)
        If (feature 2 <= 73.0)
         Predict: 0.0
        Else (feature 2 > 73.0)
         Predict: 0.0
       Else (feature 3 > 183.3)
        If (feature 8 <= 6.0)
         Predict: 0.0
        Else (feature 8 > 6.0)
         Predict: 0.0
     Else (feature 7 > 13.0)
      If (feature 0 <= 38.0)
       If (feature 8 <= 17.0)
        If (feature 9 <= 0.0)
         Predict: 0.0
        Else (feature 9 > 0.0)
   