# Machine Learning Pipelines

Various classifiers from PySpark's ml library were selected as candidates to fit the data concerning customer churn since the data pipeline was already hosted on PySpark, and the customer churn problem appears to be a binary classification problem. PySpark appears to be convenient to use in this case, so the plan is to build ml models in PySpark to fit and predict customer churn.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import (
    StructField
    , StringType
    , IntegerType
    , DoubleType
    , BooleanType
    , StructType
)
from pyspark.ml.feature import (
    VectorAssembler
    , OneHotEncoder
    , StringIndexer
    , Imputer
)
from pyspark.ml.classification import (
    LogisticRegression
    , DecisionTreeClassifier
    , GBTClassifier
    , RandomForestClassifier
)
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import (
    BinaryClassificationEvaluator
    , MulticlassClassificationEvaluator
)

In [0]:
spark = SparkSession.builder.appName('ML_Pipeline').getOrCreate()

In [0]:
my_schema = StructType(fields=[
    StructField('customer_id', IntegerType(), True)
    , StructField('surname', StringType(), True)
    , StructField('credit_score', IntegerType(), True)
    , StructField('geography', StringType(), True)
    , StructField('gender', StringType(), True)
    , StructField('age', IntegerType(), True)
    , StructField('tenure', IntegerType(), True)
    , StructField('balance', DoubleType(), True)
    , StructField('product_count', IntegerType(), True)
    , StructField('has_creditcard', IntegerType(), True)
    , StructField('active_member', IntegerType(), True)
    , StructField('estimated_salary', DoubleType(), True)
    , StructField('complain', IntegerType(), True)
    , StructField('satisfaction_score', IntegerType(), True)
    , StructField('card_type', StringType(), True)
    , StructField('points_earned', IntegerType(), True)
    , StructField('churn', IntegerType(), True)
])

data = spark.read.csv(
    '/FileStore/tables/pyspark_churn.csv'
    , header=True
    , schema=my_schema
)

data.printSchema()

root
 |-- customer_id: integer (nullable = true)
 |-- surname: string (nullable = true)
 |-- credit_score: integer (nullable = true)
 |-- geography: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- balance: double (nullable = true)
 |-- product_count: integer (nullable = true)
 |-- has_creditcard: integer (nullable = true)
 |-- active_member: integer (nullable = true)
 |-- estimated_salary: double (nullable = true)
 |-- complain: integer (nullable = true)
 |-- satisfaction_score: integer (nullable = true)
 |-- card_type: string (nullable = true)
 |-- points_earned: integer (nullable = true)
 |-- churn: integer (nullable = true)



During a testing phase, many errors were thrown from the ml objects due to a lack of data. Further cleaning shows that many columns held no data. The columns were either entirely null or contained only a single value. These are removed so the ml objects can accept meaningful data. Nulls are also dropped and the data is split into a train-test split. From a prior EDA document, a third category `unlabelled_data` is also created as this accounts for 5/13 of the data.

In [0]:
kept_columns = []
for c in data.columns:
    if data.select(c).distinct().count() <= 1 or data.select(c).na.drop().count() == 0:
        print(f'Column {c} has no data.')
    else:
        kept_columns.append(c)
label = 'churn'
raw_features = [c for c in kept_columns if c not in {'customer_id', 'surname', label}]
ml_columns = raw_features + [label]
customer_columns = ['customer_id', 'surname'] + raw_features
print(f'''
The left over columns are features
{raw_features}
and label {[label]}.
''')

Column age has no data.
Column has_creditcard has no data.
Column active_member has no data.
Column complain has no data.
Column satisfaction_score has no data.
Column card_type has no data.
Column points_earned has no data.

The left over columns are features
['credit_score', 'geography', 'gender', 'tenure', 'balance', 'product_count', 'estimated_salary']
and label ['churn'].



In [0]:
nonnull_count = data.select(ml_columns).na.drop().count()
if nonnull_count < 1000:
    print(f'There are {nonnull_count} data points after dropping nulls. An imputer is required.')
else:
    data = data.na.drop(how='all')
    print('Nulls have been dropped.')

Nulls have been dropped.


In [0]:
labelled_data = data.select(ml_columns).filter(data.churn.isNotNull())
unlabelled_data = data.select(customer_columns).filter(data.churn.isNull())
print(f'Total customers: {data.count()}')
print(f'Tracked customers: {labelled_data.count()}')
print(f'New customers: {unlabelled_data.count()}')
train, test = labelled_data.randomSplit([0.8,0.2])

Total customers: 13545
Tracked customers: 8430
New customers: 5115


A final check shows that each columns has at least 2 distinct values.

In [0]:
for c in train.columns:
    population = train.select(c).distinct().count()
    assert population>1, f'The ml models will not work on this data: column {c} is invalid.'
    print(f'Column {c} has {population} distinct values.')

Column credit_score has 369 distinct values.
Column geography has 3 distinct values.
Column gender has 2 distinct values.
Column tenure has 11 distinct values.
Column balance has 3318 distinct values.
Column product_count has 4 distinct values.
Column estimated_salary has 6330 distinct values.
Column churn has 2 distinct values.


In [0]:
train.printSchema()

root
 |-- credit_score: integer (nullable = true)
 |-- geography: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- balance: double (nullable = true)
 |-- product_count: integer (nullable = true)
 |-- estimated_salary: double (nullable = true)
 |-- churn: integer (nullable = true)



Columns `geography` and `gender` are strings which will not be accepted by ml objects. This issue can be swiftly fixed by forcing a numerical value on them by way of an indexer. The result is one-hot-encoded to reflect that this is categorical data.

In [0]:
categorical_columns = ['geography', 'gender']
feature_columns = [c if c not in categorical_columns else c+'_onehotencode' for c in raw_features]

geography_indexer = StringIndexer(inputCol='geography',outputCol='geography_index')
geography_encoder = OneHotEncoder(inputCol='geography_index',outputCol='geography_onehotencode')
gender_indexer = StringIndexer(inputCol='gender',outputCol='gender_index')
gender_encoder = OneHotEncoder(inputCol='gender_index',outputCol='gender_onehotencode')
assembler = VectorAssembler(inputCols=feature_columns,outputCol='features')

A dictionary holds the names and objects of the proposed classifiers.

In [0]:
ml_objects = {
    'logistic_regression': LogisticRegression(featuresCol='features',labelCol='churn')
    , 'decision_tree': DecisionTreeClassifier(featuresCol='features',labelCol='churn')
    , 'gbt': GBTClassifier(featuresCol='features',labelCol='churn')
    , 'random_forest': RandomForestClassifier(featuresCol='features',labelCol='churn')
}

Some common sample metrics were selected from the confusion matrix in order to compare the different models with each other.

In [0]:
auc_eval = BinaryClassificationEvaluator(
    rawPredictionCol='prediction'
    , labelCol='churn'
)
acc_eval = MulticlassClassificationEvaluator(
    labelCol='churn'
    , predictionCol='prediction'
    , metricName='accuracy'
)
precision_eval = MulticlassClassificationEvaluator(
    labelCol='churn'
    , predictionCol='prediction'
    , metricName='weightedPrecision'
)
recal_eval = MulticlassClassificationEvaluator(
    labelCol='churn'
    , predictionCol='prediction'
    , metricName='weightedRecall'
)

Everything is put together at the end: creating each ml pipeline, feeding it the training data, and evaluating the result.

In [0]:
saved_models = dict()
for name, ml_obj in ml_objects.items():
    pipeline = Pipeline(stages=[
        geography_indexer
        , gender_indexer
        , geography_encoder
        , gender_encoder
        , assembler
        , ml_obj
    ])
    model = pipeline.fit(train)
    saved_models[name] = model
    result = model.transform(test)
    auc = auc_eval.evaluate(result)
    accuracy = acc_eval.evaluate(result)
    precision = precision_eval.evaluate(result)
    recall = recal_eval.evaluate(result)
    print(f'The {name} model has...')
    print(f'AUC: {auc}')
    print(f'Accuracy: {accuracy}')
    print(f'Precision: {precision}')
    print(f'Recall: {recall}')
    print('')

The logistic_regression model has...
AUC: 0.5480911089535416
Accuracy: 0.7816377171215881
Precision: 0.7368616102801628
Recall: 0.781637717121588

The decision_tree model has...
AUC: 0.680421172453045
Accuracy: 0.8207196029776674
Precision: 0.8053152123267633
Recall: 0.8207196029776676

The gbt model has...
AUC: 0.6858303480754852
Accuracy: 0.8213399503722084
Precision: 0.8067003737631921
Recall: 0.8213399503722084

The random_forest model has...
AUC: 0.5930898255716629
Accuracy: 0.8064516129032258
Precision: 0.7910396643620096
Recall: 0.8064516129032258




The "best" model can be selected and used on new customers. Here, the `decision_tree` classifier is selected because it has the highest accuracy. Optionally, the predictions may be saved to a database of your choosing.

In [0]:
model = saved_models['gbt']
predictions = model.transform(unlabelled_data)
export_columns = ['customer_id', 'surname', 'prediction'] +\
    raw_features +\
    ['features', 'probability', 'rawPrediction']
export = predictions.select(export_columns)
display(export.select('*'))

customer_id,surname,prediction,credit_score,geography,gender,tenure,balance,product_count,estimated_salary,features,probability,rawPrediction
15565819,Pinto,0.0,669,Germany,Female,6,137946.75,1,82467.57,"Map(vectorType -> dense, length -> 8, values -> List(669.0, 0.0, 1.0, 0.0, 6.0, 137946.75, 1.0, 82467.57))","Map(vectorType -> dense, length -> 2, values -> List(0.5487913657289122, 0.4512086342710878))","Map(vectorType -> dense, length -> 2, values -> List(0.0978942535233952, -0.0978942535233952))"
15566141,Onio,0.0,682,France,Female,2,130933.52,2,199644.6,"Map(vectorType -> dense, length -> 8, values -> List(682.0, 1.0, 0.0, 0.0, 2.0, 130933.52, 2.0, 199644.6))","Map(vectorType -> dense, length -> 2, values -> List(0.7782167055232293, 0.2217832944767707))","Map(vectorType -> dense, length -> 2, values -> List(0.6276521365376945, -0.6276521365376945))"
15566238,Mazzi,0.0,696,Spain,Female,2,171671.9,1,181419.29,"Map(vectorType -> dense, length -> 8, values -> List(696.0, 0.0, 0.0, 0.0, 2.0, 171671.9, 1.0, 181419.29))","Map(vectorType -> dense, length -> 2, values -> List(0.7570478950824082, 0.24295210491759178))","Map(vectorType -> dense, length -> 2, values -> List(0.5682810981101382, -0.5682810981101382))"
15566246,Onuoha,0.0,661,France,Female,1,0.0,1,4595.05,"Map(vectorType -> dense, length -> 8, values -> List(661.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 4595.05))","Map(vectorType -> dense, length -> 2, values -> List(0.7837321565150398, 0.2162678434849602))","Map(vectorType -> dense, length -> 2, values -> List(0.6437748346238933, -0.6437748346238933))"
15566726,Nkemdilim,0.0,602,Spain,Female,3,0.0,2,168814.32,"Map(vectorType -> sparse, length -> 8, indices -> List(0, 4, 6, 7), values -> List(602.0, 3.0, 2.0, 168814.32))","Map(vectorType -> dense, length -> 2, values -> List(0.920954514069936, 0.079045485930064))","Map(vectorType -> dense, length -> 2, values -> List(1.2276935947242553, -1.2276935947242553))"
15566878,Manna,0.0,649,France,Male,6,0.0,2,182495.85,"Map(vectorType -> dense, length -> 8, values -> List(649.0, 1.0, 0.0, 1.0, 6.0, 0.0, 2.0, 182495.85))","Map(vectorType -> dense, length -> 2, values -> List(0.9468886116470293, 0.05311138835297069))","Map(vectorType -> dense, length -> 2, values -> List(1.4403950443772902, -1.4403950443772902))"
15567111,Lucciano,0.0,554,France,Female,2,86977.96,2,109794.31,"Map(vectorType -> dense, length -> 8, values -> List(554.0, 1.0, 0.0, 0.0, 2.0, 86977.96, 2.0, 109794.31))","Map(vectorType -> dense, length -> 2, values -> List(0.9063347768009377, 0.09366522319906234))","Map(vectorType -> dense, length -> 2, values -> List(1.1348408893738795, -1.1348408893738795))"
15567138,Chukwumaobim,0.0,846,France,Male,3,0.0,2,152310.19,"Map(vectorType -> dense, length -> 8, values -> List(846.0, 1.0, 0.0, 1.0, 3.0, 0.0, 2.0, 152310.19))","Map(vectorType -> dense, length -> 2, values -> List(0.9407416141582423, 0.05925838584175769))","Map(vectorType -> dense, length -> 2, values -> List(1.3823806061216044, -1.3823806061216044))"
15567826,Zetticci,1.0,556,France,Female,9,0.0,4,87811.12,"Map(vectorType -> dense, length -> 8, values -> List(556.0, 1.0, 0.0, 0.0, 9.0, 0.0, 4.0, 87811.12))","Map(vectorType -> dense, length -> 2, values -> List(0.03418183116107065, 0.9658181688389293))","Map(vectorType -> dense, length -> 2, values -> List(-1.6706406675878054, 1.6706406675878054))"
15567829,L?,0.0,762,France,Female,9,136442.58,1,102760.96,"Map(vectorType -> dense, length -> 8, values -> List(762.0, 1.0, 0.0, 0.0, 9.0, 136442.58, 1.0, 102760.96))","Map(vectorType -> dense, length -> 2, values -> List(0.8013036541590359, 0.19869634584096407))","Map(vectorType -> dense, length -> 2, values -> List(0.6972311050028129, -0.6972311050028129))"
