## Binary Customer Churn

A marketing agency has many customers that use their service to produce ads for the client/customer websites. They notice quite a bit of churn in clients. They currently randomly assign account managers.

The task is to create a machine learning model that will predict which customers will churn (stop buying their service) so they can correctly assign the customers most at risk to churn an account manager. 

Here are the data fields and their definitions:

    Name : Name of the latest contact at Company
    Age: Customer Age
    Total_Purchase: Total Ads Purchased
    Account_Manager: Binary 0=No manager, 1= Account manager assigned
    Years: Totaly Years as a customer
    Num_sites: Number of websites that use the service.
    Onboard_date: Date that the name of the latest contact was onboarded
    Location: Client HQ Address
    Company: Name of Client Company
    
Once trained, test the model on new data saved under new_customers.csv. The client wants to know which customers are most likely to churn given this unlabelled data.

In [2]:
import findspark
findspark.init('/home/matt/spark-3.1.1-bin-hadoop2.7')

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cust_churn').getOrCreate()

In [4]:
data = spark.read.csv('customer_churn.csv',inferSchema=True,header=True)

In [5]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [5]:
data.describe().show()

+-------+-----------------+-----------------+------------------+-----------------+------------------+-------------------+
|summary|              Age|   Total_Purchase|   Account_Manager|            Years|         Num_Sites|              Churn|
+-------+-----------------+-----------------+------------------+-----------------+------------------+-------------------+
|  count|              900|              900|               900|              900|               900|                900|
|   mean|41.81666666666667|10062.82403333334|0.4811111111111111| 5.27315555555555| 8.587777777777777|0.16666666666666666|
| stddev|6.127560416916251|2408.644531858096|0.4999208935073339|1.274449013194616|1.7648355920350969| 0.3728852122772358|
|    min|             22.0|            100.0|                 0|              1.0|               3.0|                  0|
|    max|             65.0|         18026.01|                 1|             9.15|              14.0|                  1|
+-------+---------------

In [38]:
data.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

## Feature Vectors and Labels

In [8]:
from pyspark.ml.feature import VectorAssembler

In [12]:
assembler = VectorAssembler(inputCols=['Age',
 'Total_Purchase',
 'Years',
 'Num_Sites'],outputCol='features')

In [13]:
output = assembler.transform(data)

In [14]:
final_data = output.select('features','churn')

In [15]:
# data split
train_churn,test_churn = final_data.randomSplit([0.7,0.3])

## Train model

In [16]:
from pyspark.ml.classification import LogisticRegression

In [17]:
lr_churn = LogisticRegression(labelCol='churn')

In [18]:
fitted_churn_model = lr_churn.fit(train_churn)

In [19]:
training_sum = fitted_churn_model.summary

In [26]:
training_sum.predictions.show(10)

+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[25.0,9672.03,5.4...|  0.0|[4.35969249188806...|[0.98737900782025...|       0.0|
|[28.0,8670.98,3.9...|  0.0|[7.59276963631759...|[0.99949617097940...|       0.0|
|[28.0,9090.43,5.7...|  0.0|[1.6347259378615,...|[0.83681601676661...|       0.0|
|[28.0,11128.95,5....|  0.0|[4.24200064358779...|[0.98582502290541...|       0.0|
|[28.0,11204.23,3....|  0.0|[1.42009496911474...|[0.80535330413775...|       0.0|
|[28.0,11245.38,6....|  0.0|[3.29048688035686...|[0.96410100897801...|       0.0|
|[29.0,5900.78,5.5...|  0.0|[4.43583266888357...|[0.98829346753106...|       0.0|
|[29.0,8688.17,5.7...|  1.0|[2.85630606421819...|[0.94564373735889...|       0.0|
|[29.0,9617.59,5.4...|  0.0|[4.11006052021966...|[0.98385805579815...|       0.0|
|[29.0,10203.18,

## Evaluate model

In [21]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [22]:
pred_and_labels = fitted_churn_model.evaluate(test_churn)

In [24]:
pred_and_labels.predictions.show(10)

+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[22.0,11254.38,4....|    0|[4.70612998523834...|[0.99104129026811...|       0.0|
|[26.0,8787.39,5.4...|    1|[0.75802431303083...|[0.68092463715970...|       0.0|
|[26.0,8939.61,4.5...|    0|[6.14850046495793...|[0.99786787228131...|       0.0|
|[27.0,8628.8,5.3,...|    0|[5.66892024191962...|[0.99656028234844...|       0.0|
|[29.0,9378.24,4.9...|    0|[4.46269472099668...|[0.98860020596405...|       0.0|
|[29.0,12711.15,5....|    0|[4.87992996466007...|[0.99245974135701...|       0.0|
|[29.0,13255.05,4....|    0|[4.10350308112556...|[0.98375358350138...|       0.0|
|[31.0,5304.6,5.29...|    0|[3.30368625953915...|[0.96455505483039...|       0.0|
|[31.0,5387.75,6.8...|    0|[2.39070844231502...|[0.91611602614143...|       0.0|
|[31.0,8829.83,4

In [27]:
# apply AUC (Area under the ROC curve)
churn_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                           labelCol='churn')

In [28]:
auc = churn_eval.evaluate(pred_and_labels.predictions)

In [29]:
auc

0.8308742346682726

## Predict new unlabeled data

Evaluate the new_customers.csv file

In [33]:
# train again but with previous all data
final_lr_model = lr_churn.fit(final_data)

In [34]:
# import new data
new_customers = spark.read.csv('new_customers.csv',inferSchema=True,
                              header=True)

In [36]:
# apply previous vector/label assembler
test_new_customers = assembler.transform(new_customers)

In [37]:
test_new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)



In [38]:
# Apply to new data
final_results = final_lr_model.transform(test_new_customers)

In [39]:
# show predicted companies that need an account manager
final_results.select('Company','prediction').show()

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+

