## PROJECT: Predicting Customer Churn

It costs money for any business to lose customers. You can give customers incentives to stay if you catch them early on when they are unhappy. We use a well-known illustration of churn: customers leaving an operator of mobile phones. 

It appears that one can always criticize their current provider! And if the service provider is aware that a customer is considering leaving, it can provide timely incentives, such as a phone upgrade or the activation of a new feature, to encourage the customer to remain. Often, incentives are much cheaper than losing and regaining a customer.

This notebook explains how to automate the identification of dissatisfied customers using machine learning (ML), also known as customer churn prediction. This notebook also discusses how to account for the relative costs of prediction errors when determining the financial outcome of using ML because ML models rarely provide perfect predictions.



In [24]:
# import findpask library to enable us import pyspark libraries easily.
import findspark
findspark.init()

import pandas as pd
# Spark Session module is imported with the pyspark.sql library
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace 

In [25]:
# the SparkSession builder and getOrCreate() methods are used respectively for building and creating the app 
spark = SparkSession.builder.appName('Churn_pred').getOrCreate()

# we will be reading our dataset which includes all the features to predict customers who are likely to churn 
raw = spark.read.csv('churn.txt',inferSchema=True, header=True)

### Let’s explore and clean the dataset

In [26]:
# To remove the dots and spaces from the Churn? column 
# after we wuold drop the old Churn? column
data = raw.withColumn("Churn", regexp_replace("Churn?", "[\\. ]", "")).drop('Churn?')

## Data Exploration

In [27]:
# Explore the loaded data by using the following Apache Spark DataFrame methods
data.printSchema()
print("Number of fields: %3g" % len(data.schema))

root
 |-- State: string (nullable = true)
 |-- Account Length: integer (nullable = true)
 |-- Area Code: integer (nullable = true)
 |-- Phone: string (nullable = true)
 |-- Int'l Plan: string (nullable = true)
 |-- VMail Plan: string (nullable = true)
 |-- VMail Message: integer (nullable = true)
 |-- Day Mins: double (nullable = true)
 |-- Day Calls: integer (nullable = true)
 |-- Day Charge: double (nullable = true)
 |-- Eve Mins: double (nullable = true)
 |-- Eve Calls: integer (nullable = true)
 |-- Eve Charge: double (nullable = true)
 |-- Night Mins: double (nullable = true)
 |-- Night Calls: integer (nullable = true)
 |-- Night Charge: double (nullable = true)
 |-- Intl Mins: double (nullable = true)
 |-- Intl Calls: integer (nullable = true)
 |-- Intl Charge: double (nullable = true)
 |-- CustServ Calls: integer (nullable = true)
 |-- Churn: string (nullable = true)

Number of fields:  21


As you can see, the data contains 21 fields. "Churn" field is the one you would like to predict (label).

In [28]:
# describing the dataset
data.show(10)
data.describe().show()
print("Total number of records: " + str(data.count()))

+-----+--------------+---------+--------+----------+----------+-------------+------------------+---------+------------------+------------------+---------+------------------+------------------+-----------+------------------+------------------+----------+-------------------+--------------+-----+
|State|Account Length|Area Code|   Phone|Int'l Plan|VMail Plan|VMail Message|          Day Mins|Day Calls|        Day Charge|          Eve Mins|Eve Calls|        Eve Charge|        Night Mins|Night Calls|      Night Charge|         Intl Mins|Intl Calls|        Intl Charge|CustServ Calls|Churn|
+-----+--------------+---------+--------+----------+----------+-------------+------------------+---------+------------------+------------------+---------+------------------+------------------+-----------+------------------+------------------+----------+-------------------+--------------+-----+
|   PA|           163|      806|403-2562|        no|       yes|          300|   8.1622040217391|        3| 7.579173

In [29]:
# checking if all records have complete data.
complete = data.dropna()
print("Number of records with complete data: %3g" % complete.count())

# Inspect the class distribution in the label column.
data.groupBy('Churn').count().show()

Number of records with complete data: 5000
+-----+-----+
|Churn|count|
+-----+-----+
|False| 2502|
| True| 2498|
+-----+-----+



In [30]:
# Import necessary modules
from pyspark.sql.functions import count, when, round, desc

# Filter out irrelevant columns and rows
df = data.select("State", "Churn")
df = df.filter("Churn == 'True' or Churn == 'False'")

# Aggregate the data by state and calculate churn rate
churn_state = df.groupBy("State").agg(
    count("Churn").alias("total_churn"),
    count(when(df.Churn == "True", True)).alias("churned"),
)
churn_state = churn_state.withColumn( "churn_rate", round(churn_state.churned / churn_state.total_churn, 3))

# Sort the results in descending order of churn rate
sorted_churn = churn_state.orderBy(desc("churn_rate"))

# Show the top 10 states with the highest churn rate
sorted_churn.show(10)

+-----+-----------+-------+----------+
|State|total_churn|churned|churn_rate|
+-----+-----------+-------+----------+
|   NY|         98|     60|     0.612|
|   OH|        111|     63|     0.568|
|   AZ|         90|     51|     0.567|
|   MD|        113|     64|     0.566|
|   CT|         89|     50|     0.562|
|   NE|        109|     61|      0.56|
|   IA|        103|     57|     0.553|
|   NC|         95|     52|     0.547|
|   ME|         74|     40|     0.541|
|   AK|         85|     46|     0.541|
+-----+-----------+-------+----------+
only showing top 10 rows



In [31]:
### we would use the Plotly package to explore the prediction results.
import os
import sys
import plotly.graph_objs as go
import plotly.offline as pyo
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
import plotly.graph_objs as go


A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2



### Feature Selection
    feature selection is one of the most important steps in data preprocessing where we select all the features that based on our knowledge would be the best fit for the model development phase. Hence here all the valid numerical columns will be taken into account.

In [32]:
(train_data, test_data, predict_data) = data.randomSplit([0.8, 0.18, 0.02], 24)

print("Number of records for training: " + str(train_data.count()))
print("Number of records for evaluation: " + str(test_data.count()))
print("Number of records for prediction: " + str(predict_data.count()))

Number of records for training: 4005
Number of records for evaluation: 901
Number of records for prediction: 94


In [33]:
from pyspark.ml.feature import StringIndexer, IndexToString, VectorAssembler, StandardScaler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier

from pyspark.ml import Pipeline, Model

In [34]:
# Converting labels from strings into integers
indexer = StringIndexer(inputCol="Churn", outputCol= 'label').setHandleInvalid("keep")
indexer_model = indexer.fit( data)

In [35]:
# assemble feature columns into one column
assemble = VectorAssembler(inputCols=['Account Length', 'Area Code', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge', 'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls'], outputCol='features')

# normalize/scale the values of the feature
# This is useless for decision trees
scale = StandardScaler(inputCol='features', outputCol='scaled_features')

# To convert the predicted numerical labels back to country names  
converter = IndexToString(inputCol="label", outputCol="predicted_churn", labels=indexer_model.labels)

#### We trained the random forest model using the previously defined pipeline and train data

In [36]:
# used random forest to define estimators you want to use for classification. 
rf = RandomForestClassifier(numTrees=1,maxDepth=30, featureSubsetStrategy="all", featuresCol="features", labelCol="label")

# Now build the pipeline. A pipeline consists of transformers and an estimator.
pipeline = Pipeline(stages=[indexer, assemble, scale, rf,converter])

# you can train your Logistic Regression model by using the previously defined pipeline and train data.
model = pipeline.fit(train_data)

In [37]:
# Evaluate the accuracy of the model
predictions = model.transform(test_data)
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

# Print the accuracy of the model on the test data
print("Test dataset:")
print("Accuracy = %3.2f" % accuracy)
print("Error = %g" % (1.0 - accuracy))

Test dataset:
Accuracy = 0.87
Error = 0.126526


#### Also train the Logistic Regression model using the previously defined pipeline and train data

In [38]:
# used logistic Regression to define estimators you want to use for classification. 
lr = LogisticRegression(maxIter=10, featuresCol="features", labelCol="label")

# Now build the pipeline. A pipeline consists of transformers and an estimator.
lr_pipeline = Pipeline(stages=[indexer, assemble, scale, lr,converter])

# you can train your Logistic Regression model by using the previously defined pipeline and train data.
lr_model = lr_pipeline.fit(train_data)

# Evaluate the accuracy of the model
lr_predictions = model.transform(test_data)
lr_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
lr_accuracy = lr_evaluator.evaluate(lr_predictions)

# Print the accuracy of the model on the test data
print("Test dataset:")
print("Accuracy = %3.2f" % accuracy)
print("Error = %g" % (1.0 - accuracy))

Test dataset:
Accuracy = 0.87
Error = 0.126526


In [39]:
# Preview the previouse predictions by calling the show() method
predictions = model.transform(test_data)
predictions.show(5)

+-----+--------------+---------+--------+----------+----------+-------------+------------------+---------+-----------------+------------------+---------+------------------+------------------+-----------+------------------+------------------+----------+------------------+--------------+-----+-----+--------------------+--------------------+-------------+-------------+----------+---------------+
|State|Account Length|Area Code|   Phone|Int'l Plan|VMail Plan|VMail Message|          Day Mins|Day Calls|       Day Charge|          Eve Mins|Eve Calls|        Eve Charge|        Night Mins|Night Calls|      Night Charge|         Intl Mins|Intl Calls|       Intl Charge|CustServ Calls|Churn|label|            features|     scaled_features|rawPrediction|  probability|prediction|predicted_churn|
+-----+--------------+---------+--------+----------+----------+-------------+------------------+---------+-----------------+------------------+---------+------------------+------------------+-----------+-----

In [40]:
# we can see the split between the labels by tabulating a count
predictions.select("prediction").groupBy("prediction").count().show(truncate=False)

+----------+-----+
|prediction|count|
+----------+-----+
|0.0       |453  |
|1.0       |448  |
+----------+-----+



### Data visualization using Plotly

In [41]:
### we would use the Plotly package to explore the prediction results.
import os
import sys
import plotly.graph_objs as go
import plotly.offline as pyo
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
import plotly.graph_objs as go

In [43]:
# Plot a pie chart that shows the predicted churn split by Contract and MonthlyCharges.
cumulative_stats = predictions.groupby(['label']).count()
labels_data_plot = ['No','Yes']
values_data_plot = [cumulative_stats.select('count').collect()[x][0] for x in range(2)]

product_data = [
    {
        'type': 'pie',
        'labels': labels_data_plot,
        'values': values_data_plot,
    }
]
product_layout = { 'title': 'Churn'}

fig = go.Figure(data=product_data, layout=product_layout)
fig.show()

    From the above pie chart, the number of customers who didn't churn is 50.7% compared to the number that churned which is 49.3%. It implies that if not resolved, almost half of the company's revenue will be lost.

In [57]:
# calculate mean daily charges by churn class using pandas groupby method
mean_charges = [predictions.groupby(['label']).mean().collect()[x][6] for x in range(2)]
age_data = [go.Bar(y= mean_charges,x= labels_data_plot)]

# customize chart layout using Plotly's Layout object
age_layout = go.Layout(
    title='Average Day Charges per churn class',
    xaxis=dict(title = "Churn",showline=False,),
    yaxis=dict(title = "Day Charges",),)

# create a figure object and plot it with iplot() function from Plotly's offline module
fig = go.Figure(data=age_data, layout=age_layout)
iplot(fig)

    The churned customers are charged more daily compared to the remaining customers.