# Predicting Customer Churn

A marketing agency has many customers that use their service to produce ads for the client/customer websites. They've noticed that they have quite a bit of churn in clients. They currently randomly assign account managers, but want to create a machine learning model that will help predict which customers will churn (stop buying their service) so that they can correctly assign the customers most at risk to churn an account manager.

Given historical data, we can use a logistic regression based classification algorithm that will help classify whether a new customer is churned or not. The company can then test this against incoming data for future customers to predict which customers will churn and assign them an account manager.

### Importing Libraries

In [1]:
import os
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import VectorAssembler, OneHotEncoder, StringIndexer, VectorIndexer
from pyspark.sql.functions import month, dayofyear, year

In [2]:
os.chdir('..')

### Creating SparkSession and importing data

In [3]:
spark = SparkSession.builder.appName('customerChurn').getOrCreate()

In [4]:
DATA_FILE = os.getcwd() + '/data/customer_churn.csv'
df = spark.read.csv(DATA_FILE, header=True, inferSchema=True)
df.show(n = 5, truncate=False, vertical=True)

-RECORD 0-------------------------------------------------------------
 Names           | Cameron Williams                                   
 Age             | 42.0                                               
 Total_Purchase  | 11066.8                                            
 Account_Manager | 0                                                  
 Years           | 7.22                                               
 Num_Sites       | 8.0                                                
 Onboard_date    | 2013-08-30 07:00:40                                
 Location        | 10265 Elizabeth Mission Barkerburgh, AK 89518      
 Company         | Harvey LLC                                         
 Churn           | 1                                                  
-RECORD 1-------------------------------------------------------------
 Names           | Kevin Mueller                                      
 Age             | 41.0                                               
 Total

### EDA and summaries

In [5]:
## We don't need 'Account_Manager', 'Names', 'Location' and 'Onboard_date'
df = df.select(list(set(df.columns) - set(['Names', 'Account_Manager', 'Location', 'Onboard_date'])))
df.columns

['Age', 'Years', 'Company', 'Total_Purchase', 'Churn', 'Num_Sites']

In [6]:
df.printSchema()

root
 |-- Age: double (nullable = true)
 |-- Years: double (nullable = true)
 |-- Company: string (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Churn: integer (nullable = true)
 |-- Num_Sites: double (nullable = true)



In [7]:
df.summary().show(vertical=True, truncate=False)

-RECORD 0-----------------------------------
 summary        | count                     
 Age            | 900                       
 Years          | 900                       
 Company        | 900                       
 Total_Purchase | 900                       
 Churn          | 900                       
 Num_Sites      | 900                       
-RECORD 1-----------------------------------
 summary        | mean                      
 Age            | 41.81666666666667         
 Years          | 5.27315555555555          
 Company        | null                      
 Total_Purchase | 10062.82403333334         
 Churn          | 0.16666666666666666       
 Num_Sites      | 8.587777777777777         
-RECORD 2-----------------------------------
 summary        | stddev                    
 Age            | 6.127560416916251         
 Years          | 1.274449013194616         
 Company        | null                      
 Total_Purchase | 2408.644531858096         
 Churn    

### Encoding variables

In [8]:
stringCols = [item[0] for item in df.dtypes if 'string' in item[1]]
numCols = [item[0] for item in df.dtypes if item[0] not in stringCols]
indep = list(set(numCols) - set(['Churn']))
print(stringCols)
print(numCols)
print(indep)

['Company']
['Age', 'Years', 'Total_Purchase', 'Churn', 'Num_Sites']
['Total_Purchase', 'Age', 'Num_Sites', 'Years']


In [9]:
indexers = [StringIndexer(inputCol = col, outputCol = "{0}_indexed".format(col)) for col in stringCols]

encoders = [OneHotEncoder(inputCol = indexer.getOutputCol(),
            outputCol = "{0}_encoded".format(indexer.getOutputCol())) for indexer in indexers]

assembler = VectorAssembler(inputCols = [encoder.getOutputCol() for encoder in encoders] + indep, outputCol = "features")
# print(indexers)
# print(encoders)
# print(assembler)
# indexers + encoders + [assembler]

In [10]:
pipeline = Pipeline(stages = indexers + encoders + [assembler])
model = pipeline.fit(df)
transformed = model.transform(df)
transformed.columns

['Age',
 'Years',
 'Company',
 'Total_Purchase',
 'Churn',
 'Num_Sites',
 'Company_indexed',
 'Company_indexed_encoded',
 'features']

In [11]:
transformed.show(n=5, vertical = True, truncate=False)

-RECORD 0----------------------------------------------------------------------------
 Age                     | 42.0                                                      
 Years                   | 7.22                                                      
 Company                 | Harvey LLC                                                
 Total_Purchase          | 11066.8                                                   
 Churn                   | 1                                                         
 Num_Sites               | 8.0                                                       
 Company_indexed         | 824.0                                                     
 Company_indexed_encoded | (872,[824],[1.0])                                         
 features                | (876,[824,872,873,874,875],[1.0,11066.8,42.0,8.0,7.22])   
-RECORD 1----------------------------------------------------------------------------
 Age                     | 41.0                       

In [12]:
final_df = transformed.select(['features', 'Churn'])
final_df.show(n = 5, truncate=False, vertical=True)

-RECORD 0-------------------------------------------------------------
 features | (876,[824,872,873,874,875],[1.0,11066.8,42.0,8.0,7.22])   
 Churn    | 1                                                         
-RECORD 1-------------------------------------------------------------
 features | (876,[1,872,873,874,875],[1.0,11916.22,41.0,11.0,6.5])    
 Churn    | 1                                                         
-RECORD 2-------------------------------------------------------------
 features | (876,[272,872,873,874,875],[1.0,12884.75,38.0,12.0,6.67]) 
 Churn    | 1                                                         
-RECORD 3-------------------------------------------------------------
 features | (876,[21,872,873,874,875],[1.0,8010.76,42.0,10.0,6.71])   
 Churn    | 1                                                         
-RECORD 4-------------------------------------------------------------
 features | (876,[524,872,873,874,875],[1.0,9191.58,37.0,9.0,5.56])   
 Churn

### Train-Test split

In [13]:
train, test = final_df.randomSplit([0.7, 0.3])

### Building the LogisticRegression model

In [14]:
churn_model = LogisticRegression(labelCol='Churn')
churn_model_fit = churn_model.fit(train)

In [15]:
training_summary = churn_model_fit.summary
training_summary.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|              Churn|         prediction|
+-------+-------------------+-------------------+
|  count|                623|                623|
|   mean|0.15248796147672553|0.15248796147672553|
| stddev|0.35978209656566007|0.35978209656566007|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



### Model Evaluation on Test data

In [16]:
results = churn_model_fit.transform(test)
results.printSchema()
results.select(['Churn', 'prediction']).show()

root
 |-- features: vector (nullable = true)
 |-- Churn: integer (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

+-----+----------+
|Churn|prediction|
+-----+----------+
|    0|       0.0|
|    0|       0.0|
|    1|       1.0|
|    0|       1.0|
|    0|       0.0|
|    1|       1.0|
|    0|       0.0|
|    0|       1.0|
|    0|       1.0|
|    0|       0.0|
|    0|       0.0|
|    1|       0.0|
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       1.0|
+-----+----------+
only showing top 20 rows



In [17]:
churn_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='Churn')
AUC = churn_eval.evaluate(results)
AUC

0.6214987714987715

In [18]:
churn_eval2 = BinaryClassificationEvaluator(labelCol='Churn')
ROC = churn_eval2.evaluate(results)
ROC

0.712366912366912