# Predict Customer Churn Using Logistic Regression & PySpark

Logistic Regresion on a maketing agency's customer database to predict future customer churn on seen and unforeseen data.
The data is saved as customer_churn.csv. Here are the fields and their definitions:

    Name : Name of the latest contact at Company
    Age: Customer Age
    Total_Purchase: Total Ads Purchased
    Account_Manager: Binary 0=No manager, 1= Account manager assigned
    Years: Totaly Years as a customer
    Num_sites: Number of websites that use the service.
    Onboard_date: Date that the name of the latest contact was onboarded
    Location: Client HQ Address
    Company: Name of Client Company

In [0]:
# Start a new Spark session
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('logistic').getOrCreate()

In [0]:
# Read the dataset from the file
data=spark.read.csv('dbfs:/FileStore/shared_uploads/hrishagni95@gmail.com/customer_churn.csv',inferSchema=True,header=True)

In [0]:
# Check out the columns in the dataframe
data.columns

In [0]:
# Sneak peek on the data that we are dealing with
data.show()

In [0]:
# Select only the below columns to consider for the regression and drop all the rows with null values in them
df=data.select('Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Churn').na.drop()

In [0]:
#Sneak peek of the data after filtering the original dataset
df.show()

In [0]:
# Import VectorAssembler and LogisticRegression from pyspark
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

In [0]:
# Use the below fields to create the 'features' field needed for regression
assembler=VectorAssembler(inputCols=['Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites'],outputCol='features')

In [0]:
# Create an instance of Logistic Regression
logistic_model=LogisticRegression(featuresCol='features',labelCol='Churn')

In [0]:
# Import the pipeline function from pyspark
from pyspark.ml.pipeline import Pipeline

In [0]:
# Start the pipeline and describe the stages
pipeline=Pipeline(stages=[assembler,logistic_model])

In [0]:
# Split the filtered dataset using 70/30 split and create training and testing data
train_data,test_data=df.randomSplit([0.7,0.3])

In [0]:
# Fit the pipeline with the training data
fit_model=pipeline.fit(train_data)

In [0]:
# Use the fitted model to transform the testing data set
res=fit_model.transform(test_data)

In [0]:
# Import BinaryClassificationEvaluator from pyspark and evaluate the results that we got from transforming the testing data set
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [0]:
# Create an instance of BinaryClassificationEvaluator
evaluation=BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='Churn')

In [0]:
# Check the AUROC score from the evaluation object using the results from testing data transformation
evaluation.evaluate(res)

In [0]:
# The AUROC is ~0.75 which is quite decent

In [0]:
# Sneak peek into which customers might actually churn and who wouldn't
res.show()

## Regression on unforeseen data

In [0]:
# Import the new data file
new_cust=spark.read.csv('dbfs:/FileStore/shared_uploads/hrishagni95@gmail.com/new_customers-2.csv',inferSchema=True,header=True)

In [0]:
new_cust.show()

In [0]:
# Use the previously fitted model to transform the new dataset so as to extract the results
new_res=fit_model.transform(new_cust)

In [0]:
# Display the prediction of who might churn
new_res.show()