https://www.kaggle.com/datasets/hassanamin/customer-churn

## Introduction

Customer Churn Prediction is one of the most enlightened problem statement nowadays as possibly everything is done for the purpose of **making profit from business** and that profit comes from customers that company holds from it's products and services so the goal of the organisation is to **hold up their permanent customers and analyse the potential one who may choose other alternatives this condition is known as the customer churn**.


In this blog we will build the potential model to predict the **customer churn** with the help of **PySpark's MLIB** library.

## Problem Statement

Let's not take it as just an article from now on let's suppose that we are working for a marketing agency who have hired us to draw a prediction about the potential customers who might stop buying their marketing services i.e which customer are more likely to be churned.


## Approach

As we are working on a real-world project so let's understand the flow of it. Firstly one important thing to mention is that we have the **"new_customer"** independent data which will eventually used as the testing data after the **model development phase**. We also need to create a **classification algorithm** which would help to classify based on the features we fed to the model that **customer will churn or not**.



## About the dataset

This is the data of the marketing agency which have altogether 8 features and 1 target variable. If you want to know more about this dataset then go through this [link](https://www.kaggle.com/datasets/hassanamin/customer-churn
).

1. **Name:** Name of the company whom customer is tagged to
2. **Age:** Age of the Customer
3. **Total_Purchase:** Total Ads Purchased 
4. **Account_Manager:** Binary 0=No manager, 1= Account manager assigned
5. **Years:** Total Years of customer using the company service
6. **Num_sites:** Total number of websites who are using this service.
7. **Onboard_date:** Onboarding date of the latest contacted person.
8. **Location:** Head Quarter address of the client
9. **Company:** Name of Client's Company

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 35 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 13.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=d2a0febb8a9f5811b50bd9075043ff4a13a02067863ac4dbf01ed8c9473e2fc9
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


## Importing libraries and starting the Spark Session

Here we are starting the first phase where the required libraries is imported for **setting up the Spark environment** and starting the **Spark Session** which is always the mandatory step to get started with **PySpark**.

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('customer_churn_prediction').getOrCreate()
spark

**Inference:** In the first step Spark Session module is imported with **pyspark.sql** library and then for building and creating the SparkSession **builder** and **getOrCreate()** methods are used respectively.

Note that when we are looking the GUI version of the session then we can see the **App name**, **Version** of the Spark and the **location** where the session is created.

## Reading the dataset

In this section we will be reading our dataset which includes all the **features that are required to predict which customer is most likely to be churned** and think of other alternatives.

In [4]:
data = spark.read.csv('customer_churn.csv',inferSchema=True,
                     header=True)

**Inference:** In the above line of code we have read the **CSV** formatted data using **read.csv** function and put the **inferSchema** and **header** parameter as True so that we can see the real essence of dataset.

In [5]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



**Inference:** Printing the Schema of the data is one of the best practice to know about the **type of each column** like what **kind of data it can hold**. From the above output it is shown that **Onboard_date** is of String type so in following code if this feature is required then we should convert it to proper date format (if needed)

**Let's do some statistical analysis of our dataset** where describe method alone can provide lots of insights about the statistics about the dataset.

In [6]:
data.describe().show()

+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+-------------------+--------------------+--------------------+-------------------+
|summary|        Names|              Age|   Total_Purchase|   Account_Manager|            Years|         Num_Sites|       Onboard_date|            Location|             Company|              Churn|
+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+-------------------+--------------------+--------------------+-------------------+
|  count|          900|              900|              900|               900|              900|               900|                900|                 900|                 900|                900|
|   mean|         null|41.81666666666667|10062.82403333334|0.4811111111111111| 5.27315555555555| 8.587777777777777|               null|                null|                null|0.16666666666666666|
| stddev| 

**Inference:** The very first inference that we can draw is there are **no NULL values in the dataset** as the count is **900** for all the features hence we got rid of dealing with missing values. Then after looking at the **mean and standard deviation** of **Names** column we can conclude **string type** don't contribute anything to statistical analysis.

In [7]:
data.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

**Inference:** The column object is used only to **get the names of all the columns that the current instance of dataset's varible** holds and in the above output one can see the same.

## Feature Selection

As we all are well aware the **feature selection** is one of the most important step in the data preprocessing where we select all the features that based on our knowledge would be best fit for the **model development phase**. Hence here all the valid numerical columns will be taken into account.

In [8]:
from pyspark.ml.feature import VectorAssembler

In [9]:
assembler = VectorAssembler(inputCols=['Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites'],outputCol='features')

**Inference:** While working with MLIB we should know the **format of data** which MLIB as library accepts hence we use **VectorAssembler** module which **clubs all the selected features togethere** in one column and that are treated as the feature column (summation of all the features), same thing we can see in the parameter section of the **assembler object**.

In [10]:
output = assembler.transform(data)

**Inference:** Transforming the data is very much necessary as it works as the **commit** statement i.e. **all the transaction (changes) which are processed should be seen in the real dataset** if we see it hence we used **transform** method for it.

In [31]:
final_data = output.select('features','churn').show()

+--------------------+-----+
|            features|churn|
+--------------------+-----+
|[42.0,11066.8,0.0...|    1|
|[41.0,11916.22,0....|    1|
|[38.0,12884.75,0....|    1|
|[42.0,8010.76,0.0...|    1|
|[37.0,9191.58,0.0...|    1|
|[48.0,10356.02,0....|    1|
|[44.0,11331.58,1....|    1|
|[32.0,9885.12,1.0...|    1|
|[43.0,14062.6,1.0...|    1|
|[40.0,8066.94,1.0...|    1|
|[30.0,11575.37,1....|    1|
|[45.0,8771.02,1.0...|    1|
|[45.0,8988.67,1.0...|    1|
|[40.0,8283.32,1.0...|    1|
|[41.0,6569.87,1.0...|    1|
|[38.0,10494.82,1....|    1|
|[45.0,8213.41,1.0...|    1|
|[43.0,11226.88,0....|    1|
|[53.0,5515.09,0.0...|    1|
|[46.0,8046.4,1.0,...|    1|
+--------------------+-----+
only showing top 20 rows



**Inference:** So while looking at the above output things will get clear that what we were aiming to do as the first column is **features** which have all the **selected columns** and then the label column i.e. **churn**.

## Test Train Split

Now if you are following me from the very beginning of the article might have a question that **if we already have the seperate testing data then why are spliting this dataset? right?**

So the answer is keep this phase of **splitting as the validation of the model** and we do not have to perform this routine again when we would be dealing with new data as it is **already splitted** into different CSV file.

In [12]:
train_churn,test_churn = final_data.randomSplit([0.7,0.3])

**Inference:** With the help of tuple unpacking we have stored the **70%** of the data in **train_churn** and **30%** of it in **test_churn** by using PySpark's **randomSplit**() method.

## Model development

We reaching this phase of the article is the proof that we have already **cleaned our data** completely that it is ready to be fed to the classification algorithm model (more specifically the **Logistic Regression**)

**Note that we have to do this model building again when we have to deal with new customers data.**


In [13]:
from pyspark.ml.classification import LogisticRegression

In [14]:
lr_churn = LogisticRegression(labelCol='churn')

In [15]:
fitted_churn_model = lr_churn.fit(train_churn)

In [16]:
training_sum = fitted_churn_model.summary

**Code breakdown:** This would be a complete explanation of the steps that are required in the model building phase using MLIB

1. Importing the **LogisticRegression** module from the **ml.classification** lirary of the Pyspark.

2. Creating a Logistic Regression object and passing the **label column (churn).**

3. **Fitting the model** i.e. starting the training of the model on the training dataset.

4. Getting the **summary of the training** using the summary object which was attained over trained model

In [17]:
training_sum.predictions.describe().show()



+-------+-------------------+-------------------+
|summary|              churn|         prediction|
+-------+-------------------+-------------------+
|  count|                633|                633|
|   mean| 0.1579778830963665|0.11848341232227488|
| stddev|0.36500869525442065|0.32343524011826996|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



**Inference:** So the summary object of the MLIB library returned a lot insights about the **trained logistic regresion model** and with the statistical information available we can conclude that model has performed well as the **mean, standard deviation** of the **churn** (actual values) and **prediction** (predicted values) is very close.

## Model Evaluation

In this stage of the churn prediction we should **analyze our model** which was trained on the **70%** of the dataset and by evaluating it we can decide that we should **go with model or some twitches are required**. 

In [18]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [19]:
pred_and_labels = fitted_churn_model.evaluate(test_churn)

**Inference:** One can notice that in the first step we imported the **BinaryClassificationEvaluator** which is quite logical as well because we are dealing with the label column that has binary values only.

Then **evaluate**() method comes into existence where it take the **testing data** (**30**% of the total dataset) as the parameter and returns the multiple fields from where we can evaluate the model (manually).

In [20]:
pred_and_labels.predictions.show()



+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[26.0,8787.39,1.0...|    1|[0.26741618669989...|[0.56645847293786...|       0.0|
|[26.0,8939.61,0.0...|    0|[5.91744417715237...|[0.99731515594254...|       0.0|
|[27.0,8628.8,1.0,...|    0|[4.81589064770682...|[0.99196507828919...|       0.0|
|[28.0,11204.23,0....|    0|[1.80953499926464...|[0.85930566533014...|       0.0|
|[29.0,9378.24,0.0...|    0|[4.47613350870519...|[0.98875066870221...|       0.0|
|[29.0,9617.59,0.0...|    0|[4.20685116490178...|[0.98532536140902...|       0.0|
|[30.0,8403.78,1.0...|    0|[5.25931135316257...|[0.99482800491977...|       0.0|
|[30.0,10183.98,1....|    0|[2.55476995464296...|[0.92789331160039...|       0.0|
|[31.0,7073.61,0.0...|    0|[2.77243369914190...|[0.94116788740757...|       0.0|
|[31.0,8829.83,1

**Inference:** In the above output one can see 4 columns that were returned by the evaluate method they are:

1. **Features:** All the features values that were clubbed together by VectorAssembler during **feature selection phase**.

2. **Churn:** The Actual values i.e. the actual **label** column

3. **Probability:** This column have the **probability of the predictions** that were made by the model.

4. **Predictions:** The predicted values (**here 0 or 1**) by the model on the testing data.

## Predicting the new data (new customers)

Finally comes the last stage of the article where till now we have already **build and evaluated our model** and now here the predictions will be made on the **completely new data** i.e. the new customers dataset and see how well the model performed.

Note that in this stage the steps will be same but the dataset will be different according to the situation

In [24]:
final_lr_model = lr_churn.fit(final_data)

**Inference:** Yes! Yes! nothing extra to discuss here as we have already gone through this step but the main thing to notice is that we are **performing the training on the complete dataset (final_data)** as we know we already have the testing data in the CSV file hence **no splitting of the dataset is required**.

In [25]:
new_customers = spark.read.csv('new_customers.csv',inferSchema=True,
                              header=True)

In [26]:
new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)



**Inference:** As the testing data is in the different file then it becomes neccesary to **read** it in the same way we did it before in the case of **customer_churn** dataset.

Then we saw the Schema of this new dataset and comes to conclusion that it has the **exactly same Schema**.

In [27]:
test_new_customers = assembler.transform(new_customers)

**Inference:** Assembler object was already created while the main features were selected so now same **assembler object is being used to transform this new testing data.**

In [29]:
final_results = final_lr_model.transform(test_new_customers)

**Inference:** As we did the transformation of the features using assembler object similarly we also need to do the **transformation of the final model** on top of **new customers**.

In [30]:
final_results.select('Company','prediction').show()

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+



**Inference:** Here comes the data which we were aiming to achieve where we could know that the companies like **Cannon-Benson,Barron-Robertson,Sexton-GOlden, and Parks-Robbins** need to assign an Account Manager to decrease the churn of the customers.

## Conclusion

This is an important aspect of the article where I'll try to give brief about everything we did in this article like **how we are able to assign the Account Managers to the customers** to decrease the rate of churn in those particular companies and discuss each step in brief.

1. First we read the **customer_churn** data and analyzed it both statistically and logically.

2. Then we **selected the main features** that could be best fit for the model development phase after splitting the dataset (for this instance it was required)

3. Then after building the model we evaluated it too using the **BinaryClassificationEvaluator** which helped us to know how well our model performed on **testing data.**


4. Then we did the same process on top of the new dataset (**new testing data**) i.e. feature selection, model building and at the last making predictions **that at the end helped in knowing which company requires the Account Manager.**