# Customer Churn Prediction

The marketing agency is facing a challenge of client attrition and seeks a predictive solution to identify customers likely to churn. Using historical data, our goal is to design a machine learning model to classify clients based on the likelihood of discontinuing the agency's services. This predictive tool aims to optimize the assignment of account managers to clients most at risk.

The dataset comprises the following fields:

    Name: Latest contact at the client's company.
    Age: Age of the customer.
    Total_Purchase: Total ads purchased by the client.
    Account_Manager: Indicates if an account manager is assigned (0 for No, 1 for Yes).
    Years: Duration of the client's association with the agency.
    Num_sites: Number of websites using the agency's service.
    Onboard_date: Date when the latest contact was onboarded.
    Location: Address of the client's headquarters.
    Company: Name of the client company.
    
    Churn: 0 or 1 indicating whether customer has churned.

Post-development, we'll validate our model using a new set of data (`new_customers.csv`) to ensure its robustness in predicting customer churn.

## Importing Dataset

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('log_reg').getOrCreate()

23/10/18 17:29:37 WARN Utils: Your hostname, SunKim.local resolves to a loopback address: 127.0.0.1; using 192.168.1.67 instead (on interface en0)
23/10/18 17:29:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/18 17:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
df = spark.read.csv('customer_churn.csv', inferSchema=True, header=True)

In [4]:
df.show()

+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|              Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|
+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|   Cameron Williams|42.0|       11066.8|              0| 7.22|      8.0|2013-08-30 07:00:40|10265 Elizabeth M...|          Harvey LLC|    1|
|      Kevin Mueller|41.0|      11916.22|              0|  6.5|     11.0|2013-08-13 00:38:46|6157 Frank Garden...|          Wilson PLC|    1|
|        Eric Lozano|38.0|      12884.75|              0| 6.67|     12.0|2016-06-29 06:20:07|1331 Keith Court ...|Miller, Johnson a...|    1|
|      Phillip White|42.0|       8010.76|              0| 6.71|     10.0|2014-04-22 12:43:12|13120 Daniel Moun...|           Smith Inc|    1|
|     

In [5]:
df.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [7]:
df.describe().show()

23/10/18 20:30:29 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+--------------------+--------------------+-------------------+
|summary|        Names|              Age|   Total_Purchase|   Account_Manager|            Years|         Num_Sites|            Location|             Company|              Churn|
+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+--------------------+--------------------+-------------------+
|  count|          900|              900|              900|               900|              900|               900|                 900|                 900|                900|
|   mean|         null|41.81666666666667|10062.82403333334|0.4811111111111111| 5.27315555555555| 8.587777777777777|                null|                null|0.16666666666666666|
| stddev|         null|6.127560416916251|2408.644531858096|0.4999208935073339|1.274449013194616|1.764835592035

## Vectorizing Feature Columns

In [9]:
df.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

In [10]:
from pyspark.ml.feature import VectorAssembler

In [11]:
assembler = VectorAssembler(inputCols=['Age',
                                       'Total_Purchase',
                                       'Account_Manager',
                                       'Years',
                                       'Num_Sites'],
                            outputCol='features')

In [12]:
output = assembler.transform(df)

In [13]:
final_data = output.select('features', 'churn')

## Splitting Train-test Set and Fitting the Model

In [14]:
train_df, test_df = final_data.randomSplit([0.7, 0.3])

In [15]:
from pyspark.ml.classification import LogisticRegression

In [16]:
lr_churn = LogisticRegression(labelCol='churn')

In [17]:
fitted_model = lr_churn.fit(train_df)

23/10/18 20:35:52 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/10/18 20:35:52 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS


In [18]:
train_summary = fitted_model.summary

In [19]:
train_summary.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|              churn|         prediction|
+-------+-------------------+-------------------+
|  count|                636|                636|
|   mean| 0.1650943396226415|0.12264150943396226|
| stddev|0.37155789142012735|0.32828344355697203|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



## Evaluating the Model

In [20]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [21]:
pred_and_labels = fitted_model.evaluate(test_df)

In [22]:
pred_and_labels.predictions.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[22.0,11254.38,1....|    0|[4.67959803614180...|[0.99080263234864...|       0.0|
|[27.0,8628.8,1.0,...|    0|[5.60216880113338...|[0.99632371248260...|       0.0|
|[28.0,9090.43,1.0...|    0|[1.41272016564104...|[0.80419463067848...|       0.0|
|[28.0,11245.38,0....|    0|[3.87244462333310...|[0.97961668380142...|       0.0|
|[29.0,5900.78,1.0...|    0|[4.19242074742122...|[0.98511523983712...|       0.0|
|[29.0,13240.01,1....|    0|[6.94424409809262...|[0.99903675985206...|       0.0|
|[29.0,13255.05,1....|    0|[4.19679160000439...|[0.98517919483065...|       0.0|
|[30.0,7960.64,1.0...|    1|[3.13297967860910...|[0.95823281066214...|       0.0|
|[30.0,8874.83,0.0...|    0|[3.28051078747448...|[0.96375413073490...|       0.0|
|[30.0,10960.52,

In [23]:
churn_evaluation = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='churn')

In [24]:
AUC = churn_evaluation.evaluate(pred_and_labels.predictions)

In [25]:
AUC

0.7882800608828006

## Prediction on New Data

In [26]:
final_model = lr_churn.fit(final_data)

In [27]:
new_customers = spark.read.csv('new_customers.csv', inferSchema=True, header=True)

In [28]:
new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)



In [29]:
test_new_customers = assembler.transform(new_customers)

In [30]:
test_new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)



In [31]:
final_results = final_model.transform(test_new_customers)

In [33]:
final_results.select('Company', 'prediction').show()

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+

