# Logistic Regression Consulting Project

## Binary Customer Churn

A marketing agency has many customers that use their service to produce ads for the client/customer websites. They've noticed that they have quite a bit of churn in clients. They basically randomly assign account managers right now, but want you to create a machine learning model that will help predict which customers will churn (stop buying their service) so that they can correctly assign the customers most at risk to churn an account manager. Luckily they have some historical data.

Create a classification algorithm that will help classify whether or not a customer churned. Then the company can test this against incoming data for future customers to predict which customers will churn and assign them an account manager.

The data is saved as customer_churn.csv. Here are the fields and their definitions:

    Name : Name of the latest contact at Company
    Age: Customer Age
    Total_Purchase: Total Ads Purchased
    Account_Manager: Binary 0=No manager, 1= Account manager assigned  (NB. currently it is randomely assigned!)
    Years: Totaly Years as a customer
    Num_sites: Number of websites that use the service.
    Onboard_date: Date that the name of the latest contact was onboarded
    Location: Client HQ Address
    Company: Name of Client Company
    
Once you've created the model and evaluated it, test out the model on some new data (you can think of this almost like a hold-out set) that your client has provided, saved under new_customers.csv. The client wants to know which customers are most likely to churn given this data (they don't have the label yet).


In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('MLlibLogisticRegProject').getOrCreate()

In [3]:
data = spark.read.csv('customer_churn.csv', header = True, inferSchema=True)

In [4]:
data.head(1)


[Row(Names='Cameron Williams', Age=42.0, Total_Purchase=11066.8, Account_Manager=0, Years=7.22, Num_Sites=8.0, Onboard_date='2013-08-30 07:00:40', Location='10265 Elizabeth Mission Barkerburgh, AK 89518', Company='Harvey LLC', Churn=1)]

In [5]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [36]:
#Count null/nan values for all columns:
from pyspark.sql.functions import isnan, isnull, when, count, col

data.select([count(when(isnull(c), c)).alias(c) for c in data.columns]).show()
data.select([count(when(isnan(c), c)).alias(c) for c in data.columns]).show()

+-----+---+--------------+---------------+-----+---------+------------+--------+-------+-----+------+
|Names|Age|Total_Purchase|Account_Manager|Years|Num_Sites|Onboard_date|Location|Company|Churn|Weight|
+-----+---+--------------+---------------+-----+---------+------------+--------+-------+-----+------+
|    0|  0|             0|              0|    0|        0|           0|       0|      0|    0|     0|
+-----+---+--------------+---------------+-----+---------+------------+--------+-------+-----+------+

+-----+---+--------------+---------------+-----+---------+------------+--------+-------+-----+------+
|Names|Age|Total_Purchase|Account_Manager|Years|Num_Sites|Onboard_date|Location|Company|Churn|Weight|
+-----+---+--------------+---------------+-----+---------+------------+--------+-------+-----+------+
|    0|  0|             0|              0|    0|        0|           0|       0|      0|    0|     0|
+-----+---+--------------+---------------+-----+---------+------------+--------+-

### Exploring the feature importance:

In [6]:
df = data.groupby('Company').count()
df.filter(df['count']>2).show()


+--------------+-----+
|       Company|count|
+--------------+-----+
|    Wilson PLC|    3|
|Anderson Group|    4|
|  Williams PLC|    3|
+--------------+-----+



'Company' name does not seem important, since it has not been repeated frequently.

'Location' is the same. 

In [7]:
df = data.groupby('Names').count()
df.filter(df['count']>1).show()

+-------------+-----+
|        Names|count|
+-------------+-----+
|Jennifer Wood|    2|
+-------------+-----+



'Names' is not repeated, so not important.

'Account_Manager' is assigned randomely, so not important

#### In total, five features seem to be informative:

In [8]:
data.describe().select('summary','Age','Total_Purchase','Years',
                       'Num_sites','Onboard_date',).show()

+-------+-----------------+-----------------+-----------------+------------------+-------------------+
|summary|              Age|   Total_Purchase|            Years|         Num_sites|       Onboard_date|
+-------+-----------------+-----------------+-----------------+------------------+-------------------+
|  count|              900|              900|              900|               900|                900|
|   mean|41.81666666666667|10062.82403333334| 5.27315555555555| 8.587777777777777|               null|
| stddev|6.127560416916251|2408.644531858096|1.274449013194616|1.7648355920350969|               null|
|    min|             22.0|            100.0|              1.0|               3.0|2006-01-02 04:16:13|
|    max|             65.0|         18026.01|             9.15|              14.0|2016-12-28 04:07:38|
+-------+-----------------+-----------------+-----------------+------------------+-------------------+



# Imbalanced classes:


In [9]:
data.groupby('Churn').count().show()

+-----+-----+
|Churn|count|
+-----+-----+
|    1|  150|
|    0|  750|
+-----+-----+



The classes are imbalance. 150 samples are available for class 1, while 750 classes for label 0.

Ratio is 150/750 = 0.2.  So, 0.2 of samples belong to the class 1 and 0.8 of samples belong to the class 0.

There are different methods to compensate for class imbalance issue. One way is to under sample the majority class. Another way is to assign weights (as **weightCol in LogisticRegression**) for each class to penalize the majority class by assigning less weight and boost the minority class by assigning higher weight. 

NB. weightCol should only affect the model training step. No effect on test time.

In [10]:
from pyspark.sql.functions import when

ratio = 0.2
def weightBalance(label):
    return when(label == 1, 1-ratio).otherwise(ratio)

data = data.withColumn('Weight', weightBalance(data['Churn']))

## Preprocess on 'Onboard_date' feature:

First, convert from string to datetime.datetime format

In [11]:
data.select('Onboard_date').show()

+-------------------+
|       Onboard_date|
+-------------------+
|2013-08-30 07:00:40|
|2013-08-13 00:38:46|
|2016-06-29 06:20:07|
|2014-04-22 12:43:12|
|2016-01-19 15:31:15|
|2009-03-03 23:13:37|
|2016-12-05 03:35:43|
|2006-03-09 14:50:20|
|2011-09-29 05:47:23|
|2006-03-28 15:42:45|
|2016-11-13 13:13:01|
|2015-05-28 12:14:03|
|2011-02-16 08:10:47|
|2012-11-22 05:35:03|
|2015-03-28 02:13:44|
|2015-07-22 08:38:40|
|2006-09-03 06:13:55|
|2006-10-22 04:42:38|
|2015-10-07 00:27:10|
|2014-11-06 23:47:14|
+-------------------+
only showing top 20 rows



In [12]:
from pyspark.sql.functions import to_timestamp

dfdate = data.withColumn('Date', to_timestamp(data['Onboard_date'],'yyyy-MM-dd HH:mm:ss'))

In [13]:
dfdate.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)
 |-- Weight: double (nullable = false)
 |-- Date: timestamp (nullable = true)



Then, compute **delta (time difference) from today** to the Onboard_date ('Date' column):

In [14]:
import datetime
#dt = datetime.datetime.today()
dt = datetime.datetime(2020, 7, 9, 11, 12, 37, 597168)

In [15]:
dfdate.select('Date').head(1)

[Row(Date=datetime.datetime(2013, 8, 30, 7, 0, 40))]

Register the python function as udf:

In [16]:
import pyspark.sql.functions as F
dtDeltafn = F.udf(lambda x: (dt-x).days)
dfDur = dfdate.withColumn('Duration_date', dtDeltafn(dfdate['Date']).cast('int'))

# Final dataset

In [17]:
featureColumns = ['Duration_date','Age','Total_Purchase','Years','Num_sites']
df = dfDur.withColumnRenamed('Churn','label').select(featureColumns+['Weight','label'])

df.show()

+-------------+----+--------------+-----+---------+------+-----+
|Duration_date| Age|Total_Purchase|Years|Num_sites|Weight|label|
+-------------+----+--------------+-----+---------+------+-----+
|         2505|42.0|       11066.8| 7.22|      8.0|   0.8|    1|
|         2522|41.0|      11916.22|  6.5|     11.0|   0.8|    1|
|         1471|38.0|      12884.75| 6.67|     12.0|   0.8|    1|
|         2269|42.0|       8010.76| 6.71|     10.0|   0.8|    1|
|         1632|37.0|       9191.58| 5.56|      9.0|   0.8|    1|
|         4145|48.0|      10356.02| 5.12|      8.0|   0.8|    1|
|         1312|44.0|      11331.58| 5.23|     11.0|   0.8|    1|
|         5235|32.0|       9885.12| 6.92|      9.0|   0.8|    1|
|         3206|43.0|       14062.6| 5.46|     11.0|   0.8|    1|
|         5216|40.0|       8066.94| 7.11|     11.0|   0.8|    1|
|         1333|30.0|      11575.37| 5.22|      8.0|   0.8|    1|
|         1868|45.0|       8771.02| 6.64|     11.0|   0.8|    1|
|         3431|45.0|     

In [18]:
df.describe().show()

+-------+------------------+-----------------+-----------------+-----------------+------------------+-------------------+-------------------+
|summary|     Duration_date|              Age|   Total_Purchase|            Years|         Num_sites|             Weight|              label|
+-------+------------------+-----------------+-----------------+-----------------+------------------+-------------------+-------------------+
|  count|               900|              900|              900|              900|               900|                900|                900|
|   mean|3376.3077777777776|41.81666666666667|10062.82403333334| 5.27315555555555| 8.587777777777777|0.29999999999999083|0.16666666666666666|
| stddev| 1171.897908168496|6.127560416916251|2408.644531858096|1.274449013194616|1.7648355920350969|0.22373112736634135| 0.3728852122772358|
|    min|              1289|             22.0|            100.0|              1.0|               3.0|                0.2|                  0|
|    m

### VectorAssembler:

In [19]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols = featureColumns,
                            outputCol = 'featuresAssem')

### Normalizing features:
Normalizing each feature to have unit standard deviation and/or zero mean

In [20]:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="featuresAssem", outputCol="features",
                        withStd=True, withMean=False)

### Define a LogisticRegression instance:

In [21]:
from pyspark.ml.classification import LogisticRegression
logr = LogisticRegression(featuresCol='features',labelCol='label', weightCol='Weight')

df.withColumnRenamed('Churn','label')

DataFrame[Duration_date: int, Age: double, Total_Purchase: double, Years: double, Num_sites: double, Weight: double, label: int]

### Pipline creation and defining the stages:

In [22]:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[assembler,scaler,logr])

### Train/test split

In [23]:
train, test = df.randomSplit([0.7,.3])

### Training the model

In [24]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid = ParamGridBuilder() \
    .addGrid(logr.regParam, [0, 0.1, 0.01])\
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=3)


In [25]:
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(train)
#cvModel includes the best model

In [26]:
# transform the test data
results = cvModel.transform(test)

In [27]:
my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                       labelCol='label', metricName="areaUnderROC" )
AUC = my_eval.evaluate(results)
AUC

0.811113804387347

### To find the best set of params:

In [28]:
bestPipeline = cvModel.bestModel
bestLRModel = bestPipeline.stages[2]
bestParams = bestLRModel.extractParamMap()

#bestParams

### Predict on a brand new unlabeled data

In [29]:
data_newCust = spark.read.csv('new_customers.csv', header = True, inferSchema=True)

In [30]:
data_newCust.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)



In [31]:
dfDate_newCust = data_newCust.withColumn('Date', to_timestamp(data_newCust['Onboard_date'],'yyyy-MM-dd HH:mm:ss'))

dfDur_newCust = dfDate_newCust.withColumn('Duration_date', dtDeltafn(dfDate_newCust['Date']).cast('int'))

df_newCust = dfDur_newCust.select(featureColumns+['Company'])

In [32]:
results_newCust = cvModel.transform(df_newCust)

In [33]:
results_newCust.select('Company','probability','prediction').show()

+----------------+--------------------+----------+
|         Company|         probability|prediction|
+----------------+--------------------+----------+
|        King Ltd|[0.61001946597518...|       0.0|
|   Cannon-Benson|[0.03661577428857...|       1.0|
|Barron-Robertson|[0.09110822343468...|       1.0|
|   Sexton-Golden|[0.04535616983946...|       1.0|
|        Wood LLC|[0.49787560005344...|       1.0|
|   Parks-Robbins|[0.25682973707624...|       1.0|
+----------------+--------------------+----------+



The above results show that we should not assign an Acocunt Manager to King Ltd Company.