# Logistic Regression Consulting Project

## (Henri's Solution + Additional Comments)

You have been contacted by a top marketing agency to help them out with customer churn (the annual percentage rate at which customers stop subscribing to a service or employees leave a job).

A marketing agency has many customers that use their service to produce ads for the client/customer websites.

They've noticed that they have quite a bit of churn in clients.

They currently randomly assign account managers, but want you to create a machine learning model that will help predict which cutomers will churn (stop buying their service) so that they can correctly assign the customers most at risk to churn an account manager.

Luckily they have some historical data, can you help them out?

Create a classification algorithm that will help classify whether or not a customer churned.

Then the company can test this against incoming data for future customers to predict which customers will churn and assign them an account manager.

The data is under ```customer_churn.csv```.  Let's quickly go over the data and what your main task is.

- ```Name```: Name of the latest contact at Company.
- ```Age```: Customer Age
- ```Total_Purchase```: Total Ads Purchased
- ```Account_Manager```: Binary 0 -> No manager, 1 -> Account manager assigned
- ```Years```: Total Years as a customer
- ```Num_sites```: Number of websites that use the service
- ```Onboard_date```: Date that the name of the latest contact was onboarded
- ```Location```: Client HQ Address
- ```Company```: Name of the Client Company
- ```Churn```: 0 or 1 indicating whether customer has churned.

Your goal is to create a model that can predict whether a customer will churn (0 or 1) based off the features.

Remember that the account manager is curreently randomly assigned.

As always, treat this consulting project as a loosely guided exercise, or skip ahead and treat it as a code along project.

Best of luck!

In [1]:
from pyspark.sql import SparkSession

In [2]:
# Start a spark session
spark = SparkSession.builder.appName("churn").getOrCreate()

In [3]:
# Load in our data.
data = spark.read.csv("customer_churn.csv", inferSchema=True, header=True)

In [4]:
data.printSchema()
print("\n")
print("Number of samples is: {:.0f}.".format(data.count()))

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



Number of samples is: 900.


In [5]:
# EDA
data.head(1)[0]

Row(Names='Cameron Williams', Age=42.0, Total_Purchase=11066.8, Account_Manager=0, Years=7.22, Num_Sites=8.0, Onboard_date=datetime.datetime(2013, 8, 30, 7, 0, 40), Location='10265 Elizabeth Mission Barkerburgh, AK 89518', Company='Harvey LLC', Churn=1)

In [6]:
# EDA
for row in data.head(5):
    print(row)
    print("\n")

Row(Names='Cameron Williams', Age=42.0, Total_Purchase=11066.8, Account_Manager=0, Years=7.22, Num_Sites=8.0, Onboard_date=datetime.datetime(2013, 8, 30, 7, 0, 40), Location='10265 Elizabeth Mission Barkerburgh, AK 89518', Company='Harvey LLC', Churn=1)


Row(Names='Kevin Mueller', Age=41.0, Total_Purchase=11916.22, Account_Manager=0, Years=6.5, Num_Sites=11.0, Onboard_date=datetime.datetime(2013, 8, 13, 0, 38, 46), Location='6157 Frank Gardens Suite 019 Carloshaven, RI 17756', Company='Wilson PLC', Churn=1)


Row(Names='Eric Lozano', Age=38.0, Total_Purchase=12884.75, Account_Manager=0, Years=6.67, Num_Sites=12.0, Onboard_date=datetime.datetime(2016, 6, 29, 6, 20, 7), Location='1331 Keith Court Alyssahaven, DE 90114', Company='Miller, Johnson and Wallace', Churn=1)


Row(Names='Phillip White', Age=42.0, Total_Purchase=8010.76, Account_Manager=0, Years=6.71, Num_Sites=10.0, Onboard_date=datetime.datetime(2014, 4, 22, 12, 43, 12), Location='13120 Daniel Mount Angelabury, WY 30645-4695',

In [7]:
data.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

In [8]:
desired_cols = data.select(["Age", "Total_Purchase", "Years", "Num_Sites", "Churn"])

**In the following, we create an assembler.**

In [9]:
from pyspark.ml.feature import VectorAssembler

In [10]:
# Create your assembler object.
assembler = VectorAssembler(inputCols=["Age", "Total_Purchase", "Years", "Num_Sites"],
                            outputCol="features")
# Concerning inputCols:
# --- "Names": is too arbitrary; won't help in our model.
# --- "Account_Manager": we don't expect this to help much since they are randomly assigned.
#     But yet the instructor includes it in his assembler...  Why?

In [11]:
# Generate the output.
# Notice that we only select "features" and "Churn".
proc_data = assembler.transform(dataset=desired_cols).select(["features", "Churn"])

In [12]:
# Basic EDA.
proc_data.head(1)[0]

Row(features=DenseVector([42.0, 11066.8, 7.22, 8.0]), Churn=1)

**Instantiate and train the model.**

In [13]:
from pyspark.ml.classification import LogisticRegression

In [14]:
# Instantiate a LogisticRegression model.
log_reg = LogisticRegression(labelCol="Churn")

In [15]:
# Split the data.
train_data, test_data = proc_data.randomSplit([0.7, 0.3], seed=123)

In [16]:
# Fit the model to the training data.
fitted_log_reg = log_reg.fit(dataset=train_data)

In [17]:
# Check out a summary of the trained model.
training_summary = fitted_log_reg.summary
# training_summary --> <pyspark.ml.classification.BinaryLogisticRegressionTrainingSummary at 0x7f76ed11a860>

In [18]:
training_summary.predictions.describe().show()
# A good comparison to make is between "Churn" and "prediction" "mean" and "stddev".

+-------+-------------------+-------------------+
|summary|              Churn|         prediction|
+-------+-------------------+-------------------+
|  count|                633|                633|
|   mean|0.15639810426540285|0.11216429699842022|
| stddev| 0.3635195998705865|0.31581804295341376|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



**Evaluating the results.**

You will see below that we finally calculate the ```auroc``` metric.

In [19]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [20]:
# Evaluates the model on a test dataset
preds_and_labels = fitted_log_reg.evaluate(dataset=test_data)

In [21]:
# Grab the "predictions" DataFrame off `preds_and_labels`.
preds_and_labels.predictions.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|Churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[25.0,9672.03,5.4...|    0|[4.45728553683191...|[0.98853908393873...|       0.0|
|[26.0,8787.39,5.4...|    1|[1.07747513654642...|[0.74601587906872...|       0.0|
|[26.0,8939.61,4.5...|    0|[6.08134633024471...|[0.99772011168408...|       0.0|
|[28.0,11204.23,3....|    0|[1.69011971259467...|[0.84423990259824...|       0.0|
|[29.0,5900.78,5.5...|    0|[4.44595421204329...|[0.98840999172866...|       0.0|
|[29.0,11274.46,4....|    0|[4.64709674466339...|[0.99050168115979...|       0.0|
|[29.0,12711.15,5....|    0|[5.00285442680424...|[0.99332609877076...|       0.0|
|[29.0,13240.01,4....|    0|[6.70687780527334...|[0.99877901764305...|       0.0|
|[30.0,10183.98,5....|    0|[3.15571935401620...|[0.95913348972409...|       0.0|
|[30.0,11575.37,

**Using the AUC.**

In [22]:
my_evaluator = BinaryClassificationEvaluator(labelCol="Churn",
                                             rawPredictionCol="probability")
# Henri's question is:  "Why does the instructor in his solution set rawPredictionCol to 'prediction'?!"

In [23]:
# Evaluates the output.
auroc = my_evaluator.evaluate(dataset=preds_and_labels.predictions)
print(auroc)

0.9139433551198269


## Predict on Brand New Data that do _NOT_ have any labels

The file of interest here is  
```new_customers.csv```

The following mimics **deployment** in real-life.

In [24]:
# Fit the logistic regression model on the ENTIRE dataset.
# (At least this is what the instructor is doing.)
final_log_reg = log_reg.fit(dataset=proc_data)

In [25]:
unseen = spark.read.csv("new_customers.csv", inferSchema=True, header=True)

In [26]:
unseen.printSchema()
print("\n")

print(unseen.head(1)[0])
print("\n")

print(unseen.count())

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)



Row(Names='Andrew Mccall', Age=37.0, Total_Purchase=9935.53, Account_Manager=1, Years=7.71, Num_Sites=8.0, Onboard_date=datetime.datetime(2011, 8, 29, 18, 37, 54), Location='38612 Johnny Stravenue Nataliebury, WI 15717-8316', Company='King Ltd')


6


In [27]:
# We need to transform the "unseen" DataFrame.
unseen_processed = assembler.transform(dataset=unseen)

In [28]:
# Confirm that we now have an additional "features" column:
unseen_processed.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)



In [29]:
# Transform the DataFrame.
unseen_transformed = final_log_reg.transform(dataset=unseen_processed)

In [30]:
# What we mostly care about, amidst the other columns in the DataFrame, is the "prediction" column.
unseen_transformed.select(["Company", "prediction"]).show()
# Notice that our "unseen" DataFrame is very small; only work with 6 customers.

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+



***Therefore, we need to assign account managers to:***
- ```Cannon-Benson```
- ```Barron-Robertson```
- ```Sexton-Golden```
- ```Parks-Robbins```

Customers that are **not likely** to churn include:
- ```King Ltd```
- ```Wood LLC```