# Logistic Regression

A client wants to reduce customer churn by determining how reliably it can be predicted and intervening in situations wheew a client is likely to leave them. We'll use the data in customer_churn.csv to fit and evaluate a model and then generate predictions using the data in new_customers.csv.

In [2]:
import findspark
findspark.init("/home/bryan/Documents/Code/spark-2.4.5-bin-hadoop2.7")

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lr_churn').getOrCreate()

# EDA

In [4]:
data = spark.read.csv("data/customer_churn.csv", inferSchema=True, header=True)

In [5]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [6]:
assert data.count() == data.na.drop().count(), "Check for missing data."

In [7]:
data.show(3)

+----------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|           Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|
+----------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|Cameron Williams|42.0|       11066.8|              0| 7.22|      8.0|2013-08-30 07:00:40|10265 Elizabeth M...|          Harvey LLC|    1|
|   Kevin Mueller|41.0|      11916.22|              0|  6.5|     11.0|2013-08-13 00:38:46|6157 Frank Garden...|          Wilson PLC|    1|
|     Eric Lozano|38.0|      12884.75|              0| 6.67|     12.0|2016-06-29 06:20:07|1331 Keith Court ...|Miller, Johnson a...|    1|
+----------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
only showing top 3 rows



> ### According to the client, whether a client has an account manager or not is determined randomly. Can we see, based on the data, whether this random phenomenon has a positive or negative impact on churn?

In [8]:
round(data.select('Company').distinct().count()/data.count(),2)

0.97

> ### The Company column is effectively a unique value and may not help us in predicting churn.

In [12]:
data.select('Location').show(10, False)

+-------------------------------------------------------+
|Location                                               |
+-------------------------------------------------------+
|10265 Elizabeth Mission Barkerburgh, AK 89518          |
|6157 Frank Gardens Suite 019 Carloshaven, RI 17756     |
|1331 Keith Court Alyssahaven, DE 90114                 |
|13120 Daniel Mount Angelabury, WY 30645-4695           |
|765 Tricia Row Karenshire, MH 71730                    |
|6187 Olson Mountains East Vincentborough, PR 74359     |
|4846 Savannah Road West Justin, IA 87713-3460          |
|25271 Roy Expressway Suite 147 Brownport, FM 59852-6150|
|3725 Caroline Stravenue South Christineview, MA 82059  |
|363 Sandra Lodge Suite 144 South Ann, WI 51655-7561    |
+-------------------------------------------------------+
only showing top 10 rows



> ### The locations appear to be U.S.-based. Would it be helpful to generate new State and/or Zip Code features?

In [22]:
from pyspark.sql.functions import year

In [31]:
data = data.withColumn('Onboard_date_year', year('Onboard_date'))
data.select('Onboard_date_year').describe().show()

+-------+------------------+
|summary| Onboard_date_year|
+-------+------------------+
|  count|               900|
|   mean|2010.8011111111111|
| stddev|3.2072288498508783|
|    min|              2006|
|    max|              2016|
+-------+------------------+



> ### Onboard_date covers a period of approximately 10 years from 2006 to 2016. Below I will verify whether the columns Years and Onboard_date are internally consistent.

In [58]:
from pyspark.sql import functions as F

In [59]:
data = data.withColumn('Years_since_onboard', 2020-F.col('Onboard_date_year'))

In [60]:
data.select(['Years', 'Years_since_onboard']).show(5)

+-----+-------------------+
|Years|Years_since_onboard|
+-----+-------------------+
| 7.22|                  7|
|  6.5|                  7|
| 6.67|                  4|
| 6.71|                  6|
| 5.56|                  4|
+-----+-------------------+
only showing top 5 rows



In [61]:
data = data.withColumn('Years_diff', F.abs(col('Years')-col('Years_since_onboard')))

In [63]:
data.select(['Years', 'Years_since_onboard', 'Years_diff']).show(5)

+-----+-------------------+-------------------+
|Years|Years_since_onboard|         Years_diff|
+-----+-------------------+-------------------+
| 7.22|                  7|0.21999999999999975|
|  6.5|                  7|                0.5|
| 6.67|                  4|               2.67|
| 6.71|                  6|               0.71|
| 5.56|                  4| 1.5599999999999996|
+-----+-------------------+-------------------+
only showing top 5 rows



In [65]:
data.select(['Years', 'Years_since_onboard', 'Years_diff']).describe().show()

+-------+-----------------+-------------------+--------------------+
|summary|            Years|Years_since_onboard|          Years_diff|
+-------+-----------------+-------------------+--------------------+
|  count|              900|                900|                 900|
|   mean| 5.27315555555555|  9.198888888888888|   4.249177777777782|
| stddev|1.274449013194616| 3.2072288498508783|   2.980322188381301|
|    min|              1.0|                  4|0.009999999999999787|
|    max|             9.15|                 14|               11.59|
+-------+-----------------+-------------------+--------------------+

