# Customer churn augment

This notebook is derived from [customer churn augment notebook](https://github.com/NVIDIA/data-science-blueprints/blob/main/churn/augment.ipynb), please refer to this [git repo](https://github.com/NVIDIA/data-science-blueprints/tree/main/churn) for more detail information.

 

In [20]:
# notebook parameters

import os

spark_master = "spark://ip:port"
app_name = "augment"
input_file = os.path.join("data", "WA_Fn-UseC_-Telco-Customer-Churn-.csv")
output_prefix = "your-output-path-"
output_mode = "overwrite"
output_kind = "parquet"
driver_memory = '12g'
executor_memory = '8g'

dup_times = 100


In [2]:
import churn.augment

churn.augment.register_options(
    spark_master = spark_master,
    app_name = app_name,
    input_file = input_file,
    output_prefix = output_prefix,
    output_mode = output_mode,
    output_kind = output_kind,
    driver_memory = driver_memory,
    executor_memory = executor_memory,
    dup_times = dup_times,
    use_decimal = True
)

# Sanity-checking

We're going to make sure we're running with a compatible JVM first — if we run on macOS, we might get one that doesn't work with Scala.

In [3]:
from os import getenv

In [4]:
getenv("JAVA_HOME")

'/data/usr/lib/jvm/java-8-openjdk-amd64'

# Spark setup

In [5]:
import pyspark

In [6]:
session = pyspark.sql.SparkSession.builder \
    .master(spark_master) \
    .appName(app_name) \
    .config("spark.driver.memory", driver_memory) \
    .config("spark.executor.memory", executor_memory) \
    .getOrCreate()
session

# Schema definition

Most of the fields are strings representing booleans or categoricals, but a few (`tenure`, `MonthlyCharges`, and `TotalCharges`) are numeric.

In [7]:
from churn.augment import load_supplied_data

df = load_supplied_data(session, input_file)

                                                                                

read 7043 records from source dataset (7032 non-null records)


# Splitting the data frame

The training data schema looks like this:

- customerID
- gender
- SeniorCitizen
- Partner
- Dependents
- tenure
- PhoneService
- MultipleLines
- InternetService
- OnlineSecurity
- OnlineBackup
- DeviceProtection
- TechSupport
- StreamingTV
- StreamingMovies
- Contract
- PaperlessBilling
- PaymentMethod
- MonthlyCharges
- TotalCharges
- Churn

We want to divide the data frame into several frames that we can join together in an ETL job.

Those frames will look like this:

- **Customer metadata**
  - customerID
  - gender
  - date of birth (we'll derive age and senior citizen status from this)
  - Partner
  - Dependents
  - (nominal) MonthlyCharges
- **Billing events**
  - customerID
  - date (we'll derive tenure from the number/duration of billing events)
  - kind (one of "AccountCreation", "Charge", or "AccountTermination")
  - value (either a positive nonzero amount or 0.00; we'll derive TotalCharges from the sum of amounts and Churn from the existence of an AccountTermination event)
- **Customer phone features**
  - customerID
  - feature (one of "PhoneService" or "MultipleLines")
- **Customer internet features**
  - customerID
  - feature (one of "InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies")
  - value (one of "Fiber", "DSL", "Yes", "No")
- **Customer account features**
  - customerID
  - feature (one of "Contract", "PaperlessBilling", "PaymentMethod")
  - value (one of "Month-to-month", "One year", "Two year", "No", "Yes", "Credit card (automatic)", "Mailed check", "Bank transfer (automatic)", "Electronic check")

In [8]:
df.printSchema()

root
 |-- customerID: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: string (nullable = true)
 |-- Partner: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- tenure: double (nullable = true)
 |-- PhoneService: string (nullable = true)
 |-- MultipleLines: string (nullable = true)
 |-- InternetService: string (nullable = true)
 |-- OnlineSecurity: string (nullable = true)
 |-- OnlineBackup: string (nullable = true)
 |-- DeviceProtection: string (nullable = true)
 |-- TechSupport: string (nullable = true)
 |-- StreamingTV: string (nullable = true)
 |-- StreamingMovies: string (nullable = true)
 |-- Contract: string (nullable = true)
 |-- PaperlessBilling: string (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: double (nullable = true)
 |-- Churn: string (nullable = true)



We'll start by generating a series of monthly charges, then a series of account creation events, and finally a series of churn events. `billingEvents` is the data frame containing all of these events:  account activation, account termination, and individual payment events.

In [9]:
from churn.augment import billing_events
billingEvents = billing_events(df)



Our next step is to generate customer metadata, which includes the following fields:

  - gender
  - date of birth (we'll derive age and senior citizen status from this)
  - Partner
  - Dependents
  
We'll calculate date of birth by using the hash of the customer ID as a pseudorandom number and then assuming that ages are uniformly distributed between 18-65 and exponentially distributed over 65.

In [10]:
from churn.augment import customer_meta
customerMeta = customer_meta(df)

2022-04-05 09:36:31,848 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
2022-04-05 09:36:31,849 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist
2022-04-05 09:36:33,683 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
2022-04-05 09:36:33,683 WARN metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore yuanli@127.0.1.1
2022-04-05 09:36:33,811 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
2022-04-05 09:36:33,892 WARN rapids.GpuOverrides: 
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
  @Expression <AttributeReference> name#326 could run on GPU
  @Expression <AttributeReference> database#327 could 

Now we can generate customer phone features, which include:

  - customerID
  - feature (one of "PhoneService" or "MultipleLines")
  - value (always "Yes"; there are no records for "No" or "No Phone Service")

In [11]:
from churn.augment import phone_features
customerPhoneFeatures = phone_features(df)

Customer internet features include:
  - customerID
  - feature (one of "InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies")
  - value (one of "Fiber", "DSL", "Yes" -- no records for "No" or "No internet service")

In [12]:
from churn.augment import internet_features
customerInternetFeatures = internet_features(df)

Customer account features include:

  - customerID
  - feature (one of "Contract", "PaperlessBilling", "PaymentMethod")
  - value (one of "Month-to-month", "One year", "Two year", "Yes", "Credit card (automatic)", "Mailed check", "Bank transfer (automatic)", "Electronic check")

In [13]:
from churn.augment import account_features
customerAccountFeatures = account_features(df)

# Write outputs

In [14]:
%%time

from churn.augment import write_df

write_df(billingEvents, "billing_events", partition_by="month")
write_df(customerMeta, "customer_meta", skip_replication=True)
write_df(customerPhoneFeatures, "customer_phone_features")
write_df(customerInternetFeatures.orderBy("customerID"), "customer_internet_features")
write_df(customerAccountFeatures, "customer_account_features")

2022-04-05 09:36:36,792 WARN rapids.GpuOverrides: 
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
  @Expression <AttributeReference> name#798 could run on GPU
  @Expression <AttributeReference> database#799 could run on GPU
  @Expression <AttributeReference> description#800 could run on GPU
  @Expression <AttributeReference> tableType#801 could run on GPU
  @Expression <AttributeReference> isTemporary#802 could run on GPU

2022-04-05 09:36:37,142 WARN rapids.GpuOverrides: 
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
  @Expression <Alias> format_string(%s-%s, customerID#0, u_value#337) AS customerID#816 could run on GPU
    ! <FormatString> format_string(%s-%s, customerID#0, u_value#337) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.FormatString
      @Expression <Literal> %s-%s

2022-04-05 09:36:37,476 WARN rapids.GpuOverrides: 
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
  @Expression <Alias> format_string(%s-%s, customerID#0, u_value#337) AS customerID#816 could run on GPU
    ! <FormatString> format_string(%s-%s, customerID#0, u_value#337) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.FormatString
      @Expression <Literal> %s-%s could run on GPU
      @Expression <AttributeReference> customerID#0 could run on GPU
      @Expression <AttributeReference> u_value#337 could run on GPU
  @Expression <AttributeReference> kind#133 could run on GPU
  @Expression <AttributeReference> value#136 could run on GPU
  @Expression <AttributeReference> date#156 could run on GPU
  @Expression <AttributeReference> month#315 could run on GPU
        !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
          @Expression <AttributeRefere

2022-04-05 09:36:37,897 WARN rapids.GpuOverrides: 
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
  @Expression <Alias> format_string(%s-%s, customerID#0, u_value#337) AS customerID#816 could run on GPU
    ! <FormatString> format_string(%s-%s, customerID#0, u_value#337) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.FormatString
      @Expression <Literal> %s-%s could run on GPU
      @Expression <AttributeReference> customerID#0 could run on GPU
      @Expression <AttributeReference> u_value#337 could run on GPU
  @Expression <AttributeReference> kind#133 could run on GPU
  @Expression <AttributeReference> value#136 could run on GPU
  @Expression <AttributeReference> date#156 could run on GPU
  @Expression <AttributeReference> month#315 could run on GPU
        !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
          @Expression <AttributeRefere

2022-04-05 09:40:21,129 WARN rapids.GpuOverrides:                               
  !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
    @Partitioning <RangePartitioning> could run on GPU
      @Expression <SortOrder> customerID#395 ASC NULLS FIRST could run on GPU
        @Expression <AttributeReference> customerID#395 could run on GPU
    !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
      @Expression <Alias> format_string(%s-%s, customerID#0, u_value#337) AS customerID#395 could run on GPU
        ! <FormatString> format_string(%s-%s, customerID#0, u_value#337) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.FormatString
          @Expression <Literal> %s-%s could run on GPU
          @Expression <AttributeReference> customerID#0 could run on GPU
          @Expression <AttributeReference> u_value#337 could run on GPU

2022-04-05 09:40:23,697 WARN rapids.GpuOverrides:                   (0 + 1) / 1]
  !Exec <AQEShuffleReadExec> cannot run on GPU because Unable to replace CustomShuffleReader due to child not being columnar

2022-04-05 09:40:24,451 WARN rapids.GpuOverrides:                               
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
  @Expression <AttributeReference> name#894 could run on GPU
  @Expression <AttributeReference> database#895 could run on GPU
  @Expression <AttributeReference> description#896 could run on GPU
  @Expression <AttributeReference> tableType#897 could run on GPU
  @Expression <AttributeReference> isTemporary#898 could run on GPU

2022-04-05 09:40:24,499 WARN rapids.GpuOverrides: 
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
  @Expression <Alias> format_string(%s-%s, customerID#0, u_value#337) AS customerID#910 could run on 

2022-04-05 09:40:28,911 WARN rapids.GpuOverrides:                               
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
  @Expression <AttributeReference> name#998 could run on GPU
  @Expression <AttributeReference> database#999 could run on GPU
  @Expression <AttributeReference> description#1000 could run on GPU
  @Expression <AttributeReference> tableType#1001 could run on GPU
  @Expression <AttributeReference> isTemporary#1002 could run on GPU

2022-04-05 09:40:28,964 WARN rapids.GpuOverrides: 
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
  @Expression <Alias> format_string(%s-%s, customerID#721, u_value#337) AS customerID#1014 could run on GPU
    ! <FormatString> format_string(%s-%s, customerID#721, u_value#337) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.FormatSt

CPU times: user 214 ms, sys: 34 ms, total: 248 ms
Wall time: 3min 54s


                                                                                

In [15]:
for f in ["billing_events", "customer_meta", "customer_phone_features", "customer_internet_features", "customer_account_features"]:
    output_df = session.read.parquet("%s.parquet" % f)
    print(f, output_df.select("customerID").distinct().count())

                                                                                

billing_events 703200
customer_meta 703200
customer_phone_features 635200
customer_internet_features 551200
customer_account_features 703200


In [16]:
import pyspark.sql.functions as F
from functools import reduce

output_dfs = []

for f in ["billing_events", "customer_meta", "customer_phone_features", "customer_internet_features", "customer_account_features"]:
    output_dfs.append(
        session.read.parquet("%s.parquet" % f).select(
            F.lit(f).alias("table"),
            "customerID"
        )
    )

all_customers = reduce(lambda l, r: l.unionAll(r), output_dfs)

                                                                                

In [17]:

each_table = all_customers.groupBy("table").agg(F.approx_count_distinct("customerID").alias("approx_unique_customers"))
overall = all_customers.groupBy(F.lit("all").alias("table")).agg(F.approx_count_distinct("customerID").alias("approx_unique_customers"))

each_table.union(overall).show()

2022-04-05 09:41:25,790 WARN rapids.GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
      !Exec <HashAggregateExec> cannot run on GPU because not all expressions can be replaced
        @Expression <AttributeReference> table#1189 could run on GPU
        @Expression <AggregateExpression> approx_count_distinct(customerID#1179, 0.05, 0, 0) could run on GPU
          ! <HyperLogLogPlusPlus> approx_count_distinct(customerID#1179, 0.05, 0, 0) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
            @Expression 

2022-04-05 09:42:07,736 WARN rapids.GpuOverrides: >               (0 + 0) / 815]
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
      !Exec <HashAggregateExec> cannot run on GPU because not all expressions can be replaced
        @Expression <AttributeReference> table#1189 could run on GPU
        @Expression <AggregateExpression> approx_count_distinct(customerID#1179, 0.05, 0, 0) could run on GPU
          ! <HyperLogLogPlusPlus> approx_count_distinct(customerID#1179, 0.05, 0, 0) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlu

+--------------------+-----------------------+
|               table|approx_unique_customers|
+--------------------+-----------------------+
|      billing_events|                 699470|
|       customer_meta|                 699470|
|customer_phone_fe...|                 631148|
|customer_internet...|                 521053|
|customer_account_...|                 699470|
|                 all|                 699470|
+--------------------+-----------------------+



In [18]:
rows = each_table.union(overall).collect()

2022-04-05 09:42:47,133 WARN rapids.GpuOverrides: 
  !Exec <HashAggregateExec> cannot run on GPU because not all expressions can be replaced
    @Expression <AttributeReference> table#1189 could run on GPU
    @Expression <AggregateExpression> approx_count_distinct(customerID#1179, 0.05, 0, 0) could run on GPU
      ! <HyperLogLogPlusPlus> approx_count_distinct(customerID#1179, 0.05, 0, 0) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
        @Expression <AttributeReference> customerID#1179 could run on GPU
    @Expression <AttributeReference> approx_count_distinct(customerID#1179, 0.05, 0, 0)#1352L could run on GPU
    @Expression <AttributeReference> table#1189 could run on GPU
    @Expression <Alias> approx_count_distinct(customerID#1179, 0.05, 0, 0)#1352L AS approx_unique_customers#1353L could run on GPU
      @Expression <AttributeReference> approx_count_distinct(customerID#1179, 

2022-04-05 09:43:28,385 WARN rapids.GpuOverrides: >               (0 + 0) / 815]
  !Exec <HashAggregateExec> cannot run on GPU because not all expressions can be replaced
    @Expression <AttributeReference> table#1189 could run on GPU
    @Expression <AggregateExpression> approx_count_distinct(customerID#1179, 0.05, 0, 0) could run on GPU
      ! <HyperLogLogPlusPlus> approx_count_distinct(customerID#1179, 0.05, 0, 0) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
        @Expression <AttributeReference> customerID#1179 could run on GPU
    @Expression <AttributeReference> approx_count_distinct(customerID#1179, 0.05, 0, 0)#1352L could run on GPU
    @Expression <AttributeReference> table#1189 could run on GPU
    @Expression <Alias> approx_count_distinct(customerID#1179, 0.05, 0, 0)#1352L AS approx_unique_customers#1353L could run on GPU
      @Expression <AttributeReference> approx_co

In [19]:
dict([(row[0], row[1]) for row in rows])

{'billing_events': 699470,
 'customer_meta': 699470,
 'customer_phone_features': 631148,
 'customer_internet_features': 521053,
 'customer_account_features': 699470,
 'all': 699470}