# Transforming and joining raw data

The "raw" data is divided among the following tables:

- **Customer metadata**
  - customerID
  - gender
  - date of birth (we'll derive age and senior citizen status from this)
  - Partner
  - Dependents
  - (nominal) MonthlyCharges
- **Billing events**
  - customerID
  - date (we'll derive tenure from the number/duration of billing events)
  - kind (one of "AccountCreation", "Charge", or "AccountTermination")
  - value (either a positive nonzero amount or 0.00; we'll derive TotalCharges from the sum of amounts and Churn from the existence of an AccountTermination event)
- **Customer phone features**
  - customerID
  - feature (one of "PhoneService" or "MultipleLines")
- **Customer internet features**
  - customerID
  - feature (one of "InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies")
  - value (one of "Fiber", "DSL", "Yes", "No")
- **Customer account features**
  - customerID
  - feature (one of "Contract", "PaperlessBilling", "PaymentMethod")
  - value (one of "Month-to-month", "One year", "Two year", "No", "Yes", "Credit card (automatic)", "Mailed check", "Bank transfer (automatic)", "Electronic check")

We want to join these together to reconstitute a training data set with this schema:

- customerID
- gender
- SeniorCitizen
- Partner
- Dependents
- tenure
- PhoneService
- MultipleLines
- InternetService
- OnlineSecurity
- OnlineBackup
- DeviceProtection
- TechSupport
- StreamingTV
- StreamingMovies
- Contract
- PaperlessBilling
- PaymentMethod
- MonthlyCharges
- TotalCharges
- Churn

In [1]:
# notebook parameters

import os

spark_master = "spark://ip:port"
app_name = "churn-etl"
input_files = dict(
    billing="billing_events", 
    account_features="customer_account_features", 
    internet_features="customer_internet_features", 
    meta="customer_meta", 
    phone_features="customer_phone_features"
)
output_file = "churn-etl"
output_prefix = ""
output_mode = "overwrite"
output_kind = "parquet"
input_kind = "parquet"
driver_memory = '8g'
executor_memory = '8g'


In [2]:
import pyspark

session = pyspark.sql.SparkSession.builder \
    .master(spark_master) \
    .appName(app_name) \
    .config("spark.eventLog.enabled", True) \
    .config("spark.eventLog.dir", ".") \
    .config("spark.driver.memory", driver_memory) \
    .config("spark.executor.memory", executor_memory) \
    .getOrCreate()
session

In [3]:
import churn.etl

churn.etl.register_options(
    spark_master = spark_master,
    app_name = app_name,
    input_files = input_files,
    output_prefix = output_prefix,
    output_mode = output_mode,
    output_kind = output_kind,
    input_kind = input_kind,
    driver_memory = driver_memory,
    executor_memory = executor_memory
)

# Reconstructing billing events and charges

In [4]:
from churn.etl import read_df
billing_events = read_df(session, input_files["billing"])
billing_events.printSchema()

                                                                                

root
 |-- customerID: string (nullable = true)
 |-- kind: string (nullable = true)
 |-- value: decimal(8,2) (nullable = true)
 |-- date: date (nullable = true)
 |-- month: string (nullable = true)



In [5]:
from churn.etl import join_billing_data
customer_billing = join_billing_data(billing_events)

In [6]:
customer_billing

DataFrame[customerID: string, Churn: boolean, tenure: bigint, TotalCharges: decimal(18,2)]

When we aggregated billing data, we also captured a unique list of customers in a temporary view.  For convenience, we can access it as follows:

In [7]:
from churn.etl import customers as get_customers
customers = get_customers()

# Reconstructing phone features


In [8]:
phone_features = read_df(session, input_files["phone_features"])
phone_features.printSchema()

root
 |-- customerID: string (nullable = true)
 |-- feature: string (nullable = true)
 |-- value: string (nullable = true)



In [9]:
from churn.etl import join_phone_features
customer_phone_features = join_phone_features(phone_features)

# Reconstructing internet features

Whereas phone features only include whether or not there are multiple lines, there are several internet-specific features in accounts:

- `InternetService` (one of `Fiber optic` or `DSL` in the "raw" data; its absence translates to `No` in the processed data)
- `OnlineSecurity` (`Yes` in the "raw" data if present; one of `No`, `Yes`, or `No internet service` in the processed data)
- `OnlineBackup` (`Yes` in the "raw" data if present; one of `No`, `Yes`, or `No internet service` in the processed data)
- `DeviceProtection` (`Yes` in the "raw" data if present; one of `No`, `Yes`, or `No internet service` in the processed data)
- `TechSupport` (`Yes` in the "raw" data if present; one of `No`, `Yes`, or `No internet service` in the processed data)
- `StreamingTV` (`Yes` in the "raw" data if present; one of `No`, `Yes`, or `No internet service` in the processed data)
- `StreamingMovies` (`Yes` in the "raw" data if present; one of `No`, `Yes`, or `No internet service` in the processed data)

This will lead to some slightly more interesting joins!

In [10]:
internet_features = read_df(session, input_files["internet_features"])
internet_features.printSchema()
internet_features.show()

root
 |-- customerID: string (nullable = true)
 |-- feature: string (nullable = true)
 |-- value: string (nullable = true)



2022-04-05 09:59:39,224 WARN rapids.GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU



+--------------------+---------------+-----+
|          customerID|        feature|value|
+--------------------+---------------+-----+
|7590-VHVEG-Mg8VG5...|InternetService|  DSL|
|7590-VHVEG-5xLi5Z...|InternetService|  DSL|
|7590-VHVEG-ZePlJi...|InternetService|  DSL|
|7590-VHVEG-x9IoNd...|InternetService|  DSL|
|7590-VHVEG-Z9yCIk...|InternetService|  DSL|
|7590-VHVEG-K8kBya...|InternetService|  DSL|
|7590-VHVEG-4ZjnIU...|InternetService|  DSL|
|7590-VHVEG-0stTDJ...|InternetService|  DSL|
|7590-VHVEG-lqhKlh...|InternetService|  DSL|
|7590-VHVEG-4Y_zUA...|InternetService|  DSL|
|7590-VHVEG-34V86Q...|InternetService|  DSL|
|7590-VHVEG-GCNzU2...|InternetService|  DSL|
|7590-VHVEG-i0AFUE...|InternetService|  DSL|
|7590-VHVEG-F1ALBc...|InternetService|  DSL|
|7590-VHVEG-aEfHl7...|InternetService|  DSL|
|7590-VHVEG-eiqTDe...|InternetService|  DSL|
|7590-VHVEG-3K15yQ...|InternetService|  DSL|
|7590-VHVEG-iMYyeZ...|InternetService|  DSL|
|7590-VHVEG-rReekB...|InternetService|  DSL|
|7590-VHVE

In [11]:
from churn.etl import join_internet_features
customer_internet_features = join_internet_features(internet_features)

# Reconstructing account features

In [12]:
account_features = read_df(session, input_files["account_features"])
account_features.printSchema()
account_features.show()

root
 |-- customerID: string (nullable = true)
 |-- feature: string (nullable = true)
 |-- value: string (nullable = true)

+--------------------+-------------+----------------+
|          customerID|      feature|           value|
+--------------------+-------------+----------------+
|7590-VHVEG-Mg8VG5...|PaymentMethod|Electronic check|
|7590-VHVEG-5xLi5Z...|PaymentMethod|Electronic check|
|7590-VHVEG-ZePlJi...|PaymentMethod|Electronic check|
|7590-VHVEG-x9IoNd...|PaymentMethod|Electronic check|
|7590-VHVEG-Z9yCIk...|PaymentMethod|Electronic check|
|7590-VHVEG-K8kBya...|PaymentMethod|Electronic check|
|7590-VHVEG-4ZjnIU...|PaymentMethod|Electronic check|
|7590-VHVEG-0stTDJ...|PaymentMethod|Electronic check|
|7590-VHVEG-lqhKlh...|PaymentMethod|Electronic check|
|7590-VHVEG-4Y_zUA...|PaymentMethod|Electronic check|
|7590-VHVEG-34V86Q...|PaymentMethod|Electronic check|
|7590-VHVEG-GCNzU2...|PaymentMethod|Electronic check|
|7590-VHVEG-i0AFUE...|PaymentMethod|Electronic check|
|7590-VHVEG-

2022-04-05 09:59:42,068 WARN rapids.GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU



In [13]:
from churn.etl import join_account_features
customer_account_features = join_account_features(account_features)

# Account metadata

In [14]:
account_meta = read_df(session, input_files["meta"])
account_meta.printSchema()

root
 |-- customerID: string (nullable = true)
 |-- dateOfBirth: date (nullable = true)
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: string (nullable = true)
 |-- Partner: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- MonthlyCharges: decimal(8,2) (nullable = true)
 |-- now: timestamp (nullable = true)



In [15]:
from churn.etl import process_account_meta
customer_account_meta = process_account_meta(account_meta)

# Putting it all together

In [16]:
from churn.etl import chained_join
from churn.etl import forcefloat

wide_data = chained_join(
    "customerID",
    customers,
    [
        customer_billing,
        customer_phone_features,
        customer_internet_features,
        customer_account_features,
        customer_account_meta
    ]
).select(
    "customerID", 
    "gender", 
    "SeniorCitizen", 
    "Partner", 
    "Dependents", 
    "tenure", 
    "PhoneService", 
    "MultipleLines", 
    "InternetService", 
    "OnlineSecurity", 
    "OnlineBackup", 
    "DeviceProtection", 
    "TechSupport", 
    "StreamingTV", 
    "StreamingMovies", 
    "Contract", 
    "PaperlessBilling", 
    "PaymentMethod", 
    forcefloat("MonthlyCharges"),
    forcefloat("TotalCharges"), 
    "Churn"
)

In [17]:
wide_data.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [customerID#0, gender#265, SeniorCitizen#279, Partner#267, Dependents#268, tenure#61L, PhoneService#97, MultipleLines#101, InternetService#199, OnlineSecurity#200, OnlineBackup#201, DeviceProtection#202, TechSupport#203, StreamingTV#204, StreamingMovies#205, Contract#233, PaperlessBilling#258, PaymentMethod#239, cast(MonthlyCharges#269 as float) AS MonthlyCharges#366, cast(TotalCharges#62 as float) AS TotalCharges#367, Churn#41]
   +- BroadcastHashJoin [customerID#0], [customerID#263], LeftOuter, BuildRight, false
      :- Project [customerID#0, Churn#41, tenure#61L, TotalCharges#62, PhoneService#97, MultipleLines#101, InternetService#199, OnlineSecurity#200, OnlineBackup#201, DeviceProtection#202, TechSupport#203, StreamingTV#204, StreamingMovies#205, Contract#233, PaperlessBilling#258, PaymentMethod#239]
      :  +- SortMergeJoin [customerID#0], [customerID#324], LeftOuter
      :     :- Project [customerID#0, Churn#4

In [18]:
%%time
from churn.etl import write_df
write_df(wide_data, output_file)



CPU times: user 1.15 s, sys: 188 ms, total: 1.34 s
Wall time: 2min 58s


                                                                                

# Inspecting individual tables

If we need to inspect individual components of our processing, we can.  Each constituent of these joins is registered as a temporary view.  For example, we loaded `customers` earlier using a method from `churn.etl`, but it is also available as a table:

In [19]:
customers = session.table("customers")

In [20]:
customers.show()

2022-04-05 10:02:56,112 WARN rapids.GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU

2022-04-05 10:02:56,113 WARN rapids.GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU

2022-04-05 10:02:56,114 WARN rapids.GpuOverrides: 
!Exec <

+--------------------+
|          customerID|
+--------------------+
|9102-OXKFY-fCyaBG...|
|5478-JJVZK-pUoKfE...|
|1843-TLSGD-PjdWrt...|
|2027-FECZV-5HMwOd...|
|3793-MMFUH-FBa4QK...|
|5360-XGYAZ-F1ALBc...|
|1843-TLSGD-L@JxWt...|
|5872-OEQNH-5NXyac...|
|6773-LQTVT-XB@vuC...|
|3301-VKTGC-PjdWrt...|
|9251-AWQGT-fCyaBG...|
|9830-ECLEN-lqhKlh...|
|7969-FFOWG-fPARzA...|
|9451-WLYRI-0stTDJ...|
|4293-ETKAP-dkh3P1...|
|6281-FKEWS-0V3zMQ...|
|8220-OCUFY-PjdWrt...|
|0578-SKVMF-GSLp0h...|
|2165-VOEGB-K8kBya...|
|6754-WKSHP-rt81Nn...|
+--------------------+
only showing top 20 rows



                                                                                

We can see which tables are available by querying the session catalog:

In [21]:
tables = session.catalog.listTables()
[t.name for t in tables]

2022-04-05 10:03:38,813 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
2022-04-05 10:03:38,814 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist
2022-04-05 10:03:40,550 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
2022-04-05 10:03:40,550 WARN metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore yuanli@127.0.1.1
2022-04-05 10:03:40,703 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
2022-04-05 10:03:40,833 WARN rapids.GpuOverrides: 
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
  @Expression <AttributeReference> name#507 could run on GPU
  @Expression <AttributeReference> database#508 could 

['churned',
 'contracts',
 'counts_and_charges',
 'customer_account_features',
 'customer_account_meta',
 'customer_billing',
 'customer_charges',
 'customer_internet_features',
 'customer_phone_features',
 'customers',
 'device_protection',
 'internet_service',
 'multiple_lines',
 'online_backup',
 'online_security',
 'paperless',
 'payment',
 'phone_service',
 'streaming_movies',
 'streaming_tv',
 'tech_support',
 'terminations']

# Finishing up

In [22]:
session.stop()