The purpose of this notebook is to engineer a limited number of features from the transaction and user logs which will be used to predict churn.  It should be run on a cluster leveraging **Databricks ML 7.1+** and **CPU-based** nodes.

###Step 1: Engineer Features from the Transaction Log

It is important when we access the data in the transaction log that we limit our results to information we would have access to just prior to the start of the period of interest but not after.  For example, if we are examining churn in February 2017, we would want to examine transaction log data up to and through January 31, 2017 but not on or after February 1, 2017. 

Knowing which subscriptions are viable headed into the period of interest and when those subscriptions started, we can define a range of dates from the start of the subscription through the day prior to the start of the period of interest from which we might derive transaction log features.  These ranges are calculated in the next cell, presented here in isolation so that the logic may more easily be reviewed before it is applied in the feature engineering query below it:

In [0]:
%sql

SELECT
  a.msno,
  b.subscription_id,
  b.subscription_start as start_at,
  c.last_at
FROM kkbox.train a  -- LIMIT ANALYSIS TO AT-RISK SUBSCRIBERS IN THE TRAINING PERIOD
LEFT OUTER JOIN (   -- subscriptions not yet churned heading into the period of interest
  SELECT *
  FROM kkbox.subscription_windows 
  WHERE subscription_start < '2017-02-01' AND subscription_end > DATE_ADD('2017-02-01', -30)
  )b
  ON a.msno=b.msno
LEFT OUTER JOIN (
  SELECT            -- last transaction date prior to the start of the at-risk period (we could also have just set this to the day prior to the start of the period of interest)
    subscription_id,
    MAX(transaction_date) as last_at
  FROM kkbox.transactions_enhanced
  WHERE transaction_date < '2017-02-01'
  GROUP BY subscription_id
  ) c
  ON b.subscription_id=c.subscription_id

msno,subscription_id,start_at,last_at
+AYYJsSTdy+yE0syWm6hCVRwfPIs70XzTSdGmFLMG8A=,78,2015-01-01,2017-01-01
+o4y9xpyRuBNtNToxlUHDeZ4IQD1qcVDUqmr7QNLp+M=,296,2015-01-01,2017-01-20
/NPUZNOWx2pYneDKcJjqmi35N8ogQBB1PTBm7NFJdK8=,513,2015-01-01,2017-01-24
/NcvrGvDJxRqPUt0gw/KRLR/Cu0nTHwHckXAEQ4NRDA=,516,2015-01-01,2017-01-12
/ZZqY4JVjPqsNEjCkH9C1Yc9FB49B1GcHu7RcIKf2Ec=,580,2015-01-01,2017-01-10
0Eeobh+ZuSEUrfTgwEyb5ORTE5VpQWoxnh6NR84Ytaw=,799,2015-01-01,2017-01-31
0L53joPKh8aWQLT2hKzuJwRgpyeMd+Zt61gNQCcI5d8=,833,2015-01-01,2017-01-02
0QKdErs5Evxxz6C7xRKcHsn33HzKvBKYvmWrR/RLvcg=,853,2015-01-01,2017-01-21
0WUK4180i/Lt5FB/xQkEBq8eR/AcvB7IDhZU3nEynIs=,883,2015-01-01,2017-01-26
0m5bofQVssHvDdcSkCBBIY9rHzQQzUv/8Xh5+ltuXjw=,970,2015-01-01,2017-01-18


Using these date ranges, we can now derive features from the transaction log for the *current* subscription.  Please note, it may also be interesting to derive information from all of a subscriber's prior subscriptions, but for this exercise, we are limiting our feature engineering to information associated with the current subscription plus a simple count of prior subscriptions:

In [0]:
%sql
DROP TABLE IF EXISTS kkbox.train_trans_features;

CREATE TABLE kkbox.train_trans_features
USING DELTA
AS
  WITH transaction_window (  -- this is the query from above defined as a CTE
    SELECT
      a.msno,
      b.subscription_id,
      b.subscription_start as start_at,
      c.last_at
    FROM kkbox.train a
    LEFT OUTER JOIN (
      SELECT *
      FROM kkbox.subscription_windows 
      WHERE subscription_start < '2017-02-01' AND subscription_end > DATE_ADD('2017-02-01', -30)
      )b
      ON a.msno=b.msno
    LEFT OUTER JOIN (
      SELECT  
        subscription_id,
        MAX(transaction_date) as last_at
      FROM kkbox.transactions_enhanced
      WHERE transaction_date < '2017-02-01'
      GROUP BY subscription_id
      ) c
      ON b.subscription_id=c.subscription_id
      )
  SELECT
    a.msno,
    YEAR(b.start_at) as start_year,
    MONTH(b.start_at) as start_month,
    DATEDIFF(b.last_at, b.start_at) as subscription_age,
    c.renewals,
    c.total_list_price,
    c.total_amount_paid,
    c.total_discount,
    DATEDIFF('2017-02-01', LAST(a.transaction_date) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date)) as days_since_last_account_action,
    LAST(a.plan_list_price) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_plan_list_price,
    LAST(a.actual_amount_paid) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_actual_amount_paid,
    LAST(a.discount) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_discount,
    LAST(a.payment_plan_days) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_payment_plan_days,
    LAST(a.payment_method_id) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_payment_method,
    LAST(a.is_cancel) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_is_cancel,
    LAST(a.is_auto_renew) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_is_auto_renew,
    LAST(a.change_in_list_price) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_change_in_list_price,
    LAST(a.change_in_discount) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_change_in_discount,
    LAST(a.change_in_payment_plan_days) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_change_in_payment_plan_days,
    LAST(a.change_in_payment_method_id) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_change_in_payment_method_id,
    LAST(a.change_in_cancellation) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_change_in_cancellation,
    LAST(a.change_in_auto_renew) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_change_in_auto_renew,
    LAST(a.days_change_in_membership_expire_date) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_days_change_in_membership_expire_date,
    DATEDIFF('2017-02-01', LAST(a.membership_expire_date) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date)) as days_until_expiration,
    d.total_subscription_count,
    e.city,
    CASE WHEN e.bd < 10 THEN NULL WHEN e.bd > 70 THEN NULL ELSE e.bd END as bd,
    CASE WHEN LOWER(e.gender)='female' THEN 0 WHEN LOWER(e.gender)='male' THEN 1 ELSE NULL END as gender,
    e.registered_via  
  FROM kkbox.transactions_enhanced a
  INNER JOIN transaction_window b
    ON a.subscription_id=b.subscription_id AND a.transaction_date = b.last_at
  INNER JOIN (
    SELECT  -- summary stats for current subscription
      x.subscription_id,
      SUM(CASE WHEN x.days_change_in_membership_expire_date > 0 THEN x.plan_list_price ELSE 0 END) as total_list_price,
      SUM(CASE WHEN x.days_change_in_membership_expire_date > 0 THEN x.actual_amount_paid ELSE 0 END) as total_amount_paid,
      SUM(CASE WHEN x.days_change_in_membership_expire_date > 0 THEN x.discount ELSE 0 END) as total_discount,
      SUM(CASE WHEN x.days_change_in_membership_expire_date > 0 THEN 1 ELSE 0 END) as renewals
    FROM kkbox.transactions_enhanced x
    INNER JOIN transaction_window y
      ON x.subscription_id=y.subscription_id AND x.transaction_date BETWEEN y.start_at AND y.last_at
    GROUP BY x.subscription_id
    ) c
    ON a.subscription_id=c.subscription_id
  INNER JOIN (
    SELECT  -- count of all unique subscriptions for each customer
      msno,
      COUNT(*) as total_subscription_count
    FROM kkbox.subscription_windows
    WHERE subscription_start < '2017-02-01'
    GROUP BY msno
    ) d
    ON a.msno=d.msno
  LEFT OUTER JOIN kkbox.members e
    ON a.msno=e.msno;
    
SELECT * FROM kkbox.train_trans_features;

msno,start_year,start_month,subscription_age,renewals,total_list_price,total_amount_paid,total_discount,days_since_last_account_action,last_plan_list_price,last_actual_amount_paid,last_discount,last_payment_plan_days,last_payment_method,last_is_cancel,last_is_auto_renew,last_change_in_list_price,last_change_in_discount,last_change_in_payment_plan_days,last_change_in_payment_method_id,last_change_in_cancellation,last_change_in_auto_renew,last_days_change_in_membership_expire_date,days_until_expiration,total_subscription_count,city,bd,gender,registered_via
+AYYJsSTdy+yE0syWm6hCVRwfPIs70XzTSdGmFLMG8A=,2015,1,731,24,3427,3576,-149,31,149,149,0,30,41,0,1,0,0,0,0,0,0,31,-1,1,1.0,,,7.0
+o4y9xpyRuBNtNToxlUHDeZ4IQD1qcVDUqmr7QNLp+M=,2015,1,750,25,2976,3125,-149,12,99,99,0,30,41,0,1,0,0,0,0,0,0,31,-19,1,1.0,,,7.0
/NPUZNOWx2pYneDKcJjqmi35N8ogQBB1PTBm7NFJdK8=,2015,1,754,25,3576,3725,-149,8,149,149,0,30,40,0,1,0,0,0,0,0,0,31,-23,1,13.0,,,7.0
/NcvrGvDJxRqPUt0gw/KRLR/Cu0nTHwHckXAEQ4NRDA=,2015,1,742,36,4566,4594,-28,20,99,99,0,30,41,0,1,0,0,0,0,0,0,31,-11,1,1.0,,,7.0
/ZZqY4JVjPqsNEjCkH9C1Yc9FB49B1GcHu7RcIKf2Ec=,2015,1,740,23,3402,3551,-149,22,180,180,0,30,36,0,0,0,0,0,0,0,0,31,-8,1,13.0,25.0,1.0,9.0
0Eeobh+ZuSEUrfTgwEyb5ORTE5VpQWoxnh6NR84Ytaw=,2015,1,761,25,3286,3165,121,1,129,129,0,30,41,0,1,0,0,0,0,0,0,28,-27,1,15.0,28.0,1.0,7.0
0L53joPKh8aWQLT2hKzuJwRgpyeMd+Zt61gNQCcI5d8=,2015,1,732,24,3427,3576,-149,30,149,149,0,30,37,0,1,0,0,0,0,0,0,31,-1,1,22.0,48.0,1.0,9.0
0QKdErs5Evxxz6C7xRKcHsn33HzKvBKYvmWrR/RLvcg=,2015,1,751,25,3456,3605,-149,11,129,129,0,30,41,0,1,0,0,0,0,0,0,31,-20,1,1.0,,,7.0
0WUK4180i/Lt5FB/xQkEBq8eR/AcvB7IDhZU3nEynIs=,2015,1,756,24,2727,2876,-149,6,99,99,0,30,41,0,1,0,0,0,0,0,0,31,-26,1,1.0,,,7.0
0m5bofQVssHvDdcSkCBBIY9rHzQQzUv/8Xh5+ltuXjw=,2015,1,748,24,2727,2876,-149,14,99,99,0,30,41,0,1,0,0,0,0,0,0,31,-17,1,1.0,,,7.0


Modifying the dates, we can derive these same features for the test period, March 2017:

In [0]:
%sql
DROP TABLE IF EXISTS kkbox.test_trans_features;

CREATE TABLE kkbox.test_trans_features
USING DELTA
AS
  WITH transaction_window (
    SELECT
      a.msno,
      b.subscription_id,
      b.subscription_start as start_at,
      c.last_at
    FROM kkbox.test a  -- LIMIT ANALYSIS TO AT-RISK SUBSCRIBERS IN THE TESTING PERIOD
    LEFT OUTER JOIN (  -- subscriptions not yet churned heading into the period of interest
      SELECT *
      FROM kkbox.subscription_windows 
      WHERE subscription_start < '2017-03-01' AND subscription_end > DATE_ADD('2017-03-01', -30) 
      )b
      ON a.msno=b.msno
    LEFT OUTER JOIN (
      SELECT  
        subscription_id,
        MAX(transaction_date) as last_at
      FROM kkbox.transactions_enhanced
      WHERE transaction_date < '2017-03-01'
      GROUP BY subscription_id
      ) c
      ON b.subscription_id=c.subscription_id
      )
  SELECT
    a.msno,
    YEAR(b.start_at) as start_year,
    MONTH(b.start_at) as start_month,
    DATEDIFF(b.last_at, b.start_at) as subscription_age,
    c.renewals,
    c.total_list_price,
    c.total_amount_paid,
    c.total_discount,
    DATEDIFF('2017-03-01', LAST(a.transaction_date) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date)) as days_since_last_account_action,
    LAST(a.plan_list_price) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_plan_list_price,
    LAST(a.actual_amount_paid) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_actual_amount_paid,
    LAST(a.discount) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_discount,
    LAST(a.payment_plan_days) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_payment_plan_days,
    LAST(a.payment_method_id) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_payment_method,
    LAST(a.is_cancel) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_is_cancel,
    LAST(a.is_auto_renew) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_is_auto_renew,
    LAST(a.change_in_list_price) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_change_in_list_price,
    LAST(a.change_in_discount) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_change_in_discount,
    LAST(a.change_in_payment_plan_days) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_change_in_payment_plan_days,
    LAST(a.change_in_payment_method_id) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_change_in_payment_method_id,
    LAST(a.change_in_cancellation) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_change_in_cancellation,
    LAST(a.change_in_auto_renew) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_change_in_auto_renew,
    LAST(a.days_change_in_membership_expire_date) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date) as last_days_change_in_membership_expire_date,
    DATEDIFF('2017-03-01', LAST(a.membership_expire_date) OVER(PARTITION BY a.subscription_id ORDER BY a.transaction_date)) as days_until_expiration,
    d.total_subscription_count,
    e.city,
    CASE WHEN e.bd < 10 THEN NULL WHEN e.bd > 70 THEN NULL ELSE e.bd END as bd,
    CASE WHEN LOWER(e.gender)='female' THEN 0 WHEN LOWER(e.gender)='male' THEN 1 ELSE NULL END as gender,
    e.registered_via  
  FROM kkbox.transactions_enhanced a
  INNER JOIN transaction_window b
    ON a.subscription_id=b.subscription_id AND a.transaction_date = b.last_at
  INNER JOIN (
    SELECT 
      x.subscription_id,
      SUM(CASE WHEN x.days_change_in_membership_expire_date > 0 THEN x.plan_list_price ELSE 0 END) as total_list_price,
      SUM(CASE WHEN x.days_change_in_membership_expire_date > 0 THEN x.actual_amount_paid ELSE 0 END) as total_amount_paid,
      SUM(CASE WHEN x.days_change_in_membership_expire_date > 0 THEN x.discount ELSE 0 END) as total_discount,
      SUM(CASE WHEN x.days_change_in_membership_expire_date > 0 THEN 1 ELSE 0 END) as renewals
    FROM kkbox.transactions_enhanced x
    INNER JOIN transaction_window y
      ON x.subscription_id=y.subscription_id AND x.transaction_date BETWEEN y.start_at AND y.last_at
    GROUP BY x.subscription_id
    ) c
    ON a.subscription_id=c.subscription_id
  INNER JOIN (
    SELECT
      msno,
      COUNT(*) as total_subscription_count
    FROM kkbox.subscription_windows
    WHERE subscription_start < '2017-03-01'
    GROUP BY msno
    ) d
    ON a.msno=d.msno
  LEFT OUTER JOIN kkbox.members e
    ON a.msno=e.msno;
    
SELECT * FROM kkbox.test_trans_features;

msno,start_year,start_month,subscription_age,renewals,total_list_price,total_amount_paid,total_discount,days_since_last_account_action,last_plan_list_price,last_actual_amount_paid,last_discount,last_payment_plan_days,last_payment_method,last_is_cancel,last_is_auto_renew,last_change_in_list_price,last_change_in_discount,last_change_in_payment_plan_days,last_change_in_payment_method_id,last_change_in_cancellation,last_change_in_auto_renew,last_days_change_in_membership_expire_date,days_until_expiration,total_subscription_count,city,bd,gender,registered_via
+AYYJsSTdy+yE0syWm6hCVRwfPIs70XzTSdGmFLMG8A=,2015,1,762,25,3576,3725,-149,28,149,149,0,30,41,0,1,0,0,0,0,0,0,28,-1,1,1.0,,,7.0
+o4y9xpyRuBNtNToxlUHDeZ4IQD1qcVDUqmr7QNLp+M=,2015,1,781,26,3075,3224,-149,9,99,99,0,30,41,0,1,0,0,0,0,0,0,28,-19,1,1.0,,,7.0
/NPUZNOWx2pYneDKcJjqmi35N8ogQBB1PTBm7NFJdK8=,2015,1,785,26,3725,3874,-149,5,149,149,0,30,40,0,1,0,0,0,0,0,0,28,-23,1,13.0,,,7.0
/NcvrGvDJxRqPUt0gw/KRLR/Cu0nTHwHckXAEQ4NRDA=,2015,1,773,37,4665,4693,-28,17,99,99,0,30,41,0,1,0,0,0,0,0,0,28,-11,1,1.0,,,7.0
/ZZqY4JVjPqsNEjCkH9C1Yc9FB49B1GcHu7RcIKf2Ec=,2015,1,770,24,3582,3731,-149,20,180,180,0,30,36,0,0,0,0,0,0,0,0,30,-10,1,13.0,25.0,1.0,9.0
0Eeobh+ZuSEUrfTgwEyb5ORTE5VpQWoxnh6NR84Ytaw=,2015,1,789,26,3415,3294,121,1,129,129,0,30,41,0,1,0,0,0,0,0,0,31,-30,1,15.0,28.0,1.0,7.0
0L53joPKh8aWQLT2hKzuJwRgpyeMd+Zt61gNQCcI5d8=,2015,1,763,25,3576,3725,-149,27,149,149,0,30,37,0,1,0,0,0,0,0,0,28,-1,1,22.0,48.0,1.0,9.0
0QKdErs5Evxxz6C7xRKcHsn33HzKvBKYvmWrR/RLvcg=,2015,1,782,26,3585,3734,-149,8,129,129,0,30,41,0,1,0,0,0,0,0,0,28,-20,1,1.0,,,7.0
0WUK4180i/Lt5FB/xQkEBq8eR/AcvB7IDhZU3nEynIs=,2015,1,787,25,2826,2975,-149,3,99,99,0,30,41,0,1,0,0,0,0,0,0,28,-26,1,1.0,,,7.0
0m5bofQVssHvDdcSkCBBIY9rHzQQzUv/8Xh5+ltuXjw=,2015,1,779,25,2826,2975,-149,11,99,99,0,30,41,0,1,0,0,0,0,0,0,28,-17,1,1.0,,,7.0


Examining the transaction features above, you may recognize the opportunity to derive many more features.  Our goal in this exercise is not to provide an exhaustive review of feature types but instead to generate a meaningful subset of potential features against which to train our model.

Before going further, let's make sure we have features for all customers identified in our training and testing period datasets.  Each of these queries should return a count of zero unmatched records:

In [0]:
%sql

SELECT COUNT(*)
FROM kkbox.train a
LEFT OUTER JOIN kkbox.train_trans_features b
  ON a.msno=b.msno
WHERE b.msno IS NULL

count(1)
0


In [0]:
%sql

SELECT COUNT(*)
FROM kkbox.test a
LEFT OUTER JOIN kkbox.test_trans_features b
  ON a.msno=b.msno
WHERE b.msno IS NULL

count(1)
0


###Step 2: Engineer Features from the User Logs

As with our transaction log features, we need to define a range of dates within which we wish to examine user activity for the current, at-risk subscription.  This logic differs from the earlier logic in that we'll consider all user activity headed into the period of interest as KKBox allows users to continue using their subscription for 30-days following expiration.  Knowing an expired subscription is still in use should be a significant indication of churn intent.  

In addition, it should be noted that we are constraining our feature generation from the user logs to activity occurring no more than 30-days prior to the start of the period of interest. As with the transaction logs, there are many more features we could derive such as those representing usage at the beginning of the subscription, usage throughout the subscription (ahead of the start of the period of interest), and periods of differing durations heading into the period of interest.  The limiting of the features in this way is arbitrary as, again, our goal in this exercise is not to create an exhaustive set of features but to create a meaningful set which could be used in model training. 

With all that in mind, let's calculate the date ranges over which we will derive features from the user logs:

In [0]:
%sql

SELECT
  a.msno,
  b.subscription_id,
  CASE 
    WHEN b.subscription_start < DATE_ADD('2017-02-01', -30) THEN DATE_ADD('2017-02-01', -30) -- cap subscription info to 30-days prior to start of period
    ELSE b.subscription_start 
    END as start_at,
  DATE_ADD('2017-02-01', -1) as end_at,
  c.last_at as last_exp_at
FROM kkbox.train a  -- LIMIT ANALYSIS TO AT-RISK SUBSCRIBERS IN THE TRAINING PERIOD
LEFT OUTER JOIN (   -- subscriptions not yet churned heading into the period of interest 
  SELECT *
  FROM kkbox.subscription_windows 
  WHERE subscription_start < '2017-02-01' AND subscription_end > DATE_ADD('2017-02-01', -30)
  )b
  ON a.msno=b.msno
LEFT OUTER JOIN (  -- last known expiration date headed into this period
  SELECT
    x.subscription_id,
    y.membership_expire_date as last_at
  FROM (
    SELECT  -- last subscription transaction before start of this period
      subscription_id,
      MAX(transaction_date) as transaction_date
    FROM kkbox.transactions_enhanced
    WHERE transaction_date < '2017-02-01'
    GROUP BY subscription_id
    ) x
  INNER JOIN kkbox.transactions_enhanced y
    ON x.subscription_id=y.subscription_id AND x.transaction_date=y.transaction_date
  ) c
  ON b.subscription_id=c.subscription_id  

msno,subscription_id,start_at,end_at,last_exp_at
+AYYJsSTdy+yE0syWm6hCVRwfPIs70XzTSdGmFLMG8A=,78,2017-01-02,2017-01-31,2017-02-02
+o4y9xpyRuBNtNToxlUHDeZ4IQD1qcVDUqmr7QNLp+M=,296,2017-01-02,2017-01-31,2017-02-20
/NPUZNOWx2pYneDKcJjqmi35N8ogQBB1PTBm7NFJdK8=,513,2017-01-02,2017-01-31,2017-02-24
/NcvrGvDJxRqPUt0gw/KRLR/Cu0nTHwHckXAEQ4NRDA=,516,2017-01-02,2017-01-31,2017-02-12
/ZZqY4JVjPqsNEjCkH9C1Yc9FB49B1GcHu7RcIKf2Ec=,580,2017-01-02,2017-01-31,2017-02-09
0Eeobh+ZuSEUrfTgwEyb5ORTE5VpQWoxnh6NR84Ytaw=,799,2017-01-02,2017-01-31,2017-02-28
0L53joPKh8aWQLT2hKzuJwRgpyeMd+Zt61gNQCcI5d8=,833,2017-01-02,2017-01-31,2017-02-02
0QKdErs5Evxxz6C7xRKcHsn33HzKvBKYvmWrR/RLvcg=,853,2017-01-02,2017-01-31,2017-02-21
0WUK4180i/Lt5FB/xQkEBq8eR/AcvB7IDhZU3nEynIs=,883,2017-01-02,2017-01-31,2017-02-27
0m5bofQVssHvDdcSkCBBIY9rHzQQzUv/8Xh5+ltuXjw=,970,2017-01-02,2017-01-31,2017-02-18


Using these date ranges, we can now constrain our analysis of the user logs.  It's important to note that users may have multiple streaming sessions on a given date.  As such, we'll want to derive day-level statistics on the user-logs to make them easier to consume.  In addition, we will want to join our day-level statistics with our date range dataset, *i.e.* kkbox.dates derived in the last notebook, so that we may have one record for each day in the range of interest.  Understanding patterns of activity as well as inactivity may be helpful in determining which subscriptions will churn:

In [0]:
%sql

WITH activity_window (
    SELECT
      a.msno,
      b.subscription_id,
      CASE 
        WHEN b.subscription_start < DATE_ADD('2017-02-01', -30) THEN DATE_ADD('2017-02-01', -30) 
        ELSE b.subscription_start 
        END as start_at,
      DATE_ADD('2017-02-01', -1) as end_at,
      c.last_at as last_exp_at
    FROM kkbox.train a
    LEFT OUTER JOIN (
      SELECT *
      FROM kkbox.subscription_windows 
      WHERE subscription_start < '2017-02-01' AND subscription_end > DATE_ADD('2017-02-01', -30)
      )b
      ON a.msno=b.msno
    LEFT OUTER JOIN (  -- last known expiration date headed into this period
      SELECT
        x.subscription_id,
        y.membership_expire_date as last_at
      FROM (
        SELECT  -- last subscription transaction before start of this period
          subscription_id,
          MAX(transaction_date) as transaction_date
        FROM kkbox.transactions_enhanced
        WHERE transaction_date < '2017-02-01'
        GROUP BY subscription_id
        ) x
      INNER JOIN kkbox.transactions_enhanced y
        ON x.subscription_id=y.subscription_id AND x.transaction_date=y.transaction_date
      ) c
      ON b.subscription_id=c.subscription_id  
    )    
SELECT
  a.subscription_id,
  a.msno,
  b.date,
  CASE WHEN b.date > a.last_exp_at THEN 1 ELSE 0 END as after_exp,
  CASE WHEN c.date IS NOT NULL THEN 1 ELSE 0 END as had_session,
  COALESCE(c.session_count, 0) as sessions_total,
  COALESCE(c.total_secs, 0) as seconds_total,
  COALESCE(c.num_uniq,0) as number_uniq,
  COALESCE(c.num_total,0) as number_total
FROM activity_window a
INNER JOIN kkbox.dates b
  ON b.date BETWEEN a.start_at AND a.end_at
LEFT OUTER JOIN (
  SELECT
    msno,
    date,
    COUNT(*) as session_count,
    SUM(total_secs) as total_secs,
    SUM(num_uniq) as num_uniq,
    SUM(num_25+num_50+num_75+num_985+num_100) as num_total
  FROM kkbox.user_logs
  GROUP BY msno, date
  ) c
  ON a.msno=c.msno AND b.date=c.date
ORDER BY subscription_id, date

subscription_id,msno,date,after_exp,had_session,sessions_total,seconds_total,number_uniq,number_total
1,++KbErD/TJoTzWzoMQzvaHHnRPE5GZLjXR2YbfbJ+ow=,2017-01-02,0,0,0,0.0,0,0
1,++KbErD/TJoTzWzoMQzvaHHnRPE5GZLjXR2YbfbJ+ow=,2017-01-03,0,0,0,0.0,0,0
1,++KbErD/TJoTzWzoMQzvaHHnRPE5GZLjXR2YbfbJ+ow=,2017-01-04,0,0,0,0.0,0,0
1,++KbErD/TJoTzWzoMQzvaHHnRPE5GZLjXR2YbfbJ+ow=,2017-01-05,0,0,0,0.0,0,0
1,++KbErD/TJoTzWzoMQzvaHHnRPE5GZLjXR2YbfbJ+ow=,2017-01-06,0,0,0,0.0,0,0
1,++KbErD/TJoTzWzoMQzvaHHnRPE5GZLjXR2YbfbJ+ow=,2017-01-07,0,0,0,0.0,0,0
1,++KbErD/TJoTzWzoMQzvaHHnRPE5GZLjXR2YbfbJ+ow=,2017-01-08,0,0,0,0.0,0,0
1,++KbErD/TJoTzWzoMQzvaHHnRPE5GZLjXR2YbfbJ+ow=,2017-01-09,0,0,0,0.0,0,0
1,++KbErD/TJoTzWzoMQzvaHHnRPE5GZLjXR2YbfbJ+ow=,2017-01-10,0,0,0,0.0,0,0
1,++KbErD/TJoTzWzoMQzvaHHnRPE5GZLjXR2YbfbJ+ow=,2017-01-11,0,0,0,0.0,0,0


With our daily activity records now constructed, we can create the summary statistics that will form our user activity features:

In [0]:
%sql
DROP TABLE IF EXISTS kkbox.train_act_features;

CREATE TABLE kkbox.train_act_features
USING DELTA 
AS
WITH activity_window (
    SELECT
      a.msno,
      b.subscription_id,
      CASE 
        WHEN b.subscription_start < DATE_ADD('2017-02-01', -30) THEN DATE_ADD('2017-02-01', -30) 
        ELSE b.subscription_start 
        END as start_at,
      DATE_ADD('2017-02-01', -1) as end_at,
      c.last_at as last_exp_at
    FROM kkbox.train a
    LEFT OUTER JOIN (
      SELECT *
      FROM kkbox.subscription_windows 
      WHERE subscription_start < '2017-02-01' AND subscription_end > DATE_ADD('2017-02-01', -30)
      )b
      ON a.msno=b.msno
    LEFT OUTER JOIN (  -- last known expiration date headed into this period
      SELECT
        x.subscription_id,
        y.membership_expire_date as last_at
      FROM (
        SELECT  -- last subscription transaction before start of this period
          subscription_id,
          MAX(transaction_date) as transaction_date
        FROM kkbox.transactions_enhanced
        WHERE transaction_date < '2017-02-01'
        GROUP BY subscription_id
        ) x
      INNER JOIN kkbox.transactions_enhanced y
        ON x.subscription_id=y.subscription_id AND x.transaction_date=y.transaction_date
      ) c
      ON b.subscription_id=c.subscription_id  
      ),
  activity (
    SELECT
      a.subscription_id,
      a.msno,
      b.date,
      CASE WHEN b.date > a.last_exp_at THEN 1 ELSE 0 END as after_exp,
      CASE WHEN c.date IS NOT NULL THEN 1 ELSE 0 END as had_session,
      COALESCE(c.session_count, 0) as sessions_total,
      COALESCE(c.total_secs, 0) as seconds_total,
      COALESCE(c.num_uniq,0) as number_uniq,
      COALESCE(c.num_total,0) as number_total
    FROM activity_window a
    INNER JOIN kkbox.dates b
      ON b.date BETWEEN a.start_at AND a.end_at
    LEFT OUTER JOIN (
      SELECT
        msno,
        date,
        COUNT(*) as session_count,
        SUM(total_secs) as total_secs,
        SUM(num_uniq) as num_uniq,
        SUM(num_25+num_50+num_75+num_985+num_100) as num_total
      FROM kkbox.user_logs
      GROUP BY msno, date
      ) c
      ON a.msno=c.msno AND b.date=c.date
    )
  
SELECT 
  subscription_id,
  msno,
  COUNT(*) as days_total,
  SUM(had_session) as days_with_session,
  COALESCE(SUM(had_session)/COUNT(*),0) as ratio_days_with_session_to_days,
  SUM(after_exp) as days_after_exp,
  SUM(had_session * after_exp) as days_after_exp_with_session,
  COALESCE(SUM(had_session * after_exp)/SUM(after_exp),0) as ratio_days_after_exp_with_session_to_days_after_exp,
  SUM(sessions_total) as sessions_total,
  COALESCE(SUM(sessions_total)/COUNT(*),0) as ratio_sessions_total_to_days_total,
  COALESCE(SUM(sessions_total)/SUM(had_session),0) as ratio_sessions_total_to_days_with_session,
  SUM(sessions_total * after_exp) as sessions_total_after_exp,
  COALESCE(SUM(sessions_total * after_exp)/SUM(after_exp),0) as ratio_sessions_total_after_exp_to_days_after_exp,
  COALESCE(SUM(sessions_total * after_exp)/SUM(had_session * after_exp),0) as ratio_sessions_total_after_exp_to_days_after_exp_with_session,
  SUM(seconds_total) as seconds_total,
  COALESCE(SUM(seconds_total)/COUNT(*),0) as ratio_seconds_total_to_days_total,
  COALESCE(SUM(seconds_total)/SUM(had_session),0) as ratio_seconds_total_to_days_with_session,
  SUM(seconds_total * after_exp) as seconds_total_after_exp,
  COALESCE(SUM(seconds_total * after_exp)/SUM(after_exp),0) as ratio_seconds_total_after_exp_to_days_after_exp,
  COALESCE(SUM(seconds_total * after_exp)/SUM(had_session * after_exp),0) as ratio_seconds_total_after_exp_to_days_after_exp_with_session,
  SUM(number_uniq) as number_uniq,
  COALESCE(SUM(number_uniq)/COUNT(*),0) as ratio_number_uniq_to_days_total,
  COALESCE(SUM(number_uniq)/SUM(had_session),0) as ratio_number_uniq_to_days_with_session,
  SUM(number_uniq * after_exp) as number_uniq_after_exp,
  COALESCE(SUM(number_uniq * after_exp)/SUM(after_exp),0) as ratio_number_uniq_after_exp_to_days_after_exp,
  COALESCE(SUM(number_uniq * after_exp)/SUM(had_session * after_exp),0) as ratio_number_uniq_after_exp_to_days_after_exp_with_session,
  SUM(number_total) as number_total,
  COALESCE(SUM(number_total)/COUNT(*),0) as ratio_number_total_to_days_total,
  COALESCE(SUM(number_total)/SUM(had_session),0) as ratio_number_total_to_days_with_session,
  SUM(number_total * after_exp) as number_total_after_exp,
  COALESCE(SUM(number_total * after_exp)/SUM(after_exp),0) as ratio_number_total_after_exp_to_days_after_exp,
  COALESCE(SUM(number_total * after_exp)/SUM(had_session * after_exp),0) as ratio_number_total_after_exp_to_days_after_exp_with_session
FROM activity
GROUP BY subscription_id, msno
ORDER BY msno;

SELECT *
FROM kkbox.train_act_features;

subscription_id,msno,days_total,days_with_session,ratio_days_with_session_to_days,days_after_exp,days_after_exp_with_session,ratio_days_after_exp_with_session_to_days_after_exp,sessions_total,ratio_sessions_total_to_days_total,ratio_sessions_total_to_days_with_session,sessions_total_after_exp,ratio_sessions_total_after_exp_to_days_after_exp,ratio_sessions_total_after_exp_to_days_after_exp_with_session,seconds_total,ratio_seconds_total_to_days_total,ratio_seconds_total_to_days_with_session,seconds_total_after_exp,ratio_seconds_total_after_exp_to_days_after_exp,ratio_seconds_total_after_exp_to_days_after_exp_with_session,number_uniq,ratio_number_uniq_to_days_total,ratio_number_uniq_to_days_with_session,number_uniq_after_exp,ratio_number_uniq_after_exp_to_days_after_exp,ratio_number_uniq_after_exp_to_days_after_exp_with_session,number_total,ratio_number_total_to_days_total,ratio_number_total_to_days_with_session,number_total_after_exp,ratio_number_total_after_exp_to_days_after_exp,ratio_number_total_after_exp_to_days_after_exp_with_session
2745735,+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
406264,+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
1926236,++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
1937808,++0/NopttBsaAn6qHZA2AWWrDg7Me7UOMs1vsyo4tSI=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
1259237,++0BJXY8tpirgIhJR14LDM1pnaRosjD1mdO1mIKxlJA=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
1462853,++0wqjjQge1mBBe5r4ciHGKwtF/m322zkra7CK8I+Mw=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
1280430,++1G0wVY14Lp0VXak1ymLhPUdXPSFJVBnjWwzGxBKJs=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
2178834,++1GCIyXZO7834NjDKmcK1lBVLQi9PsN6sOC7wfW+8g=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
2136980,++1Wu2wKBA60W9F9sMh15RXmh1wN1fjoVGzNqvw/Gro=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
2117333,++2Ju1OdxLSyexwhZ/C0glNK0DMIfUjsFpk9lt8Dll8=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0


We can use the same logic to generate features for the testing period:

In [0]:
%sql
DROP TABLE IF EXISTS kkbox.test_act_features;

CREATE TABLE kkbox.test_act_features
USING DELTA 
AS
WITH activity_window (
    SELECT
      a.msno,
      b.subscription_id,
      CASE 
        WHEN b.subscription_start < DATE_ADD('2017-03-01', -30) THEN DATE_ADD('2017-03-01', -30) 
        ELSE b.subscription_start 
        END as start_at,
      DATE_ADD('2017-03-01', -1) as end_at,
      c.last_at as last_exp_at
    FROM kkbox.test a
    LEFT OUTER JOIN (
      SELECT *
      FROM kkbox.subscription_windows 
      WHERE subscription_start < '2017-03-01' AND subscription_end > DATE_ADD('2017-03-01', -30)
      )b
      ON a.msno=b.msno
    LEFT OUTER JOIN (  -- last known expiration date headed into this period
      SELECT
        x.subscription_id,
        y.membership_expire_date as last_at
      FROM (
        SELECT  -- last subscription transaction before start of this period
          subscription_id,
          MAX(transaction_date) as transaction_date
        FROM kkbox.transactions_enhanced
        WHERE transaction_date < '2017-03-01'
        GROUP BY subscription_id
        ) x
      INNER JOIN kkbox.transactions_enhanced y
        ON x.subscription_id=y.subscription_id AND x.transaction_date=y.transaction_date
      ) c
      ON b.subscription_id=c.subscription_id  
      ),
  activity (
    SELECT
      a.subscription_id,
      a.msno,
      b.date,
      CASE WHEN b.date > a.last_exp_at THEN 1 ELSE 0 END as after_exp,
      CASE WHEN c.date IS NOT NULL THEN 1 ELSE 0 END as had_session,
      COALESCE(c.session_count, 0) as sessions_total,
      COALESCE(c.total_secs, 0) as seconds_total,
      COALESCE(c.num_uniq,0) as number_uniq,
      COALESCE(c.num_total,0) as number_total
    FROM activity_window a
    INNER JOIN kkbox.dates b
      ON b.date BETWEEN a.start_at AND a.end_at
    LEFT OUTER JOIN (
      SELECT
        msno,
        date,
        COUNT(*) as session_count,
        SUM(total_secs) as total_secs,
        SUM(num_uniq) as num_uniq,
        SUM(num_25+num_50+num_75+num_985+num_100) as num_total
      FROM kkbox.user_logs
      GROUP BY msno, date
      ) c
      ON a.msno=c.msno AND b.date=c.date
    )
  
SELECT 
  subscription_id,
  msno,
  COUNT(*) as days_total,
  SUM(had_session) as days_with_session,
  COALESCE(SUM(had_session)/COUNT(*),0) as ratio_days_with_session_to_days,
  SUM(after_exp) as days_after_exp,
  SUM(had_session * after_exp) as days_after_exp_with_session,
  COALESCE(SUM(had_session * after_exp)/SUM(after_exp),0) as ratio_days_after_exp_with_session_to_days_after_exp,
  SUM(sessions_total) as sessions_total,
  COALESCE(SUM(sessions_total)/COUNT(*),0) as ratio_sessions_total_to_days_total,
  COALESCE(SUM(sessions_total)/SUM(had_session),0) as ratio_sessions_total_to_days_with_session,
  SUM(sessions_total * after_exp) as sessions_total_after_exp,
  COALESCE(SUM(sessions_total * after_exp)/SUM(after_exp),0) as ratio_sessions_total_after_exp_to_days_after_exp,
  COALESCE(SUM(sessions_total * after_exp)/SUM(had_session * after_exp),0) as ratio_sessions_total_after_exp_to_days_after_exp_with_session,
  SUM(seconds_total) as seconds_total,
  COALESCE(SUM(seconds_total)/COUNT(*),0) as ratio_seconds_total_to_days_total,
  COALESCE(SUM(seconds_total)/SUM(had_session),0) as ratio_seconds_total_to_days_with_session,
  SUM(seconds_total * after_exp) as seconds_total_after_exp,
  COALESCE(SUM(seconds_total * after_exp)/SUM(after_exp),0) as ratio_seconds_total_after_exp_to_days_after_exp,
  COALESCE(SUM(seconds_total * after_exp)/SUM(had_session * after_exp),0) as ratio_seconds_total_after_exp_to_days_after_exp_with_session,
  SUM(number_uniq) as number_uniq,
  COALESCE(SUM(number_uniq)/COUNT(*),0) as ratio_number_uniq_to_days_total,
  COALESCE(SUM(number_uniq)/SUM(had_session),0) as ratio_number_uniq_to_days_with_session,
  SUM(number_uniq * after_exp) as number_uniq_after_exp,
  COALESCE(SUM(number_uniq * after_exp)/SUM(after_exp),0) as ratio_number_uniq_after_exp_to_days_after_exp,
  COALESCE(SUM(number_uniq * after_exp)/SUM(had_session * after_exp),0) as ratio_number_uniq_after_exp_to_days_after_exp_with_session,
  SUM(number_total) as number_total,
  COALESCE(SUM(number_total)/COUNT(*),0) as ratio_number_total_to_days_total,
  COALESCE(SUM(number_total)/SUM(had_session),0) as ratio_number_total_to_days_with_session,
  SUM(number_total * after_exp) as number_total_after_exp,
  COALESCE(SUM(number_total * after_exp)/SUM(after_exp),0) as ratio_number_total_after_exp_to_days_after_exp,
  COALESCE(SUM(number_total * after_exp)/SUM(had_session * after_exp),0) as ratio_number_total_after_exp_to_days_after_exp_with_session
FROM activity
GROUP BY subscription_id, msno
ORDER BY msno;

SELECT *
FROM kkbox.test_act_features;

subscription_id,msno,days_total,days_with_session,ratio_days_with_session_to_days,days_after_exp,days_after_exp_with_session,ratio_days_after_exp_with_session_to_days_after_exp,sessions_total,ratio_sessions_total_to_days_total,ratio_sessions_total_to_days_with_session,sessions_total_after_exp,ratio_sessions_total_after_exp_to_days_after_exp,ratio_sessions_total_after_exp_to_days_after_exp_with_session,seconds_total,ratio_seconds_total_to_days_total,ratio_seconds_total_to_days_with_session,seconds_total_after_exp,ratio_seconds_total_after_exp_to_days_after_exp,ratio_seconds_total_after_exp_to_days_after_exp_with_session,number_uniq,ratio_number_uniq_to_days_total,ratio_number_uniq_to_days_with_session,number_uniq_after_exp,ratio_number_uniq_after_exp_to_days_after_exp,ratio_number_uniq_after_exp_to_days_after_exp_with_session,number_total,ratio_number_total_to_days_total,ratio_number_total_to_days_with_session,number_total_after_exp,ratio_number_total_after_exp_to_days_after_exp,ratio_number_total_after_exp_to_days_after_exp_with_session
2745735,+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
406264,+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
1926236,++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
1937808,++0/NopttBsaAn6qHZA2AWWrDg7Me7UOMs1vsyo4tSI=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
1259237,++0BJXY8tpirgIhJR14LDM1pnaRosjD1mdO1mIKxlJA=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
1462853,++0wqjjQge1mBBe5r4ciHGKwtF/m322zkra7CK8I+Mw=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
1280430,++1G0wVY14Lp0VXak1ymLhPUdXPSFJVBnjWwzGxBKJs=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
2178834,++1GCIyXZO7834NjDKmcK1lBVLQi9PsN6sOC7wfW+8g=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
2136980,++1Wu2wKBA60W9F9sMh15RXmh1wN1fjoVGzNqvw/Gro=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0
2117333,++2Ju1OdxLSyexwhZ/C0glNK0DMIfUjsFpk9lt8Dll8=,30,0,0.0,0,0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0


And again, let's make sure we aren't missing records for any at-risk subscriptions.  Each of these queries should return a count of zero:

In [0]:
%sql

SELECT COUNT(*)
FROM kkbox.train a
LEFT OUTER JOIN kkbox.train_act_features b
  ON a.msno=b.msno
WHERE b.msno IS NULL

count(1)
0


In [0]:
%sql

SELECT COUNT(*)
FROM kkbox.test a
LEFT OUTER JOIN kkbox.test_act_features b
  ON a.msno=b.msno
WHERE b.msno IS NULL

count(1)
0
