The competition is to be performed in groups of two. You'll have a report of 4 pages to submit by december 14th, presenting the methods you tested and used. For the defense you'll get 8 minutes of presentations + 7 minutes of questions, including on question on the labs, that may involve writing a code snippet.


Churn prediction 25/26
**Predict churn prediction from streaming service logs**

The goal of the competition is to predict whether or not some users (whose user ids are in the test file) will **churn in the window of 10 days that follows the given observations (ie after "2018-11-20")**. We consider that a user churns when they visit the page **'Cancellation Confirmation'** (edited) 


In [27]:
import pandas as pd

df_test = pd.read_parquet("data/test.parquet")
df_train = pd.read_parquet("data/train.parquet")

In [4]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17499636 entries, 0 to 25661583
Data columns (total 19 columns):
 #   Column         Dtype         
---  ------         -----         
 0   status         int64         
 1   gender         object        
 2   firstName      object        
 3   level          object        
 4   lastName       object        
 5   userId         object        
 6   ts             int64         
 7   auth           object        
 8   page           object        
 9   sessionId      int64         
 10  location       object        
 11  itemInSession  int64         
 12  userAgent      object        
 13  method         object        
 14  length         float64       
 15  song           object        
 16  artist         object        
 17  time           datetime64[us]
 18  registration   datetime64[us]
dtypes: datetime64[us](2), float64(1), int64(4), object(12)
memory usage: 2.6+ GB


In [28]:
# Creating cancellation in following ten days column

import numpy as np

cancellation_events = df_train[df_train['page'] == 'Cancellation Confirmation'].copy()
cancellation_events = cancellation_events[['userId', 'time']].rename(columns={'time': 'churn_time'})

df_train = df_train.merge(cancellation_events, on='userId', how='left')

df_train['days_until_churn'] = (df_train['churn_time'] - df_train['time']).dt.total_seconds() / (24 * 3600)

df_train['will_churn_10days'] = ((df_train['days_until_churn'] >= 0) & 
                                   (df_train['days_until_churn'] <= 10)).astype(int)

df_train = df_train.drop(['churn_time', 'days_until_churn'], axis=1)

In [37]:
df_train.describe() #max time is 2018-11-20 so we are going to keep only the rows that are at least 10 days old OR that have churn True

df_train = df_train[(df_train["time"] < "2018-11-10" )| (df_train["will_churn_10days"] == 1)]

In [30]:
df_train.describe()

Unnamed: 0,status,ts,sessionId,itemInSession,length,time,registration,will_churn_10days
count,17499640.0,17499640.0,17499640.0,17499640.0,14291430.0,17499636,17499636,17499640.0
mean,209.1387,1540428000000.0,84802.94,105.5937,248.7135,2018-10-25 00:47:01.161927,2018-08-25 04:40:21.543066,0.1034635
min,200.0,1538352000000.0,1.0,0.0,0.522,2018-10-01 00:00:01,2017-10-14 22:05:25,0.0
25%,200.0,1539340000000.0,25159.0,26.0,199.8885,2018-10-12 10:33:57.750000,2018-08-10 21:14:59,0.0
50%,200.0,1540397000000.0,79038.0,66.0,234.0828,2018-10-24 15:58:54,2018-09-05 18:35:50,0.0
75%,200.0,1541500000000.0,138368.0,144.0,276.8714,2018-11-06 10:25:35,2018-09-20 17:24:57,0.0
max,404.0,1542672000000.0,207003.0,1426.0,3024.666,2018-11-20 00:00:00,2018-11-19 23:34:34,1.0
std,30.2305,1233485000.0,61414.27,116.8854,97.22845,,,0.3045633


In [39]:
#Checking that code worked as expected
df_train.sort_values(by='time', ascending = False).head(10)    

#!!!!!!!
# Since we only keep rows that have churn True in the last 10 days, our model could learn that a later date means more chance to churn! MERDE


Unnamed: 0,status,gender,firstName,level,lastName,userId,ts,auth,page,sessionId,location,itemInSession,userAgent,method,length,song,artist,time,registration,will_churn_10days
2098954,200,M,Andrew,paid,Juarez,1741654,1542671709000,Cancelled,Cancellation Confirmation,190011,"Bloomington, IN",56,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",GET,,,,2018-11-19 23:55:09,2018-04-04 02:45:38,1
2098953,307,M,Andrew,paid,Juarez,1741654,1542671646000,Logged In,Cancel,190011,"Bloomington, IN",55,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,,,,2018-11-19 23:54:06,2018-04-04 02:45:38,1
2098952,200,M,Andrew,paid,Juarez,1741654,1542671645000,Logged In,Downgrade,190011,"Bloomington, IN",54,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",GET,,,,2018-11-19 23:54:05,2018-04-04 02:45:38,1
2098951,200,M,Andrew,paid,Juarez,1741654,1542671627000,Logged In,NextSong,190011,"Bloomington, IN",53,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,267.20608,Subterranean Homesick Alien,Radiohead,2018-11-19 23:53:47,2018-04-04 02:45:38,1
2098950,200,M,Andrew,paid,Juarez,1741654,1542671369000,Logged In,NextSong,190011,"Bloomington, IN",52,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,258.11546,Catch Hell Blues,The White Stripes,2018-11-19 23:49:29,2018-04-04 02:45:38,1
2098949,200,M,Andrew,paid,Juarez,1741654,1542671083000,Logged In,NextSong,190011,"Bloomington, IN",51,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,286.24934,Help I'm Alive,Metric,2018-11-19 23:44:43,2018-04-04 02:45:38,1
2098948,200,M,Andrew,paid,Juarez,1741654,1542670762000,Logged In,NextSong,190011,"Bloomington, IN",50,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,321.77587,Freewill,Rush,2018-11-19 23:39:22,2018-04-04 02:45:38,1
2098947,200,M,Andrew,paid,Juarez,1741654,1542670566000,Logged In,NextSong,190011,"Bloomington, IN",49,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,196.54485,Dip It Low,Christina Milian,2018-11-19 23:36:06,2018-04-04 02:45:38,1
2098946,200,M,Andrew,paid,Juarez,1741654,1542670252000,Logged In,NextSong,190011,"Bloomington, IN",48,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,314.53995,Thriller,Michael Jackson,2018-11-19 23:30:52,2018-04-04 02:45:38,1
2098945,200,M,Andrew,paid,Juarez,1741654,1542670013000,Logged In,NextSong,190011,"Bloomington, IN",47,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,239.17669,Teach Me How To Dougie,California Swag District,2018-11-19 23:26:53,2018-04-04 02:45:38,1
