## Churn prediction
#### Author: Aleksandra Kocot

Let's try to predict users churn based on dataset from Kaggle competition:
<br>
><b>"WSDM - KKBox's Churn Prediction Challenge
Can you predict when subscribers will churn?"</b>
<br>
https://www.kaggle.com/c/kkbox-churn-prediction-challenge/

<hr>
<h3> Introduction </h3>
The churning user is the one that does not renew the subscription within the 30 days after the current subscription expiration.

<code> sdfsdf </code>
$\sqrt{k}$

**Churn definition (in data description on Kaggle)** 
>The criteria of "churn" is no new valid service subscription within 30 days after the current membership expires.

In [3]:
import xgboost as xgb

### Non-contractual vs conctractual (subscription based) business model

In [1]:
import pandas as pd

## Reading the data

In [5]:
training_set_path = r"C:\Users\Olks\Desktop\churn_prediction\train_v2.csv"

In [6]:
training_set = pd.read_csv(training_set_path)

In [7]:
training_set.head(6)

Unnamed: 0,msno,is_churn
0,ugx0CjOMzazClkFzU2xasmDZaoIqOUAZPsH1q0teWCg=,1
1,f/NmvEzHfhINFEYZTR05prUdr+E+3+oewvweYz9cCQE=,1
2,zLo9f73nGGT1p21ltZC3ChiRnAVvgibMyazbCxvWPcg=,1
3,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1
4,K6fja4+jmoZ5xG6BypqX80Uw/XKpMgrEMdG2edFOxnA=,1
5,ibIHVYBqxGwrSExE63/omeDD99M5vYB3CN2HzkEY+eM=,1


In [26]:
len(training_set)

970960

In [27]:
training_set.msno.nunique()

970960

In [9]:
user_logs_path = r"C:\Users\Olks\Desktop\churn_prediction\user_logs.csv"
user_logs_v2_path = r"C:\Users\Olks\Desktop\churn_prediction\user_logs_v2.csv"

In [11]:
user_logs_set = pd.read_csv(user_logs_path, nrows=1000, dtype = {"num_50": np.int8, "num_50": np.int8})
user_logs_v2_set = pd.read_csv(user_logs_v2_path, nrows=1000)

In [55]:
num_50 = pd.read_csv(user_logs_path, usecols = ["msno", "num_50"])

In [58]:
num_50.msno.nunique()

5234111

In [59]:
len(num_50)

392106543

In [60]:
num_50.num_50.max()

1710

In [12]:
user_logs_set.head(6)

Unnamed: 0,msno,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
0,rxIP2f2aN0rYNp+toI0Obt/N/FYQX8hcO1fTmmy2h34=,20150513,0,0,0,0,1,1,280.335
1,rxIP2f2aN0rYNp+toI0Obt/N/FYQX8hcO1fTmmy2h34=,20150709,9,1,0,0,7,11,1658.948
2,yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8=,20150105,3,3,0,0,68,36,17364.956
3,yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8=,20150306,1,0,1,1,97,27,24667.317
4,yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8=,20150501,3,0,0,0,38,38,9649.029
5,yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8=,20150702,4,0,1,1,33,10,10021.52


In [44]:
print(f"Table size: {user_logs_set.memory_usage().sum() / 2**10} KB")

Table size: 70.4375 KB


In [48]:
user_logs_set.describe()

Unnamed: 0,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20156590.0,6.085,1.257,0.884,0.92,28.573,28.314,7356.884146
std,6166.804,11.711231,2.487396,1.691569,2.323144,36.085549,30.275998,8630.169684
min,20150100.0,0.0,0.0,0.0,0.0,0.0,1.0,1.775
25%,20150700.0,0.0,0.0,0.0,0.0,6.75,7.0,1917.72775
50%,20160120.0,2.0,0.0,0.0,0.0,16.0,18.0,4430.198
75%,20160810.0,7.0,1.0,1.0,1.0,37.0,38.0,9653.27175
max,20170230.0,130.0,31.0,24.0,53.0,296.0,204.0,56870.736


In [None]:
print(f"Table size: {user_logs_set.memory_usage().sum() / 2**10} KB")

In [47]:
user_logs_set.dtypes

msno           object
date            int64
num_25          int64
num_50          int64
num_75          int64
num_985         int64
num_100         int64
num_unq         int64
total_secs    float64
dtype: object

In [13]:
user_logs_v2_set.head(6)

Unnamed: 0,msno,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
0,u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg=,20170331,8,4,0,1,21,18,6309.273
1,nTeWW/eOZA/UHKdD5L7DEqKKFTjaAj3ALLPoAWsU8n0=,20170330,2,2,1,0,9,11,2390.699
2,2UqkWXwZbIjs03dHLU9KHJNNEvEkZVzm69f3jCS+uLI=,20170331,52,3,5,3,84,110,23203.337
3,ycwLc+m2O0a85jSLALtr941AaZt9ai8Qwlg9n0Nql5U=,20170331,176,4,2,2,19,191,7100.454
4,EGcbTofOSOkMmQyN1NMLxHEXJ1yV3t/JdhGwQ9wXjnI=,20170331,2,1,0,1,112,93,28401.558
5,qR/ndQ5B+1cY+c9ihwLoiz+RFiqEnGyQKo32ZErEVKo=,20170331,3,0,0,0,39,41,9786.842


In [34]:
print(f"Min date in user logs set: {user_logs_set.date.min():>6}") 
print(f"Min date in user logs set v2: {user_logs_v2_set.date.min():>5}")

Min date in user logs set: 20150101
Min date in user logs set v2: 20170301


In [24]:
user_logs_set.shape[0]

18396362

In [25]:
len(user_logs_set)

18396362

In [35]:
user_logs_v2_set.msno.nunique()

997

 Let's base our solution on tips that the challenge winner, Bryian Gregory, gives in his article, <br>
**"Predicting Customer Churn: Extreme Gradient Boosting with Temporal Data"**
<br>
https://medium.com/@bryan.gregory1/predicting-customer-churn-extreme-gradient-boosting-with-temporal-data-332c0d9f32bf

The data is too big to read all at once. However, to train a model we need user level data.
Therefore, we can read the data by chunks and update our user table.

Features: 
    1. Days passed from last day when user played at least 85% of the song

In [None]:
dpath = 'p_flg_tmp1.csv'

for pdf in pd.read_csv(dpath, chunksize=1000):
    *do something here*

In [7]:
transactions_path = r"C:\Users\Olks\Desktop\churn_prediction\transactions.csv"
transactions_v2_path = r"C:\Users\Olks\Desktop\churn_prediction\transactions_v2.csv"

In [36]:
transactions = pd.read_csv(transactions_path, usecols = ["msno", "membership_expire_date"])

In [38]:
transactions.msno.nunique()

2363626

In [43]:
expirations = transactions.groupby("msno").membership_expire_date.max()

In [44]:
expirations

msno
+++FOrTS7ab3tIgIh8eWwX4FqRv8w/FoiOuyXsFvphY=    20160914
+++IZseRRiQS9aaSkH6cMYU6bGDcxUieAi/tH67sC5s=    20170104
+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=    20170315
+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw=    20170319
+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=    20170326
                                                  ...   
zzz9+ZF4+GMyt63oU8xfjo1EkvRqH5OINlES0RUJI6I=    20161113
zzzF1KsGfHH3qI6qiSNSXC35UXmVKMVFdxkp7xmDMc0=    20170304
zzzN9thH22os1dRS0VHReY/8FTfGHOi86//d+wGGFsQ=    20170204
zzztsqkufVj9DPVJDM3FxDkhlbCL5z4aiYxgPSGkIK4=    20150615
zzzyOgMk9MljCerbCCYrVtvu85aSCiy7yCMjAEgNYMs=    20150615
Name: membership_expire_date, Length: 2363626, dtype: int64

In [19]:
expirations.min()

20150101

In [22]:
expirations_march = expirations.loc[expirations >= 20170201]

In [23]:
expirations_march

msno
+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=    20170215
+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=    20170226
++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=    20170215
++0/NopttBsaAn6qHZA2AWWrDg7Me7UOMs1vsyo4tSI=    20170220
++0BJXY8tpirgIhJR14LDM1pnaRosjD1mdO1mIKxlJA=    20170224
                                                  ...   
zzx4hKiyR9XFEGAr7SAjcCPbKJCZ+IqegWL7dPjPwZk=    20170218
zzxZeMFx2fjfKZigMnJa2w0EmloDbm8+8nTf/o/00GY=    20170226
zzxi7n5xoTYo9Q3VTygLWvl/rBDcexwaeAry0yK7Q0E=    20170218
zzzF1KsGfHH3qI6qiSNSXC35UXmVKMVFdxkp7xmDMc0=    20170205
zzzN9thH22os1dRS0VHReY/8FTfGHOi86//d+wGGFsQ=    20170202
Name: transaction_date, Length: 885566, dtype: int64

In [37]:
transactions_v2 = pd.read_csv(transactions_v2_path, usecols = ["msno", "membership_expire_date"])

In [13]:
transactions_v2.msno.nunique()

1197050

In [45]:
expirations_vs = transactions_v2.groupby("msno").membership_expire_date.max()

In [46]:
expirations_vs

msno
+++IZseRRiQS9aaSkH6cMYU6bGDcxUieAi/tH67sC5s=    20180206
+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=    20170415
+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw=    20170519
+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=    20170426
++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=    20170415
                                                  ...   
zzy0oyiTnRTo5Mbg23oKbBkf9eoaS7+eU4V+d14bzfY=    20170527
zzy7iqSpfcRq7R4hmKKuhI+CJRs79a6pteqEggpiNO0=    20170401
zzyHq6TK2+cBkeGFUHvh12Z7UxFZiSM7dOOSllSBPDw=    20170410
zzz1Dc3P9s53HAowRTrm3fNsWju5yeN4YBfNDq7Z99Q=    20170524
zzzF1KsGfHH3qI6qiSNSXC35UXmVKMVFdxkp7xmDMc0=    20170404
Name: membership_expire_date, Length: 1197050, dtype: int64

In [47]:
expirations_vs_february = expirations_vs.loc[(expirations_vs >= 20170201) & (expirations_vs <= 20170228)]

In [48]:
expirations_vs_february

msno
+umumRhOZ0IVup9DS+caJIgZNks+ZiGJbss3GAhB8TM=    20170227
/1jSD2n7XV3ntAn8KsvkiyxMk6ZHJxhBvF/G7k4wBxo=    20170228
/6k3KjKec/1g8iQif8LGLmw1N/2ZUNlm/OzDPEgYpf0=    20170228
/a9leEtnr5OQYFQpXZq1c+cYNakhNxvYMQ8J9BW09pc=    20170228
0B11048lE+vvRBnfwLNmG4nHxYn6LvqH49F2bQP0g3Q=    20170228
                                                  ...   
xVqiP2TUx2h0wJDV84Lvcppqyb9T2GhkhsJ20/DQUoI=    20170228
xZ6xQpHYfinjfH6PntMdzeEQ4/U+lb1pbVKSF9KsrcY=    20170228
yPGGg/w2nzSPArf+sjXI3LaUWfFerNYdhmq349inj68=    20170227
yXeNejvmgWIStsP4C2ec1O1J+XZf4ilKU03FtG75hUM=    20170228
zmaCzj5ovJi8a553Hnd3ECXHH7TB7SXElV6uEVNLhfg=    20170228
Name: membership_expire_date, Length: 133, dtype: int64

In [35]:
expirations.loc["zRZce8QGUgg1PySSFfAevYSVwSBaL2Xly0UBcNCzCb4="]

20170128