<a href="https://colab.research.google.com/github/Dellainey/Predict-Customer-Churn/blob/update-1/predict%20customer%20churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Customer Churn** can be defined as the loss of customers or clients. **Customer attrition** and **customer turnover** are some of the other terms that are used and mean the same. This is one of the metrics used by organizations to measure their performance. Losing a customer is a big concern for organizations as the efforts in retaining a customer is much less than obtaining a new customer. 

The objective of this project is to **predict the customer churn** and thus alert the concerned to take appropriate customer retention initiative.

# **DATASET:**
To perform this task, we will consider the dataset provided by KKBOX (Asia's leading music streaming service) for the WSDM challenge available on kaggle https://www.kaggle.com/c/kkbox-churn-prediction-challenge Note that you have to accept and agree to the rules of the competition to use the dataset.

The following steps involve reading in the data tables. 
Since the size of the dataset is 30+ GB, we will randomly select a few users in the train data and trim the other tables so as to keep the data pertaining to only the selected customers.

In [1]:
# import the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# 5 steps to loading a dataset from kaggle
# Thanks to Filemon for his post (https://www.kaggle.com/general/74235) on kaggle to do this. Saved me alot of time.

#Step -1
! pip install -q kaggle

# installing the updated version 
!pip install --upgrade --force-reinstall --no-deps kaggle

#Step -2
from google.colab import files
files.upload()    # a prompt will appear, locate the kaggle.json file that you downloaded

#Step -3
! mkdir ~/.kaggle

#Step -4
! cp kaggle.json ~/.kaggle/

#Step -5
! chmod 600 ~/.kaggle/kaggle.json

Collecting kaggle
[?25l  Downloading https://files.pythonhosted.org/packages/62/ab/bb20f9b9e24f9a6250f95a432f8d9a7d745f8d24039d7a5a6eaadb7783ba/kaggle-1.5.6.tar.gz (58kB)
[K     |█████▋                          | 10kB 15.8MB/s eta 0:00:01[K     |███████████▎                    | 20kB 5.6MB/s eta 0:00:01[K     |█████████████████               | 30kB 6.4MB/s eta 0:00:01[K     |██████████████████████▌         | 40kB 6.7MB/s eta 0:00:01[K     |████████████████████████████▏   | 51kB 6.5MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 3.5MB/s 
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Created wheel for kaggle: filename=kaggle-1.5.6-cp36-none-any.whl size=72859 sha256=7667f7213910d3fa47c94b30300be366d8fece98f90415a818c39d6a4defcfef
  Stored in directory: /root/.cache/pip/wheels/57/4e/e8/bb28d035162fb8f17f8ca5d42c3230e284c6aa565b42b72674
Successfully built kaggle
Installing collected packages: k

Saving kaggle.json to kaggle.json


In [None]:
#sanity check
! kaggle datasets list

ref                                                               title                                              size  lastUpdated          downloadCount  voteCount  usabilityRating  
----------------------------------------------------------------  ------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
Cornell-University/arxiv                                          arXiv Dataset                                     877MB  2020-08-14 23:50:57           1779        417  0.875            
agirlcoding/all-space-missions-from-1957                          All Space Missions from 1957                      101KB  2020-08-13 16:18:58            805         92  0.85294116       
landlord/handwriting-recognition                                  Handwriting Recognition                             1GB  2020-08-05 17:20:36            266         50  0.9411765        
jmmvutu/summer-products-and-sales-in-ecommerce-wish         

In [3]:
# copy the API command from the kaggle dataset
!kaggle competitions download -c kkbox-churn-prediction-challenge

Downloading kkbox-churn-prediction-challenge.zip to /content
100% 8.34G/8.34G [02:10<00:00, 80.9MB/s]
100% 8.34G/8.34G [02:10<00:00, 68.7MB/s]


In [4]:
# making a directory for the main folder and unzipping its content
! mkdir kkbox-churn-prediction-challenge
! unzip kkbox-churn-prediction-challenge.zip -d kkbox-churn-prediction-challenge

Archive:  kkbox-churn-prediction-challenge.zip
  inflating: kkbox-churn-prediction-challenge/WSDMChurnLabeller.scala  
  inflating: kkbox-churn-prediction-challenge/members_v3.csv.7z  
  inflating: kkbox-churn-prediction-challenge/sample_submission_v2.csv.7z  
  inflating: kkbox-churn-prediction-challenge/sample_submission_zero.csv.7z  
  inflating: kkbox-churn-prediction-challenge/train.csv.7z  
  inflating: kkbox-churn-prediction-challenge/train_v2.csv.7z  
  inflating: kkbox-churn-prediction-challenge/transactions.csv.7z  
  inflating: kkbox-churn-prediction-challenge/transactions_v2.csv.7z  
  inflating: kkbox-churn-prediction-challenge/user_logs.csv.7z  
  inflating: kkbox-churn-prediction-challenge/user_logs_v2.csv.7z  


In [None]:
# make a directory to save the zipped content
# dont have to make directory if the format is .7z
# ! mkdir kkbox-churn-prediction-challenge/transactions_v2
# ! mkdir kkbox-churn-prediction-challenge/sample_submission_zero
# ! mkdir kkbox-churn-prediction-challenge/transactions
# ! mkdir kkbox-churn-prediction-challenge/user_logs_v2
# ! mkdir kkbox-churn-prediction-challenge/sample_submission_v2
# ! mkdir kkbox-churn-prediction-challenge/train_v2
# ! mkdir kkbox-churn-prediction-challenge/train
# ! mkdir kkbox-churn-prediction-challenge/members_v3
# ! mkdir kkbox-churn-prediction-challenge/user_logs

In [6]:
# unzipping the contents

# If the file was just a .zip file, you would use the below comment to unzip and save in the directory
#! unzip kkbox-churn-prediction-challenge/transactions_v2.csv.7z -d transactions_v2

#Since the files are formated with .7z, we follow the below command to unzip
#! 7z e kkbox-churn-prediction-challenge/transactions_v2.csv.7z
! 7z e kkbox-churn-prediction-challenge/sample_submission_zero.csv.7z
! 7z e kkbox-churn-prediction-challenge/transactions.csv.7z
#! 7z e kkbox-churn-prediction-challenge/user_logs_v2.csv.7z
#! 7z e kkbox-churn-prediction-challenge/sample_submission_v2.csv.7z
#! 7z e kkbox-churn-prediction-challenge/train_v2.csv.7z
#! 7z e kkbox-churn-prediction-challenge/train.csv.7z
! 7z e kkbox-churn-prediction-challenge/members_v3.csv.7z
#! 7z e kkbox-churn-prediction-challenge/user_logs.csv.7z


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.00GHz (50653),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 32828332 bytes (32 MiB)

Extracting archive: kkbox-churn-prediction-challenge/sample_submission_zero.csv.7z
--
Path = kkbox-churn-prediction-challenge/sample_submission_zero.csv.7z
Type = 7z
Physical Size = 32828332
Headers Size = 162
Method = LZMA2:24
Solid = -
Blocks = 1

  0%    
Would you like to replace the existing file:
  Path:     ./sample_submission_zero.csv
  Size:     45635134 bytes (44 MiB)
  Modified: 2017-09-18 18:24:04
with the file from archive:
  Path:     sample_submission_zero.csv
  Size:     45635134 bytes (44 MiB)
  Modified: 2017-09-18 18:24:04
? (Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit? y

  6% - sample_submission_zero.csv      

In [None]:
# This is another method to load the data from your local disk.

# from google.colab import files
# uploaded = files.upload()       #It will prompt to choose the file location


In [None]:
train = pd.read_csv('train.csv')
train.shape

(992931, 2)

In [None]:
# Since the dataset is too large, let us first work with a small section.
# Once you feel the algorithm works well then we can include all the data
train_1 = train[train['is_churn']==0].reset_index(drop = True)
rv_1 = np.random.rand(train_1.shape[0])
train_2 = train[train['is_churn']==1].reset_index(drop = True)
rv_2 = np.random.rand(train_2.shape[0])
train_1 = train_1.loc[rv_1<0.02]
print(train_1.shape)
train_2 = train_2.loc[rv_2<0.02]
print(train_2.shape)
train = pd.concat([train_1, train_2], ignore_index=True)
train.shape

(18705, 2)
(1273, 2)


(19978, 2)

In [None]:
# Let us see how many customer churn and how many dont in this smaller dataset
print(train[train['is_churn']==0].shape)
train[train['is_churn']==1].shape

(18705, 2)

Once you disconnect, all your data will be lost when you come back to work on this project, as colab wipes itself off. Hence it would be wise to store your inprocess data on your google drive (as shown below) or on your local disk.

In [None]:
# copying the newly created train dataset to the google drive 
# you will need to give access to your google file system

# from google.colab import drive
# drive.mount('/drive')
# train.to_csv('/drive/My Drive/Colab_datasets/Churn_prediction/train.csv')


In [None]:
# Here we save a copy of the smaller trainset on our local drive
from google.colab import files
train.to_csv('train.csv',index=False) 
files.download('train.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [19]:
msno = train['msno'].to_list()    #'msno' stores the list of customers we will use in our model
len(msno)

19978

In [20]:
# Let us now extract data pertaining to only these users from the rest of the tables and save them to the local disk for processing
transactions = pd.read_csv('transactions.csv')
members = pd.read_csv('members_v3.csv')
user_logs = pd.read_csv('user_logs.csv')

In [21]:
# Keeping on;y the data pertaining to the customers in the train set
transactions = transactions[transactions['msno'].isin(msno)]
members = members[members['msno'].isin(msno)]

Let us save these two tables in the google drive so we don't have to re-run the above steps.

In [12]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [26]:
# saving the to tables in the google drive
transactions.to_csv('transactions.csv',index=False)
members.to_csv('members.csv',index=False)
!cp transactions.csv "drive/My Drive/Colab_datasets/Churn_prediction/transactions.csv"
!cp members.csv "drive/My Drive/Colab_datasets/Churn_prediction/members.csv"

Upon seeing the number of customers in these 2 tables, it is visiblee that there exists atleast one transaction for each customer. But not all customer details are available in the member table.

In [23]:
# checking the number of customer details available in the two tables
print(transactions['msno'].nunique())
members['msno'].nunique()

19978


17575

In [None]:
# Let us now extract data pertaining to only these users from the rest of the tables and save them to the local disk for processing
# n = 100  # every 100th line = 1% of the lines

# user_logs = pd.read_csv('user_logs.csv', header=0, skiprows=lambda i: i % n != 0)
# user_logs.shape

(3921065, 9)

Colab free version provides you only 12.5 GB RAM. This is not sufficient for my 28 GB 'user_logs' table. To work around this problem, i will read the file in chunks and use a filter to filter out only the customers we are interested in and then concatenate these fltered chunks.

In [None]:
iter_csv = pd.read_csv('user_logs.csv', iterator=True, chunksize=1000)    # you can try with different chunk size
user_log = pd.concat([chunk[chunk['msno'].isin(msno)] for chunk in iter_csv])

In [None]:
user_log.to_csv('user_log.csv', index=False) 
files.download('user_log.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Before we get into data cleaning and modelling, let us understand the task and the daataset and make assumptions if required.

The dataset contains an update in November 2017, for simplicity, we will not include this update.

Task: Predict the customer churn for the month of March 2017. **We can say that a customer has churned if he/she doesn't make new service subscription transaction within 30 days after the current membership expiration date.**

To begin with let us understand the problem and the data. It always helps me in modelling and to avoid silly mistakes.

Certain key points:

1.   A customer can auto renew the subscription or manually.
2.   A customer can cancel the subscription at any time.

Now that the needed files are reduced and saved in the drive, let us read them and do the necessary cleaning.


In [28]:
train = pd.read_csv('/content/drive/My Drive/Colab_datasets/Churn_prediction/train.csv')
members = pd.read_csv('/content/drive/My Drive/Colab_datasets/Churn_prediction/members.csv')
transactions = pd.read_csv('/content/drive/My Drive/Colab_datasets/Churn_prediction/transactions.csv')
user_log = pd.read_csv('/content/drive/My Drive/Colab_datasets/Churn_prediction/user_log.csv')

to be continued ...