## Setting up the environment (data + libraries)

In [None]:
import pandas as pd
import numpy as np
import re

In [None]:
# read the dataset
train = pd.read_csv('data/training_set.csv')

## Skimming through the data

In [None]:
# try to extract some data instances
train.head()

As seen from the sample of 5 extracted rows:
* 46 features
* NULL values are denoted as NaN (and not -1 as in Porto Segure competition)

In [None]:
train.columns.values # all the features of training set

In [None]:
train.info()

From the above summary:
* total rows: **992,931**
* types of features:
    * **categorical**: city, bd, gender, registered_via, is_auto_renew_median, is_auto_renew_last, plan_list_price_mean, plan_list_price_last, is_cancel_mean, is_cancel_last
    * **object**: TimeSinceReg, msno
    * **numerical**: all the other features

In [None]:
train.describe()

In [None]:
train[['msno','is_churn']].groupby(['is_churn'], as_index=False).count()

'0' denotes clients who renews their service and '1' those who churn. Approximately, only **6.82%** of customers churn after the expiration of their subscription.

## Feature selection (by human understanding)

This section covers the process of feature selection but only based on human understanding i.e. which features deemed reasonable to be removed manually, without considering any indicator of feature importance.

In [None]:
Y = train['is_churn'] # extract the label variables

In [None]:
# features to remove
to_rem = ['Unnamed: 0', 'msno', 'is_churn', 'membership_expire_date_last', 'transaction_date_last']

# please correct me if I'm wrong but I don't think records about transaction date would have important effect on the prediction, 
# except the one of TimeSinceReg
X = train.drop(to_rem, axis=1) # remove no., msno, is_churn from data

## Feature transformation (TimeSinceReg) 

In [None]:
# extract the number of days from the attribute TimeSinceReg
# at first the data was '4000 days 00:00:00.000000...'
# at the end, we only need the concrete day such as 4000 --> replace the original data of TimeSinceReg by the number of days

regexp = re.compile('(-?[0-9]+)')
tmp = []
for t in train['TimeSinceReg']:
    if type(t) is not str:
        tmp.append(0)
        continue
    result = regexp.match(t)
    tmp.append(int(result.group(0)))

X['TimeSinceReg'] = tmp

After this step, all data of ambiguous type *object* (msno and TimeSinceRef) have been either removed or transformed into another type appropriate for machine learning method.

## Lacunar features

In [None]:
X.isnull().any(axis=1).sum() # count the total number of rows that have one or more null values

There are 296 529 rows that have null values on one or more attributes. Approximately, these rows occupy **29.86%** over the entire dataset.

## Correlated features

In [None]:
# the list of all categorical features
cat_feat = [
    "city",
    "bd",
    "gender", 
    'registered_via', 
    'is_auto_renew_median', 
    'is_auto_renew_last',
    'plan_list_price_mean', 
    'plan_list_price_last',
    'is_cancel_mean','is_cancel_last']

In [None]:
X_num = X.drop(cat_feat, axis=1) # extract the training set that contains only numerical features

In [None]:
# libraries to do pretty plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Seaborn style
sns.set_style("whitegrid")

In [None]:
# Getting correlation matrix
cor_matrix = X_num.corr().round(2)

# Plotting heatmap 
fig = plt.figure(figsize=(20,20));
sns.heatmap(cor_matrix, annot=True, center=0, cmap = sns.diverging_palette(250, 10, as_cmap=True), ax=plt.subplot(111));
plt.show()

The correlated features can be easily deduced from this correlation matrix. Said features are listed as follows (threshold >= 0.8):
* payment_plan_days_mean <==> payment_plan_days_last, actual_amound_paid_mean, actual_amount_paid_last
* payment_method_id_mean <==> payment_method_id_last
* num_25_avg_1mo <==> num_25_avg_3mo
* num_50_avg_1mo <==> num_50_avg_3mo
* num_75_avg_1mo <==> num_75_avg_3mo
* num_985_avg_1mo <==> num_985_avg_3mo
* num_100_avg_1mo <==> num_100_avg_3mo (there is an interesting pattern between those **1mo** and **3mo**), total_secs_avg_1mo, total_secs_avg_3mo
* num_unq_avg_1mo <==> total_secs_avg_1mo, num_unq_avg_3mo
* count_1mo <==> count_3mo
* num_100_avg_3mo <==> total_secs_avg_3mo
* num_unq_avg_3mo <==> num_unq_avg_6mo
* total_secs_avg_3mo <==> total_secs_avg_6mo
* num_100_avg_6mo <==> num_unq_avg_6mo, total_secs_avg_6mo
* num_unq_avg_6mo <==> total_secs_avg_6mo

A quick look at the number of distinct values for categorical variables:

In [None]:
for v in cat_feat:
    print('%s has %d unique values' % (v, len(X[v].unique())))